<a href="https://colab.research.google.com/github/freny-caicedo-endava/Pio.ML/blob/master/MLIntro_P2_RefiningTheModel_init.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


> ![answer](https://drive.google.com/uc?export=view&id=1yj7jPO0w4Ayq1OxkpflWOE-BBdMd_O4k)

## Introduction to Machine Learning
###PART TWO: Refining the model 

# 1.  EVALUATION METRICS

**Confusion Matrix** is a performance measurement for machine learning classification problem, it is relatively simple to understand, but the related terminology can be confusing.

> ![answer](https://drive.google.com/uc?export=view&id=1nbipIp7_oIEFQzkuf9o9F4bB_9W2L9rc)

> ![answer](https://drive.google.com/uc?export=view&id=1Euhh_mSTMZwdyuanrFeDd4OeGgBRbFZn)







In [0]:
from sklearn.metrics import confusion_matrix

y_true = [1, 1, 0, 1, 0, 0, 1]
y_pred = [1, 0, 0, 1, 0, 0, 1]

# get confusion matrix

**Accuracy:** Overall, how often is the classifier correct? When it predicts yes, how often is it correct?

(TP+TN)/total 

**Precision:** When it predicts the positive result, how often is it correct?
Fraction of positive predictions that are actually positive

TP/(TP+FP)

**Recall:** When it is actually the positive result, how often does it predict correctly?

How much of actual positive data  was predicted to be positive.

TP/(TP+FN)

> ![answer](https://drive.google.com/uc?export=view&id=1jPebdA_bJKgq9vFuQhpA8sbF-v8LoMKT)


---

*When Is Precision more important?*

Youtube recommendations, should it reccomend a product?

FP: bad user experience!

FN: not a big problem

---

*When Is Recall more important?*

Lung Carcer Warning from x-ray

FP: not a big problem

FN: person loses chance to live!!!

So, precision is important to avoid false positives.
And recall is important to avoid false negatives.

---


**F1 Score:** is the harmonic average of the precision and recall.

Harmonic mean is a kind of average where result is closer to the lower number, so F1 score is colest to the smallest between Precision and Recall.

In [0]:
import statistics as s

values = [0.1, 0.8]

# get harmonic mean

Note that F1 score gives equal importance to precision and recall.

**F-Beta score:** is the weighted harmonic mean of precision and recall.

The beta parameter determines the weight of recall in the combined score. 

*   *beta < 1* lends more weight to precision
*   *beta > 1* favors recall
*   *beta = 1* is just harmonic mean

In [0]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, fbeta_score

y_true = [1, 1, 0, 1, 0, 0, 1]
y_pred = [1, 0, 0, 0, 0, 0, 0]

# get and print accuracy, precision and recall

# print harmonic mean and fbeta score

# 2.  DETECTING ERRORS

## 2.1 TYPES OF ERRORS 

**UNDERFITTING**

Does not do well on the training set.
Error due to bias.

**OVERFITTING**

Does well on the training set, but it tends to memorize it instead of learning the characteristics of it.
Error due to variance.


> ![answer](https://drive.google.com/uc?export=view&id=1psGrlora0dAitbFqKYnHWpAoKLrieW7V)

## 2.2 CROSS VALIDATION

Cross-validation is a statistical method used to estimate the skill of machine learning models.

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. 

The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation.

Very useful when we have few data.

> ![answer](https://drive.google.com/uc?export=view&id=1t_V8pJo000_ybbohxeuHqZlnrjz-RsPi)



# 3. DATA PRE-PROCESSING

## 3.1 Handling Null Values

In [0]:
import pandas as pd
from io import StringIO

csv_data = \
'''
A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
9.0,10.0,11.0,
'''

df = pd.read_csv(StringIO(csv_data))

# preview data

# show null positions

# show nulls by column

**How to fix it?**

In [0]:
from sklearn.preprocessing import Imputer

# create Inputer instance

# get rid of nulls on affected columns

## 3.2 Standardization 

In Standardization we transform our values such that the mean of the values is 0 and the standard deviation is 1.

![answer](https://drive.google.com/uc?export=view&id=1lNdkOnerB69nZ5IpgeKDY8Ci-63-O9Du)

![answer](https://drive.google.com/uc?export=view&id=1uw9iuGjtipnNYZ7Rctha2s39KDnjh2Ys)

In [0]:

from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt
import numpy as np

csv_data = \
'''
Country,Age,Salary
Colombia,44.0,72000.0
Spain,27.0,48000.0
Germany,30.0,54000.0
Colombia,38.0,61000.0
Germany,70.0,63000.0
'''

df = pd.read_csv(StringIO(csv_data))

# print data and histogram for Salary column

# standarize Age and Salary columns

# print new data and histogram for Salary column


## 3.3 Handling Categorical Variables

Categorical variables are basically the variables that are discrete and not continuous, they are further divided into 2 types and we need to preprocess them differently.


### 3.3.1 ORDINAL

Can be ordered, e.g.  size of a T-shirt, we can say that M<L<XL.

In [0]:
df_cat = pd.DataFrame(data = 
                     [['green','M'],
                      ['blue','L'],
                      ['green','S'],
                      ['white','M']])
df_cat.columns = ['color','size']

df_cat

In [0]:
# map size column values

### 3.3.2 NOMINAL

Can’t be ordered., e.g. color of a T-shirt., we can’t say that Blue < Green.

**Label Encoder**

Encode labels with value between 0 and n_classes-1.

In [0]:
from sklearn.preprocessing import LabelEncoder

# encode color columns values


**One-Hot Encoding**

This method creates *n* columns where *n* is the number of unique values that the nominal variable can take, for each encoded value only one column have value = 1 and the rest all will have value = 0.

In [0]:
from sklearn.preprocessing import OneHotEncoder

df_cat = pd.DataFrame(data = 
                      [['green'],
                       ['blue'],
                       ['green'],
                       ['white']])
df_cat.columns = ['color']

# print data

# reshape and encode color values

# 4. TUNNING

## 4.1 LOSS FUNCTION

It’s a method of evaluating how well specific algorithm models the given data. If predictions deviates too much from actual results, the result will output a large number. 

A basic loss function will simply measure the absolute difference between our prediction and the actual value and average it out across the whole dataset.

In mathematical notation, it might look something like abs(y_predicted – y) 


![answer](https://drive.google.com/uc?export=view&id=1-nM2-cT6I0KMlzQLT3PSmxBTu97IcQPb)



## 4.2 HYPERPARAMETERS

In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins, they are not derived via training.

Hyperparameter optimization finds values that yields an optimal model which minimizes a loss function on given test data.



## 4.3 BENCHMARK MODEL

Benchmarking is the process of comparing your result to existing methods. 

You may compare to published results from another paper, for example. Or you might compare to a very simple model (a simple regression, K Nearest Neighbors). 

If the field is well studied, you should probably benchmark against the current published state of the art (and possibly against human performance when relevant).


# 5. REFINING PREVIOUS MODEL

## 5.1 DATA PREPROCESSING

In [0]:
import pandas as pd

data_url = 'https://raw.githubusercontent.com/freny-caicedo-endava/Pio.ML/master/census.csv'

# load to Pandas Dataframe and preview

In [0]:
from sklearn.preprocessing import LabelEncoder

# enconde income values

In [0]:
# remove target column

In [0]:
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler


numeric_cols = ['age', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']

# handle Null values and Standarize data in numerical columns

In [0]:
from sklearn.impute import SimpleImputer

categorical_cols = ['workclass', 'education_level', 'marital_status',
                    'occupation', 'relationship', 'race', 'sex', 
                    'native_country']

# Handle Null values in categorical cols
    

# Map education level values as Categorical Ordinal

eduLevel_mapping = {
    ' Preschool': 0,
    ' Some-college':1,
    ' 1st-4th':2,
    ' 5th-6th':3,
    ' 7th-8th':4,
    ' 9th':5,
    ' 10th':6,
    ' 11th':7,
    ' 12th':8,
    ' HS-grad':9,
    ' Bachelors':10,
    ' Masters':11,
    ' Prof-school':12,
    ' Assoc-acdm':13,
    ' Assoc-voc':14,
    ' Doctorate':15
}

In [0]:

# Apply OneHot encoder to Categorical Nominal

## 5.2 TUNNING

In [0]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# get best parameters for Decision Tree algorithm

param_grid = { 
    'max_depth': [15, 32, 64],
    'min_samples_split': [0.000001, 0.00001, 0.0001],
    'min_samples_leaf' : [0.0001, 0.001, 0.01]
}

In [0]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

# apply found optimal values


# measure accuracy again
#
# Benchmark model = Decision Tress trained in previous workshop
# Benchmark model Accuracy: 0.82 (+/- 0.00812)