# Chapter 7 - Prepare You Data for Machine Learning
- Rescale Data
- Standardize Data
- Normalize Data
- Binarize Data


Each Algorithm makes assumptions about your data and may require different transformations.

**It is recommended that you make various transforms of your data and exercise a varieity of algorithms to narrow in on the transforms that expose the structure of your problem.**

1. Load the dataset
2. Split the dataset into input and output variable for ML
3. Apply a preprocessing transform to the input variables
4. SUmmarize the data to show the change

Scikit-learn has two standard idioms for transforming data
- Fit and Mulitple Transform (preferred apporoach)
- Combined Fit-And-Transform

Call the `fit()` function to prepare the parameters of the transform once on your data
<br>Then call the `transform()` function on the same data to prepare it for modeling and agian on the test or validation dataset.

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing



### Load Dataset

In [2]:
# Rescale data (between 0 and 1)
from pandas import read_csv
from numpy import set_printoptions

filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)
array = dataframe.values

In [4]:
dataframe

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


### Split Dataset

In [5]:
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

In [12]:
array.shape

(768, 9)

In [11]:
X.shape

(768, 8)

In [13]:
Y.shape

(768,)

## Scale Data

rescales the values form between 0 and 1, this is useful for core ML models like Gradient Descent and usefule for wieght inputs like regression and NN.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [17]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = min_max_scaler.fit_transform(X)

### Summarize the data to show the change

In [18]:
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:]) # printing just the first 5 rows

[[0.353 0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.254 0.183]
 [0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.   ]
 [0.    0.688 0.328 0.354 0.199 0.642 0.944 0.2  ]]


**This rescaling leaves all the values as a range between 0 and 1**

### Standardize Data 

Useful technique to transform attributes with Gaussian distribution and differening means of stardard deviations to a stardar Gaussian distribution with a mean of 0 and stardard deviation of 1.

Best when the technique assumes a Gaussian distribution in the input variables such as linear regresssion, logistic regress and linear discriminate analysis

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [24]:
from sklearn.preprocessing import StandardScaler

stdscaler = StandardScaler().fit(X)
rescaledX = stdscaler.transform(X)

In [25]:
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


**The values now have a mean value of 0 for each row and a stddev of 1**

### Normalize Data

Rescales each observation (row) to have a length of 1 called a unit norm or a vector with the length of 1 in linear algebra.

Good for sparse data (data with lots of zeros) and algorithms that weight input values like NN or K-Nearest Neighbors

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html

In [26]:
from sklearn.preprocessing import Normalizer

norm_scaler = Normalizer().fit(X)
normalizedX = norm_scaler.transform(X)

In [27]:
# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[0.034 0.828 0.403 0.196 0.    0.188 0.004 0.28 ]
 [0.008 0.716 0.556 0.244 0.    0.224 0.003 0.261]
 [0.04  0.924 0.323 0.    0.    0.118 0.003 0.162]
 [0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
 [0.    0.596 0.174 0.152 0.731 0.188 0.01  0.144]]


**The rows are normalized to length 1.**

### Binarise Data (Make Binary) (aka Thresholding your data)

Uses a binary threshold to transform each valu into a 1 or 0.

Useful when you have prbabilties that you want to transform into crisp values.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html

In [28]:
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)

In [29]:
# summarize transformed data
set_printoptions(precision=3)
print(binaryX[0:5,:])

[[1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1.]]


**All the values less that zero are marked 0 and above zero are 1**

-------
# Chapter 8 - Feature Selection For Machine Learning

- Univariate Selection
- Recursive Feature Elimination
- Principle Component Analysis
- Feature Importance

### Feature Selection

The process that you automatically select features in your dta that contribute the most to your prediction variable or output. You dont want to include irrelevant features
- Reduces overfitting
- Improves Accuracy
- Reduces Training Time

http://scikit-learn.org/stable/modules/feature_selection.html

### Univariate Selection
Many statistical tests can be used with this selection method.

ANOVA F-value is appropriate for numberical inputs and categorical data sucah as the Pima Dataset.

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

In [30]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

# feature extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, Y)

In [31]:
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)

[ 39.67  213.162   3.257   4.304  13.281  71.772  23.871  46.141]


In [32]:
# summarize selected features
features = fit.transform(X)
print(features[0:5,:])

[[  6.  148.   33.6  50. ]
 [  1.   85.   26.6  31. ]
 [  8.  183.   23.3  32. ]
 [  1.   89.   28.1  21. ]
 [  0.  137.   43.1  33. ]]


names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 

We set k=4 in SelectKBest 0 (preq), 1 (plas), 5 (mass), and 7 (age).

Each feature was scored and the top 4 features were chosen and we see that printed from features

### Recursive Feature Elimination (RFE)

Recursively removes attributes and builds model on the remaining attributes.

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

In [34]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# feature extraction
model = LogisticRegression(solver='liblinear')
rfe = RFE(estimator=model, n_features_to_select=3)
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]


names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

Shows the top 3 features all represented as 1 in the feature ranking preg, mass, and pedi

### Principal Component Analysis (PCA)

Uses linear algebra to transform the dataset into a compressed form. 

You can choose the number of dimensions or principal components in the transformation.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [35]:
from sklearn.decomposition import PCA

# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)

In [36]:
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: [0.889 0.062 0.026]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [ 2.265e-02  9.722e-01  1.419e-01 -5.786e-02 -9.463e-02  4.697e-02
   8.168e-04  1.402e-01]
 [ 2.246e-02 -1.434e-01  9.225e-01  3.070e-01 -2.098e-02  1.324e-01
   6.400e-04  1.255e-01]]


The results here are unrecognisable but the PCA has made a selection.

### Feature Importance
Bagged decision trees like Random Forest and Extra Trees can be used to estimate the impotance of features.

We construct a ExtraTreeClassifier

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

In [39]:
from sklearn.ensemble import ExtraTreesClassifier

# feature extraction
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X, Y)
print(model.feature_importances_)

[0.111 0.227 0.098 0.078 0.075 0.148 0.118 0.145]


The larger the values the more important the attribute plas, age, and mass

I can see that the idea is to mix and match preprocessing from Ch 7 and the feature selection form Ch 8 

- **Univariate Selection:** preq, plas, mass, and age
- **Recursive Feature Elimination:** preg, mass, and pedi
- **Principle Component Analysis:** unrecognisable
- **Feature Importance:** plas, age, and mass

# Chapter 9 - Evaluate the Performance of Machine Learning Algorithms with Resampling

This evaluates who your algorithms perform on unseen data.

1) Make a prediction for data you already have the answer for
#### OR
2) Use clever techniques from statistics called resampling

If an algorithm remembers every observation it sees it will overfit and not perform well on new data. 

So we need to split data intelligently to get keep the model from doing this. We need to estimate performance on new data the re-train the final algorithm.

- Train and Test Sets
- k-fold Cross-Validation
- Leave One Out Cross-Validation
- Repeated Random Test-Train Splits

### Train Test Splits
- the simplest method: split into 2 parts 
- train algorithm on first part (commonly 67%)
- make prediction on second part (commonly 33%)
- this is fast aand ideal for very large datasets
- Downside of this is there may be high variance in which the differences in the training and testing set can be meaninful



In [42]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

test_size = 0.33
seed = 7
# Split the data
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed)

In [43]:
# fit the model
model = LogisticRegression(solver='liblinear') 
model.fit(X_train, Y_train)

In [44]:
# Summarize the sampling
result = model.score(X_test, Y_test) 
print("Accuracy: %.3f%%" % (result*100.0))

Accuracy: 75.591%


We have fixed the random_state seed so that we always have the same reproducible division of data.

We can make apples to apples comparisons and later we can compare the same model under different configurations

### K-fold Cross-Validation
- Splits the dataset into k-parts where each split is called a fold.
- The algorithm is trained on k-1 folds with one held back for testing.
- This is repeated so that each fold is given a chance to be held back for testing.
- You end with k number of scores that you can summarise with mean and standard deviation.
- Best for modest sized dataset with 10,000's of records

In [48]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

In [49]:
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 77.086% (5.091%)


This reports the mean and (standard deviation)

### Leave One Out Cross-Validation
- computationally more expensive than the k-fold
- configure the cross-validation so that the size of the fold is 1, this means k is set to the number of observations

In [50]:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

loocv = LeaveOneOut()
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.823% (42.196%)


This score hase more variance than the k-folds

### Repeated Random Test-Train Splits
- Another variation of k-folds cross-validation
- repeats the the train/test split multiple times
- This has the speed of train/test split while reducing the variation in of k-folds cross validation
- this could introduce redundancy into the process becaue much of the same data can be present in the train test splits


In [51]:
from sklearn.model_selection import ShuffleSplit

n_splits = 10
test_size = 0.33
seed = 7
kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))


Accuracy: 76.496% (1.698%)


- K-fold Cross-Validation is the gold standard with k set at 3, 5,or 10
- train/test split is good for speed, with large datasets produces lower bias, and slow algorithms
- Leave-One-Out Cross-Validation and Repeated Random Splits can be useful intermediates when trying to balance variance and estimated performance, training speed and dataset size

-----
# Chapter 10 - Machine Learing Algorithm Performance Metrics
- Descriptive Statistics
- Visualisation
- Preprocessing
- Feature Selection
- Splits and Folds
- **Performance Metrics**

Metrics influence how you weight the importance of different characteristics in the results and your ultimate choice of which algorithm to choose.




- for classification metrics, we use binary classification problem where all input variable are numeric
- for regression metrics, we use a regression problem where all the inputs variables are numeric

logistic regression for classification and linear regression for regression.

http://scikit-learn.org/stable/modules/model_evaluation.html

- Classification Accuracy
- Logistic Loss
- Area Under ROC Curve
- Confusion Matrix
- Classification Report

### Classification Accuracy
- the number of correct predictions made aas a ratio of all predictions.
- the most common evaluation metric for clasification problems and misused
- Only really effective when there are equal number of observations in each class and all predictions and errors are equally important
- this is almost never the case


In [52]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LogisticRegression(solver='liblinear')
scoring = 'accuracy'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print("Accuracy: %.3f (%.3f)" % (results.mean(), results.std()))

Accuracy: 0.771 (0.051)


This is a 77% accuracy

### Logistic Loss

Logisitic Loss (logloss) evaluates the predictions of probabilities of membership to a given class, a value btw 0 to 1 as a confidence value. Predictions are rewarded or punished porportionally to the confidence of the prediction

In [55]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LogisticRegression(solver='liblinear')
scoring = 'neg_log_loss'
results = cross_val_score(model, X,Y, cv=kfold, scoring=scoring)
print("Logloss: %.3f (%.3f)" % (results.mean(), results.std()))

Logloss: -0.494 (0.042)


Smaller logloss is better with 0 being perfect. But this score of -0.5 is only as good as random guessing

### Area Under ROC Curve (ROC AUC)
- for binary classificatio problems
- a 1.0 represents perfect predictions
- a 0.5 is only as good as random


In [59]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LogisticRegression(solver='liblinear')
scoring = 'roc_auc'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

AUC: 0.826 (0.050)


This suggestions some skill in prediction since the AUC is relatively close to 1


### Confusion Matrix 
- good for problems with 2 or more classes

- [[TruPos, FalsPos] for 0
- [FalsNeg, TruNeg] for 1]
- the horizontal is Actual
- the vertical is Predicted

In [60]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [61]:
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)


In [64]:
model = LogisticRegression(solver='liblinear')
model.fit(X_train, Y_train)

In [65]:
predicted = model.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)

In [66]:
print(matrix)

[[141  21]
 [ 41  51]]


The majority of predictions fall on the diagonal line of the matric which are correct

### Classification Report
Display a quick report of different scores
- precision
- F1-score
- Support

In [67]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [68]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.33, random_state=7)

In [69]:
model = LogisticRegression(solver='liblinear')
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
report = classification_report(Y_test, predicted)
print(report)

              precision    recall  f1-score   support

         0.0       0.77      0.87      0.82       162
         1.0       0.71      0.55      0.62        92

    accuracy                           0.76       254
   macro avg       0.74      0.71      0.72       254
weighted avg       0.75      0.76      0.75       254



## Regression Metrics
- Mean Absolute Error
- Mean Squared Error
- R2

### Mean Absolute Error (MAE)
- the average of absolute differences etween predicitons and actual values to give an idea of how wrong the predictions were
- magnitude of error but not direction(over or under)


In [82]:


filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO',
'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, sep='\s+', names=names)

array  = dataframe.values
X=array[:,0:13]
Y=array[:,13]

In [87]:
from sklearn.linear_model import LinearRegression
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LinearRegression()

In [89]:
scoring = 'neg_mean_absolute_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MAE: %.3f (%.3f)" % (results.mean(), results.std()))

MAE: -3.387 (0.667)


Value of 0 indicates no error 

### Mean Squared Error (MSE) 
- provides a gross idea of magnitude of error. 
- the squared root of MSE gives us the original output call RMSE (Root Mean Squared Error) and can be helpful for descriptin and presentation



In [90]:
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print("MSE: %.3f (%.3f)" % (results.mean(), results.std()))

MSE: -23.747 (11.143)


Remember to take the absolute value of the MSE before calculating the RMSE

### R2 
- Provides indication og "goodness" of fit 
- the coefficient of determination
- values btw 0 for no fit and 1 for perfect fit

In [91]:
scoring = 'r2'
results =cross_val_score(model,X,Y, cv=kfold, scoring=scoring)
print("R^2: %.3f (%.3f)" % (results.mean(), results.std()))


R^2: 0.718 (0.099)


These results are closer to 1 and greater than 0.5 so it is a good fit