# Task 1: confusion matrix, Precison, Recall, F1-score

In the field of machine learning and specifically the problem of *statistical classification*, a **confusion matrix, also known as an error matrix**, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix): <img src="wikipedia-confusion.png">

Interpretations (taken from [there](https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62))

**True Positive:**
*Interpretation: You predicted positive and it’s true.*
You predicted that a woman is pregnant and she actually is.

**True Negative:**
*Interpretation: You predicted negative and it’s true.*
You predicted that a man is not pregnant and he actually is not.

**False Positive: (Type 1 Error)**
*Interpretation: You predicted positive and it’s false.*
You predicted that a man is pregnant but he actually is not.

**False Negative: (Type 2 Error)**
*Interpretation: You predicted negative and it’s false.*
You predicted that a woman is not pregnant but she actually is.

**The F1 score** can be interpreted as a *weighted average of the precision and recall*, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal.

Why is this so important?

High scores for both(precicion and recall) show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. An ideal system with high precision and high recall will return many results, with all results labeled correctly.

# Task 2: Moons dataset


In [1]:
#importing dependencies
import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import make_moons
import numpy as np
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import scale
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report  
from sklearn.metrics import confusion_matrix  

In [3]:
X, y = make_moons(n_samples = 1000, noise = 0.275)
X = scale(X)
print("X shape: ", X.shape)
print("Labels shape", y.shape)

X shape:  (1000, 2)
Labels shape (1000,)


## 2.1: Desriptive characteristics

**What is descriptive statistics?** Descriptive statistics involves summarizing and organizing the data so they can be easily understood.
For elementary theory and intuitive interpreation see [here](https://towardsdatascience.com/understanding-descriptive-statistics-c9c2b0641291)

In [4]:
#mean
print("X mean: ", X.mean())
#median
print("X median: ", np.median(X))
#mode -- X has bimodal distribution
print("X and y mode: ", stats.mode(X)[0], stats.mode(y)[0])

X mean:  -1.9539925233402755e-16
X median:  0.011443085578461798
X and y mode:  [[-2.22832973 -2.72338303]] [0]


In [5]:
#let's look at Measure of Spread / Dispersion -- variability of given data
# 1) standard deviation:
print("Standard deviation of X", np.std(X))
# 2) mean absolute deviation:
print("Mean absolute deviation of X", stats.median_absolute_deviation(X))
# 3) variance:
print("Variance of X", (np.std(X)) ** 2)
# 4) quartiles:
print("Q1 quantile of X : ", np.quantile(X, .25)) 
print("Q2 quantile of X : ", np.quantile(X, .50)) 
print("Q3 quantile of X : ", np.quantile(X, .75)) 
print("100th quantile of X : ", np.quantile(X, .1)) 
# 5) interquantile range(IQR = Q3 - Q1) :
print("IQR of X:",  np.quantile(X, .75) - np.quantile(X, .25))
# 6) skewness:
#  6.1) mode:
print("Pearson First Coefficient of Skewness of X: \n", (np.mean(X) - stats.mode(X, axis = 0 ))/np.std(X))
#  6.2) median:
print("Pearson Second Coefficient of Skewness of X: ", 3 * (np.mean(X) - np.median(X))/np.std(X))
# 7) features correlation:
#print("Correlation between features of X: \n", np.corrcoef(X))

Standard deviation of X 1.0
Mean absolute deviation of X [1.08866909 1.17328649]
Variance of X 1.0
Q1 quantile of X :  -0.7548718755258662
Q2 quantile of X :  0.011443085578461798
Q3 quantile of X :  0.7580688035501697
100th quantile of X :  -1.3863686145592549
IQR of X: 1.5129406790760358
Pearson First Coefficient of Skewness of X: 
 [[[ 2.22832973  2.72338303]]

 [[-1.         -1.        ]]]
Pearson Second Coefficient of Skewness of X:  -0.034329256735385984


## 2.2: Classification task

In [6]:
#having read the task, i unserstood that we need only test and train sets -- no validation and dev sets!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
print("Train size: ", len(y_train))
print("Test size: ", len(y_test))

Train size:  750
Test size:  250


In [9]:
#running base classificators
log_reg_model = LogisticRegression()
log_reg_model.fit(X_train, y_train)

knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)

decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)

random_forest_model = RandomForestClassifier()
random_forest_model.fit(X_train, y_train)

native_bayes_model = BernoulliNB()
native_bayes_model.fit(X_train, y_train)

svc_model = SVC(kernel = 'rbf')
svc_model.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [10]:
#making predictions
y_log_reg = log_reg_model.predict(X_test)
y_knn = knn_model.predict(X_test)
y_decision_tree = decision_tree_model.predict(X_test)
y_random_forest = random_forest_model.predict(X_test)
y_native_bayes = native_bayes_model.predict(X_test)
y_svc = svc_model.predict(X_test)

print("Log. Regression: \n", classification_report(y_true = y_test, y_pred = y_log_reg))
print("KNN: \n", classification_report(y_true = y_test, y_pred = y_knn))
print("Decision Tree: \n", classification_report(y_true = y_test, y_pred = y_decision_tree))
print("Random Forest: \n", classification_report(y_true = y_test, y_pred = y_random_forest))
print("Naive Bayes: \n", classification_report(y_true = y_test, y_pred = y_native_bayes))
print("SVC: \n", classification_report(y_true = y_test, y_pred = y_svc))

Log. Regression: 
               precision    recall  f1-score   support

           0       0.83      0.83      0.83       115
           1       0.85      0.86      0.86       135

    accuracy                           0.84       250
   macro avg       0.84      0.84      0.84       250
weighted avg       0.84      0.84      0.84       250

KNN: 
               precision    recall  f1-score   support

           0       0.93      0.88      0.90       115
           1       0.90      0.94      0.92       135

    accuracy                           0.91       250
   macro avg       0.91      0.91      0.91       250
weighted avg       0.91      0.91      0.91       250

Decision Tree: 
               precision    recall  f1-score   support

           0       0.91      0.87      0.89       115
           1       0.89      0.93      0.91       135

    accuracy                           0.90       250
   macro avg       0.90      0.90      0.90       250
weighted avg       0.90      0.

## Model refinements

In [12]:
#looking for the best estimator for Decision tree classifier
from sklearn.model_selection import GridSearchCV
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4]}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state = 42), params, verbose = 1)
grid_search_cv.fit(X_train, y_train)
grid_search_cv.best_estimator_

Fitting 3 folds for each of 294 candidates, totalling 882 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 882 out of 882 | elapsed:    1.3s finished


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=7,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=42, splitter='best')

In [14]:
#unneccesary tree visualization:
from sklearn.tree import export_graphviz
export_graphviz(grid_search_cv.best_estimator_, 
                out_file = ("moons_tree.dot"),
                feature_names = None, 
                class_names = None,
                filled = True)

In [21]:
#TODO:
params = {'l1_ratio': list(np.linspace(0.01, 1, 100)),
          'C' : list(np.linspace(0.01, 1, 100)), 'penalty': ['l1', 'l2']}
grid_search_cv = GridSearchCV(LogisticRegression(), params, verbose = 1, cv = 3)
grid_search_cv.fit(X_train, y_train)
best_linear_reg = grid_search_cv.best_estimator_  
print(best_linear_reg)

Fitting 3 folds for each of 20000 candidates, totalling 60000 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


LogisticRegression(C=0.02, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=0.01, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)


[Parallel(n_jobs=1)]: Done 60000 out of 60000 | elapsed:  1.4min finished


In [22]:
y_pred = best_linear_reg.predict(X_test)
print("Log. Regression with L2: \n", classification_report(y_true = y_test, y_pred = y_pred))

Log. Regression with L2: 
               precision    recall  f1-score   support

           0       0.83      0.83      0.83       115
           1       0.85      0.86      0.86       135

    accuracy                           0.84       250
   macro avg       0.84      0.84      0.84       250
weighted avg       0.84      0.84      0.84       250

