<a href="https://colab.research.google.com/github/eischaire/ML_4year/blob/master/Assignment2_Okhapkina.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2

Anna Okhapkina

In [0]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss, make_scorer
# from collections import defaultdict

## Defining GridSearchCV + useful fuctions and variables

In [0]:
elem_in_class = 40 

In [0]:
def accuracy(y_true, y_pred):
  y_pred = list(y_pred)
  n = len(y_true)
  shots = 0
  for i in range(n):
    shots += int(y_pred[i] == y_true[i])
  return shots / n

In [0]:
def crossvalidate_knn(X_train, y_train, X_test, y_test, metric='minkowski'):
  """
  does grid search + CV for k-NN hyperparams, fits the model with the best ones

  :param neigh: KNeigborsClassifier instance
  :param X_train: list of lists of integers
  :param y_train: list of integers

  :return: accuracy for the best model
  """
  params = {'n_neighbors': list(range(1, neigh_lim))}
  neigh = KNeighborsClassifier(metric=metric)
  gs_knn = GridSearchCV(neigh, params, return_train_score=True, scoring=make_scorer(log_loss, greater_is_better=False))
  gs_knn.fit(X_train, y_train)
  best_n_neigh = gs_knn.best_params_['n_neighbors']
  print('Best n_neighbors param: {}'.format(best_n_neigh))
  neigh = KNeighborsClassifier(n_neighbors=best_n_neigh, metric=metric)
  neigh.fit(X_train, y_train)
  y_pred = neigh.predict(X_test)
  return accuracy(y_test, y_pred)

In [0]:
def crossvalidate_lr(X_train, y_train, X_test, y_test, random_state=42):
  params = {
            # 'penalty': ['l1', 'l2', 'elasticnet', 'none'], 
            'C': [(int(0.1 * i * 100) / 100) for i in range(1, 11)], # see next two cells
            'fit_intercept': [True, False],
            # 'l1_ratio': [(int(0.1 * i * 100) / 100) for i in range(1, 11)]
            }
  lr = LogisticRegression(solver='liblinear', multi_class='ovr', random_state=random_state) # liblinear because I train the model on a small dataset
  gs_lr = GridSearchCV(lr, params, return_train_score=True, scoring=make_scorer(log_loss, greater_is_better=False))
  gs_lr.fit(X_train, y_train)
  best_params = gs_lr.best_params_
  print('Best param: {}'.format(best_params))
  lr = LogisticRegression(C=best_params['C'], fit_intercept=best_params['fit_intercept'], random_state=random_state)
  lr.fit(X_train, y_train)
  y_pred = lr.predict(X_test)
  return accuracy(y_test, y_pred)

In [74]:
[0.1 * i for i in range(1, 11)]

[0.1,
 0.2,
 0.30000000000000004,
 0.4,
 0.5,
 0.6000000000000001,
 0.7000000000000001,
 0.8,
 0.9,
 1.0]

In [75]:
[(int(0.1 * i * 100) / 100) for i in range(1, 11)]

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

## Task 1. k-NN worse than LR

In [0]:
X1 = [[0,0]]
first_par = 0
sec_par = 0
for i in range(1, elem_in_class * 2):
   first_par += i
   if i % 2 > 0:
     sec_par += 1
   pos = (-1) ** (i % 2 + 1)
   X1.append([first_par, sec_par * pos])

y1 = [1, 0] * elem_in_class
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, random_state=42)
neigh_lim = min(len(X1_train), len(X1_test)) + 1
neigh_lim

The Euclicdean distance between elements of the same class and elements of different classes is not dependent on classes at all, so k-NN models with Euclidean distance and Minkowski's metric show low performance. 

On the contrary, the x1 (from input params [x0, x1]) will have much weight in LR model, so that it can make a right decision.

In [302]:
crossvalidate_knn(X1_train, y1_train, X1_test, y1_test, metric='euclidean')

Best n_neighbors param: 17


0.3

In [303]:
crossvalidate_knn(X1_train, y1_train, X1_test, y1_test)

Best n_neighbors param: 17


0.3

In [304]:
crossvalidate_lr(X1_train, y1_train, X1_test, y1_test)

Best param: {'C': 0.1, 'fit_intercept': True}


0.95

The only k-NN model that LR could not overcome was the model with cosine distance

In [301]:
crossvalidate_knn(X1_train, y1_train, X1_test, y1_test, metric='cosine')

Best n_neighbors param: 1


0.95

## Task 2. LR worse than k-NN

In [331]:
X2 = [[i, i] for i in range(elem_in_class * 2)]
y2 = [0] * (elem_in_class - int(elem_in_class / 2)) + [1] * elem_in_class + [0] * int(elem_in_class / 2)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, random_state=42)
neigh_lim = min(len(X2_train), len(X2_test)) + 1
neigh_lim

21

One of the clusters is split, and its parts are lying around the second cluster. So, there is no dividing line and no right threshold for LR to make a decision.

In [332]:
crossvalidate_knn(X2_train, y2_train, X2_test, y2_test, metric='cosine')

Best n_neighbors param: 4


0.65

In [333]:
crossvalidate_knn(X2_train, y2_train, X2_test, y2_test, metric='euclidean')

Best n_neighbors param: 2


1.0

In [334]:
crossvalidate_knn(X2_train, y2_train, X2_test, y2_test)

Best n_neighbors param: 2


1.0

In [335]:
crossvalidate_lr(X2_train, y2_train, X2_test, y2_test)

Best param: {'C': 0.1, 'fit_intercept': False}


0.55

## Task 3. Cosine distance k-NN better than Euclidean disctance k-NN

In [86]:
ones = [[i, i+1] for i in range(elem_in_class)]
zeros = [[j, j-1] for j in range(elem_in_class)]
X3 = ones + zeros
y3 = [1] * elem_in_class + [0] * elem_in_class
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, random_state=42)
neigh_lim = min(len(X3_train), len(X3_test)) + 1
neigh_lim

21

Due to positions of the whole classes on one line above or under the bisector, the angle between elements of the same class is always less than one between elements of different classes.

So, if we are lucky enough and we meet elements of both classes in our train sample (we have to, actually), the nearest neighbors for every element in the test sample are always the elements of its actual class. 

In [87]:
crossvalidate_knn(X3_train, y3_train, X3_test, y3_test, metric='euclidean')

Best n_neighbors param: 1


0.75

In [88]:
crossvalidate_knn(X3_train, y3_train, X3_test, y3_test, metric='cosine')

Best n_neighbors param: 1


1.0