<a href="https://colab.research.google.com/github/adellabr/Classification/blob/main/ML4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Download data from Don’tGetKicked competition.

In [None]:
import pandas as pd
import numpy as np
from google.colab import drive
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score
from sklearn.neighbors import KDTree
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import auc

In [None]:
drive.mount ('/content/drive', force_remount=True)
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/ML4_5/data/training.csv')

Mounted at /content/drive


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72983 entries, 0 to 72982
Data columns (total 34 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   RefId                              72983 non-null  int64  
 1   IsBadBuy                           72983 non-null  int64  
 2   PurchDate                          72983 non-null  object 
 3   Auction                            72983 non-null  object 
 4   VehYear                            72983 non-null  int64  
 5   VehicleAge                         72983 non-null  int64  
 6   Make                               72983 non-null  object 
 7   Model                              72983 non-null  object 
 8   Trim                               70623 non-null  object 
 9   SubModel                           72975 non-null  object 
 10  Color                              72975 non-null  object 
 11  Transmission                       72974 non-null  obj

In [None]:
df.describe(include='all')

Unnamed: 0,RefId,IsBadBuy,PurchDate,Auction,VehYear,VehicleAge,Make,Model,Trim,SubModel,...,MMRCurrentRetailAveragePrice,MMRCurrentRetailCleanPrice,PRIMEUNIT,AUCGUART,BYRNO,VNZIP1,VNST,VehBCost,IsOnlineSale,WarrantyCost
count,72983.0,72983.0,72983,72983,72983.0,72983.0,72983,72983,70623,72975,...,72668.0,72668.0,3419,3419,72983.0,72983.0,72983,72983.0,72983.0,72983.0
unique,,,517,3,,,33,1063,134,863,...,,,2,2,,,37,,,
top,,,11/23/2010,MANHEIM,,,CHEVROLET,PT CRUISER,Bas,4D SEDAN,...,,,NO,GREEN,,,TX,,,
freq,,,384,41043,,,17248,2329,13950,15236,...,,,3357,3340,,,13596,,,
mean,36511.428497,0.122988,,,2005.343052,4.176644,,,,,...,8775.723331,10145.385314,,,26345.842155,58043.059945,,6730.934326,0.02528,1276.580985
std,21077.241302,0.328425,,,1.731252,1.71221,,,,,...,3090.702941,3310.254351,,,25717.351219,26151.640415,,1767.846435,0.156975,598.846788
min,1.0,0.0,,,2001.0,0.0,,,,,...,0.0,0.0,,,835.0,2764.0,,1.0,0.0,462.0
25%,18257.5,0.0,,,2004.0,3.0,,,,,...,6536.0,7784.0,,,17212.0,32124.0,,5435.0,0.0,837.0
50%,36514.0,0.0,,,2005.0,4.0,,,,,...,8729.0,10103.0,,,19662.0,73108.0,,6700.0,0.0,1155.0
75%,54764.5,0.0,,,2007.0,5.0,,,,,...,10911.0,12309.0,,,22808.0,80022.0,,7900.0,0.0,1623.0


PurchDate : Dates by themselves are not suitable for analysis unless recurring indicators are extracted from them (such as month, day, day of the week), or the intervals between dates are calculated.

VehYear: "VehicleAge" is present in the data, which is a better alternative.

Model, Trim and Submodel: The number of classes in these features is high, and additional expertise is required to merge classes with low frequency.

WheelTypeID: "WheelType" is present in data, which is better alternative.

BYRNO: Its just an ID.

VNZIP1 and VNST: They often do not directly contribute to predictive power unless specific location-based insights are needed.

In the 'PRIMEUNIT' and 'AUCGUART' variables, there are more than 95% missing values.


In [None]:
processed_df = df.drop(columns=['PRIMEUNIT', 'AUCGUART', 'Model', 'Trim', 'SubModel', 'VNST', 'VNZIP1', 'VehYear', 'BYRNO', 'WheelTypeID'])

numerical_columns = processed_df.select_dtypes(include=['number']).columns.tolist()
categorial_columns = processed_df.select_dtypes(include='object').columns.to_list()

for col in numerical_columns:
  median_value = processed_df[col].median()
  processed_df[col] = processed_df[col].fillna(median_value)

for col in categorial_columns:
  mode_value = processed_df[col].mode()
  processed_df[col] = processed_df[col].fillna(mode_value)

In [None]:
processed_df[categorial_columns].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72983 entries, 0 to 72982
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   PurchDate             72983 non-null  object
 1   Auction               72983 non-null  object
 2   Make                  72983 non-null  object
 3   Color                 72975 non-null  object
 4   Transmission          72974 non-null  object
 5   WheelType             69809 non-null  object
 6   Nationality           72978 non-null  object
 7   Size                  72978 non-null  object
 8   TopThreeAmericanName  72978 non-null  object
dtypes: object(9)
memory usage: 5.0+ MB


2. Design train/validation/test split. Use “PurchDate” field for splitting, test must be later in time than validation, the same goes for validation and train: train.PurchDate < valid.PurchDate < test.PurchDate. Use the first 33% of dates for train, last 33% of dates for test, and middle 33% for validation set. Don’t use the test dataset until the end!

In [None]:
processed_df['PurchDate'] = pd.to_datetime(df['PurchDate'])
processed_df = processed_df.sort_values(by='PurchDate')
n_part = processed_df.shape[0] // 3

processed_df = processed_df.drop(columns='PurchDate')
categorial_columns.remove('PurchDate')

train = processed_df.iloc[: n_part].copy()
valid = processed_df.iloc[n_part : 2 * n_part].copy()
test = processed_df.iloc[2 * n_part :].copy()

3. Use LabelEncoder or OneHotEncoder from sklearn to preprocess categorical variables. Be careful with data leakage (fit Encoder on train and apply on validation & test). Consider another encoding approach if you meet new categorical values in valid & test (unseen in the training dataset), for example: https://contrib.scikit-learn.org/category_encoders/count.html

In [None]:
encoder_onehot = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
onehot_columns = encoder_onehot.fit_transform(train[categorial_columns])

train_encoded = pd.DataFrame(onehot_columns, index=train.index, columns=encoder_onehot.get_feature_names_out())

onehot_columns = encoder_onehot.transform(valid[categorial_columns])
valid_encoded = pd.DataFrame(onehot_columns, index=valid.index, columns=encoder_onehot.get_feature_names_out())

onehot_columns = encoder_onehot.transform(test[categorial_columns])
test_encoded = pd.DataFrame(onehot_columns, index=test.index, columns=encoder_onehot.get_feature_names_out())

4. Train LogisticRegression, GaussianNB, KNN from sklearn on the training dataset and check the quality of your algorithms on the validation dataset. The dependent variable (IsBadBuy) is binary. Don’t forget to normalize your datasets before training models.

In [None]:
y_train = train['IsBadBuy'].copy()
y_valid = valid['IsBadBuy'].copy()
y_test = test['IsBadBuy'].copy()

train.drop(columns='IsBadBuy', inplace=True)
valid.drop(columns='IsBadBuy', inplace=True)
test.drop(columns='IsBadBuy', inplace=True)

numerical_columns.remove('IsBadBuy')

In [None]:
MinMax = MinMaxScaler()
train_norm = pd.DataFrame(MinMax.fit_transform(train[numerical_columns]), index=train.index, columns=numerical_columns)
valid_norm = pd.DataFrame(MinMax.transform(valid[numerical_columns]), index=valid.index, columns=numerical_columns)
test_norm = pd.DataFrame(MinMax.transform(test[numerical_columns]), index=test.index, columns=numerical_columns)

X_train = pd.concat([train_norm, train_encoded], axis=1)
X_valid = pd.concat([valid_norm, valid_encoded], axis=1)
X_test = pd.concat([test_norm, test_encoded], axis=1)

*LogisticRegression*

In [None]:
logReg = LogisticRegression(random_state=42)
logReg.fit(X_train, y_train)

y_pred_train_lg_pr = logReg.predict_proba(X_train)
y_pred_valid_lg_pr = logReg.predict_proba(X_valid)
y_pred_test_lg_pr = logReg.predict_proba(X_test)

giniTrainLG = 2 * roc_auc_score(y_train, y_pred_train_lg_pr[:, 1]) - 1
giniValidLG = 2 * roc_auc_score(y_valid, y_pred_valid_lg_pr[:, 1]) - 1
giniTestLG = 2 * roc_auc_score(y_test, y_pred_test_lg_pr[:, 1]) - 1

In [None]:
print('Gini score Logistic Regression:')
print('Train:', giniTrainLG)
print('Valid:', giniValidLG)
print('Test:', giniTestLG)

Gini score Logistic Regression:
Train: 0.5053060839414765
Valid: 0.4786878179261531
Test: 0.4895512041852712


*GaussianNB*

In [None]:
gaussNB = GaussianNB()
gaussNB.fit(X_train, y_train)

y_pred_train_gnb_pr = gaussNB.predict_proba(X_train)
y_pred_valid_gnb_pr = gaussNB.predict_proba(X_valid)
y_pred_test_gnb_pr = gaussNB.predict_proba(X_test)

giniTrainGNB = 2 * roc_auc_score(y_train, y_pred_train_gnb_pr[:, 1]) - 1
giniValidGNB = 2 * roc_auc_score(y_valid, y_pred_valid_gnb_pr[:, 1]) - 1
giniTestGNB = 2 * roc_auc_score(y_test, y_pred_test_gnb_pr[:, 1]) - 1

In [None]:
print('Gini score GaussianNB:')
print('Train:', giniTrainGNB)
print('Valid:', giniValidGNB)
print('Test:', giniTestGNB)

Gini score GaussianNB:
Train: 0.4339873399690395
Valid: 0.3787812352653981
Test: 0.38608982396736846


*KNN*

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

y_pred_train_knn_pr = knn.predict_proba(X_train)
y_pred_valid_knn_pr = knn.predict_proba(X_valid)
y_pred_test_knn_pr = knn.predict_proba(X_test)

giniTrainKNN = 2 * roc_auc_score(y_train, y_pred_train_knn_pr[:, 1]) - 1
giniValidKNN = 2 * roc_auc_score(y_valid, y_pred_valid_knn_pr[:, 1]) - 1
giniTestKNN = 2 * roc_auc_score(y_test, y_pred_test_knn_pr[:, 1]) - 1

In [None]:
print('Gini score KNN:')
print('Train:', giniTrainKNN)
print('Valid:', giniValidKNN)
print('Test:', giniTestKNN)

Gini score KNN:
Train: 0.8141329977749521
Valid: 0.2731743460383025
Test: 0.2622217362927062


Наилучшие результаты у LogisticRegression. GaussianNB требует нормального распределения данных, а также предполагает независимые признаки. KNN склонен к переобучению.

5. Implement Gini score calculation. You can use 2*ROC AUC - 1 approach, so you need to implement ROC AUC calculation. Check if your metric approximately equals abs(2\*sklearn.metrcs.roc_auc_score - 1).

In [None]:
def Gini_score(y_true, y_score):
  df_score_class = pd.DataFrame({'y_score': y_score, 'y_true': y_true})
  df_score_class = df_score_class.sort_values(by='y_score', ascending=False)

  n_pos = np.sum(y_true)
  n_neg = len(y_true) - n_pos

  if n_pos:
    tpr = np.cumsum(df_score_class['y_true']) / n_pos
  else:
    tpr = 0

  if n_neg:
    fpr = np.cumsum(1 - df_score_class['y_true']) / n_neg
  else:
    fpr = 0

  tpr = np.concatenate(([0], tpr, [1]))
  fpr = np.concatenate(([0], fpr, [1]))

  auc_roc = np.trapz(tpr, fpr)

  return 2 * auc_roc - 1

In [None]:
gini = Gini_score(y_train, y_pred_train_lg_pr[:, 1])
print('sklearn version:', 2 * roc_auc_score(y_train, y_pred_train_lg_pr[:, 1]) - 1)
print('my version of gini', gini)

sklearn version: 0.5053060839414765
my version of gini 0.5053060839414765


6. Implement your own versions of LogisticRegression, KNN and NaiveBayes classifiers. For LogisticRegression compute gradients with respect to the loss and use stochastic gradient descent. Can you reproduce the results from step 4?


 - Guidance for this task: Your model must be represented by class with methods fit, predict (predict_proba with 0.5 threshold), predict_proba. For LR moder, compute the loss gradient with respect to parameters w and parameter b in the fit function. Use a simple SGD approach to estimate optimal values of parameters.

*Logistic Regression*

In [None]:
class LogisticRegressionMy:
  def __init__(self, learning_rate=0.01, n_iterations=100, batch_size=100, threshold=0.5, random_state=None) -> None:
    self.learning_rate = learning_rate
    self.n_iterations = n_iterations
    self.batch_size = batch_size
    self.threshold = threshold
    self.weights = None
    self.bias = None
    self.random_state = random_state


  def sigmoid(self, y_pred_):
    return (1 / (1 + np.exp(-y_pred_)))


  def fit(self, X, y):
    if self.random_state:
      np.random.seed(self.random_state)

    n_samples, n_features = X.shape
    self.weights = np.random.randn(n_features) * 0.01
    self.bias = np.random.randn() * 0.01


    for i in range(self.n_iterations):
      start_ind = np.random.randint(0, n_samples - self.batch_size - 1)
      end_ind = start_ind + self.batch_size
      y_pred = X[start_ind : end_ind] @ self.weights + self.bias
      sigm_func = self.sigmoid(y_pred)
      dw = (X[start_ind : end_ind].T @ (sigm_func - y[start_ind : end_ind])) / self.batch_size
      db = np.sum(sigm_func - y[start_ind : end_ind]) / self.batch_size

      self.weights -= self.learning_rate * dw
      self.bias -= self.learning_rate * db


  def predict_proba(self, X):
      y_pred = X @ self.weights + self.bias
      return self.sigmoid(y_pred)


  def predict(self, X):
    return (self.predict_proba(X) >= self.threshold).astype(int)

In [None]:
lr_my = LogisticRegressionMy()
lr_my.fit(X_train, y_train)

y_pred_train_lr_my = lr_my.predict_proba(X_train)
y_pred_valid_lr_my = lr_my.predict_proba(X_valid)
y_pred_test_lr_my = lr_my.predict_proba(X_test)

giniTrainLGmy = 2 * roc_auc_score(y_train, y_pred_train_lr_my) - 1
giniValidLGmy = 2 * roc_auc_score(y_valid, y_pred_valid_lr_my) - 1
giniTestLGmy = 2 * roc_auc_score(y_test, y_pred_test_lr_my) - 1

In [None]:
print('Gini score Logistic Regression my version:')
print('Train:', giniTrainLGmy)
print('Valid:', giniValidLGmy)
print('Test:', giniTestLGmy)

Gini score Logistic Regression my version:
Train: 0.16807436545681198
Valid: 0.14668215816634378
Test: 0.1465990474185217


*KNN*

In [None]:
class KNeighborsClassifierMy:
  def __init__(self, n_neighbors=5, threshold=0.5, learn_count=200) -> None:
    self.n_neighbors = n_neighbors
    self.X_ = None
    self.y_ = None
    self.threshold = threshold
    self.learn_count = learn_count

  def euclidean_distance_(self, x_i):
    return np.sqrt(np.sum((self.X_ - x_i)**2, axis=1))


  def fit(self, X, y):
    indices_X = list(range(X_test.shape[0]))
    np.random.shuffle(indices_X)
    ind = indices_X[:self.learn_count]
    self.X_ = np.array(X.iloc[ind])
    self.y_ = np.array(y.iloc[ind])


  def predict_proba(self, X):
    y_pred_list = []
    for x_i in np.array(X):
      dist = self.euclidean_distance_(np.array(x_i))
      indices = np.argsort(dist)[:self.n_neighbors]
      count_class = np.bincount(self.y_[indices], minlength=2)
      y_pred_list.append([count_class[0] / self.n_neighbors, count_class[1] / self.n_neighbors])

    return np.array(y_pred_list)


  def predict(self, X):
    y_pred = self.predict_proba(X)
    return (y_pred[:, 1] > self.threshold).astype(int)

In [None]:
knn_my = KNeighborsClassifierMy()
knn_my.fit(X_train, y_train)
y_knn_train = knn_my.predict_proba(X_train)
y_knn_valid = knn_my.predict_proba(X_valid)
y_knn_test = knn_my.predict_proba(X_test)

In [None]:
giniTrainKNNmy = 2 * roc_auc_score(y_train, y_knn_train[:, 1]) - 1
giniValidKNNmy = 2 * roc_auc_score(y_valid, y_knn_valid[:, 1]) - 1
giniTestKNNmy = 2 * roc_auc_score(y_test, y_knn_test[:, 1]) - 1

print('Gini score my KNN:')
print('Train:', giniTrainKNNmy)
print('Valid:', giniValidKNNmy)
print('Test:', giniTestKNNmy)

Gini score my KNN:
Train: 0.1010396853545128
Valid: 0.12442210790210462
Test: 0.12289145019766146


*NaiveBayes*

In [None]:
class GaussianNBMy:
  def __init__(self) -> None:
    self.classes_ = None
    self.class_count_ = None
    self.class_prior_ = None
    self.mean_ = None
    self.var_ = None


  def fit(self, X, y):
    self.classes_, self.class_count_ = np.unique(y, return_counts=True)
    self.class_prior_ = self.class_count_ / len(y)
    self.mean_ = []
    self.var_ = []
    eps = 1e-1

    for cls in self.classes_:
      self.mean_.append(np.mean(X[y == cls], axis=0))
      self.var_.append(np.var(X[y==cls], axis=0))

    self.mean_ = np.array(self.mean_)
    self.var_ = np.array(self.var_)  + eps


  def _prob_density(self, x, cls_i):
    return (1 / np.sqrt(2 * np.pi * self.var_[cls_i]) * np.exp(-((x - self.mean_[cls_i])**2) / (2 * self.var_[cls_i])) )


  def _calculate_posteriors(self, X):
    posteriors = []
    log_prior = np.log(self.class_prior_)
    for cls_i in range(self.classes_.size):
      pdf = np.clip(self._prob_density(X, cls_i), a_min=1e-2, a_max=None)
      posteriors.append(self.class_prior_[cls_i] * np.prod(pdf, axis=1))
    return posteriors


  def predict(self, X):
    posteriors = self._calculate_posteriors(X)
    posteriors = np.array(posteriors)
    return self.classes_[np.argmax(posteriors, axis=0)]

In [None]:
gauss_my = GaussianNBMy()
gauss_my.fit(X_train, y_train)
y_gauss_train_my = gauss_my.predict(X_train)
y_gauss_valid_my = gauss_my.predict(X_valid)
y_gauss_test_my = gauss_my.predict(X_test)

giniTrainGNB = 2 * roc_auc_score(y_train, y_gauss_train_my) - 1
giniValidGNB = 2 * roc_auc_score(y_valid, y_gauss_valid_my) - 1
giniTestGNB = 2 * roc_auc_score(y_test, y_gauss_test_my) - 1

In [None]:
print('Gini score my GaussianNB:')
print('Train:', giniTrainGNB)
print('Valid:', giniValidGNB)
print('Test:', giniTestGNB)

Gini score my GaussianNB:
Train: 0.20678296012236852
Valid: 0.19728960384727956
Test: 0.2160831330040187


7. Try to create non-linear features, for example:

fractions: feature1/feature2
groupby features: df[‘categorical_feature’].map(df.groupby(‘categorical_feature’)[‘continious_feature’].mean())

Add new features to your pipeline, repeat step 4. Did you manage to increase your Gini score (you should!)?

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24327 entries, 32367 to 15838
Data columns (total 22 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   RefId                              24327 non-null  int64  
 1   Auction                            24327 non-null  object 
 2   VehicleAge                         24327 non-null  int64  
 3   Make                               24327 non-null  object 
 4   Color                              24319 non-null  object 
 5   Transmission                       24319 non-null  object 
 6   WheelType                          23402 non-null  object 
 7   VehOdo                             24327 non-null  int64  
 8   Nationality                        24327 non-null  object 
 9   Size                               24327 non-null  object 
 10  TopThreeAmericanName               24327 non-null  object 
 11  MMRAcquisitionAuctionAveragePrice  24327 non-null  floa

In [None]:
scaler = MinMaxScaler()
train_feat = pd.DataFrame()
valid_feat = pd.DataFrame()
test_feat = pd.DataFrame()

train_feat['MMRAcquisitionAuctionAveragePrice_div_VehicleAge'] = X_train['MMRAcquisitionAuctionAveragePrice'] / (X_train['VehicleAge'] + 1e-6)
valid_feat['MMRAcquisitionAuctionAveragePrice_div_VehicleAge'] = X_valid['MMRAcquisitionAuctionAveragePrice'] / (X_valid['VehicleAge'] + 1e-6)
test_feat['MMRAcquisitionAuctionAveragePrice_div_VehicleAge'] = X_test['MMRAcquisitionAuctionAveragePrice'] / (X_test['VehicleAge'] + 1e-6)

train_feat['VehOdo_div_VehicleAge'] = X_train['VehOdo'] / (X_train['VehicleAge'] + 1e-6)
valid_feat['VehOdo_div_VehicleAge'] = X_valid['VehOdo'] / (X_valid['VehicleAge'] + 1e-6)
test_feat['VehOdo_div_VehicleAge'] = X_test['VehOdo'] / (X_test['VehicleAge'] + 1e-6)

train_feat['MMRCurrentRetailAveragePrice_div_WarrantyCost'] = X_train['MMRCurrentRetailAveragePrice'] / (X_train['WarrantyCost'] + 1e-6)
valid_feat['MMRCurrentRetailAveragePrice_div_WarrantyCost'] = X_valid['MMRCurrentRetailAveragePrice'] / (X_valid['WarrantyCost'] + 1e-6)
test_feat['MMRCurrentRetailAveragePrice_div_WarrantyCost'] = X_test['MMRCurrentRetailAveragePrice'] / (X_test['WarrantyCost'] + 1e-6)

grouped_make_train = train.groupby('Make')['MMRAcquisitionAuctionCleanPrice'].mean()
grouped_make_valid = valid.groupby('Make')['MMRAcquisitionAuctionCleanPrice'].mean()
grouped_make_test = test.groupby('Make')['MMRAcquisitionAuctionCleanPrice'].mean()
train_feat['Make_MMRAcquisitionAuctionCleanPrice'] = train['Make'].map(grouped_make_train)
valid_feat['Make_MMRAcquisitionAuctionCleanPrice'] = valid['Make'].map(grouped_make_valid)
test_feat['Make_MMRAcquisitionAuctionCleanPrice'] = test['Make'].map(grouped_make_test)

# X_train[['MMRAcquisitionAuctionAveragePrice_div_VehicleAge', 'VehOdo_div_VehicleAge', 'MMRCurrentRetailAveragePrice_div_WarrantyCost', ]] = \
#  scaler.fit_transform(X_train[['MMRAcquisitionAuctionAveragePrice_div_VehicleAge', 'VehOdo_div_VehicleAge', 'MMRCurrentRetailAveragePrice_div_WarrantyCost', ]])

train_feat = pd.DataFrame(scaler.fit_transform(train_feat), index=train_feat.index, columns=train_feat.columns)
valid_feat = pd.DataFrame(scaler.transform(valid_feat), index=valid_feat.index, columns=valid_feat.columns)
test_feat = pd.DataFrame(scaler.transform(test_feat), index=test_feat.index, columns=test_feat.columns)

train_common = pd.concat([X_train, train_feat], axis=1)
valid_common = pd.concat([X_valid, valid_feat], axis=1)
test_common = pd.concat([X_test, test_feat], axis=1)

*LogisticRegression*

In [None]:
logReg_nlf = LogisticRegression(random_state=42)
logReg_nlf.fit(train_common, y_train)

y_pred_train_lg_nlf = logReg_nlf.predict_proba(train_common)
y_pred_valid_lg_nlf = logReg_nlf.predict_proba(valid_common)
y_pred_test_lg_nlf = logReg_nlf.predict_proba(test_common)

giniTrainLG_nlf = 2 * roc_auc_score(y_train, y_pred_train_lg_nlf[:, 1]) - 1
giniValidLG_nlf = 2 * roc_auc_score(y_valid, y_pred_valid_lg_nlf[:, 1]) - 1
giniTestLG_nlf = 2 * roc_auc_score(y_test, y_pred_test_lg_nlf[:, 1]) - 1

print('Gini score Logistic Regression with non-linear features:')
print('Train:', giniTrainLG_nlf)
print('Valid:', giniValidLG_nlf)
print('Test:', giniTestLG_nlf)

Gini score Logistic Regression with non-linear features:
Train: 0.5059022955593353
Valid: 0.4784009290837663
Test: 0.48955642834218316


In [None]:
print('Gini score Logistic Regression:')
print('Train:', giniTrainLG)
print('Valid:', giniValidLG)
print('Test:', giniTestLG)

Gini score Logistic Regression:
Train: 0.5053060839414765
Valid: 0.4786878179261531
Test: 0.4895512041852712


*GaussianNB*

In [None]:
gaussNB_nlf = GaussianNB()
gaussNB_nlf.fit(train_common, y_train)

y_pred_train_gnb_nlf = gaussNB_nlf.predict_proba(train_common)
y_pred_valid_gnb_nlf = gaussNB_nlf.predict_proba(valid_common)
y_pred_test_gnb_nlf = gaussNB_nlf.predict_proba(test_common)

giniTrainGNB_nlf = 2 * roc_auc_score(y_train, y_pred_train_gnb_nlf[:, 1]) - 1
giniValidGNB_nlf = 2 * roc_auc_score(y_valid, y_pred_valid_gnb_nlf[:, 1]) - 1
giniTestGNB_nlf = 2 * roc_auc_score(y_test, y_pred_test_gnb_nlf[:, 1]) - 1

In [None]:
print('Gini score GaussianNB with non-linear features:')
print('Train:', giniTrainGNB_nlf)
print('Valid:', giniValidGNB_nlf)
print('Test:', giniTestGNB_nlf)

Gini score GaussianNB with non-linear features:
Train: 0.43296992404094437
Valid: 0.3819719290393815
Test: 0.3860872274369984


In [None]:
print('Gini score GaussianNB:')
print('Train:', giniTrainGNB)
print('Valid:', giniValidGNB)
print('Test:', giniTestGNB)

Gini score GaussianNB:
Train: 0.20678296012236852
Valid: 0.19728960384727956
Test: 0.2160831330040187


*KNN*

In [None]:
knn_nlf = KNeighborsClassifier()
knn_nlf.fit(train_common, y_train)

y_pred_train_knn_nlf = knn_nlf.predict_proba(train_common)
y_pred_valid_knn_nlf = knn_nlf.predict_proba(valid_common)
y_pred_test_knn_nlf = knn_nlf.predict_proba(test_common)

giniTrainKNN_nlf = 2 * roc_auc_score(y_train, y_pred_train_knn_nlf[:, 1]) - 1
giniValidKNN_nlf = 2 * roc_auc_score(y_valid, y_pred_valid_knn_nlf[:, 1]) - 1
giniTestKNN_nlf = 2 * roc_auc_score(y_test, y_pred_test_knn_nlf[:, 1]) - 1

In [None]:
print('Gini score KNN with non-linear featurs:')
print('Train:', giniTrainKNN_nlf)
print('Valid:', giniValidKNN_nlf)
print('Test:', giniTestKNN_nlf)

Gini score KNN with non-linear featurs:
Train: 0.8135660399192184
Valid: 0.26805237213315003
Test: 0.26063244203308145


In [None]:
print('Gini score KNN:')
print('Train:', giniTrainKNN)
print('Valid:', giniValidKNN)
print('Test:', giniTestKNN)

Gini score KNN:
Train: 0.8141329977749521
Valid: 0.2731743460383025
Test: 0.2622217362927062


8. Determine the best features for the problem using the coefficients of the logistic model. Try to eliminate useless features by hand and by L1 regularization. Which approach is better in terms of Gini score?

*Eliminate useless features by hand*

In [None]:
coeffs = logReg.coef_[0]
feat_coeffs = pd.DataFrame({'features': X_train.columns, 'coeff': np.abs(coeffs)})
feat_coeffs = feat_coeffs.sort_values(by=['coeff'], ascending=False)

feature_hand = feat_coeffs[feat_coeffs['coeff'] > 0.3]['features'].to_list()

*LogisticRegression*

In [None]:
logReg_hand = LogisticRegression(random_state=42)
logReg_hand.fit(X_train[feature_hand], y_train)

y_pred_train_lg_hand = logReg_hand.predict_proba(X_train[feature_hand])
y_pred_valid_lg_hand = logReg_hand.predict_proba(X_valid[feature_hand])
y_pred_test_lg_hand = logReg_hand.predict_proba(X_test[feature_hand])

giniTrainLG_hand = 2 * roc_auc_score(y_train, y_pred_train_lg_hand[:, 1]) - 1
giniValidLG_hand = 2 * roc_auc_score(y_valid, y_pred_valid_lg_hand[:, 1]) - 1
giniTestLG_hand = 2 * roc_auc_score(y_test, y_pred_test_lg_hand[:, 1]) - 1

print('Gini score Logistic Regression with top features (by hand):')
print('Train:', giniTrainLG_hand)
print('Valid:', giniValidLG_hand)
print('Test:', giniTestLG_hand)

Gini score Logistic Regression with top features (by hand):
Train: 0.496012329928849
Valid: 0.47975099422440914
Test: 0.4880849574786277


In [None]:
print('Gini score Logistic Regression:')
print('Train:', giniTrainLG)
print('Valid:', giniValidLG)
print('Test:', giniTestLG)

Gini score Logistic Regression:
Train: 0.5053060839414765
Valid: 0.4786878179261531
Test: 0.4895512041852712


*eliminate useless features by L1 regularization*

In [None]:
logReg_l1 = LogisticRegression(penalty='l1', solver='liblinear', random_state=42)
logReg_l1.fit(X_train, y_train)
l1_coefficients = logReg_l1.coef_[0]
features_l1 = X_train.columns[np.where(l1_coefficients != 0)]

In [None]:
y_pred_train_l1 = logReg_l1.predict_proba(X_train)
y_pred_valid_l1 = logReg_l1.predict_proba(X_valid)
y_pred_test_l1 = logReg_l1.predict_proba(X_test)

print('Gini score Logistic Regression with top features (by l1):')
print('Train:', Gini_score(y_train, y_pred_train_l1[:, 1]))
print('Valid:', Gini_score(y_valid, y_pred_valid_l1[:, 1]))
print('Test:', Gini_score(y_test, y_pred_test_l1[:, 1]))

Gini score Logistic Regression with top features (by l1):
Train: 0.5066765562112203
Valid: 0.4823711361329823
Test: 0.4953879867837536


Ручной отбор признаков и l1 регуляризация работают примерно с одинаковым качеством

10. Select your best model (algorithm + feature set) and tweak its hyperparameters to increase the Gini score on the validation dataset. Which hyperparameters have the most impact?

In [None]:
param_grid = {
    'C': [0.01, 0.1, 1]
}

gini_scorer = make_scorer(Gini_score, greater_is_better=True)
grid_search = GridSearchCV(estimator=logReg_l1, param_grid=param_grid, scoring=gini_scorer)
grid_search.fit(X_valid, y_valid)
print("Лучшие гиперпараметры: ", grid_search.best_params_)

Лучшие гиперпараметры:  {'C': 1}


11. Check the Gini scores on all three datasets for your best model: training Gini, valid Gini, test Gini. Do you see a drop in performance when comparing the valid quality to the test quality? Is your model overfitted or not? Explain.


In [None]:
y_train_best = grid_search.best_estimator_.predict_proba(X_train)
y_valid_best = grid_search.best_estimator_.predict_proba(X_valid)
y_test_best = grid_search.best_estimator_.predict_proba(X_test)

print('Gini score train:', Gini_score(y_train, y_train_best[:, 1]))
print('Gini score valid:', Gini_score(y_valid, y_valid_best[:, 1]))
print('Gini score test:', Gini_score(y_test, y_test_best[:, 1]))

Gini score train: 0.4654406858198805
Gini score valid: 0.529588529416456
Gini score test: 0.48471506530864694


Модель не переобучена, поскольку наблюдается незначительное расхождение Gini score у тестовой выборки с тренировочной и валидационной.

12. Implement calculation or Recall, Precision, F1 score and AUC PR metrics.
Compare your algorithms on the test dataset using AUC PR metric.

In [None]:
def recall(y_true, y_pred):
  TP = np.sum((y_true == 1) & (y_pred == 1))
  FN = np.sum((y_true == 1) & (y_pred == 0))

  return TP / (TP + FN) if (TP + FN) > 0 else 0.0

In [None]:
def precision(y_true, y_pred):
  TP = np.sum((y_true == 1) & (y_pred == 1))
  FP = np.sum((y_true == 0) & (y_pred == 1))

  return TP / (TP + FP) if (TP + FP) > 0 else 0.0

In [None]:
def f1_score_my(y_true, y_pred):
  TP = np.sum((y_true == 1) & (y_pred == 1))
  FN = np.sum((y_true == 1) & (y_pred == 0))
  FP = np.sum((y_true == 0) & (y_pred == 1))

  return TP / (TP + (FP + FN) / 2) if (TP + (FP + FN) / 2) > 0 else 0.0

In [None]:
def auc_pr(y_true, y_score):
  df_score_class = pd.DataFrame({'y_score': y_score, 'y_true': y_true})
  df_score_class = df_score_class.sort_values(by='y_score', ascending=False)

  n_pos = np.sum(y_true)

  tp = np.cumsum(df_score_class['y_true'])
  fp = np.cumsum(1 - df_score_class['y_true'])
  fn = n_pos - tp

  recall = np.where((tp + fn) == 0, 0, (tp / (tp + fn)))
  precision = np.where((tp + fp) == 0, 0, (tp / (tp + fp)))

  return auc(recall, precision)

In [None]:
print('AUC PR:', auc_pr(y_test, y_test_best))

AUC PR: 0.3202488755010808


13. Which hard label metric do you prefer for the task of detecting "lemon" cars?

"lemon" cars это бракованные автомомобили. Для выявления максимального количества бракованных автомобилей необходимо ориентироваться на ложноотрицательные результаты, которые выявляются с помощью метрики recall