## Подготовка обучающей и тестовой выборки, кросс-валидация и подбор гиперпараметров на примере метода ближайших соседей.

### 1. Выберем набор данных

In [None]:
!wget https://archive.ics.uci.edu/static/public/850/raisin.zip

--2024-03-24 15:21:44--  https://archive.ics.uci.edu/static/public/850/raisin.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘raisin.zip’

raisin.zip              [ <=>                ] 111.99K  --.-KB/s    in 0.1s    

2024-03-24 15:21:45 (792 KB/s) - ‘raisin.zip’ saved [114677]



In [None]:
!unzip raisin.zip
!unzip Raisin_Dataset.zip

Archive:  raisin.zip
  inflating: Raisin_Dataset.zip      
Archive:  Raisin_Dataset.zip
   creating: Raisin_Dataset/
  inflating: Raisin_Dataset/Raisin_Dataset.arff  
  inflating: Raisin_Dataset/Raisin_Dataset.txt  
  inflating: Raisin_Dataset/Raisin_Dataset.xlsx  


In [None]:
import numpy as np
import pandas as pd
from scipy.io.arff import loadarff
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import f1_score,accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold

In [None]:
raw_data = loadarff('Raisin_Dataset/Raisin_Dataset.arff')
df = pd.DataFrame(raw_data[0])

In [None]:
df.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524.0,442.246011,253.291155,0.819738,90546.0,0.758651,1184.04,b'Kecimen'
1,75166.0,406.690687,243.032436,0.801805,78789.0,0.68413,1121.786,b'Kecimen'
2,90856.0,442.267048,266.328318,0.798354,93717.0,0.637613,1208.575,b'Kecimen'
3,45928.0,286.540559,208.760042,0.684989,47336.0,0.699599,844.162,b'Kecimen'
4,79408.0,352.19077,290.827533,0.564011,81463.0,0.792772,1073.251,b'Kecimen'


In [None]:
df.shape

(900, 8)

### 2. Проверим необходимость проведения удаления или заполнения пропусков или кодирования категориальных признаков.

Проверяем датасет на наличие пропусков

In [None]:
df.isnull().sum()

Area               0
MajorAxisLength    0
MinorAxisLength    0
Eccentricity       0
ConvexArea         0
Extent             0
Perimeter          0
Class              0
dtype: int64

Производим кодирование категориальных данных

In [None]:
df.Class.unique()

array([b'Kecimen', b'Besni'], dtype=object)

In [None]:
le = LabelEncoder()
df["Class"]= le.fit_transform(df["Class"])
df.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524.0,442.246011,253.291155,0.819738,90546.0,0.758651,1184.04,1
1,75166.0,406.690687,243.032436,0.801805,78789.0,0.68413,1121.786,1
2,90856.0,442.267048,266.328318,0.798354,93717.0,0.637613,1208.575,1
3,45928.0,286.540559,208.760042,0.684989,47336.0,0.699599,844.162,1
4,79408.0,352.19077,290.827533,0.564011,81463.0,0.792772,1073.251,1


In [None]:
df.drop("Class", axis=1)

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter
0,87524.0,442.246011,253.291155,0.819738,90546.0,0.758651,1184.040
1,75166.0,406.690687,243.032436,0.801805,78789.0,0.684130,1121.786
2,90856.0,442.267048,266.328318,0.798354,93717.0,0.637613,1208.575
3,45928.0,286.540559,208.760042,0.684989,47336.0,0.699599,844.162
4,79408.0,352.190770,290.827533,0.564011,81463.0,0.792772,1073.251
...,...,...,...,...,...,...,...
895,83248.0,430.077308,247.838695,0.817263,85839.0,0.668793,1129.072
896,87350.0,440.735698,259.293149,0.808629,90899.0,0.636476,1214.252
897,99657.0,431.706981,298.837323,0.721684,106264.0,0.741099,1292.828
898,93523.0,476.344094,254.176054,0.845739,97653.0,0.658798,1258.548


### 3. С использованием метода train_test_split разделим выборку на обучающую и тестовую.

In [None]:
X_train, X_test, y_train, y_test = train_test_split( df.drop('Class', axis = 1), df['Class'], test_size=0.2, random_state=42)

In [None]:
y_train

10     1
334    1
244    1
678    0
306    1
      ..
106    1
270    1
860    0
435    1
102    1
Name: Class, Length: 720, dtype: int64

In [None]:
X_test

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter
70,95347.0,451.526154,280.226153,0.784111,99256.0,0.674956,1255.245
827,61861.0,345.943650,235.430468,0.732706,67390.0,0.702280,1063.621
231,52693.0,283.504239,242.113954,0.520265,54860.0,0.737749,895.745
588,112808.0,542.504780,267.201878,0.870293,116961.0,0.743155,1390.400
39,49882.0,287.264327,222.185873,0.633852,50880.0,0.766378,843.764
...,...,...,...,...,...,...,...
897,99657.0,431.706981,298.837323,0.721684,106264.0,0.741099,1292.828
578,129038.0,540.814829,306.817764,0.823494,134796.0,0.648758,1459.345
779,103915.0,516.485501,260.105445,0.863933,106499.0,0.691085,1285.063
25,75620.0,368.224284,263.459255,0.698627,77493.0,0.726277,1059.186


In [None]:
X_train

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter
10,80437.0,449.454581,232.325506,0.856043,84460.0,0.674236,1176.305
334,72483.0,334.417609,282.680889,0.534303,74945.0,0.706180,1052.159
244,85739.0,380.370379,288.256159,0.652452,87052.0,0.762152,1094.576
678,182788.0,621.206763,379.424446,0.791796,188848.0,0.733061,1679.075
306,62835.0,421.169338,191.169862,0.891051,64406.0,0.786145,1018.553
...,...,...,...,...,...,...,...
106,48945.0,269.370411,239.162166,0.460121,51456.0,0.711244,872.289
270,54968.0,300.954432,234.389570,0.627246,56851.0,0.751340,893.644
860,166654.0,607.996465,349.658989,0.818083,169060.0,0.753518,1574.164
435,28216.0,245.401295,150.245582,0.790669,30316.0,0.622293,683.004


### 4. Обучим модель ближайших соседей для произвольно заданного гиперпараметра K

In [None]:
clf_knn = KNeighborsClassifier(n_neighbors=80)
clf_knn.fit(X_train, y_train)

In [None]:
y_pred = clf_knn.predict(X_test)

In [None]:
y_pred

array([0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 1, 1])

Оценим качество модели с помощью метрик.

In [None]:
f1_score(y_test, y_pred)

0.8629441624365481

In [None]:
accuracy_score(y_test, y_pred)

0.85

### 5. Подбор гиперпараметра K с использованием GridSearchCV и RandomizedSearchCV и кросс-валидации

In [None]:
n_range = np.array(range(1,20))
tuned_parameters = [{'n_neighbors': n_range}]
tuned_parameters

[{'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
         18, 19])}]

 GridSearch - инструмент для автоматического подбирания параметров для моделей машинного обучения. GridSearchCV находит наилучшие параметры, путем обычного перебора: он создает модель для каждой возможной комбинации параметров. (наилучший)

In [None]:
# GridSearch
clf_gs = GridSearchCV(KNeighborsClassifier(), tuned_parameters, cv=5, scoring='f1')
clf_gs.fit(X_train, y_train)

In [None]:
clf_gs.best_params_

{'n_neighbors': 5}

RandomizedSearchCV - реализует рандомизированный поиск по параметрам, где каждый параметр выбирается из распределения по возможным значениям параметров. (оптимальный)

In [None]:
# RandomSearch
clf_rs = RandomizedSearchCV(KNeighborsClassifier(), tuned_parameters, scoring='f1')
clf_rs.fit(X_train, y_train)

In [None]:
clf_rs.best_params_

{'n_neighbors': 5}

K-Fold cross-validator.

Предоставляет train/test индексы для разбиения данных in train/test sets. Split dataset into k последовательные фолды (without shuffling by default).

Затем каждый фолд используется один раз в качестве проверки while the k - 1 remaining folds form the training set.

In [None]:
# CrossVal Optim
kf = KFold(n_splits=6)
scores = cross_val_score(KNeighborsClassifier(n_neighbors=5),
                         X_train,y_train, scoring='f1',
                         cv=kf)
scores.mean()

0.8251623375972116

In [None]:
scores

array([0.85714286, 0.82706767, 0.82170543, 0.83185841, 0.77192982,
       0.84126984])

Stratified K-Fold cross-validator.

Предоставляет train/test индексы для разбиения данных in train/test sets.

This cross-validation object is a variation of KFold that returns stratified folds (Стратификация - это процесс разделения генеральной совокупности на меньшие группы или страты на основе одного или нескольких характеристик). The folds сохраняется процентное соотношение образцов для каждого класса.

In [None]:
skf = StratifiedKFold(n_splits=6)
scores = cross_val_score(KNeighborsClassifier(n_neighbors=5),
                         X_train,y_train, scoring='f1',
                         cv=skf)
scores.mean()

0.8326490530764062

In [None]:
scores

array([0.85483871, 0.81967213, 0.80991736, 0.82644628, 0.84375   ,
       0.84126984])

In [None]:
# CrossVal Random
kf = KFold(n_splits=6)
scores = cross_val_score(KNeighborsClassifier(n_neighbors=80),
                         X_train,y_train, scoring='f1',
                         cv=kf)
scores.mean()

0.8157086389584182

In [None]:
skf = StratifiedKFold(n_splits=6)
scores = cross_val_score(KNeighborsClassifier(n_neighbors=80),
                         X_train,y_train, scoring='f1',
                         cv=skf)
scores.mean()

0.8198506732253271