### 4. KNN-признаки (3 балла).
Запрограммируйте алгоритм построения KNN-based признаков, рассмотренных на лекции.
Можно вычислять любые подвиды KNN-признаков, например:
 * значение j-го признака $x^j_{neighbor(i)}$  у ближайшего соседа (для рассматриваемого объекта $x_i$).
 * разница $x^j_i$ – $x^j_{neighbor(i)}$ для количественного признака или индикатор $[x^j_i= x^j_{neighbor(i)}]$ для категориального.
 * расстояние до ближайшего соседа: $d(x_i, x_{neighbor(i)})$.


Для реализации можно пользоваться библиотеками *sklearn*, *annoy* и т.п.


Выберите любые 2 алгоритма из списка: логистическая регрессия, cлучайный лес, градиентный бустинг, MLP.


Проведите ряд экспериментов на любом *непопулярном* датасете (Титаник, UCI-датасеты запрещены). Какие из полученных признаков наиболее полезны в вашей задаче? Какие алгоритмы лучше всего реагируют на добавление KNN-based признаков?

### Выполнение

Используется датасет наблюдений за дневной погодой в Австралии. В качестве таргета будет использоваться переменная RainTomorrow.

Ссылка на датасет https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

Загружаем датасет и производим предобработку данных

In [1]:
import pandas as pd 

data = pd.read_csv("WeatherAUS.csv")
data.RainTomorrow.value_counts()

No     110316
Yes     31877
Name: RainTomorrow, dtype: int64

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

In [3]:
categorical = data.columns[data.dtypes==object]
numerical = data.columns[data.dtypes=='float64']

In [4]:
categorical

Index(['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm',
       'RainToday', 'RainTomorrow'],
      dtype='object')

In [5]:
numerical

Index(['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
       'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
       'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm',
       'Temp9am', 'Temp3pm'],
      dtype='object')

In [6]:
for col in data.columns:
    print(round((1-data[col].count()/len(data.index)),2)*100,col)

0.0 Date
0.0 Location
1.0 MinTemp
1.0 MaxTemp
2.0 Rainfall
43.0 Evaporation
48.0 Sunshine
7.000000000000001 WindGustDir
7.000000000000001 WindGustSpeed
7.000000000000001 WindDir9am
3.0 WindDir3pm
1.0 WindSpeed9am
2.0 WindSpeed3pm
2.0 Humidity9am
3.0 Humidity3pm
10.0 Pressure9am
10.0 Pressure3pm
38.0 Cloud9am
41.0 Cloud3pm
1.0 Temp9am
2.0 Temp3pm
2.0 RainToday
2.0 RainTomorrow


Поскольку мы будем предсказывать переменную RainTomorrow, то объекты для которых значение данной переменной отсутствует, будут выкинуты из датасета.

In [7]:
indexes = data[~data['RainTomorrow'].isnull()].index.tolist()
data = data.iloc[indexes]

Заменим пропущенные данные в категориальных колонках модой и средним в числовых 

In [8]:
pd.options.mode.chained_assignment = None 

for col in data.columns:
    if str(data[col].dtype)=="object":
        data[col]=data[col].fillna(data[col].mode()[0])
    else:
        data[col]=data[col].fillna(data[col].mean())
        

In [9]:
data

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,5.469824,7.624853,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.000000,4.503167,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,5.469824,7.624853,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,4.437189,4.503167,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,5.469824,7.624853,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,4.437189,2.000000,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,5.469824,7.624853,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,4.437189,4.503167,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,5.469824,7.624853,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.000000,8.000000,17.8,29.7,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145454,2017-06-20,Uluru,3.5,21.8,0.0,5.469824,7.624853,E,31.0,ESE,...,59.0,27.0,1024.7,1021.2,4.437189,4.503167,9.4,20.9,No,No
145455,2017-06-21,Uluru,2.8,23.4,0.0,5.469824,7.624853,E,31.0,SE,...,51.0,24.0,1024.6,1020.3,4.437189,4.503167,10.1,22.4,No,No
145456,2017-06-22,Uluru,3.6,25.3,0.0,5.469824,7.624853,NNW,22.0,SE,...,56.0,21.0,1023.5,1019.1,4.437189,4.503167,10.9,24.5,No,No
145457,2017-06-23,Uluru,5.4,26.9,0.0,5.469824,7.624853,N,37.0,SE,...,53.0,24.0,1021.0,1016.8,4.437189,4.503167,12.5,26.1,No,No


In [10]:
data.RainToday = data.RainToday.apply(lambda x: 1 if x=='Yes' else 0)
data.RainTomorrow = data.RainTomorrow.apply(lambda x: 1 if x.strip()=='Yes' else 0)

In [11]:
data['year']=pd.DatetimeIndex(data['Date']).year
data['month']=pd.DatetimeIndex(data['Date']).month
data['day']=pd.DatetimeIndex(data['Date']).day

data.drop('Date',axis=1,inplace=True)

In [12]:
categorical = categorical.drop("Date")
categorical = categorical.append(data.columns[-3:])

In [13]:
categorical

Index(['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday',
       'RainTomorrow', 'year', 'month', 'day'],
      dtype='object')

In [14]:
data

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,year,month,day
0,Albury,13.4,22.9,0.6,5.469824,7.624853,W,44.0,W,WNW,...,1007.1,8.000000,4.503167,16.9,21.8,0,0,2008,12,1
1,Albury,7.4,25.1,0.0,5.469824,7.624853,WNW,44.0,NNW,WSW,...,1007.8,4.437189,4.503167,17.2,24.3,0,0,2008,12,2
2,Albury,12.9,25.7,0.0,5.469824,7.624853,WSW,46.0,W,WSW,...,1008.7,4.437189,2.000000,21.0,23.2,0,0,2008,12,3
3,Albury,9.2,28.0,0.0,5.469824,7.624853,NE,24.0,SE,E,...,1012.8,4.437189,4.503167,18.1,26.5,0,0,2008,12,4
4,Albury,17.5,32.3,1.0,5.469824,7.624853,W,41.0,ENE,NW,...,1006.0,7.000000,8.000000,17.8,29.7,0,0,2008,12,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145454,Uluru,3.5,21.8,0.0,5.469824,7.624853,E,31.0,ESE,E,...,1021.2,4.437189,4.503167,9.4,20.9,0,0,2017,6,20
145455,Uluru,2.8,23.4,0.0,5.469824,7.624853,E,31.0,SE,ENE,...,1020.3,4.437189,4.503167,10.1,22.4,0,0,2017,6,21
145456,Uluru,3.6,25.3,0.0,5.469824,7.624853,NNW,22.0,SE,N,...,1019.1,4.437189,4.503167,10.9,24.5,0,0,2017,6,22
145457,Uluru,5.4,26.9,0.0,5.469824,7.624853,N,37.0,SE,WNW,...,1016.8,4.437189,4.503167,12.5,26.1,0,0,2017,6,23


In [15]:
from sklearn.preprocessing import LabelEncoder

data['Location']= LabelEncoder().fit_transform(data['Location']) 
data['WindGustDir']= LabelEncoder().fit_transform(data['WindGustDir']) 
data['WindDir9am']= LabelEncoder().fit_transform(data['WindDir9am']) 
data['WindDir3pm']= LabelEncoder().fit_transform(data['WindDir3pm']) 

In [16]:
data

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,year,month,day
0,2,13.4,22.9,0.6,5.469824,7.624853,13,44.0,13,14,...,1007.1,8.000000,4.503167,16.9,21.8,0,0,2008,12,1
1,2,7.4,25.1,0.0,5.469824,7.624853,14,44.0,6,15,...,1007.8,4.437189,4.503167,17.2,24.3,0,0,2008,12,2
2,2,12.9,25.7,0.0,5.469824,7.624853,15,46.0,13,15,...,1008.7,4.437189,2.000000,21.0,23.2,0,0,2008,12,3
3,2,9.2,28.0,0.0,5.469824,7.624853,4,24.0,9,0,...,1012.8,4.437189,4.503167,18.1,26.5,0,0,2008,12,4
4,2,17.5,32.3,1.0,5.469824,7.624853,13,41.0,1,7,...,1006.0,7.000000,8.000000,17.8,29.7,0,0,2008,12,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145454,41,3.5,21.8,0.0,5.469824,7.624853,0,31.0,2,0,...,1021.2,4.437189,4.503167,9.4,20.9,0,0,2017,6,20
145455,41,2.8,23.4,0.0,5.469824,7.624853,0,31.0,9,1,...,1020.3,4.437189,4.503167,10.1,22.4,0,0,2017,6,21
145456,41,3.6,25.3,0.0,5.469824,7.624853,6,22.0,9,3,...,1019.1,4.437189,4.503167,10.9,24.5,0,0,2017,6,22
145457,41,5.4,26.9,0.0,5.469824,7.624853,3,37.0,9,14,...,1016.8,4.437189,4.503167,12.5,26.1,0,0,2017,6,23


In [24]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(data.drop(columns=['RainTomorrow']), data.RainTomorrow, test_size=0.3,
                                                    random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((99535, 24), (42658, 24), (99535,), (42658,))

Для проверки созданных признаков будут использоваться метрика accuracy и следующие алгоритмы: логистическая регрессия (из библиотеки sklearn), градиентный бустинг (LightGBM).

In [25]:
test_results = pd.DataFrame(0,index = ["Base Model","5NN_mean_distance","10NN_mean_distance","25NN_mean_distance",
                                       "50NN_mean_distance","1NN_day","1NN_distance"],
                            columns=["Logistic_Regression","LightGBM"])

In [26]:
import warnings
from sklearn.exceptions import ConvergenceWarning

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score

    lr = LogisticRegression(random_state=42)
    lr.fit(X_train,y_train)
    y_pred = lr.predict(X_test)


    test_results["Logistic_Regression"].iloc[0] = accuracy_score(y_test,y_pred)

In [27]:
from lightgbm import LGBMClassifier

gb = LGBMClassifier(random_state=42)
gb.fit(X_train,y_train)
y_pred = gb.predict(X_test)

test_results["LightGBM"].iloc[0] = accuracy_score(y_test,y_pred)

In [28]:
test_results

Unnamed: 0,Logistic_Regression,LightGBM
Base Model,0.83928,0.85726
5NN_mean_distance,0.0,0.0
10NN_mean_distance,0.0,0.0
25NN_mean_distance,0.0,0.0
50NN_mean_distance,0.0,0.0
1NN_day,0.0,0.0
1NN_distance,0.0,0.0


Производим генерацию различных признаков на основе  k-ближайших соседей

Сперва рассчитаем для каждого объекта среднее евклидово расстояние до k-соседей

In [29]:
ks = [3,5,10,25]

In [30]:
from sklearn.neighbors import NearestNeighbors
import numpy as np
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)
    for i,k in enumerate(ks):
        NN = NearestNeighbors(n_neighbors=k,metric="euclidean",n_jobs=6)
        nbrs = NN.fit(X_train)
        distances = nbrs.kneighbors()[0]
        X_train[f"Mean_dist_{k}NN"] = np.mean(distances,axis=1)
        distances = nbrs.kneighbors(X_test)[0]
        X_test[f"Mean_dist_{k}NN"] = np.mean(distances,axis=1)

        lr = LogisticRegression(random_state=42)
        lr.fit(X_train,y_train)
        y_pred = lr.predict(X_test)

        test_results["Logistic_Regression"].iloc[i+1] = accuracy_score(y_test,y_pred)

        gb = LGBMClassifier(random_state=42)
        gb.fit(X_train,y_train)
        y_pred = gb.predict(X_test)

        test_results["LightGBM"].iloc[i+1] = accuracy_score(y_test,y_pred)

        X_test = X_test.drop(columns = [f"Mean_dist_{k}NN"])
        X_train = X_train.drop(columns = [f"Mean_dist_{k}NN"])

In [31]:
test_results

Unnamed: 0,Logistic_Regression,LightGBM
Base Model,0.83928,0.85726
5NN_mean_distance,0.839139,0.858338
10NN_mean_distance,0.839421,0.857916
25NN_mean_distance,0.839256,0.857565
50NN_mean_distance,0.838788,0.857354
1NN_day,0.0,0.0
1NN_distance,0.0,0.0


Значение признака Location для ближайшего соседа

In [32]:
test_results = test_results.rename(index={"1NN_day":"1NN_Location"})

In [33]:
NN = NearestNeighbors(n_neighbors=1,metric="euclidean",n_jobs=6)
nbrs = NN.fit(X_train)
distances = nbrs.kneighbors()[1]
l=[]
for d in distances:
    l.append(data["Location"].iloc[d[0]])
    
X_train["1NN_Location"] = l

distances = nbrs.kneighbors(X_test)[1]
l=[]
for d in distances:
    l.append(data["Location"].iloc[d[0]])
X_test["1NN_Location"]  = l

lr = LogisticRegression(random_state=42)
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
    
test_results["Logistic_Regression"].iloc[5] = accuracy_score(y_test,y_pred)

gb = LGBMClassifier(random_state=42)
gb.fit(X_train,y_train)
y_pred = gb.predict(X_test)

test_results["LightGBM"].iloc[5] = accuracy_score(y_test,y_pred)

X_test = X_test.drop(columns = ["1NN_Location"])
X_train = X_train.drop(columns = ["1NN_Location"])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Расстояние для ближайшего соседа

In [34]:
NN = NearestNeighbors(n_neighbors=1,metric="euclidean",n_jobs=6)
nbrs = NN.fit(X_train)
distances = nbrs.kneighbors()[0]
X_train["1NN_dist"] = distances

distances = nbrs.kneighbors(X_test)[0]
X_test["1NN_dist"] = distances
    
lr = LogisticRegression(random_state=42)
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
    
test_results["Logistic_Regression"].iloc[6] = accuracy_score(y_test,y_pred)

gb = LGBMClassifier(random_state=42)
gb.fit(X_train,y_train)
y_pred = gb.predict(X_test)

test_results["LightGBM"].iloc[6] = accuracy_score(y_test,y_pred)

X_test = X_test.drop(columns = ["1NN_dist"])
X_train = X_train.drop(columns = ["1NN_dist"])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Итоговая таблица результатов метрики accuracy при добавлении KNN-признаков 

In [35]:
test_results

Unnamed: 0,Logistic_Regression,LightGBM
Base Model,0.83928,0.85726
5NN_mean_distance,0.839139,0.858338
10NN_mean_distance,0.839421,0.857916
25NN_mean_distance,0.839256,0.857565
50NN_mean_distance,0.838788,0.857354
1NN_Location,0.839796,0.857659
1NN_distance,0.839702,0.857565


Наиболее полезными признаками являются 1NN_Location (значение признака Location для ближайшего соседа), 5NN_mean_distance и 10NN_mean_distance	(среднее расстояние для ближайших 5 и 10 соседей) и 1NN_distance (расстояние для ближайшего соседа).
.

Лучше всего реагирует на добавление KNN-based признаков LightGBM (6 из 6 признаков привели к улучшению метрики), для логистической регрессии (3 из 6 признаков привели к улучшению метрики).

Также изначально использовался StandardScaler для численных переменных (numerical), но с ним результаты были хуже.