#**Машинное обучение ИБ-2024**

#**Домашнее задание 1.**
#Регрессия, KNN, LinearRegression.

В данной домашней работе мы будем строить модели для предсказания цены квартиры в России. Ниже приведено описание некоторых колонок набора данных.

date - дата публикации объявления

price - цена в рублях

level- этаж, на котором находится квартира

levels - количество этажей в квартире

rooms - количество комнат в квартире. Если значение -1, то квартира считается апартаментами.

area - площадь квартиры.

kitchen_area - площадь кухни.

geo_lat - Latitude

geo_lon - Longitude

building_type - материал застройки. 0 - Don't know. 1 - Other. 2 - Panel. 3 - Monolithic. 4 - Brick. 5 - Blocky. 6 - Wooden

#Часть 0. Начало работы

Для начала работы с данными импортируем библиотеки, которые понадобятся в данном задании.

In [1]:
import math
import pandas as pd
import numpy as np
import matplotlib as plt
import sklearn
import seaborn as sns

Загрузим библиотеку folium для отображения данных на карте по координатам.

In [6]:
%pip install folium

Note: you may need to restart the kernel to use updated packages.


Загрузим данные из csv файла в датафрейм.

In [61]:
df = pd.read_csv('input_data.csv', sep=';')

In [62]:
df.head()

Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,geo_lat,geo_lon,building_type,object_type,postal_code,street_id,id_region,house_id
0,2021-01-01,2451300,15,31,1,30.3,0.0,56.780112,60.699355,0,2,620000.0,,66,1632918.0
1,2021-01-01,1450000,5,5,1,33.0,6.0,44.608154,40.138381,0,0,385000.0,,1,
2,2021-01-01,10700000,4,13,3,85.0,12.0,55.54006,37.725112,3,0,142701.0,242543.0,50,681306.0
3,2021-01-01,3100000,3,5,3,82.0,9.0,44.608154,40.138381,0,0,385000.0,,1,
4,2021-01-01,2500000,2,3,1,30.0,9.0,44.738685,37.713668,3,2,353960.0,439378.0,23,1730985.0


Отобразим на карте координаты наших построек.

In [24]:
import folium
from IPython.display import display

map_df = df.loc[:1000]

m = folium.Map(location=[55.751244, 37.618423], zoom_start=10)

# Список точек с широтой и долготой
lats = map_df['geo_lat'].loc[:1000]
longs = map_df['geo_lon'].loc[:1000]
# Добавляем точки на карту
for point in zip(lats, longs):
    folium.Marker(
        location=[point[0], point[1]]
    ).add_to(m)

display(m)

# Часть 1. Подготовим данные для обработки моделями машинного обучения.

**0.5 Балл**. География наших наблюдений в наборе данных крайне большая. Однако мы знаем, что стоимость квартир в Москве и Санкт-Петербурге намного выше, чем в среднем по России. Давайте сделаем признаки, который показывают, находится ли квартира в 20 килиметрах от центра Москвы или находится ли квартира в 20 килиметрах от центра Санкт-Петербурга.

Создайте два признака is_Moscow и is_Saint_Peterburg. Для нахождения расстояния по координатам используйте функцию haversine_distance.

In [63]:
def haversine_distance(lat1, lon1, lat2, lon2):
    del_lat = abs(lat1 - lat2) * 111.11
    del_lon = abs(lon1 - lon2) * 111.11 * math.cos(math.radians(lat1))
    dist = math.sqrt(del_lat**2 + del_lon**2)
    if dist < 20:
        return 1
    else:
        return 0
    
is_Moscow = []
is_Saint_Peterburg = []
for i in range(df.shape[0]):
    is_Moscow.append(haversine_distance(df['geo_lat'][i], df['geo_lon'][i], 55.75222, 37.61556))
    is_Saint_Peterburg.append(haversine_distance(df['geo_lat'][i], df['geo_lon'][i], 59.93863, 30.31413))
df['is_Moscow'] = is_Moscow
df['is_Saint_Peterburg'] = is_Saint_Peterburg

# Msc:     lat - 55.75222; lon - 37.61556
# St. Psb: lat - 59.93863; lon - 30.31413 

df.head()

Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,geo_lat,geo_lon,building_type,object_type,postal_code,street_id,id_region,house_id,is_Moscow,is_Saint_Peterburg
0,2021-01-01,2451300,15,31,1,30.3,0.0,56.780112,60.699355,0,2,620000.0,,66,1632918.0,0,0
1,2021-01-01,1450000,5,5,1,33.0,6.0,44.608154,40.138381,0,0,385000.0,,1,,0,0
2,2021-01-01,10700000,4,13,3,85.0,12.0,55.54006,37.725112,3,0,142701.0,242543.0,50,681306.0,0,0
3,2021-01-01,3100000,3,5,3,82.0,9.0,44.608154,40.138381,0,0,385000.0,,1,,0,0
4,2021-01-01,2500000,2,3,1,30.0,9.0,44.738685,37.713668,3,2,353960.0,439378.0,23,1730985.0,0,0


**0.5 Балла**. В нашем наборе данных есть признаки, которые мы теоретически можем использовать, например postal_code, но мы это будем делать в рамках домашней работы очень-очень долго. Поэтому предлагается удалить ненужные признаки из датафрейма.

Удалим geo_lat,	geo_lon,	object_type,	postal_code,	street_id,	id_region,	house_id.

In [64]:
list_to_drop = ['geo_lat', 'geo_lon', 'object_type', 'postal_code', 'street_id', 'id_region', 'house_id']
df.drop(list_to_drop, axis=1, inplace=True)

df.head()

Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,building_type,is_Moscow,is_Saint_Peterburg
0,2021-01-01,2451300,15,31,1,30.3,0.0,0,0,0
1,2021-01-01,1450000,5,5,1,33.0,6.0,0,0,0
2,2021-01-01,10700000,4,13,3,85.0,12.0,3,0,0
3,2021-01-01,3100000,3,5,3,82.0,9.0,0,0,0
4,2021-01-01,2500000,2,3,1,30.0,9.0,3,0,0


**0.5 Балл**. Для начала Вам предлагается проанализировать Ваши оставшиеся признаки (колонки) в наборе данных. Какие колонки категориальные? Какие числовые?

Категориальные: building_type, is_Moscow, is_Saint_Peterburg

Числовые: price, level, levels, rooms, area, kitchen_area

Давайте закодируем категориальные признаки с помощью OneHot-Encoding. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [81]:
from sklearn.preprocessing import OneHotEncoder

df_onehot = df.copy()
enc = OneHotEncoder(sparse_output = False)
 
encoded_df = pd.DataFrame(enc.fit_transform(df_onehot[['building_type']]))
encoded_df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [82]:
# 0 - Don't know. 1 - Other. 2 - Panel. 3 - Monolithic. 4 - Brick. 5 - Blocky. 6 - Wooden
encoded_df.columns = ["building_type_Don't know", 'building_type_Other', 'building_type_Panel', 'building_type_Monolithic', 'building_type_Brick', 'building_type_Blocky', 'building_type_Wooden']
df_onehot = df_onehot.join(encoded_df)
df_onehot.drop('building_type', axis=1, inplace=True)
df_onehot.head()

Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,is_Moscow,is_Saint_Peterburg,building_type_Don't know,building_type_Other,building_type_Panel,building_type_Monolithic,building_type_Brick,building_type_Blocky,building_type_Wooden
0,2021-01-01,2451300,15,31,1,30.3,0.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2021-01-01,1450000,5,5,1,33.0,6.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2021-01-01,10700000,4,13,3,85.0,12.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,2021-01-01,3100000,3,5,3,82.0,9.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2021-01-01,2500000,2,3,1,30.0,9.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


**0.5 Балл**. Поработаем с числовыми признаками:


1.   Добавьте в ваш датасет два признака: количество дней со дня первого наблюдения (разница между датами объявлений). Возможно, для предсказания цены не так важен этаж, как важно отношение этажа квартиры на количество этажей в доме, добавьте этот признак. После добавления нового признака колонку date можно удалить.



In [98]:
df_onehot['level_ratio'] = df_onehot['level'] / df_onehot['levels']
df_onehot.drop(['level', 'levels'], axis=1, inplace=True)
df_onehot.head()

Unnamed: 0,date,price,rooms,area,kitchen_area,is_Moscow,is_Saint_Peterburg,building_type_Don't know,building_type_Other,building_type_Panel,building_type_Monolithic,building_type_Brick,building_type_Blocky,building_type_Wooden,level_ratio
0,2021-01-01,2451300,1,30.3,0.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.483871
1,2021-01-01,1450000,1,33.0,6.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,2021-01-01,10700000,3,85.0,12.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.307692
3,2021-01-01,3100000,3,82.0,9.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6
4,2021-01-01,2500000,1,30.0,9.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.666667


In [117]:
import datetime

days = []
for i in range(df_onehot.shape[0]):
    days.append((datetime.datetime.strptime(df_onehot['date'][i], '%Y-%m-%d').date() - datetime.date(2021,1,1)).days)
    
df_onehot['days'] = days
df_onehot.head()

Unnamed: 0,date,price,rooms,area,kitchen_area,is_Moscow,is_Saint_Peterburg,building_type_Don't know,building_type_Other,building_type_Panel,building_type_Monolithic,building_type_Brick,building_type_Blocky,building_type_Wooden,level_ratio,days
0,2021-01-01,2451300,1,30.3,0.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.483871,0
1,2021-01-01,1450000,1,33.0,6.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
2,2021-01-01,10700000,3,85.0,12.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.307692,0
3,2021-01-01,3100000,3,82.0,9.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0
4,2021-01-01,2500000,1,30.0,9.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.666667,0


In [119]:
df_onehot.drop("date", axis=1, inplace=True)
df_onehot

Unnamed: 0,price,rooms,area,kitchen_area,is_Moscow,is_Saint_Peterburg,building_type_Don't know,building_type_Other,building_type_Panel,building_type_Monolithic,building_type_Brick,building_type_Blocky,building_type_Wooden,level_ratio,days
0,2451300,1,30.3,0.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.483871,0
1,1450000,1,33.0,6.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.000000,0
2,10700000,3,85.0,12.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.307692,0
3,3100000,3,82.0,9.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.600000,0
4,2500000,1,30.0,9.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.666667,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11358145,6099000,3,65.0,0.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.444444,364
11358146,2490000,2,56.9,0.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.100000,364
11358147,850000,2,37.0,5.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.000000,364
11358148,4360000,1,36.0,9.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.000000,364


2.   Числовые признаки могут иметь разные порядки. Давайте отнормируем числовые признаки с помощью StandartScaller https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html.

In [127]:
from sklearn.preprocessing import StandardScaler
# st_scaler = sklearn.preprocessing.StandardScaler().fit(df_onehot)

area_scaled = pd.DataFrame(StandardScaler().fit_transform(df_onehot[["area", "kitchen_area"]]), columns=df_onehot[["area", "kitchen_area"]].columns)
area_scaled

Unnamed: 0,area,kitchen_area
0,-0.840577,0.082486
1,-0.741051,0.267565
2,1.175756,0.452644
3,1.065171,0.360105
4,-0.851636,0.360105
...,...,...
11358145,0.438523,0.082486
11358146,0.139943,0.082486
11358147,-0.593604,0.236719
11358148,-0.630466,0.360105


In [128]:
df_onehot.drop(["area", "kitchen_area"], axis=1, inplace=True)
df_onehot = df_onehot.join(area_scaled)

df_onehot.head()

Unnamed: 0,price,rooms,is_Moscow,is_Saint_Peterburg,building_type_Don't know,building_type_Other,building_type_Panel,building_type_Monolithic,building_type_Brick,building_type_Blocky,building_type_Wooden,level_ratio,days,area,kitchen_area
0,2451300,1,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.483871,0,-0.840577,0.082486
1,1450000,1,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,-0.741051,0.267565
2,10700000,3,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.307692,0,1.175756,0.452644
3,3100000,3,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0,1.065171,0.360105
4,2500000,1,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.666667,0,-0.851636,0.360105


**2 Балла**. Реализуйте класс KNNRegressor, который должен делать регрессию методом k ближайших соседей.

In [163]:
class KNNRegressor:
  def __init__(self, n_neighbors=5, metric='euclidean'):
    self.n_neighbors = n_neighbors
    self.metric = metric
    
    
  def fit(self, X, y):
    self.X_train = X
    self.y_train = y
      
      
  def predict(self, X):
    return np.array([self.make_prediction(x) for x in X])
    #  ans = np.array([])
     # for x in X:
      #  temp = self.make_prediction(x)
       # ans = np.append(ans, temp)
      #return ans
    
  
  def distances(self, x_test_iter):
    return np.sqrt(np.sum((self.X_train - x_test_iter) ** 2, axis=1))
      
      
  def make_prediction(self, x_test_iter):
    dist = self.distances(x_test_iter)
    k_nearest = np.argsort(dist)[:self.n_neighbors]
    targets = self.y_train[k_nearest]

    return np.mean(targets).argmax()

In [167]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(df_onehot[["is_Moscow", "is_Saint_Peterburg", "building_type_Don't know", "building_type_Other", "building_type_Panel", "building_type_Monolithic", "building_type_Brick", 
                                                                   "building_type_Blocky", "building_type_Wooden", "level_ratio", "days", "area", "kitchen_area"]].values, df_onehot[["price"]].values, random_state=0)

help_me = KNNRegressor()
help_me.fit(X_train, y_train)
knn_pred_res = help_me.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_pred_res)

print(f'KNN classifier accuracy: {knn_accuracy:}')
print(knn_pred_res)

# Проблема с этим кодом в том, что для большой выборки тратиться много времени на вычисление ближайших соседей. Так как код проходит по каждому объекту в выборке, и для каждого считает всех ближайших соседей, время на выполнение программы становится слишком большим

KeyboardInterrupt: 

**3 Балла**. Реализуйте класс LinearRegression, поддерживающий обучение градиентными спусками SGD, Momentum, AdaGrad. Используйте градиент для оптимизации функции потерь MSE.

In [None]:
class LinearRegression:
    def __init__(self, learning_rate=0.01, optimization='SGD', epsilon=1e-8, decay_rate=0.9, max_iter=1000):
        self.learning_rate = learning_rate
        self.optimization = optimization
        self.epsilon = epsilon
        self.decay_rate = decay_rate
        self.max_iter = max_iter
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        ...

    def predict(self, X):
        ...

# Часть 2. Эксперименты с моделями машинного обучения.

**3 Балла**. Проведите эксперименты с написанными Вами методами машинного обучения. Выделите обучающую и тестовую выборки в отношении 0,8 и 0,2 соответственно. Измерьте ошибку MSE, MAE, RMSE. Заиспользуйте методы KNNRegressor и LinearRegression из библиотеки sklearn, сравните качество Ваших решений и библиотечных.

In [166]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score

sk_knn_reg = KNeighborsRegressor()
sk_knn_reg.fit(X_train, y_train)
sk_knn_reg_pred_res = sk_knn_reg.predict(X_test)
sk_knn_reg_r2 = r2_score(y_test, sk_knn_reg_pred_res)

print(f'sk KNN regressor R2 score: {sk_knn_reg_r2}')
print(sk_knn_reg_pred_res)

[[ 0.00000000e+00  0.00000000e+00  1.00000000e+00 ...  1.87000000e+02
   2.54214373e-01  3.29258142e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  1.44000000e+02
  -2.61849069e-01 -3.00216806e+00]
 [ 0.00000000e+00  0.00000000e+00  1.00000000e+00 ...  1.61000000e+02
  -8.51635861e-01  2.67565064e-01]
 ...
 [ 0.00000000e+00  0.00000000e+00  1.00000000e+00 ...  8.10000000e+01
  -5.27253126e-01  2.67565064e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  5.70000000e+01
  -1.54950213e-01  3.29258142e-01]
 [ 0.00000000e+00  0.00000000e+00  1.00000000e+00 ...  2.69000000e+02
  -3.35572418e-01  2.67565064e-01]]


ValueError: Input X contains NaN.
KNeighborsRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values