#**Машинное обучение ИБ-2024**

#**Домашнее задание 1.**
#Регрессия, KNN, LinearRegression.

В данной домашней работе мы будем строить модели для предсказания цены квартиры в России. Ниже приведено описание некоторых колонок набора данных.

date - дата публикации объявления

price - цена в рублях

level- этаж, на котором находится квартира

levels - количество этажей в квартире

rooms - количество комнат в квартире. Если значение -1, то квартира считается апартаментами.

area - площадь квартиры.

kitchen_area - площадь кухни.

geo_lat - Latitude

geo_lon - Longitude

building_type - материал застройки. 0 - Don't know. 1 - Other. 2 - Panel. 3 - Monolithic. 4 - Brick. 5 - Blocky. 6 - Wooden

#Часть 0. Начало работы

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Для начала работы с данными импортируем библиотеки, которые понадобятся в данном задании.

In [43]:
import math
import pandas as pd
import numpy as np
import matplotlib as plt
import sklearn
import seaborn as sns

Загрузим библиотеку folium для отображения данных на карте по координатам.

In [44]:
!pip install folium



Распакуем наши данные из архива.

In [7]:
!unzip /content/drive/MyDrive/archiveData.zip

Archive:  /content/drive/MyDrive/archiveData.zip
  inflating: input_data.csv          


Загрузим данные из csv файла в датафрейм.

**Отделим 0.2% датасета для ускорения обучения и работы google colab :)**

In [45]:
df = pd.read_csv('input_data.csv',sep=';')

In [46]:
df = df.sample(frac=0.002, random_state=17).reset_index(drop=True)


In [47]:
print(len(df['date']))

22716


Отобразим на карте координаты наших построек.

In [48]:
import folium
from IPython.display import display
map_df = df.loc[:1000]

m = folium.Map(location=[55.751244, 37.618423], zoom_start=10)

# Список точек с широтой и долготой
lats = map_df["geo_lat"].loc[:1000]
longs = map_df["geo_lon"].loc[:1000]
# Добавляем точки на карту
for point in zip(lats, longs):
    folium.Marker(
        location=[point[0], point[1]]
    ).add_to(m)

display(m)

# Часть 1. Подготовим данные для обработки моделями машинного обучения.

**0.5 Балл**. География наших наблюдений в наборе данных крайне большая. Однако мы знаем, что стоимость квартир в Москве и Санкт-Петербурге намного выше, чем в среднем по России. Давайте сделаем признаки, который показывают, находится ли квартира в 20 килиметрах от центра Москвы или находится ли квартира в 20 килиметрах от центра Санкт-Петербурга.

Создайте два признака is_Moscow и is_Saint_Peterburg. Для нахождения расстояния по координатам используйте функцию haversine_distance.

In [49]:
from math import radians, cos, sin, asin, sqrt
def haversine_distance(lat1, lon1, lat2, lon2):
  lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
  dlon = lon2 - lon1
  dlat = lat2 - lat1
  a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
  c = 2 * asin(sqrt(a))
  r = 6371
  return c * r

#центр мск
moscow_center_lat = 55.7522
moscow_center_lon = 37.6156
#центр спб
spb_center_lat = 	59.9386
spb_center_lon = 30.3141
fromMsc = []
fromSpb = []

for i in range(0, len(df['geo_lon'])):
  if haversine_distance(df['geo_lat'][i], df['geo_lon'][i], moscow_center_lat, moscow_center_lon) <= 20:
    fromMsc.append(1)
  else:
    fromMsc.append(0)

  if haversine_distance(df['geo_lat'][i], df['geo_lon'][i], spb_center_lat, spb_center_lon) <= 20:
    fromSpb.append(1)
  else:
    fromSpb.append(0)
df['from_msc'] = fromMsc
df['from_spb'] = fromSpb

In [50]:
#Первые 5 из мск для проверки
print(df[df['from_msc']==1].head())

          date     price  level  levels  rooms  area  kitchen_area    geo_lat  \
17  2021-12-20  15200000     10      16      2  52.0           9.9  55.828060   
18  2021-09-10  11400000      4      18      3  75.9          11.1  55.910578   
20  2021-09-20  18000000      8      22      1  42.9          12.0  55.673731   
25  2021-02-18  10000000      1       5      2  46.0           6.5  55.867146   
47  2021-10-27  22700000      3      10      1  42.0          14.0  55.774104   

      geo_lon  building_type  object_type  postal_code  street_id  id_region  \
17  37.576930              0            0     127322.0   497775.0         77   
18  37.736358              0            0     141000.0        NaN         50   
20  37.496760              0            0     119454.0   525896.0         77   
25  37.665436              2            0     129327.0   226222.0         77   
47  37.589782              0            0     125047.0   426171.0         77   

     house_id  from_msc  from_sp

**0.5 Балла**. В нашем наборе данных есть признаки, которые мы теоретически можем использовать, например postal_code, но мы это будем делать в рамках домашней работы очень-очень долго. Поэтому предлагается удалить ненужные признаки из датафрейма.

Удалим geo_lat,	geo_lon,	object_type,	postal_code,	street_id,	id_region,	house_id.

In [51]:
df = df.drop(columns=["geo_lat", "geo_lon", "object_type", "postal_code", "street_id", "id_region", "house_id"])

**0.5 Балл**. Для начала Вам предлагается проанализировать Ваши оставшиеся признаки (колонки) в наборе данных. Какие колонки категориальные? Какие числовые?

Категориальные: (Ваш ответ)

Числовые: (Ваш ответ)

Давайте закодируем категориальные признаки с помощью OneHot-Encoding. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [52]:
# Категориальные: building_type, from_msc, from_spb
# Числовые: date, price, level, levels, rooms, area, kitchen_area
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
categorical_columns = ['building_type', 'from_msc', 'from_spb']
one_hot_encoded = encoder.fit_transform(df[categorical_columns])
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))
df_encoded = pd.concat([df, one_hot_df], axis=1)
df_encoded = df_encoded.drop(categorical_columns, axis=1)
df_encoded.head()

Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,building_type_0,building_type_1,building_type_2,building_type_3,building_type_4,building_type_5,building_type_6,from_msc_0,from_msc_1,from_spb_0,from_spb_1
0,2021-08-22,3900000,7,9,1,40.0,-100.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,2021-08-14,1800000,5,9,1,34.0,-100.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
2,2021-05-03,2300000,15,16,-1,26.65,5.1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,2021-11-17,9000000,3,9,3,55.0,8.2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,2021-11-09,7150000,2,9,3,65.0,8.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


**0.5 Балл**. Поработаем с числовыми признаками:


1.   Добавьте в ваш датасет два признака: количество дней со дня первого наблюдения (разница между датами объявлений). Возможно, для предсказания цены не так важен этаж, как важно отношение этажа квартиры на количество этажей в доме, добавьте этот признак. После добавления нового признака колонку date можно удалить.
2.   Числовые признаки могут иметь разные порядки. Давайте отнормируем числовые признаки с помощью StandartScaller https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html.



In [53]:
from datetime import datetime

def days_between(d1, d2):
    d1 = datetime.strptime(d1, "%Y-%m-%d")
    d2 = datetime.strptime(d2, "%Y-%m-%d")
    return abs((d2 - d1).days)

days = []
avg_floor = []

first_date = '2021-01-01'
for i in range(0, len(df_encoded['date'])):
  x = days_between(first_date, df_encoded['date'][i])
  days.append(x)
  if df_encoded['levels'][i] == 0:
    y = 1
  else:
    y = float(df_encoded['level'][i])/float(df_encoded['levels'][i])
  avg_floor.append(y)

df_encoded['days_from'] = days
df_encoded['avg_floor'] = avg_floor

df_encoded.head()

Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,building_type_0,building_type_1,building_type_2,building_type_3,building_type_4,building_type_5,building_type_6,from_msc_0,from_msc_1,from_spb_0,from_spb_1,days_from,avg_floor
0,2021-08-22,3900000,7,9,1,40.0,-100.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,233,0.777778
1,2021-08-14,1800000,5,9,1,34.0,-100.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,225,0.555556
2,2021-05-03,2300000,15,16,-1,26.65,5.1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,122,0.9375
3,2021-11-17,9000000,3,9,3,55.0,8.2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,320,0.333333
4,2021-11-09,7150000,2,9,3,65.0,8.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,312,0.222222


In [54]:
df_encoded = df_encoded.drop(columns=["date"], axis=1)

In [55]:
#Давайте отнормируем числовые признаки с помощью StandartScaller
from sklearn.preprocessing import StandardScaler
nums = ['price', 'level', 'levels', 'rooms', 'area','kitchen_area', 'days_from', 'avg_floor']
scaler = StandardScaler()
df_encoded[nums] = scaler.fit_transform(df_encoded[nums])
df_encoded.head()

Unnamed: 0,price,level,levels,rooms,area,kitchen_area,building_type_0,building_type_1,building_type_2,building_type_3,building_type_4,building_type_5,building_type_6,from_msc_0,from_msc_1,from_spb_0,from_spb_1,days_from,avg_floor
0,-0.037595,0.112684,-0.380081,-0.623121,-0.495171,-3.025231,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.452432,0.711604
1,-0.061001,-0.262945,-0.380081,-0.623121,-0.722279,-3.025231,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.37289,-0.054575
2,-0.055428,1.615197,0.584187,-2.356186,-1.000487,0.238018,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,-0.65121,1.262296
3,0.019248,-0.638573,-0.380081,1.109944,0.072599,0.33427,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.317449,-0.820755
4,-0.001372,-0.826387,-0.380081,1.109944,0.451113,0.343585,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.237907,-1.203845


**2 Балла**. Реализуйте класс KNNRegressor, который должен делать регрессию методом k ближайших соседей.

In [56]:
from sklearn.metrics import pairwise_distances
class KNNRegressor:
    def __init__(self, n_neighbors=5, metric='euclidean'):
      self.n_neighbors = n_neighbors
      self.metric = metric
      self.X_train = None
      self.Y_train = None

    def fit(self, X, y):
      self.X_train = X
      self.Y_train = y

    def predict(self, X):
      dst = pairwise_distances(X, self.X_train, metric=self.metric)
      neighbours = np.argsort(dst, axis=1)[:, :self.n_neighbors]
      neighbours_val = self.Y_train[neighbours]
      return np.mean(neighbours_val, axis=1)

**3 Балла**. Реализуйте класс LinearRegression, поддерживающий обучение градиентными спусками SGD, Momentum, AdaGrad. Используйте градиент для оптимизации функции потерь MSE.

In [57]:
class LinearRegression:
    def __init__(self, learning_rate=0.01, optimization='SGD', epsilon=1e-8, decay_rate=0.9, max_iter=1000):
        self.learning_rate = learning_rate
        self.optimization = optimization
        self.epsilon = epsilon
        self.decay_rate = decay_rate
        self.max_iter = max_iter
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        if self.optimization == 'Momentum':
            self.momentum_weights = np.zeros(n_features)
            self.momentum_bias = 0
        elif self.optimization == 'AdaGrad':
            self.adagrad_weights = np.zeros(n_features)
            self.adagrad_bias = 0

        for _ in range(self.max_iter):
            y_pred = np.dot(X, self.weights) + self.bias

            errors = y_pred - y
            dw = (1 / n_samples) * np.dot(X.T, errors)
            db = (1 / n_samples) * np.sum(errors)

            if self.optimization == 'SGD':
                self.weights -= self.learning_rate * dw
                self.bias -= self.learning_rate * db

            elif self.optimization == 'Momentum':
                self.momentum_weights = self.decay_rate * self.momentum_weights + (1 - self.decay_rate) * dw
                self.momentum_bias = self.decay_rate * self.momentum_bias + (1 - self.decay_rate) * db
                self.weights -= self.learning_rate * self.momentum_weights
                self.bias -= self.learning_rate * self.momentum_bias

            elif self.optimization == 'AdaGrad':
                self.adagrad_weights += dw ** 2
                self.adagrad_bias += db ** 2
                adjusted_dw = dw / (np.sqrt(self.adagrad_weights) + self.epsilon)
                adjusted_db = db / (np.sqrt(self.adagrad_bias) + self.epsilon)
                self.weights -= self.learning_rate * adjusted_dw
                self.bias -= self.learning_rate * adjusted_db

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

# Часть 2. Эксперименты с моделями машинного обучения.

**3 Балла**. Проведите эксперименты с написанными Вами методами машинного обучения. Выделите обучающую и тестовую выборки в отношении 0,8 и 0,2 соответственно. Измерьте ошибку MSE, MAE, RMSE. Заиспользуйте методы KNNRegressor и LinearRegression из библиотеки sklearn, сравните качество Ваших решений и библиотечных.

**Разделим датасет согласно условия**

In [58]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.neighbors import KNeighborsRegressor as SklearnKNNRegressor
from sklearn.linear_model import LinearRegression as SklearnLinearRegression

X = df_encoded.drop('price', axis=1).values
y = df_encoded['price'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=17)


**Посчитаем написанный нами KNN**

In [59]:
knn = KNNRegressor(n_neighbors=7, metric='euclidean')
knn.fit(X_train, y_train)

predict = knn.predict(X_test)

**Посчитаем библиотечный KNN**

In [60]:
neigh = SklearnKNNRegressor(n_neighbors=7, metric='euclidean')
neigh.fit(X_train, y_train)

lib_predict = neigh.predict(X_test)

**Напишем функцию для сравнения итоговых результатов согласно нашим метрикам**

In [61]:
def check(real, predicted):
  mse = mean_squared_error(real, predicted)
  mae = mean_absolute_error(real, predicted)
  rmse = np.sqrt(mse)
  print(f"mse={mse}, mae={mae}, rmse={rmse}")


**Сравним полученный результат**

In [62]:
check(y_test, lib_predict) #Для библиотечного решения

mse=0.16813241164169723, mae=0.04026457297678573, rmse=0.4100395244872099


In [63]:
check(y_test, predict) #Для нашего решения

mse=0.16813243373787465, mae=0.04026636003835215, rmse=0.41003955143116944


**Повторим аналогичные действия для линейной регрессии**

In [64]:
lreg = LinearRegression(learning_rate=0.01, optimization='SGD', epsilon=1e-8, decay_rate=0.9, max_iter=1000)
lreg.fit(X_train, y_train)

linear_predict = lreg.predict(X_test)

In [65]:
lregAda = LinearRegression(learning_rate=0.01, optimization='AdaGrad', epsilon=1e-8, decay_rate=0.9, max_iter=1000)
lregAda.fit(X_train, y_train)

linear_predict_Ada = lregAda.predict(X_test)

In [66]:
lregMomentum = LinearRegression(learning_rate=0.01, optimization='Momentum', epsilon=1e-8, decay_rate=0.9, max_iter=1000)
lregMomentum.fit(X_train, y_train)

linear_predict_Momentum = lregMomentum.predict(X_test)

In [67]:
lib_lreg = SklearnLinearRegression()
lib_lreg.fit(X_train, y_train)

lib_lreg_predict = lib_lreg.predict(X_test)

In [68]:
check(y_test, lib_lreg_predict) #Для библиотечного решения

mse=0.00976030343136711, mae=0.04833860963349121, rmse=0.09879424796701025


In [69]:
check(y_test, linear_predict) #Для нашего решения SGD

mse=0.009323962495438898, mae=0.044854279336539, rmse=0.09656066743472157


In [70]:
check(y_test, linear_predict_Ada) #Для нашего решения AdaGrad

mse=0.009760303431318909, mae=0.04833860963333111, rmse=0.09879424796676631


In [71]:
check(y_test, linear_predict_Momentum) #Для нашего решения Momentum

mse=0.009323910293347774, mae=0.044889085938269, rmse=0.09656039712712336
