#**Машинное обучение ИБ-2024**

#**Домашнее задание 1.**
#Регрессия, KNN, LinearRegression.

В данной домашней работе мы будем строить модели для предсказания цены квартиры в России. Ниже приведено описание некоторых колонок набора данных.

date - дата публикации объявления

price - цена в рублях

level- этаж, на котором находится квартира

levels - количество этажей в квартире

rooms - количество комнат в квартире. Если значение -1, то квартира считается апартаментами.

area - площадь квартиры.

kitchen_area - площадь кухни.

geo_lat - Latitude

geo_lon - Longitude

building_type - материал застройки. 0 - Don't know. 1 - Other. 2 - Panel. 3 - Monolithic. 4 - Brick. 5 - Blocky. 6 - Wooden

#Часть 0. Начало работы

Для начала работы с данными импортируем библиотеки, которые понадобятся в данном задании.

In [1]:
import math
import pandas as pd
import numpy as np
import matplotlib as plsource
import sklearn
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt

Загрузим библиотеку folium для отображения данных на карте по координатам.

In [2]:
!pip install folium




[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Распакуем наши данные из архива.

In [3]:
#!unzip ...

Загрузим данные из csv файла в датафрейм.

In [4]:
df = pd.read_csv("input_data.csv", sep=";")

In [6]:
df.head()

Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,geo_lat,geo_lon,building_type,object_type,postal_code,street_id,id_region,house_id
0,2021-01-01,2451300,15,31,1,30.3,0.0,56.780112,60.699355,0,2,620000.0,,66,1632918.0
1,2021-01-01,1450000,5,5,1,33.0,6.0,44.608154,40.138381,0,0,385000.0,,1,
2,2021-01-01,10700000,4,13,3,85.0,12.0,55.54006,37.725112,3,0,142701.0,242543.0,50,681306.0
3,2021-01-01,3100000,3,5,3,82.0,9.0,44.608154,40.138381,0,0,385000.0,,1,
4,2021-01-01,2500000,2,3,1,30.0,9.0,44.738685,37.713668,3,2,353960.0,439378.0,23,1730985.0


Отобразим на карте координаты наших построек.

In [7]:
import folium
from IPython.display import display

map_df = df.loc[:1000]

m = folium.Map(location=[55.751244, 37.618423], zoom_start=10)

# Список точек с широтой и долготой
lats = map_df['geo_lat'].loc[:1000]
longs = map_df['geo_lon'].loc[:1000]
# Добавляем точки на карту
for point in zip(lats, longs):
    folium.Marker(
        location=[point[0], point[1]]
    ).add_to(m)

display(m)

# Часть 1. Подготовим данные для обработки моделями машинного обучения.

**0.5 Балл**. География наших наблюдений в наборе данных крайне большая. Однако мы знаем, что стоимость квартир в Москве и Санкт-Петербурге намного выше, чем в среднем по России. Давайте сделаем признаки, который показывают, находится ли квартира в 20 килиметрах от центра Москвы или находится ли квартира в 20 килиметрах от центра Санкт-Петербурга.

Создайте два признака is_Moscow и is_Saint_Peterburg. Для нахождения расстояния по координатам используйте функцию haversine_distance.

In [8]:
from math import radians


def haversine_distance(lat1, lon1, lat2, lon2):
    lat1, lon1, lat2, lon2 = radians(lat1), radians(lon1), radians(lat2), radians(lon2)
    return 2 * np.arcsin(np.sqrt((np.sin((lat1 - lat2) / 2) ** 2) + np.cos(lat1) * np.cos(lat2) * (np.sin((lon1 - lon2) / 2) ** 2)))

In [9]:
# координаты брал тут - https://time-in.ru/coordinates/saint-petersburg
msk_lat, msk_lon = 55.7522, 37.6156
st_pet_lat, st_pet_lon = 59.9386, 30.3141

In [10]:
# домножаем на средний радиус Земли в метрах и делим на 1000, чтобы получить примерное(с определенной погрешностью) расстояние
# в км между двумя точками
haversine_distance(msk_lat, msk_lon, st_pet_lat, st_pet_lon) * (6371000/1000)

np.float64(634.4331164612089)

In [11]:
df["is_Moscow"] = [False] * len(df)
df["is_Saint_Petersburg"] = [False] * len(df)

for i in tqdm(range(len(df))):
    curr_lat, curr_lon = df.loc[i, "geo_lat"], df.loc[i, "geo_lon"]
    dist_to_msk = haversine_distance(msk_lat, msk_lon, curr_lat, curr_lon) * (6371000/1000)
    dist_to_st_pet = haversine_distance(st_pet_lat, st_pet_lon, curr_lat, curr_lon) * (6371000/1000)
    #df.iloc[i]
    if dist_to_msk <= 20:
        df.loc[i, "is_Moscow"] = True
    if dist_to_st_pet <= 20:
        df.loc[i, "is_Saint_Petersburg"] = True
    

100%|██████████| 11358150/11358150 [12:05<00:00, 15656.26it/s]


**0.5 Балла**. В нашем наборе данных есть признаки, которые мы теоретически можем использовать, например postal_code, но мы это будем делать в рамках домашней работы очень-очень долго. Поэтому предлагается удалить ненужные признаки из датафрейма.

Удалим geo_lat,	geo_lon,	object_type,	postal_code,	street_id,	id_region,	house_id.

In [12]:
df = df.drop(columns=["geo_lat", "geo_lon", "object_type", "postal_code", "street_id", "id_region", "house_id"])
df.head()

Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,building_type,is_Moscow,is_Saint_Petersburg
0,2021-01-01,2451300,15,31,1,30.3,0.0,0,False,False
1,2021-01-01,1450000,5,5,1,33.0,6.0,0,False,False
2,2021-01-01,10700000,4,13,3,85.0,12.0,3,False,False
3,2021-01-01,3100000,3,5,3,82.0,9.0,0,False,False
4,2021-01-01,2500000,2,3,1,30.0,9.0,3,False,False


**0.5 Балл**. Для начала Вам предлагается проанализировать Ваши оставшиеся признаки (колонки) в наборе данных. Какие колонки категориальные? Какие числовые?

Категориальные: rooms, building_type, is_Moscow, is_Saint_Peterburg

Числовые: level, levels, area, kitchen_area

price - целевая переменная

Давайте закодируем категориальные признаки с помощью OneHot-Encoding. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11358150 entries, 0 to 11358149
Data columns (total 10 columns):
 #   Column               Dtype  
---  ------               -----  
 0   date                 object 
 1   price                int64  
 2   level                int64  
 3   levels               int64  
 4   rooms                int64  
 5   area                 float64
 6   kitchen_area         float64
 7   building_type        int64  
 8   is_Moscow            bool   
 9   is_Saint_Petersburg  bool   
dtypes: bool(2), float64(2), int64(5), object(1)
memory usage: 714.9+ MB


In [14]:
df.describe()

Unnamed: 0,price,level,levels,rooms,area,kitchen_area,building_type
count,11358150.0,11358150.0,11358150.0,11358150.0,11358150.0,11358150.0,11358150.0
mean,6787516.0,6.426675,11.76266,1.719417,53.10356,-2.674071,1.01782
std,197711800.0,5.283144,7.218441,1.157606,27.12845,32.41855,1.562077
min,0.0,0.0,0.0,-1.0,1.0,-100.0,0.0
25%,2600000.0,2.0,5.0,1.0,36.5,0.0,0.0
50%,3995000.0,5.0,10.0,2.0,46.7,6.5,0.0
75%,6500000.0,9.0,17.0,2.0,63.0,10.5,2.0
max,635552400000.0,50.0,50.0,9.0,499.9,408.0,6.0


In [15]:
df.rooms.value_counts()

rooms
 1    3947858
 2    3843393
 3    2271401
-1     838919
 4     384776
 5      54027
 6      15459
 7       1425
 8        527
 9        365
Name: count, dtype: int64

In [16]:
df.building_type.value_counts()

building_type
0    7535937
4    1439326
2    1230098
3     718991
1     251398
5     159719
6      22681
Name: count, dtype: int64

In [17]:
# кодирование
from sklearn.preprocessing import OneHotEncoder

#ohe_rooms = OneHotEncoder(drop="first").fit(df.rooms.values.reshape(-1, 1))
#ohe_msk = OneHotEncoder(drop="first").fit(df.is_Moscow.values.reshape(-1, 1))
#ohe_st_pet = OneHotEncoder(drop="first").fit(df.is_Saint_Petersburg.values.reshape(-1, 1))
#ohe_b_type = OneHotEncoder(drop="first").fit(df.building_type.values.reshape(-1, 1))

ohe_rooms = OneHotEncoder().fit(df.rooms.values.reshape(-1, 1))
ohe_msk = OneHotEncoder(drop="if_binary").fit(df.is_Moscow.values.reshape(-1, 1))
ohe_st_pet = OneHotEncoder(drop="if_binary").fit(df.is_Saint_Petersburg.values.reshape(-1, 1))
ohe_b_type = OneHotEncoder().fit(df.building_type.values.reshape(-1, 1))

In [18]:
ohe_rooms.categories_
# [str(item) for item in ohe_rooms.categories_[0]]

[array([-1,  1,  2,  3,  4,  5,  6,  7,  8,  9])]

In [19]:
ohe_rooms.transform(df.rooms.values.reshape(-1, 1)).toarray().shape

(11358150, 10)

In [20]:
ohe_msk.categories_

[array([False,  True])]

In [21]:
ohe_msk.transform(df.is_Moscow.values.reshape(-1, 1)).toarray().shape

(11358150, 1)

In [296]:
new_df = df.copy()

In [297]:
for curr_ohe, col_name in tqdm(zip([ohe_rooms, ohe_msk, ohe_st_pet, ohe_b_type], ["rooms", "is_Moscow", "is_Saint_Petersburg", "building_type"])):
    curr_arr = curr_ohe.transform(df[col_name].values.reshape(-1, 1)).toarray()
    n = curr_arr.shape[1]
    if n == 1:
        new_col_name = col_name + "_ohe"
        new_df[new_col_name] = curr_arr
        new_df = new_df.drop(columns=[col_name])
    else:
        for i in range(n):
            new_col_name = col_name + "_" + str(curr_ohe.categories_[0][i]) + "_ohe"
            new_df[new_col_name] = curr_arr[:, i]
        new_df = new_df.drop(columns=[col_name])

4it [00:08,  2.18s/it]


In [298]:
new_df.head()

Unnamed: 0,date,price,level,levels,area,kitchen_area,rooms_-1_ohe,rooms_1_ohe,rooms_2_ohe,rooms_3_ohe,...,rooms_9_ohe,is_Moscow_ohe,is_Saint_Petersburg_ohe,building_type_0_ohe,building_type_1_ohe,building_type_2_ohe,building_type_3_ohe,building_type_4_ohe,building_type_5_ohe,building_type_6_ohe
0,2021-01-01,2451300,15,31,30.3,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2021-01-01,1450000,5,5,33.0,6.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2021-01-01,10700000,4,13,85.0,12.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,2021-01-01,3100000,3,5,82.0,9.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2021-01-01,2500000,2,3,30.0,9.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


**0.5 Балл**. Поработаем с числовыми признаками:


1.   Добавьте в ваш датасет два признака: количество дней со дня первого наблюдения (разница между датами объявлений). Возможно, для предсказания цены не так важен этаж, как важно отношение этажа квартиры на количество этажей в доме, добавьте этот признак. После добавления нового признака колонку date можно удалить.
2.   Числовые признаки могут иметь разные порядки. Давайте отнормируем числовые признаки с помощью StandartScaller https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html.



In [299]:
new_df["date_dt"] = pd.to_datetime(new_df["date"])
min_date = new_df.date_dt.min()

new_df["day_diff"] = new_df["date_dt"].apply(lambda x: (x - min_date).days)

In [300]:
new_df = new_df.drop(columns=["date", "date_dt"])

In [301]:
new_df.sample(5)

Unnamed: 0,price,level,levels,area,kitchen_area,rooms_-1_ohe,rooms_1_ohe,rooms_2_ohe,rooms_3_ohe,rooms_4_ohe,...,is_Moscow_ohe,is_Saint_Petersburg_ohe,building_type_0_ohe,building_type_1_ohe,building_type_2_ohe,building_type_3_ohe,building_type_4_ohe,building_type_5_ohe,building_type_6_ohe,day_diff
3226286,15950000,5,6,29.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,110
9735018,3180000,1,9,42.4,6.1,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,311
9283157,1900000,5,5,47.0,-100.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,297
10684371,3850000,5,9,53.08,10.8,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,339
3393399,5040000,1,6,91.0,17.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,116


In [302]:
new_df[new_df.levels == 0]

Unnamed: 0,price,level,levels,area,kitchen_area,rooms_-1_ohe,rooms_1_ohe,rooms_2_ohe,rooms_3_ohe,rooms_4_ohe,...,is_Moscow_ohe,is_Saint_Petersburg_ohe,building_type_0_ohe,building_type_1_ohe,building_type_2_ohe,building_type_3_ohe,building_type_4_ohe,building_type_5_ohe,building_type_6_ohe,day_diff
33786,16250000,4,0,60.00,13.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,3
39636,4350000,1,0,81.00,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,4
40172,2550000,4,0,47.50,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,4
41315,4620000,0,0,74.30,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,4
79626,2400000,6,0,30.00,6.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11294929,4550000,13,0,60.00,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,360
11298187,6900000,3,0,87.00,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,360
11311944,5500000,14,0,54.67,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,361
11322091,5000000,5,0,48.00,7.7,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,361


In [303]:
# часть квартир имеет 0 в столбце levels, данные примеры я буду считать ошибочными и удалю их из датасета
new_df = new_df[new_df.levels != 0]
new_df.index = list(range(len(new_df)))

In [304]:
# отношение этажа квартиры на количество этажей в доме

new_df["ratio"] = new_df["level"].values / new_df["levels"].values

In [305]:
# применим StandardScaler к столбцам
from sklearn.preprocessing import StandardScaler


numb_cols = ["level", "levels", "area", "kitchen_area", "ratio"]

for col in tqdm(numb_cols):
    new_df[col] = StandardScaler().fit_transform(new_df[col].values.reshape(-1, 1))


100%|██████████| 5/5 [00:01<00:00,  4.31it/s]


In [306]:
new_df.sample(5)

Unnamed: 0,price,level,levels,area,kitchen_area,rooms_-1_ohe,rooms_1_ohe,rooms_2_ohe,rooms_3_ohe,rooms_4_ohe,...,is_Saint_Petersburg_ohe,building_type_0_ohe,building_type_1_ohe,building_type_2_ohe,building_type_3_ohe,building_type_4_ohe,building_type_5_ohe,building_type_6_ohe,day_diff,ratio
9810895,7000000,-0.648608,-0.798759,-0.612036,0.366278,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,313,-0.253538
5802697,2831000,-0.45932,0.032576,-0.376474,0.452641,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,193,-0.822304
1464460,2500000,0.108543,-0.244535,0.438593,0.252156,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,57,0.428982
1613051,7500000,-0.837895,-1.07587,0.560244,0.329266,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,61,-0.253538
6883302,5740000,0.108543,-0.244535,-0.261827,0.082515,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,226,0.428982


**2 Балла**. Реализуйте класс KNNRegressor, который должен делать регрессию методом k ближайших соседей.

In [307]:
class KNNRegressor:
    def __init__(self, n_neighbors=5, metric='euclidean'):
        self.n_neighbors = n_neighbors
        self.p = 1
        if metric == "euclidean":
            self.p = 2

    def fit(self, X: np.ndarray, y: np.ndarray):
        self.X_train = X
        self.y_train = y

    def predict(self, X: np.ndarray):
        #  X: число элементов x число признаков
        distances = self.e_dist(X)
        neighbors_idxs = np.argsort(distances)[:, :self.n_neighbors]
        temp_lst = []
        for arr in tqdm(neighbors_idxs):
            temp_lst.append(np.mean(self.y_train[arr]))
        
        return temp_lst
    
    def e_dist(self, X: np.ndarray):
        mink = lambda x: np.power(np.sum(np.power(np.abs(self.X_train - x), self.p), axis=1), 1 / self.p)
        dist_matrix = np.apply_along_axis(mink,1,X)
        return dist_matrix

**3 Балла**. Реализуйте класс LinearRegression, поддерживающий обучение градиентными спусками SGD, Momentum, AdaGrad. Используйте градиент для оптимизации функции потерь MSE.

In [324]:
from sklearn.metrics import mean_squared_error


class MyLinearRegression:
    def __init__(self, learning_rate=0.01, optimization='SGD', epsilon=1e-8, decay_rate=0.9, max_iter=10, momentum=0.1):
        self.learning_rate = learning_rate
        self.optimization = optimization
        self.epsilon = epsilon
        self.decay_rate = decay_rate
        self.max_iter = max_iter
        self.momentum = momentum
        self.weights = None
        self.bias = None

        self.v_w = None
        self.v_b = None
        self.G_w = None
        self.G_b = None

    def fit(self, X, y):

        self.weights = np.random.normal(size=X.shape[1]).reshape(-1, 1)
        self.bias = np.random.normal()
        #self.weights = np.zeros((X.shape[1], 1))
        #self.bias = 0

        # мне тут надо проиницииализировать веса и смещение???
        if self.optimization == "SGD":
            self.fit_sgd(X, y)
        elif self.optimization == "Momentum":
            self.fit_momentum(X, y)
        elif self.optimization == "Adagrad":  # adagrad
            self.fit_adagrad(X, y)
        else:  # оставляем случайно инициализированные веса
            pass

    def fit_sgd(self, X, y):
        for i in range(self.max_iter):
            # перемешаем X и y
            idxs =np.arange(X.shape[0])
            np.random.shuffle(idxs)
            
            #for j in tqdm(idxs):
            for j in idxs:

                X_curr, y_curr = X[j, :].reshape(1, -1), y[j, :][0]
                
                pred = self.predict(X_curr)
                
                err = pred - y_curr
                
                grad_w = (2 * X_curr * err).reshape(-1, 1) + 2 * self.decay_rate * self.weights
                
                grad_b = 2 * err
                
                self.weights -= self.learning_rate * grad_w
                
                self.bias -= self.learning_rate * grad_b
                
    def fit_momentum(self, X, y):
        self.v_w = np.zeros((X.shape[1], 1))
        self.v_b = 0
        for i in range(self.max_iter):
            # перемешаем X и y
            idxs =np.arange(X.shape[0])
            np.random.shuffle(idxs)
            
            #for j in tqdm(idxs):
            for j in idxs:

                X_curr, y_curr = X[j, :].reshape(1, -1), y[j, :][0]
                
                pred = self.predict(X_curr)
                
                err = pred - y_curr
                
                grad_w = (2 * X_curr * err).reshape(-1, 1) + 2 * self.decay_rate * self.weights
                
                grad_b = 2 * err
                
                self.v_w = self.momentum * self.v_w - self.learning_rate * grad_w
                
                self.v_b = self.momentum * self.v_b - self.learning_rate * grad_b

                self.weights += self.v_w
                
                self.bias += self.v_b

    def fit_adagrad(self, X, y):
        self.G_w = np.zeros((X.shape[1], 1))
        self.G_b = 0
        for i in range(self.max_iter):
            # перемешаем X и y
            idxs =np.arange(X.shape[0])
            np.random.shuffle(idxs)
            
            #for j in tqdm(idxs):
            for j in idxs:

                X_curr, y_curr = X[j, :].reshape(1, -1), y[j, :][0]
                
                pred = self.predict(X_curr)
                
                err = pred - y_curr
                
                grad_w = (2 * X_curr * err).reshape(-1, 1) + 2 * self.decay_rate * self.weights
                
                grad_b = 2 * err

                self.G_w += grad_w ** 2
                
                self.G_b += grad_b ** 2

                self.weights -= self.learning_rate * grad_w / (np.sqrt(self.G_w) + self.epsilon)
                
                self.bias -= self.learning_rate * grad_b / (np.sqrt(self.G_b) + self.epsilon)    

    def predict(self, X):
        return X @ self.weights

# Часть 2. Эксперименты с моделями машинного обучения.

**3 Балла**. Проведите эксперименты с написанными Вами методами машинного обучения. Выделите обучающую и тестовую выборки в отношении 0,8 и 0,2 соответственно. Измерьте ошибку MSE, MAE, RMSE. Заиспользуйте методы KNNRegressor и LinearRegression из библиотеки sklearn, сравните качество Ваших решений и библиотечных.

In [319]:
# для тестирования кода возьмем только 0.2 процента от имеющегося датасета(самописный knn регрессор слишком медленный(( )
new_df_1 = new_df.sample(int(len(new_df) * 0.002))
new_df_1.index = list(range(len(new_df_1)))

In [320]:
len(new_df_1)

22711

In [321]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error


y = new_df_1[["price"]]
X = new_df_1.drop(columns=["price"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [322]:
knn_skl = KNeighborsRegressor(n_neighbors=5)
knn_skl.fit(X_train, y_train)

preds = knn_skl.predict(X_test)

print(mean_squared_error(y_test, preds))
print(mean_absolute_error(y_test, preds))
print(root_mean_squared_error(y_test, preds))

207697844241289.1
3589528.037464231
14411725.928607201


In [323]:
knn_my = KNNRegressor(n_neighbors=5)
knn_my.fit(X_train.to_numpy(), y_train.to_numpy())

preds = knn_my.predict(X_test.to_numpy())

print(mean_squared_error(y_test, preds))
print(mean_absolute_error(y_test, preds))
print(root_mean_squared_error(y_test, preds))

100%|██████████| 4543/4543 [00:00<00:00, 62882.31it/s]


207698244564388.38
3589492.818445961
14411739.817398466


In [325]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error
from sklearn.linear_model import LinearRegression

In [326]:
lr_skl = LinearRegression()
lr_skl.fit(X_train, y_train)

preds = lr_skl.predict(X_test)

print(mean_squared_error(y_test, preds))
print(mean_absolute_error(y_test, preds))
print(root_mean_squared_error(y_test, preds))

117629819103887.9
4183358.4024970196
10845728.150008552


In [327]:
lr_my = MyLinearRegression(max_iter=50, learning_rate=0.000001)
lr_my.fit(X_train.to_numpy(), y_train.to_numpy())

preds = lr_my.predict(X_test.to_numpy())

print(mean_squared_error(y_test, preds))
print(mean_absolute_error(y_test, preds))
print(root_mean_squared_error(y_test, preds))

159809921658628.22
3908009.5688915863
12641594.901697658


In [332]:
lr_my = MyLinearRegression(max_iter=10, learning_rate=0.000001, optimization="Momentum", momentum=0.3)
lr_my.fit(X_train.to_numpy(), y_train.to_numpy())

preds = lr_my.predict(X_test.to_numpy())

print(mean_squared_error(y_test, preds))
print(mean_absolute_error(y_test, preds))
print(root_mean_squared_error(y_test, preds))

175165114956327.5
3852636.461590731
13234995.842701558


In [329]:
lr_my = MyLinearRegression(max_iter=50, learning_rate=10000, optimization="Adagrad")
lr_my.fit(X_train.to_numpy(), y_train.to_numpy())

preds = lr_my.predict(X_test.to_numpy())

print(mean_squared_error(y_test, preds))
print(mean_absolute_error(y_test, preds))
print(root_mean_squared_error(y_test, preds))

188342118168041.8
4272489.190471745
13723779.296099227
