### Курсовой проект для курса "Python для Data Science"

Материалы к проекту (файлы):
train.csv
test.csv

Задание:
Используя данные из обучающего датасета (train.csv), построить модель для предсказания цен на недвижимость (квартиры).
С помощью полученной модели, предсказать цены для квартир из тестового датасета (test.csv).

Целевая переменная:
Price

Метрика качества:
R2 - коэффициент детерминации (sklearn.metrics.r2_score)

Требования к решению:
1. R2 > 0.6
2. Тетрадка Jupyter Notebook с кодом Вашего решения, названная по образцу {ФИО}_solution.ipynb, пример SShirkin_solution.ipynb
3. Файл CSV с прогнозами целевой переменной для тестового датасета, названный по образцу {ФИО}_predictions.csv, пример SShirkin_predictions.csv 
Файл должен содержать два поля: Id, Price и в файле должна быть 5001 строка (шапка + 5000 предсказаний).

Сроки сдачи:
Cдать проект нужно в течение 72 часов после окончания последнего вебинара. Оценки работ, сданных до дедлайна, будут представлены в виде рейтинга, ранжированного по заданной метрике качества. Проекты, сданные после дедлайна или сданные повторно, не попадают в рейтинг, но можно будет узнать результат.

Рекомендации для файла с кодом (ipynb):
1. Файл должен содержать заголовки и комментарии (markdown)
2. Повторяющиеся операции лучше оформлять в виде функций
3. Не делать вывод большого количества строк таблиц (5-10 достаточно)
4. По возможности добавлять графики, описывающие данные (около 3-5)
5. Добавлять только лучшую модель, то есть не включать в код все варианты решения проекта
6. Скрипт проекта должен отрабатывать от начала и до конца (от загрузки данных до выгрузки предсказаний)
7. Весь проект должен быть в одном скрипте (файл ipynb).
8. Допускается применение библиотек Python и моделей машинного обучения,
которые были в данном курсе.

Описание датасета:
- Id - идентификационный номер квартиры
- DistrictId - идентификационный номер района
- Rooms - количество комнат
- Square - площадь
- LifeSquare - жилая площадь
- KitchenSquare - площадь кухни
- Floor - этаж
- HouseFloor - количество этажей в доме
- HouseYear - год постройки дома
- Ecology_1, Ecology_2, Ecology_3 - экологические показатели местности
- Social_1, Social_2, Social_3 - социальные показатели местности
- Healthcare_1, Helthcare_2 - показатели местности, связанные с охраной здоровья
- Shops_1, Shops_2 - показатели, связанные с наличием магазинов, торговых центров
- Price - цена квартиры

In [255]:
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

In [293]:
import warnings
warnings.filterwarnings('ignore')

In [256]:
houses_train = pd.read_csv('./train.csv')

In [257]:
houses_train.head()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
0,14038,35,2.0,47.981561,29.442751,6.0,7,9.0,1969,0.08904,B,B,33,7976,5,,0,11,B,184966.93073
1,15053,41,3.0,65.68364,40.049543,8.0,7,9.0,1978,7e-05,B,B,46,10309,1,240.0,1,16,B,300009.450063
2,4765,53,2.0,44.947953,29.197612,0.0,8,12.0,1968,0.049637,B,B,34,7759,0,229.0,1,3,B,220925.908524
3,5809,58,2.0,53.352981,52.731512,9.0,8,17.0,1977,0.437885,B,B,23,5735,3,1084.0,0,5,B,175616.227217
4,10783,99,1.0,39.649192,23.776169,7.0,11,12.0,1976,0.012339,B,B,35,5776,1,2078.0,2,4,B,150226.531644


#### Проверим типы признаков:

In [258]:
houses_train.dtypes

Id                 int64
DistrictId         int64
Rooms            float64
Square           float64
LifeSquare       float64
KitchenSquare    float64
Floor              int64
HouseFloor       float64
HouseYear          int64
Ecology_1        float64
Ecology_2         object
Ecology_3         object
Social_1           int64
Social_2           int64
Social_3           int64
Healthcare_1     float64
Helthcare_2        int64
Shops_1            int64
Shops_2           object
Price            float64
dtype: object

#### Заменим типы для признаков:
- Rooms &#8594; int64
- HouseFloor &#8594;int64

In [259]:
houses_train['Rooms'] = houses_train['Rooms'].astype('int')
houses_train['HouseFloor'] = houses_train['HouseFloor'].astype('int')

In [260]:
print(f"Rooms: {houses_train.dtypes['Rooms']}, HouseFloor: {houses_train.dtypes['HouseFloor']}")

Rooms: int64, HouseFloor: int64


In [261]:
houses_train.head()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
0,14038,35,2,47.981561,29.442751,6.0,7,9,1969,0.08904,B,B,33,7976,5,,0,11,B,184966.93073
1,15053,41,3,65.68364,40.049543,8.0,7,9,1978,7e-05,B,B,46,10309,1,240.0,1,16,B,300009.450063
2,4765,53,2,44.947953,29.197612,0.0,8,12,1968,0.049637,B,B,34,7759,0,229.0,1,3,B,220925.908524
3,5809,58,2,53.352981,52.731512,9.0,8,17,1977,0.437885,B,B,23,5735,3,1084.0,0,5,B,175616.227217
4,10783,99,1,39.649192,23.776169,7.0,11,12,1976,0.012339,B,B,35,5776,1,2078.0,2,4,B,150226.531644


#### Проверим признаки с типом object

In [262]:
houses_train_obj = houses_train.select_dtypes(include='object')
houses_train_obj.head()

Unnamed: 0,Ecology_2,Ecology_3,Shops_2
0,B,B,B
1,B,B,B
2,B,B,B
3,B,B,B
4,B,B,B


In [263]:
houses_train_obj['Ecology_2'].value_counts(0)

B    9903
A      97
Name: Ecology_2, dtype: int64

In [264]:
houses_train_obj['Ecology_3'].value_counts(0)

B    9725
A     275
Name: Ecology_3, dtype: int64

In [265]:
houses_train_obj['Shops_2'].value_counts(0)

B    9175
A     825
Name: Shops_2, dtype: int64

Проверим корректность данных в колонках, описывающих площадь: *Square*, *LifeSquare*, *KitchenSquare*. Признак *Square* должен быть больше чем *LifeSquare* и *KitchenSquare*. 

In [266]:
# LifeSquare > Square
houses_train.loc[(houses_train['LifeSquare'] > houses_train['Square'])]

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
28,8054,23,1,42.530043,43.967759,1.0,3,9,2014,0.034656,B,B,0,168,0,,0,0,B,95338.198549
44,10521,38,3,104.211396,106.340403,0.0,20,0,2017,0.060753,B,B,15,2787,2,520.0,0,7,B,435462.048070
52,2301,1,2,61.400054,65.224603,0.0,17,22,2016,0.007122,B,B,1,264,0,,0,1,B,199215.452229
123,8753,25,3,85.952306,89.803753,1.0,4,3,2017,0.069753,B,B,53,13670,4,,1,11,B,309688.592681
153,9870,62,1,51.831473,53.491301,1.0,5,1,2015,0.072158,B,B,2,629,1,,0,0,A,131797.472284
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9884,41,30,2,59.015896,59.439082,12.0,5,5,2016,0.000078,B,B,22,6398,141,1046.0,3,23,B,126281.142781
9889,12918,23,2,51.440463,53.134243,51.0,3,17,2017,0.005767,B,B,1,388,0,,0,0,B,88150.012510
9895,2737,27,3,123.430072,125.806981,123.0,5,10,2015,0.017647,B,B,2,469,0,,0,0,B,234194.837047
9902,14001,73,1,44.098768,44.267551,1.0,7,24,2014,0.042032,B,B,37,6856,84,1940.0,2,5,B,381937.404161


Предположим, что здесь они просто перепутаны и поменяем их местами, т.е. сделаем 
Square, LifeSquare = LifeSquare, Square

In [267]:
h1 = houses_train.loc[(houses_train['LifeSquare'] > houses_train['Square']), ['LifeSquare']]
h2 = houses_train.loc[(houses_train['LifeSquare'] > houses_train['Square']), ['Square']]
houses_train.loc[(houses_train['LifeSquare'] > houses_train['Square']), ['Square']] = h1
houses_train.loc[(houses_train['LifeSquare'] > houses_train['Square']), ['LifeSquare']] = h2

In [268]:
# Проверим результат замены, должно быть 0 строк
houses_train.loc[(houses_train['LifeSquare'] > houses_train['Square'])]

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price


Проверим признаки KitchenSquare и Square, Square должно быть всегда больше KitchenSquare

In [269]:
houses_train.loc[(houses_train['KitchenSquare'] > houses_train['Square'])]

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
1064,14656,62,1,47.100719,46.44796,2014.0,4,1,2014,0.072158,B,B,2,629,1,,0,0,A,108337.484207
5149,13703,42,1,38.071692,19.723548,73.0,9,10,2006,0.158249,B,B,21,5731,0,,1,0,B,160488.033165
7088,6569,27,1,38.220258,18.716856,84.0,4,17,2018,0.011654,B,B,4,915,0,,0,0,B,99079.960518
8584,14679,81,1,32.276663,19.278394,1970.0,6,1,1977,0.006076,B,B,30,5285,0,645.0,6,6,B,105539.556275


В строках с id 13703 и 6569 проделаем то же самое. что и для признака LifeSquare, то есть поменяем местами значения KitchenSquare и Square. В строках с id 14656 и 14679 явно присутсвует ошибка, значение площади кухни слишком большое - это выбросы, их рассмотрим потом отдельно, вместе со всеми выбросами.

In [270]:
h1 = houses_train.loc[(houses_train['KitchenSquare'] > houses_train['Square']), ['KitchenSquare']]
h2 = houses_train.loc[(houses_train['KitchenSquare'] > houses_train['Square']), ['Square']]
houses_train.loc[(houses_train['KitchenSquare'] > houses_train['Square']), ['Square']] = h1
houses_train.loc[(houses_train['KitchenSquare'] > houses_train['Square']), ['KitchenSquare']] = h2

In [271]:
# Проверим результат замены, должно быть 0 строк
houses_train.loc[(houses_train['KitchenSquare'] > houses_train['Square'])]

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price


#### Обработка выбросов
Проверим основные показатели по каждому признаку, вызовем метод describe()

In [272]:
houses_train.describe()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Price
count,10000.0,10000.0,10000.0,9514.0,7887.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,5202.0,10000.0,10000.0,10000.0
mean,8383.4077,50.4008,1.8905,56.200741,37.199645,6.2733,8.5267,12.6094,3990.166,0.118858,24.687,5352.1574,8.0392,1142.90446,1.3195,4.2313,214138.857399
std,4859.01902,43.587592,0.839512,20.555715,86.241209,28.560917,5.241148,6.775974,200500.3,0.119025,17.532614,4006.799803,23.831875,1021.517264,1.493601,4.806341,92872.293865
min,0.0,0.0,0.0,2.377248,0.370619,0.0,1.0,0.0,1910.0,0.0,0.0,168.0,0.0,0.0,0.0,0.0,59174.778028
25%,4169.5,20.0,1.0,41.827352,22.769832,1.0,4.0,9.0,1974.0,0.017647,6.0,1564.0,0.0,350.0,0.0,1.0,153872.633942
50%,8394.5,36.0,2.0,52.466545,32.78126,6.0,7.0,13.0,1977.0,0.075424,25.0,5285.0,2.0,900.0,1.0,3.0,192269.644879
75%,12592.5,75.0,2.0,65.84278,45.128803,9.0,12.0,17.0,2001.0,0.195781,36.0,7227.0,5.0,1548.0,2.0,6.0,249135.462171
max,16798.0,209.0,19.0,641.065193,7480.592129,2014.0,42.0,117.0,20052010.0,0.521867,74.0,19083.0,141.0,4849.0,6.0,23.0,633233.46657


В признаках LifeSquare и KitchenSquare максимальные значения очень большие. Учитывая, что максимальное значение признака Square = 641.06, а значения признаков LifeSquare и KitchenSquare не должно превышать этого значения, выведем строки, в которых LifeSquare и KitchenSquare больше 641:

In [273]:
houses_train.loc[(houses_train['LifeSquare'] > houses_train['Square'].max()) | (houses_train['KitchenSquare'] > houses_train['Square'].max()), :]

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
1064,14656,62,1,,46.44796,2014.0,4,1,2014,0.072158,B,B,2,629,1,,0,0,A,108337.484207
4328,16550,27,3,,7480.592129,1.0,9,17,2016,0.017647,B,B,2,469,0,,0,0,B,217357.492366
8584,14679,81,1,,19.278394,1970.0,6,1,1977,0.006076,B,B,30,5285,0,645.0,6,6,B,105539.556275


Так как в этих строках отсутствует значение Square, но значение жилой и кухоной площадей явно велико - заменим  эти значения на соответствующие медианные:

In [274]:
houses_train.loc[(houses_train['LifeSquare'] > houses_train['Square'].max()), 'LifeSquare'] = houses_train['LifeSquare'].median()

In [275]:
houses_train.loc[(houses_train['KitchenSquare'] > houses_train['Square'].max()), 'KitchenSquare'] = houses_train['KitchenSquare'].median()

In [276]:
# Проверим результат замены, должно быть 0 строк
houses_train.loc[(houses_train['LifeSquare'] > houses_train['Square'].max()) | (houses_train['KitchenSquare'] > houses_train['Square'].max()), :]

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price


Проверим снова основные показатели, вызовем метод describe()

In [277]:
houses_train.describe()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Price
count,10000.0,10000.0,10000.0,9514.0,7887.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,5202.0,10000.0,10000.0,10000.0
mean,8383.4077,50.4008,1.8905,56.200741,36.25533,5.8761,8.5267,12.6094,3990.166,0.118858,24.687,5352.1574,8.0392,1142.90446,1.3195,4.2313,214138.857399
std,4859.01902,43.587592,0.839512,20.555715,20.273876,5.174014,5.241148,6.775974,200500.3,0.119025,17.532614,4006.799803,23.831875,1021.517264,1.493601,4.806341,92872.293865
min,0.0,0.0,0.0,2.377248,0.370619,0.0,1.0,0.0,1910.0,0.0,0.0,168.0,0.0,0.0,0.0,0.0,59174.778028
25%,4169.5,20.0,1.0,41.827352,22.769832,1.0,4.0,9.0,1974.0,0.017647,6.0,1564.0,0.0,350.0,0.0,1.0,153872.633942
50%,8394.5,36.0,2.0,52.466545,32.78126,6.0,7.0,13.0,1977.0,0.075424,25.0,5285.0,2.0,900.0,1.0,3.0,192269.644879
75%,12592.5,75.0,2.0,65.84278,45.125018,9.0,12.0,17.0,2001.0,0.195781,36.0,7227.0,5.0,1548.0,2.0,6.0,249135.462171
max,16798.0,209.0,19.0,641.065193,638.163193,123.0,42.0,117.0,20052010.0,0.521867,74.0,19083.0,141.0,4849.0,6.0,23.0,633233.46657


Следующий признак с аномальными данными - HouseYear (год постройки), максимальное значение 2005201, выведем строки со значением HouseYear большим чем текущий год - 2020

In [278]:
houses_train[(houses_train['HouseYear'] > 2020)]

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
1497,10814,109,1,37.26507,20.239714,9.0,9,12,20052011,0.13633,B,B,30,6141,10,262.0,3,6,B,254084.534396
4189,11607,147,2,44.791836,28.360393,5.0,4,9,4968,0.319809,B,B,25,4756,16,2857.0,5,8,B,243028.603096


По id 10814 можно предположить, что вместо года ввели полную дату - 20.05.2011, заменим это значение на 2011: 

In [282]:
houses_train.loc[(houses_train['HouseYear'] == 20052011), 'HouseYear'] = 2011

По id 11607 можно предположить, что ошибка в первой цифре, замени 4968 на 1968: 

In [283]:
houses_train.loc[(houses_train['HouseYear'] == 4968), 'HouseYear'] = 1968

In [284]:
houses_train.describe()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Price
count,10000.0,10000.0,10000.0,9514.0,7887.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,5202.0,10000.0,10000.0,10000.0
mean,8383.4077,50.4008,1.8905,56.200741,36.25533,5.8761,8.5267,12.6094,1984.8663,0.118858,24.687,5352.1574,8.0392,1142.90446,1.3195,4.2313,214138.857399
std,4859.01902,43.587592,0.839512,20.555715,20.273876,5.174014,5.241148,6.775974,18.412271,0.119025,17.532614,4006.799803,23.831875,1021.517264,1.493601,4.806341,92872.293865
min,0.0,0.0,0.0,2.377248,0.370619,0.0,1.0,0.0,1910.0,0.0,0.0,168.0,0.0,0.0,0.0,0.0,59174.778028
25%,4169.5,20.0,1.0,41.827352,22.769832,1.0,4.0,9.0,1974.0,0.017647,6.0,1564.0,0.0,350.0,0.0,1.0,153872.633942
50%,8394.5,36.0,2.0,52.466545,32.78126,6.0,7.0,13.0,1977.0,0.075424,25.0,5285.0,2.0,900.0,1.0,3.0,192269.644879
75%,12592.5,75.0,2.0,65.84278,45.125018,9.0,12.0,17.0,2001.0,0.195781,36.0,7227.0,5.0,1548.0,2.0,6.0,249135.462171
max,16798.0,209.0,19.0,641.065193,638.163193,123.0,42.0,117.0,2020.0,0.521867,74.0,19083.0,141.0,4849.0,6.0,23.0,633233.46657


#### Обработка пропусков
Проверим количество значений по каждому признаку, вызовем метод info()

In [285]:
houses_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             10000 non-null  int64  
 1   DistrictId     10000 non-null  int64  
 2   Rooms          10000 non-null  int64  
 3   Square         9514 non-null   float64
 4   LifeSquare     7887 non-null   float64
 5   KitchenSquare  10000 non-null  float64
 6   Floor          10000 non-null  int64  
 7   HouseFloor     10000 non-null  int64  
 8   HouseYear      10000 non-null  int64  
 9   Ecology_1      10000 non-null  float64
 10  Ecology_2      10000 non-null  object 
 11  Ecology_3      10000 non-null  object 
 12  Social_1       10000 non-null  int64  
 13  Social_2       10000 non-null  int64  
 14  Social_3       10000 non-null  int64  
 15  Healthcare_1   5202 non-null   float64
 16  Helthcare_2    10000 non-null  int64  
 17  Shops_1        10000 non-null  int64  
 18  Shops_2

Есть пропуски в значениях Square, LifeSquare и Healthcare_1.
Заменим недостающие значения по признаку Square на медианные

In [289]:
houses_train.loc[houses_train['Square'].isna(), 'Square'] = houses_train['Square'].median()

Значения в признаке LifeSquare просто заменить на медианные нельзя, так как они могут превосходить значения Square. Попробуем применить линейную регрессию, где в качестве аргументов функции будут значения признаков Square и KitchenSquare, а в качесте целевой функции - LifeSquare.

In [317]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

Для начала составим таблицу X1, в нее войдут признаки Square, KitchenSquare при не-NULL значениями признака LifeSquare:

In [303]:
X1 = houses_train.loc[(houses_train['LifeSquare'].notna()), ['Square', 'KitchenSquare']]

И составим таблицу y1, в нее войдут соответсвующие значения LifeSquare:

In [304]:
y1 = houses_train.loc[(houses_train['LifeSquare'].notna()), ['LifeSquare']]

Разделим выборки на тренировочную и валидационную в соотношении 3:1

In [334]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.15)

Создадим и обучим модель:

In [335]:
lr1 = LinearRegression()
lr1.fit(X1_train, y1_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Получим значения целевой функции:

In [336]:
y1_pred = lr1.predict(X1_test)

In [347]:
check_test1 = pd.DataFrame({
    "y1_test": y1_test['LifeSquare'],
    "y1_pred": y1_pred.flatten(),
    #"error": check_test1["y1_pred"] - check_test1["y1_test"]
})
df1 = pd.concat([X1_test, check_test1], axis=1)
df1["sq-lsq"] = df1["Square"] - df1["y1_test"]
df1[(df1['sq-lsq'] < 0 )]

Unnamed: 0,Square,KitchenSquare,y1_test,y1_pred,sq-lsq
5185,52.466545,1.0,63.923208,36.861351,-11.456663
498,52.466545,1.0,60.823136,36.861351,-8.356591
2684,52.466545,1.0,78.324716,36.861351,-25.858171
4600,52.466545,0.0,73.22528,37.254104,-20.758735
8314,52.466545,0.0,63.595616,37.254104,-11.129071
1009,52.466545,1.0,80.762909,36.861351,-28.296364
342,52.466545,1.0,78.533293,36.861351,-26.066748
6227,52.466545,1.0,80.101945,36.861351,-27.6354
3915,52.466545,0.0,81.452946,37.254104,-28.986401
9469,52.466545,0.0,87.730225,37.254104,-35.26368


In [338]:
mean_squared_error(check_test1["y1_pred"], check_test1["y1_test"])

183.9372075820431

In [341]:
mean_absolute_error(check_test1["y1_pred"], check_test1["y1_test"])

7.561951338990085

In [292]:
houses_train.describe()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Price
count,10000.0,10000.0,10000.0,10000.0,7887.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,5202.0,10000.0,10000.0,10000.0
mean,8383.4077,50.4008,1.8905,56.019259,36.25533,5.8761,8.5267,12.6094,1984.8663,0.118858,24.687,5352.1574,8.0392,1142.90446,1.3195,4.2313,214138.857399
std,4859.01902,43.587592,0.839512,20.066012,20.273876,5.174014,5.241148,6.775974,18.412271,0.119025,17.532614,4006.799803,23.831875,1021.517264,1.493601,4.806341,92872.293865
min,0.0,0.0,0.0,2.377248,0.370619,0.0,1.0,0.0,1910.0,0.0,0.0,168.0,0.0,0.0,0.0,0.0,59174.778028
25%,4169.5,20.0,1.0,42.160931,22.769832,1.0,4.0,9.0,1974.0,0.017647,6.0,1564.0,0.0,350.0,0.0,1.0,153872.633942
50%,8394.5,36.0,2.0,52.466545,32.78126,6.0,7.0,13.0,1977.0,0.075424,25.0,5285.0,2.0,900.0,1.0,3.0,192269.644879
75%,12592.5,75.0,2.0,65.121818,45.125018,9.0,12.0,17.0,2001.0,0.195781,36.0,7227.0,5.0,1548.0,2.0,6.0,249135.462171
max,16798.0,209.0,19.0,641.065193,638.163193,123.0,42.0,117.0,2020.0,0.521867,74.0,19083.0,141.0,4849.0,6.0,23.0,633233.46657


In [None]:
plt.hist(houses_train['Price'])
# plt.hist(houses_train['coeff_1']> 1000 , bins=20, density=True, label='coeff_1', alpha=0.5)
plt.show()
# houses_train['coeff_1'].hist()
# plt.ylabel('count')
# plt.xlabel('coeff_1')