## Tutors - expected math exam results

#### Predict average math exam results for students of the tutors

Ваша задача в этом соревновании - предсказать средний балл на экзамене по математике,
который получают ученики репетиторов из датасета test.csv.
Вам будут даны два датасета: train.csv (содержит признаки и целевую переменную) и test.csv (только признаки)

https://www.kaggle.com/c/tutors-expected-math-exam-results

Метрика для оценки – Коэффициент детерминации:

https://en.wikipedia.org/wiki/Coefficient_of_determination

\[ R^2 = 1 - \frac{\sigma^2}{\sigma_y^2} \]



You can only use these imports:
``` python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
```

### План
 - Анализ предметной области
 - Очистка и форматирование данных.
 - Разведочный анализ данных.
 - Конструирование и выбор признаков.
 - Сравнение метрик нескольких моделей машинного обучения.
 - Гиперпараметрическая настройка лучшей модели.
 - Оценка лучшей модели на тестовом наборе данных.
 - Интерпретирование результатов работы модели.

### Анализ предметной области

Профиль учителя включает (гипотеза! в постановке задачи не найдено):

- **age** - Возраст
- **years_of_experience** - Количество лет в профессии
- **lesson_price** - Стоимость урока
- **qualification** - Квалификация
- **physics** **chemistry**	**biology**	**english**	**geography** **history** - дополнительная квалификация
- **mean_exam_points** - средний балл (целевая переменная)

Наибольшие вопросы вызывают поля доп квалификации **physics** **chemistry**	**biology**	**english**	**geography** **history**.
Можно предположить что доп квалификация в предметах **physics** **chemistry** должна коррелировать с преподаванием математики
И следовательно, влиять на целевую переменную.

>#### Гипотеза-01-EXT
Проверить корреляцию признаков **physics** **chemistry** на целевую переменную

>#### Гипотеза-02-0EXT
Удалить признаки **biology**	**english**	**geography** **history**

### Очистка и форматирование данных.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

In [6]:
train.head()

Unnamed: 0,Id,age,years_of_experience,lesson_price,qualification,physics,chemistry,biology,english,geography,history,mean_exam_points
0,0,40.0,0.0,1400.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,63.0
1,1,48.0,4.0,2850.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,86.0
2,2,39.0,0.0,1200.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0
3,3,46.0,5.0,1400.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,56.0
4,4,43.0,1.0,1500.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,59.0


> Удаляем лишние переменные

In [15]:
train.drop(columns='Id', inplace=True)
train

Unnamed: 0,age,years_of_experience,lesson_price,qualification,physics,chemistry,biology,english,geography,history,mean_exam_points
0,40.0,0.0,1400.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,63.0
1,48.0,4.0,2850.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,86.0
2,39.0,0.0,1200.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0
3,46.0,5.0,1400.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,56.0
4,43.0,1.0,1500.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,59.0
...,...,...,...,...,...,...,...,...,...,...,...
9995,44.0,0.0,1700.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,78.0
9996,51.0,0.0,1700.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,64.0
9997,34.0,1.0,1250.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,58.0
9998,33.0,3.0,1100.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,51.0


In [16]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  10000 non-null  float64
 1   years_of_experience  10000 non-null  float64
 2   lesson_price         10000 non-null  float64
 3   qualification        10000 non-null  float64
 4   physics              10000 non-null  float64
 5   chemistry            10000 non-null  float64
 6   biology              10000 non-null  float64
 7   english              10000 non-null  float64
 8   geography            10000 non-null  float64
 9   history              10000 non-null  float64
 10  mean_exam_points     10000 non-null  float64
dtypes: float64(11)
memory usage: 859.5 KB


In [10]:
train.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,10000.0,4999.5,2886.89568,0.0,2499.75,4999.5,7499.25,9999.0
age,10000.0,45.878,8.043929,23.0,40.0,46.0,51.0,68.0
years_of_experience,10000.0,1.9868,1.772213,0.0,0.0,2.0,3.0,10.0
lesson_price,10000.0,1699.105,524.886654,200.0,1300.0,1500.0,2150.0,3950.0
qualification,10000.0,1.7195,0.792264,1.0,1.0,2.0,2.0,4.0
physics,10000.0,0.375,0.484147,0.0,0.0,0.0,1.0,1.0
chemistry,10000.0,0.1329,0.339484,0.0,0.0,0.0,0.0,1.0
biology,10000.0,0.1096,0.312406,0.0,0.0,0.0,0.0,1.0
english,10000.0,0.0537,0.225436,0.0,0.0,0.0,0.0,1.0
geography,10000.0,0.0321,0.176274,0.0,0.0,0.0,0.0,1.0


In [19]:
train.isnull().sum()

age                    0
years_of_experience    0
lesson_price           0
qualification          0
physics                0
chemistry              0
biology                0
english                0
geography              0
history                0
mean_exam_points       0
dtype: int64

> пропуски отсутствуют


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier

### Разведочный анализ данных

