---
# Crash Course Python for Data Science — Predictive Modelling
---
# 01 - Regression modelling
---
## STOP! BEFORE GOING ANY FURTHER...  

Remember, this exercises are open book, open neighbour, open everything! Try to do them on your own before looking at the solution samples.

---
<br>

### 1. Experiment with Nearest Neighbor parameter

Using the following code to load the same 10 training and test data points from the workshop.

In [26]:
# Run this first!

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error

# plot tunning
plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (10, 6)

columns = ['carat', 'cut', 'price']

features = ['carat', 'cut']
target = 'price'

train = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 422],
        [0.31, 'Ideal', 489],
        [0.42, 'Premium', 737],
        [0.5, 'Ideal', 1415],
        [0.51, 'Premium', 1177],
        [0.7, 'Fair', 1865],
        [0.73, 'Fair', 2351],
        [1.01, 'Good', 3768],
        [1.18, 'Very Good', 3965],
        [1.18, 'Ideal', 4838]])

test  = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 432],
        [0.34, 'Ideal', 687],
        [0.37, 'Premium', 1124],
        [0.4, 'Good', 720],
        [0.51, 'Ideal', 1397],
        [0.51, 'Very Good', 1284],
        [0.59, 'Ideal', 1437],
        [0.7, 'Ideal', 3419],
        [0.9, 'Premium', 3484],
        [0.9, 'Fair', 2964]])

cut_ranks = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

Then, train a `KNeighborsRegressor` model with `n_neighbors=1`.

Use both `carat` and `cut` features.

Calculate the mean absolute error on the training data and on the test data.

In [27]:
# Step 1. Create instance of the model

### YOUR CODE GOES HERE ###
KNNmodel = KNeighborsRegressor(n_neighbors=1)

# Step 2. Train the algorithm

### YOUR CODE GOES HERE ###
KNNmodel.fit(train[features], train[target])

# Step 3. Make predictions

### YOUR CODE GOES HERE ###
y_true_train = train[target]
y_true_test = test[target]
training_preds = KNNmodel.predict(train[features])
val_preds = KNNmodel.predict(test[features])

# Step 4. Evaluate the model

### YOUR CODE GOES HERE ###
print(f"\nTraining AMSE: {round(mean_absolute_error(y_true_train, training_preds),4)}")
print(f"\nValidation AMSE: {round(mean_absolute_error(y_true_test, val_preds),4)}")


Training AMSE: 0.0

Validation AMSE: 1128.8


How does the train error and test error compare to the previous `KNeighborsRegressor` model from the lesson? (The previous model used `n_neighbors=2` and only the `carat` feature.)

Is this new model overfitting or underfitting? Why do you think this is happening here? 



### 2. More data, two features, linear regression

Use the following code to load data for diamonds under $5,000, and split the data into train and test sets. The training data has almost 30,000 rows, and the test data has almost 10,000 rows.

In [28]:
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = sns.load_dataset('diamonds')
df = df[df.price < 5000]
train, test = train_test_split(df.copy(), random_state=0)
train.shape, test.shape

((29409, 10), (9804, 10))

In [29]:
# Run this to check the dataset loaded and looks ok
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [30]:
# Run this to encode the ordinal features as numbers
cut_ranks = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
train.cut = train.cut.map(cut_ranks)
target = test.cut = test.cut.map(cut_ranks)

Then, train a Linear Regression model with the `carat` and `cut` features. Calculate the mean absolute error on the training data and on the test data.

Use this model to predict the price of a half carat diamond with "very good" cut

In [33]:
### YOUR CODE GOES HERE ###
train.isnull().sum()

carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64

### 3. More data, more features, any model

You choose what features and model type to use! Try to get a better mean absolute error on the test set than your model from the last question.

Refer to [this documentation](https://ggplot2.tidyverse.org/reference/diamonds.html) for more explanation of the features.

Besides `cut`, there are two more ordinal features, which you'd need to encode as numbers if you want to use in your model: `color` and `clarity`.

In [32]:
# Run this to see the description of color and clarity features
train.describe(include=['object'])

ValueError: No objects to concatenate

### Below I've written an example solution using K'Nearest, Linear Regression and a regression algorithm we didn't cover in the crash course, known as XGBoost. I strongly encourage you to come up with **your own** solution before looking at mine!

In [None]:
# Add as many extra cells as you need. 


### YOUR CODE GOES HERE ###


