# Lambda School, Intro to Data Science, Day 7 — More Regression!

## Assignment

### 1. Experiment with Nearest Neighbor parameter

Using the same 10 training data points from the lesson, train a `KNeighborsRegressor` model with `n_neighbors=1`.

Use both `carat` and `cut` features.

Calculate the mean absolute error on the training data and on the test data.

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error

columns = ['carat', 'cut', 'price']

train = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 422],
        [0.31, 'Ideal', 489],
        [0.42, 'Premium', 737],
        [0.5, 'Ideal', 1415],
        [0.51, 'Premium', 1177],
        [0.7, 'Fair', 1865],
        [0.73, 'Fair', 2351],
        [1.01, 'Good', 3768],
        [1.18, 'Very Good', 3965],
        [1.18, 'Ideal', 4838]])

test  = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 432],
        [0.34, 'Ideal', 687],
        [0.37, 'Premium', 1124],
        [0.4, 'Good', 720],
        [0.51, 'Ideal', 1397],
        [0.51, 'Very Good', 1284],
        [0.59, 'Ideal', 1437],
        [0.7, 'Ideal', 3419],
        [0.9, 'Premium', 3484],
        [0.9, 'Fair', 2964]])

cut_ranks = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

In [0]:
features = ['carat', 'cut']
target = 'price'
model = KNeighborsRegressor(n_neighbors=1)

# train model
model.fit(train[features], train[target])

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                    weights='uniform')

In [0]:
# test model
model.fit(test[features], test[target])

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                    weights='uniform')

In [0]:
# training model mean absolute error
true_train = train[target]
predict_train = model.predict(train[features])
train_error = mean_absolute_error(true_train, predict_train)
print("Training error:", round(train_error))

Training error: 938.0


In [0]:
# test model mean absolute error
true_test = test[target]
predict_test = model.predict(test[features])
test_error = mean_absolute_error(true_test, predict_test)
print("Testing error:", round(test_error))

Testing error: 0.0


How does the train error and test error compare to the previous `KNeighborsRegressor` model from the lesson? (The previous model used `n_neighbors=2` and only the `carat` feature.)

- As stated, the previous model used `n_neighbors=2` whereas this model uses `n_neighbors=1` and uses both `carat` and `cut` features. Unfortunately, a `Kneighbor` model is unable to predict a the `price` at a higher `carat`.

Is this new model overfitting or underfitting? Why do you think this is happening here? 

- This `Kneighbor` model is underfitting because training error is much higher than testing error and testing error is zero. This dataset also has much more data than the one in the lecture and `k` value is 1 rather than 2 so the model chooses data points that are clumped closer together resulting in lower testing error.

### 2. More data, two features, linear regression

Use the following code to load data for diamonds under $5,000, and split the data into train and test sets. The training data has almost 30,000 rows, and the test data has almost 10,000 rows.

In [0]:
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = sns.load_dataset('diamonds')
df = df[df.price < 5000]
train, test = train_test_split(df.copy(), random_state=0)
train.shape, test.shape

((29409, 10), (9804, 10))

Then, train a Linear Regression model with the `carat` and `cut` features. Calculate the mean absolute error on the training data and on the test data.

In [0]:
model = LinearRegression()
cut_ranks = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

# training model
model.fit(train[features], train[target])

# testing model
model.fit(test[features], test[target])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
# training data mean absolute error
training_data = train[target]
training_data_predict = model.predict(train[features])
training_data_error = mean_absolute_error(training_data, training_data_predict)
print("Training error:", training_data_error)

Training error: 309.41076133261504


In [0]:
# testing data mean absolute error
testing_data = test[target]
testing_data_predict = model.predict(test[features])
testing_data_error = mean_absolute_error(testing_data, testing_data_predict)
print("Testing error:", testing_data_error)

Testing error: 309.0887217763545


Use this model to predict the price of a half carat diamond with "very good" cut



In [0]:
model.predict([[0.5, 3]])

array([1489.75066951])

### 3. More data, more features, any model

You choose what features and model type to use! Try to get a better mean absolute error on the test set than your model from the last question.

Refer to [this documentation](https://ggplot2.tidyverse.org/reference/diamonds.html) for more explanation of the features.

Besides `cut`, there are two more ordinal features, which you'd need to encode as numbers if you want to use in your model:

In [0]:
print(train.describe(include=['object']))
print(train.shape)

        color clarity
count   29409   29409
unique      7       8
top         E     SI1
freq     6090    6948
(29409, 10)


In [0]:
# mapping clarity and color to number rankings
clarity_rank = {"IF": 0,"VVS1": 1, "VVS2": 2,"VS1": 3, "VS2": 4,"SI1": 5, "SI2": 6, "I1": 7}
color_rank = {"J": 7, "I": 6, "H": 5, "G": 4, "F": 3, "E": 2, "E":2, "D": 1}
train.clarity = train.clarity.map(clarity_rank)
train.color = train.color.map(color_rank)
test.clarity = test.clarity.map(clarity_rank)
test.color = test.color.map(color_rank)

In [0]:
# checking for null values after mapping
train.isnull().values.any()

False

In [0]:
# everything looks good
train.isnull().sum()

carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64

In [0]:
# using linear regression model but with more features this time
# seeing if error numbers are lower than previous linear regression model
features = ['carat', 'cut', 'color', 'clarity']
target = ['price']
model = LinearRegression()
model.fit(train[features], train[target])
model.fit(test[features], test[target])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
# train mean absolute error
train_tar = train[target]
predict_tr = model.predict(train[features])
tr_error = mean_absolute_error(train_tar, predict_tr)
print('Training error:', tr_error)

Training error: 244.85512977505022


In [0]:
#test mean absolute error
test_tar = test[target]
predict_te = model.predict(test[features])
te_error = mean_absolute_error(test_tar, predict_te)
print('Test error:', te_error)

Test error: 246.02828588394243


Predict the price of a half carat diamond with "very good" cut, "G" color, "VS2" clarity.

In [0]:
# carat, cut, color, clarity
model.predict([[0.5, 3, 4, 4]])

array([[1379.6009241]])