# Lambda School, Intro to Data Science, Day 7 — More Regression!

## Assignment

### 1. Experiment with Nearest Neighbor parameter

Using the same 10 training data points from the lesson, train a `KNeighborsRegressor` model with `n_neighbors=1`.

Use both `carat` and `cut` features.

Calculate the mean absolute error on the training data and on the test data.

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error

columns = ['carat', 'cut', 'price']

train = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 422],
        [0.31, 'Ideal', 489],
        [0.42, 'Premium', 737],
        [0.5, 'Ideal', 1415],
        [0.51, 'Premium', 1177],
        [0.7, 'Fair', 1865],
        [0.73, 'Fair', 2351],
        [1.01, 'Good', 3768],
        [1.18, 'Very Good', 3965],
        [1.18, 'Ideal', 4838]])

test  = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 432],
        [0.34, 'Ideal', 687],
        [0.37, 'Premium', 1124],
        [0.4, 'Good', 720],
        [0.51, 'Ideal', 1397],
        [0.51, 'Very Good', 1284],
        [0.59, 'Ideal', 1437],
        [0.7, 'Ideal', 3419],
        [0.9, 'Premium', 3484],
        [0.9, 'Fair', 2964]])

cut_ranks = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

In [0]:
features = ['carat', 'cut']
target = ['price']

model = KNeighborsRegressor(n_neighbors=1)

model.fit(train[features],train[target])

def mean_abs_error():

  #   on training data
  y_true = train[target]
  y_predict = model.predict(train[features])
  train_error = mean_absolute_error(y_true, y_predict)

  # on test data
  y_true = test[target]
  y_predict = model.predict(test[features])
  test_error = mean_absolute_error(y_true, y_predict)
  print(f'Train Error: $', {round(train_error)})
  print(f'Test Error: $', {round(test_error)})
mean_abs_error()




Train Error: $ {0.0}
Test Error: $ {1129.0}


How does the train error and test error compare to the previous `KNeighborsRegressor` model from the lesson? (The previous model used `n_neighbors=2` and only the `carat` feature.)

Is this new model overfitting or underfitting? Why do you think this is happening here? 



**The current model is overfitting, essentially because 'n_neigbour = 1' is used as against 'n_neighbor=2' used in the previous model**

### 2. More data, two features, linear regression

Use the following code to load data for diamonds under $5,000, and split the data into train and test sets. The training data has almost 30,000 rows, and the test data has almost 10,000 rows.

In [0]:
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = sns.load_dataset('diamonds')
df = df[df.price < 5000]
train, test = train_test_split(df.copy(), random_state=0)
train.shape, test.shape
df.head(20)


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


Then, train a Linear Regression model with the `carat` and `cut` features. Calculate the mean absolute error on the training data and on the test data.

In [0]:
features = ['carat', 'cut']
target = ['price']

model = LinearRegression()
model.fit(train[features], train[target])
mean_abs_error()
model.coef_, model.intercept_

Train Error: $ {171.0}
Test Error: $ {257.0}


(array([[4770.55908807,   82.45286436]]), array([-1448.9474415]))

Use this model to predict the price of a half carat diamond with "very good" cut

In [0]:
4770 * 0.5 + 82 * 3 -1448

1183.0

In [0]:
model.predict([[0.5, 3]])

array([[1183.69069562]])

### 3. More data, more features, any model

You choose what features and model type to use! Try to get a better mean absolute error on the test set than your model from the last question.

Refer to [this documentation](https://ggplot2.tidyverse.org/reference/diamonds.html) for more explanation of the features.

Besides `cut`, there are two more ordinal features, which you'd need to encode as numbers if you want to use in your model:

In [0]:
# train.describe(include=['object'])

In [0]:
# count_color = {}
# for i in df['color']:
#   if i in count_color:
#     count_color[i] += 1
#   else:
#     count_color[i] = 1
# count_color

{'D': 5492, 'E': 8070, 'F': 7246, 'G': 7933, 'H': 5481, 'I': 3367, 'J': 1624}

In [0]:
color_rank = {"J":7, "I":6, "H":5, "G":4, "F":3, "E":2, "D":1 }
train.color = train.color.map(color_rank)
test.color = test.color.map(color_rank)

In [0]:
# count_clarity = {}
# for i in df['clarity']:
#   if i in count_clarity:
#     count_clarity[i] += 1
#   else:
#     count_clarity[i] = 1
# count_clarity

In [0]:
clarity_rank = {"IF":0,"VVS1":1, "VVS2":2,"VS1":3, "VS2":4,"SI1":5, "SI2":6, "I1":7}
train.clarity = train.clarity.map(clarity_rank)
test.clarity = test.clarity.map(clarity_rank)

In [0]:
features = ['color','clarity']
target = ['price']
model = LinearRegression()
model.fit(train[features],train[target])
mean_abs_error()
model.coef_,model.intercept_

Train Error: $ {1031.0}
Test Error: $ {1036.0}


(array([[ 77.75742165, 259.22094468]]), array([648.46824772]))

In [0]:
#model.predict([[4,4]])

array([[970.]])

In [0]:
# Using KNN Model
features = ['color', 'clarity']
target = ['price']

model = KNeighborsRegressor(n_neighbors=2)
model.fit(train[features],train[target])
mean_abs_error()

Train Error: $ {1125.0}
Test Error: $ {1141.0}


In [0]:
df.isnull().sum()

carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64