# <Anonymized BootCamp>, Intro to Data Science, Day 7 — More Regression!

## Assignment

### 1. Experiment with Nearest Neighbor parameter

Using the same 10 training data points from the lesson, train a `KNeighborsRegressor` model with `n_neighbors=1`.

Use both `carat` and `cut` features.

Calculate the mean absolute error on the training data and on the test data.

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as knr
from sklearn.metrics import mean_absolute_error


columns = ['carat', 'cut', 'price']

train = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 422],
        [0.31, 'Ideal', 489],
        [0.42, 'Premium', 737],
        [0.5, 'Ideal', 1415],
        [0.51, 'Premium', 1177],
        [0.7, 'Fair', 1865],
        [0.73, 'Fair', 2351],
        [1.01, 'Good', 3768],
        [1.18, 'Very Good', 3965],
        [1.18, 'Ideal', 4838]])

test  = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 432],
        [0.34, 'Ideal', 687],
        [0.37, 'Premium', 1124],
        [0.4, 'Good', 720],
        [0.51, 'Ideal', 1397],
        [0.51, 'Very Good', 1284],
        [0.59, 'Ideal', 1437],
        [0.7, 'Ideal', 3419],
        [0.9, 'Premium', 3484],
        [0.9, 'Fair', 2964]])

cut_ranks = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

In [2]:
#WHAT DOES TRAIN.CUT AND TEST.CUT LOOK LIKE?
print(' ' , ['train','test'])
for i in range(max(len(train.cut), len(test.cut))):
  print(i, [train.cut[i], test.cut[i]])

  ['train', 'test']
0 [5, 5]
1 [5, 5]
2 [4, 4]
3 [5, 2]
4 [4, 5]
5 [1, 3]
6 [1, 5]
7 [2, 5]
8 [3, 4]
9 [5, 1]


In [3]:
#CREATE NEAREST NEIGHBORS MODEL, K=1
model = knr(n_neighbors = 1)
features = ['carat', 'cut']
target = 'price'
model.fit(train[features], train[target])

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=1, p=2,
          weights='uniform')

In [4]:
#WHAT DOES ACTUAL & PREDICTED TRAIN.PRICE VALUES LOOK LIKE (BASED ON CARAT & CUT)?
y_true = train[target]
y_pred = model.predict(train[features])
train_error = mean_absolute_error(y_true, y_pred)

print(' ' , ['Act','Pred'])
for i in range(len(y_true)):
  print(i, [y_true[i], y_pred[i]] ) 
print('')
print('Train Error: ' +str(train_error))

  ['Act', 'Pred']
0 [422, 422.0]
1 [489, 489.0]
2 [737, 737.0]
3 [1415, 1415.0]
4 [1177, 1177.0]
5 [1865, 1865.0]
6 [2351, 2351.0]
7 [3768, 3768.0]
8 [3965, 3965.0]
9 [4838, 4838.0]

Train Error: 0.0


**Andrew: This is makes sense to me because for each training data point, because k=1, the model is going to predict the value for that data point using just that same data point.**

In [5]:
#WHAT DOES ACTUAL & PREDICTED TEST.PRICE VALUES LOOK LIKE (BASED ON CARAT & CUT)?
y_true = test[target]
y_pred = model.predict(test[features])
test_error = mean_absolute_error(y_true, y_pred)

print(' ' , ['Act','Pred'])
for i in range(len(y_true)):
  print(i, [y_true[i], y_pred[i]])
print('')
print('Test Error: ' +str(test_error))

  ['Act', 'Pred']
0 [432, 422.0]
1 [687, 489.0]
2 [1124, 737.0]
3 [720, 3768.0]
4 [1397, 1415.0]
5 [1284, 3965.0]
6 [1437, 1415.0]
7 [3419, 1415.0]
8 [3484, 1177.0]
9 [2964, 2351.0]

Test Error: 1128.8


How does the train error and test error compare to the previous `KNeighborsRegressor` model from the lesson? (The previous model used `n_neighbors=2` and only the `carat` feature.)

Is this new model overfitting or underfitting? Why do you think this is happening here? 



**Andrew:  *Note suprisingly, since the model "overfitted" the training data, there is a large amount of test error in spite of the 0.0 train error. As I mentioned above, this is because for each training data point, because k=1, the model is going to predict the value for that data point using just that same data point. The model is too rigid.**

### 2. More data, two features, linear regression

Use the following code to load data for diamonds under $5,000, and split the data into train and test sets. The training data has almost 30,000 rows, and the test data has almost 10,000 rows.

In [6]:
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split as tts

df = sns.load_dataset('diamonds')
df = df[df.price < 5000]
train, test = tts(df.copy(), random_state=0)
train.shape, test.shape


((29409, 10), (9804, 10))

In [7]:
#REMINDER OF WHAT THE TRAIN DATA LOOKS LIKE
print(train.head())

       carat        cut color clarity  depth  table  price     x     y     z
43601   0.31  Very Good     E     SI1   61.2   58.0    507  4.34  4.38  2.67
52706   0.74       Fair     H     VS2   66.1   61.0   2553  5.60  5.57  3.69
1986    0.81  Very Good     G     SI1   62.3   59.0   3095  5.93  5.98  3.71
48617   0.70       Fair     G     SI2   61.5   66.0   1999  5.55  5.60  3.43
10947   0.87      Ideal     G     VS2   61.8   56.0   4899  6.11  6.13  3.78


Then, train a Linear Regression model with the `carat` and `cut` features. Calculate the mean absolute error on the training data and on the test data.

In [0]:
#MAP CUT RANKS TO INTEGERS

cut_ranks = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

In [9]:
#CREATE LINEAR REGRESSION MODEL 
model = LinearRegression()
features = ['carat', 'cut']
target = 'price'

model.fit(train[features], train[target])


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [10]:
#CHECK WHAT THE TRAIN DATA LOOKS LIKE NOW
print(train.head())

       carat  cut color clarity  depth  table  price     x     y     z
43601   0.31    3     E     SI1   61.2   58.0    507  4.34  4.38  2.67
52706   0.74    1     H     VS2   66.1   61.0   2553  5.60  5.57  3.69
1986    0.81    3     G     SI1   62.3   59.0   3095  5.93  5.98  3.71
48617   0.70    1     G     SI2   61.5   66.0   1999  5.55  5.60  3.43
10947   0.87    5     G     VS2   61.8   56.0   4899  6.11  6.13  3.78


In [11]:
#CALCULATE TRAIN ABSOLUTE ERROR - PRICE VS CARAT,CUT
y_train_true = train[target]
y_train_pred = model.predict(train[features])
train_error = mean_absolute_error(y_train_true, y_train_pred)

#CALCULATE TEST ABSOLUTE ERROR - PRICE VS CARAT,CUT
y_test_true = test[target]
y_test_pred = model.predict(test[features])
test_error = mean_absolute_error(y_test_true, y_test_pred)

print("Train Error: " + str(train_error))
print("Test Error: " + str(test_error))

Train Error: 309.46586653861294
Test Error: 309.5202765379711


**Andrew: I'm impressed how close those errors are to each other! This makes sense to me since it's based on so much data.**

Use this model to predict the price of a half carat diamond with "very good" cut

In [12]:
#PREDICT PRICE BASED ON CARAT AND CUT QUALITY  - ONE INSTANCE
carat = 0.5
cut_str = 'Very Good'
cut = cut_ranks[cut_str]
print('Predicted price for ' + str(carat) + ' carat & ' + str(cut_str) 
      + ' cut quality: $' + str(model.predict([[carat,cut]])))

Predicted price for 0.5 carat & Very Good cut quality: $[1489.45526366]


In [13]:
#SANITY CHECK FOR ME, IS THE ABOVE A REASONABLE PREDICTION?

#filter by carart
f_carat = df[df.carat == carat]

#filter by cut
f_cut = f_carat[f_carat.cut == 'Very Good']
f_cut = f_cut.sort_values(by='carat')
print(f_cut.head(20))


       carat        cut color clarity  depth  table  price     x     y     z
3343     0.5  Very Good     D      IF   62.9   59.0   3378  4.99  5.09  3.17
44856    0.5  Very Good     D     VS2   63.4   57.0   1627  5.05  5.08  3.21
44855    0.5  Very Good     D     VS2   62.8   60.0   1627  5.02  5.08  3.17
44820    0.5  Very Good     G     VS1   62.1   55.0   1624  5.06  5.22  3.19
44818    0.5  Very Good     E     VS2   62.9   59.0   1624  5.06  5.09  3.19
44810    0.5  Very Good     G     VS1   61.0   59.0   1624  5.09  5.14  3.12
44805    0.5  Very Good     E     VS2   61.3   59.0   1624  5.05  5.10  3.11
44803    0.5  Very Good     E     VS2   61.8   58.0   1624  5.05  5.11  3.14
44865    0.5  Very Good     D     VS2   61.2   56.0   1628  5.14  5.16  3.15
44802    0.5  Very Good     E     VS2   61.3   58.0   1624  5.06  5.12  3.12
44798    0.5  Very Good     G     VS1   59.4   60.0   1624  5.13  5.17  3.06
44796    0.5  Very Good     E     VS2   61.5   56.0   1624  5.07  5.11  3.13

### 3. More data, more features, any model

You choose what features and model type to use! Try to get a better mean absolute error on the test set than your model from the last question.

Refer to [this documentation](https://ggplot2.tidyverse.org/reference/diamonds.html) for more explanation of the features.

Besides `cut`, there are two more ordinal features, which you'd need to encode as numbers if you want to use in your model:

In [14]:
train.describe(include=['object'])

Unnamed: 0,color,clarity
count,29409,29409
unique,7,8
top,E,SI1
freq,6090,6948


In [0]:
#CONVERT CLARITY AND COLOR TO INTEGERS FOR TRAIN DATA
clarity_rank = {"IF":0,"VVS1":1, "VVS2":2,"VS1":3, "VS2":4,"SI1":5, "SI2":6, "I1":7}
color_rank = {"J":7, "I":6, "H":5, "G":4, "F":3, "E":2, "D":1 }

train.clarity = train.clarity.map(clarity_rank) 
train.color = train.color.map(color_rank)

#CONVERT CLARITY AND COLOR TO INTEGERS FOR THE TEST DATA
test.clarity = test.clarity.map(clarity_rank)
test.color = test.color.map(color_rank)

In [16]:
#CHECKPOINT - DOES THE DATA LOOK RIGHT?
print(test.head())
print(test.isnull().sum())
print(train.head())
print(train.isnull().sum())

       carat  cut  color  clarity  depth  table  price     x     y     z
9742    1.20    5      7        6   61.7   56.0   4659  6.79  6.87  4.21
9374    0.32    5      6        1   62.5   55.0    589  4.37  4.40  2.74
10683   1.01    5      3        6   62.7   57.0   4843  6.36  6.39  4.00
4589    1.01    2      7        4   62.8   58.0   3655  6.30  6.35  3.97
2196    0.90    2      2        6   63.4   62.0   3139  6.00  6.02  3.81
carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64
       carat  cut  color  clarity  depth  table  price     x     y     z
43601   0.31    3      2        5   61.2   58.0    507  4.34  4.38  2.67
52706   0.74    1      5        4   66.1   61.0   2553  5.60  5.57  3.69
1986    0.81    3      4        5   62.3   59.0   3095  5.93  5.98  3.71
48617   0.70    1      4        6   61.5   66.0   1999  5.55  5.60  3.43
10947   0.87    5      4        4   61.8   56.0   4899

In [17]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

#CREATE MODEL - DEFINE FEATURES & MODEL
features3 = ['carat','clarity']

for n in range(1,13):      #CHOOSE N-DEGREES HERE
  target3 = ['price']
  model3 = make_pipeline(PolynomialFeatures(degree=n), LinearRegression())
  #model3 = make_pipeline(knr(n_neighbors=n))
  model3.fit(train[features3],train[target3])

#CALCULATE TRAINING ERROR
  y3_train_true = train[target3]
  y3_train_pred = model3.predict(train[features3])
  train3_error = mean_absolute_error(y3_train_true, y3_train_pred)

#CALCULATE TEST ERROR
  y3_test_true = test[target3]
  y3_test_pred = model3.predict(test[features3])
  test3_error = mean_absolute_error(y3_test_true, y3_test_pred)
  
  print('Poly degree = ' + str(n))
  print('Train error: ' + str(train3_error))
  print('Test error: ' + str(test3_error))

Poly degree = 1
Train error: 275.2655373468392
Test error: 277.1366277761029
Poly degree = 2
Train error: 265.95922717693406
Test error: 269.0307649852026
Poly degree = 3
Train error: 231.54625637823963
Test error: 234.00827371362365
Poly degree = 4
Train error: 229.59751922966666
Test error: 232.48735358760447
Poly degree = 5
Train error: 225.40781623187647
Test error: 228.57704872190365
Poly degree = 6
Train error: 224.0968923233701
Test error: 227.43857181456644
Poly degree = 7
Train error: 222.9775461213722
Test error: 226.1173117594289
Poly degree = 8
Train error: 221.3747324243301
Test error: 224.31572911086545
Poly degree = 9
Train error: 219.90953983237884
Test error: 223.18580451511133
Poly degree = 10
Train error: 220.0088266401906
Test error: 223.11644305630296
Poly degree = 11
Train error: 223.63519368882405
Test error: 227.82317593280703
Poly degree = 12
Train error: 219.85686920928777
Test error: 223.39862746179094


###carat +color = ~292 @ degree = 18 for Train & Test!
##carat+clarity = ~223 @ degree = 13 for Train & Test!

In [25]:
from sklearn.metrics import accuracy_score
#poly_test_acc = accuracy_score(y3_test_true, y3_test_pred)
train.describe()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
count,29409.0,29409.0,29409.0,29409.0,29409.0,29409.0,29409.0,29409.0,29409.0,29409.0
mean,0.572621,3.921385,3.411643,3.852698,61.764184,57.334574,1912.448842,5.212862,5.218321,3.221193
std,0.259543,1.135751,1.666277,1.70532,1.424072,2.269379,1326.834107,0.771722,0.779769,0.510568
min,0.2,1.0,1.0,0.0,43.0,43.0,326.0,0.0,3.68,0.0
25%,0.34,3.0,2.0,3.0,61.1,56.0,802.0,4.49,4.5,2.77
50%,0.51,4.0,3.0,4.0,61.8,57.0,1412.0,5.14,5.14,3.18
75%,0.73,5.0,5.0,5.0,62.5,59.0,2825.0,5.8,5.81,3.59
max,1.74,5.0,7.0,7.0,79.0,79.0,4999.0,7.62,31.8,31.8


You choose what features and model type to use! Try to get a better mean absolute error on the test set than your model from the last question.