<a href="https://colab.research.google.com/github/Nckflannery/DS-Unit-2-Regression-Classification/blob/master/module2/assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

## Import

In [1]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

Initialized empty Git repository in /content/.git/
remote: Enumerating objects: 156, done.[K
remote: Total 156 (delta 0), reused 0 (delta 0), pack-reused 156[K
Receiving objects: 100% (156/156), 19.30 MiB | 19.57 MiB/s, done.
Resolving deltas: 100% (71/71), done.
From https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master
Collecting category_encoders==2.0.0 (from -r requirements.txt (line 1))
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 2.9MB/s 
[?25hCollecting eli5==0.10.0 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/e6/ea/47bd5844bb609d45821114aa7e0bc9e4422053fe24a6cf6b357f0d3f74d3/eli5-0.10.0-py2.py3-none-any.whl (105kB)
[K     |███████████████

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv('../data/apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import plotly.express as px
import matplotlib.pyplot as plt
### For later!

# Assignment

## Split by date

In [42]:
# Look at df
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [157]:
# Check dtype of created column
df['created'].dtypes

dtype('O')

In [0]:
# Change to datetime dtype for easier filtering
df['created'] = pd.to_datetime(df['created'])

In [159]:
# Check dtype change
df['created'].dtypes

dtype('<M8[ns]')

In [160]:
# Check min and max
min(df['created']), max(df['created'])

(Timestamp('2016-04-01 22:12:41'), Timestamp('2016-06-29 21:41:47'))

In [0]:
# Filter train and test datasets using June as cutoff for test
train = df[df['created'] < '06-01-2016']
test = df[df['created'] >= '06-01-2016']

In [51]:
# Check date ranges are correct
min(train['created']), max(train['created']), min(test['created']), max(test['created'])

(Timestamp('2016-04-01 22:12:41'),
 Timestamp('2016-05-31 23:10:48'),
 Timestamp('2016-06-01 01:10:37'),
 Timestamp('2016-06-29 21:41:47'))

## Engineer 2 new features

In [0]:
# Let's create a feature called Rooms that has the total number of bedrooms and 
# bathrooms
df['Rooms'] = df['bathrooms'] + df['bedrooms']

In [137]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,Rooms
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.5
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0


In [0]:
# Let's create another new feature that factors in many features to create 
# a "deluxe" rating, which tells you how many of the amenities the property has
df['Deluxe'] = ''

In [139]:
# Testing for how I want to sum all values
df.iloc[:,10:35]

Unnamed: 0,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,Rooms
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.5
1,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0
2,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
3,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6.0
6,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0
7,1,0,1,0,1,1,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,3.0
8,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
9,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6.0


In [140]:
df.iloc[:,10:35].values

array([[0. , 0. , 0. , ..., 0. , 0. , 4.5],
       [1. , 1. , 0. , ..., 0. , 0. , 3. ],
       [0. , 0. , 1. , ..., 0. , 0. , 2. ],
       ...,
       [1. , 1. , 0. , ..., 0. , 0. , 2. ],
       [1. , 1. , 0. , ..., 0. , 0. , 1. ],
       [0. , 0. , 1. , ..., 0. , 0. , 3. ]])

In [141]:
# Not what I want...
sum(df.iloc[:,10:35].values)

array([ 25621.,  23348.,  23348.,  21852.,  20740.,  20263.,  17920.,
         2576.,  13105.,   9063.,   8587.,   6481.,   6756.,   5020.,
         4257.,   2952.,   2695.,   2534.,   2255.,   2114.,   2085.,
         1920.,   1329.,   1281., 133707.])

In [145]:
# there we go
sum(df.iloc[:,10:35].values.T)

array([4.5, 8. , 5. , ..., 7. , 6. , 4. ])

In [0]:
df['Deluxe'] = sum(df.iloc[:,10:35].values.T)

In [154]:
# This gives us a rough grading scale of how 'nice' an apartment is by how many
# features and rooms it has
df['Deluxe'].value_counts()

5.0     5952
6.0     5841
4.0     5072
7.0     4853
3.0     3964
8.0     3824
9.0     3200
10.0    2666
2.0     2255
11.0    2178
12.0    1846
13.0    1484
14.0    1196
15.0     957
16.0     680
1.0      613
17.0     480
18.0     317
19.0     204
20.0     130
9.5      102
8.5      101
11.5      87
7.5       78
6.5       72
12.5      68
10.5      68
5.5       66
13.5      55
21.0      50
15.5      47
4.5       47
0.0       42
3.5       38
22.0      33
14.5      32
16.5      22
17.5      19
18.5      18
19.5      15
20.5      11
2.5       11
23.0       7
24.0       6
25.0       3
21.5       2
22.5       2
23.5       2
1.5        1
Name: Deluxe, dtype: int64

In [155]:
# Now we have a 25 point rating
df['Deluxe'].min(), df['Deluxe'].max()

(0.0, 25.0)

## Fit a linear regression model with at least two features

In [0]:
# Let's start with longitude and latitude as our two features

In [174]:
model = LinearRegression()
features = ['longitude', 'latitude']
target = 'price'
X_train = df[features]
y_train = df[target]
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [175]:
# These numbers look crazy because of the nature of longitude and latitude
print('Intercept', model.intercept_)
coefficients = pd.Series(model.coef_, features)
print(coefficients.to_string())

Intercept -1302468.0701019831
longitude   -16360.381742
latitude      2351.491444


In [0]:
# Let's try again with our new features and write a quick function
def model_2_features(data, feature1, feature2, target):
  model = LinearRegression()
  features = [feature1, feature2]
  X_train = df[features]
  y_train = df[target]
  model.fit(X_train, y_train)
  print('Intercept', model.intercept_)
  coefficients = pd.Series(model.coef_, features)
  print(coefficients.to_string())

In [177]:
model_2_features(df, 'Deluxe', 'bedrooms', 'price')

Intercept 1536.0138729832158
Deluxe      147.037881
bedrooms    620.495945


In [178]:
model_2_features(df, 'bathrooms', 'bedrooms', 'price')

Intercept 464.92844095548935
bathrooms    2099.112156
bedrooms      385.099606


##Get regression metrics RMSE, MAE, and  R2 , for both the train and test data.

In [0]:
model = LinearRegression()
features = ['longitude', 'latitude']
target = 'price'

X_train = train[features]
y_train = train[target]
model.fit(X_train, y_train)
X_test = test[features]
y_test = test[target]

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [436]:
train_mae = mean_absolute_error(y_train, y_train_pred)
print(train_mae)
test_mae = mean_absolute_error(y_test, y_test_pred)
print(test_mae)

1147.1493278231853
1139.700457630829


In [437]:
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
print(train_rmse)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
print(test_rmse)

1704.2079552598782
1703.0806186783316


In [438]:
train_r2 = r2_score(y_train, y_train_pred)
print(train_r2)
test_r2 = r2_score(y_test, y_test_pred)
print(test_r2)

0.06463820907125062
0.06677485649195314


## Searching for best MAE

In [208]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,Rooms,Deluxe
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.5,4.5
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,8.0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,5.0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,4.0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0,6.0


In [560]:
min(df['longitude']), max(df['longitude']), min(df['latitude']), max(df['latitude'])

(-74.0873, -73.7001, 40.5757, 40.9894)

In [564]:
(df['longitude']<=-73.97).value_counts()

True     29295
False    19522
Name: longitude, dtype: int64

In [565]:
# Let's seperate it by lat 40.75 and long -73.97
df['East_West'] = (df['longitude'] <= -73.97).astype(int)
df['East_West'].head(20)

0     0
1     0
2     1
3     0
4     0
5     1
6     0
7     1
8     0
9     1
10    0
11    0
12    1
13    1
14    0
15    1
16    1
17    0
18    0
19    1
Name: East_West, dtype: int64

In [566]:
df['North_South'] = (df['latitude'] >= 40.75).astype(int)
df['North_South'].head(20)

0     0
1     1
2     0
3     1
4     1
5     0
6     1
7     0
8     1
9     0
10    1
11    1
12    0
13    1
14    1
15    0
16    0
17    1
18    1
19    0
Name: North_South, dtype: int64

In [0]:
df['North_South'] = df['North_South'].replace({0:2, 1:4})

In [568]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,Rooms,Deluxe,East_West,North_South,Sector
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.5,4.5,0,2,2
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,1,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,8.0,0,4,1
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,5.0,1,2,2
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,4.0,0,4,1
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0,6.0,0,4,1


In [0]:
df['Sector'] = df['North_South'] + df['East_West']

In [570]:
df['Sector'].value_counts()

3    19834
4    15469
5     9461
2     4053
Name: Sector, dtype: int64

In [0]:
df['Sector'] = df['Sector'].replace({2:0, 5:1, 3:2, 4:3})

In [508]:
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space', 'Rooms',
       'Deluxe', 'East_West', 'North_South', 'Sector'],
      dtype='object')

In [0]:
def my_train_test(data, feature, target, testsize=1/3):
  X_train, X_test, y_train, y_test = train_test_split(data[feature], 
                                                      data[target], 
                                                      test_size=testsize,
                                                      random_state=42)
  model = LinearRegression()
  model.fit(X_train, y_train)
  y_train_pred = model.predict(X_train)
  y_test_pred = model.predict(X_test)
  
  print(f'For {feature} to {target} with testsize {testsize:.2f}')
  
  train_mae = mean_absolute_error(y_train, y_train_pred)
  print(f'Training MAE: {train_mae:.05f}')
  
  train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
  print(f'Train RMSE: {train_rmse:.05f}')
  
  train_r2 = r2_score(y_train, y_train_pred)
  print(f'Train R^2: {train_r2:.05f}')
  
  
  test_mae = mean_absolute_error(y_test, y_test_pred)
  print(f'Test MAE: {test_mae:.05f}')
  
  test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
  print(f'Test RMSE: {test_rmse:.05f}')
  
  test_r2 = r2_score(y_test, y_test_pred)
  print(f'Test R^2: {test_r2:.05f}')

In [590]:
my_train_test(df, feature=['Deluxe', 'Rooms'], target='price', testsize=.01)

For ['Deluxe', 'Rooms'] to price with testsize 0.01
Training MAE: 860.50361
Train RMSE: 1293.00753
Train R^2: 0.46074
Test MAE: 927.66654
Test RMSE: 1464.40811
Test R^2: 0.41617


In [483]:
my_train_test(df, feature=['longitude', 'latitude'], target='Deluxe')

For ['longitude', 'latitude'] to Deluxe with testsize 0.33
Training MAE: 3.03917
Train RMSE: 3.83331
Train R^2: 0.04348
Test MAE: 3.01536
Test RMSE: 3.80419
Test R^2: 0.04577


In [484]:
my_train_test(df, ['Deluxe'], 'bedrooms')

For ['Deluxe'] to bedrooms with testsize 0.33
Training MAE: 0.79096
Train RMSE: 0.98639
Train R^2: 0.20461
Test MAE: 0.79548
Test RMSE: 0.99444
Test R^2: 0.19150


In [499]:
my_train_test(df, ['latitude', 'longitude'], 'price')

For ['latitude', 'longitude'] to price with testsize 0.33
Training MAE: 1139.87523
Train RMSE: 1695.28278
Train R^2: 0.06361
Test MAE: 1147.65989
Test RMSE: 1720.89967
Test R^2: 0.06844


In [487]:
my_train_test(df, ['Deluxe'], 'price')

For ['Deluxe'] to price with testsize 0.33
Training MAE: 1006.72135
Train RMSE: 1509.59981
Train R^2: 0.25750
Test MAE: 1032.56100
Test RMSE: 1557.22701
Test R^2: 0.23721


In [553]:
my_train_test(df, ['Sector'], 'price')

For ['Sector'] to price with testsize 0.33
Training MAE: 1182.71826
Train RMSE: 1745.74384
Train R^2: 0.00704
Test MAE: 1196.65742
Test RMSE: 1776.23143
Test R^2: 0.00758


In [506]:
my_train_test(df, ['cats_allowed', 'dogs_allowed'], 'price')

For ['cats_allowed', 'dogs_allowed'] to price with testsize 0.33
Training MAE: 1189.31343
Train RMSE: 1748.39320
Train R^2: 0.00402
Test MAE: 1203.98830
Test RMSE: 1780.13775
Test R^2: 0.00321


In [578]:
# All of the numeric columns
for col in df.select_dtypes(include=[np.number]).columns.tolist():
  my_train_test(df, [col], 'price') 
  print('\n')

For ['bathrooms'] to price with testsize 0.33
Training MAE: 885.33630
Train RMSE: 1274.70701
Train R^2: 0.47059
Test MAE: 895.44764
Test RMSE: 1291.17525
Test R^2: 0.47559


For ['bedrooms'] to price with testsize 0.33
Training MAE: 967.25919
Train RMSE: 1474.93360
Train R^2: 0.29121
Test MAE: 991.73393
Test RMSE: 1515.03263
Test R^2: 0.27799


For ['latitude'] to price with testsize 0.33
Training MAE: 1191.18795
Train RMSE: 1750.92074
Train R^2: 0.00114
Test MAE: 1205.32235
Test RMSE: 1781.86255
Test R^2: 0.00127


For ['longitude'] to price with testsize 0.33
Training MAE: 1137.49190
Train RMSE: 1697.64862
Train R^2: 0.06099
Test MAE: 1144.87522
Test RMSE: 1722.74478
Test R^2: 0.06644


For ['price'] to price with testsize 0.33
Training MAE: 0.00000
Train RMSE: 0.00000
Train R^2: 1.00000
Test MAE: 0.00000
Test RMSE: 0.00000
Test R^2: 1.00000


For ['interest_level'] to price with testsize 0.33
Training MAE: 1192.55864
Train RMSE: 1751.22747
Train R^2: 0.00079
Test MAE: 1206.84734
Tes

In [0]:
df['interest_level'] = df['interest_level'].astype('category').cat.codes

In [559]:
my_train_test(df, ['interest_level'], 'price')

For ['interest_level'] to price with testsize 0.33
Training MAE: 1192.55864
Train RMSE: 1751.22747
Train R^2: 0.00079
Test MAE: 1206.84734
Test RMSE: 1781.99689
Test R^2: 0.00112
