Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
# Check out head
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
# convert the created series into a datetime type
df['created'] = pd.to_datetime(df['created'])

In [0]:
# checking type
type(df['created'].iloc[0])

pandas._libs.tslibs.timestamps.Timestamp

In [0]:
# Identify the start and end date of the slice of the dataframe
start = '04-01-2016'
end = '06-01-2016'
# Create a train dataset with a condition
train = df[(df['created'] >= start) & (df['created'] < end)]

In [0]:
# checking dates
train['created'].sort_values()

5186    2016-04-01 22:12:41
7945    2016-04-01 22:56:00
6424    2016-04-01 22:57:15
7719    2016-04-01 23:26:07
1723    2016-04-02 00:48:13
                ...        
15311   2016-05-31 22:07:36
20331   2016-05-31 22:39:35
35264   2016-05-31 22:46:46
22002   2016-05-31 22:46:47
15697   2016-05-31 23:10:48
Name: created, Length: 31844, dtype: datetime64[ns]

In [0]:
# checking dates
train['created'].sort_values().tail()

15311   2016-05-31 22:07:36
20331   2016-05-31 22:39:35
35264   2016-05-31 22:46:46
22002   2016-05-31 22:46:47
15697   2016-05-31 23:10:48
Name: created, dtype: datetime64[ns]

In [0]:
# Identify the start and end date of the slice of the dataframe
start = '06-01-2016'
end = '06-30-2016'
# Create a test dataset with a condition
test = df[(df['created'] >= start) & (df['created'] <= end)]

In [0]:
# Checking dates
test['created'].sort_values()

11474   2016-06-01 01:10:37
19176   2016-06-01 01:11:06
16226   2016-06-01 01:11:12
37756   2016-06-01 01:11:52
17946   2016-06-01 01:12:22
                ...        
19943   2016-06-29 17:47:34
16801   2016-06-29 17:56:12
32633   2016-06-29 18:14:48
20560   2016-06-29 18:30:41
17743   2016-06-29 21:41:47
Name: created, Length: 16973, dtype: datetime64[ns]

In [0]:
# Checking dates
test['created'].sort_values().tail()

19943   2016-06-29 17:47:34
16801   2016-06-29 17:56:12
32633   2016-06-29 18:14:48
20560   2016-06-29 18:30:41
17743   2016-06-29 21:41:47
Name: created, dtype: datetime64[ns]

In [0]:
# Bad for loops to see what places allows cats and dogs
cats_dogs_train = []

for i in range(31844):
  if train['cats_allowed'].iloc[i] == 1 and train['dogs_allowed'].iloc[i] == 1:
    cats_dogs_train.append(1)
  else:
    cats_dogs_train.append(0)

train['dogs_and_cats_allowed'] = cats_dogs_train

cats_dogs_test = []

for i in range(16973):
  if test['cats_allowed'].iloc[i] == 1 and test['dogs_allowed'].iloc[i] == 1:
    cats_dogs_test.append(1)
  else:
    cats_dogs_test.append(0)

test['dogs_and_cats_allowed'] = cats_dogs_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [0]:
# alternative and much shorter code

df['cats_and_dogs_allowed'] = ((df['cats_allowed'] == 1) & (df['dogs_allowed'] == 1)).astype(int)

In [0]:
# Combining bedrooms and bathrooms into a new feature
train['bedrooms_and_bathrooms'] = train['bedrooms'] + train['bathrooms']
test['bedrooms_and_bathrooms'] = test['bedrooms'] + test['bathrooms']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [0]:
#checking head to see if the new features are implemented
train.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,dogs_and_cats_allowed,bedrooms_and_bathrooms
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0
5,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6.0
6,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.966,3600,210 West 107th Street,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3.0


In [0]:
# Checking head to see if the new features are implemented
test.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,dogs_and_cats_allowed,bedrooms_and_bathrooms
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.5
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3.0
11,1.0,1,2016-06-03 03:21:22,Check out this one bedroom apartment in a grea...,W. 173rd Street,40.8448,-73.9396,1675,644 W. 173rd Street,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
14,1.0,1,2016-06-01 03:11:01,Spacious 1-Bedroom to fit King-sized bed comfo...,East 56th St..,40.7584,-73.9648,3050,315 East 56th St..,low,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
24,2.0,4,2016-06-07 04:39:56,SPRAWLING 2 BEDROOM FOUND! ENJOY THE LUXURY OF...,W 18 St.,40.7391,-73.9936,7400,30 W 18 St.,medium,1,1,1,1,1,1,0,0,1,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,1,6.0


In [0]:
# import necessary librarys and instansiate the class
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
model = LinearRegression()

In [0]:
def train_test(feat1,feat2):
'''
Created a function the finds the Coeff, intercet, mae, mse, rmse, and r^2 of a
train and test model where the user identifys two features to test
'''

  model = LinearRegression()
  target = 'price'
  y_train = train[target]
  

  features = [feat1, feat2]
  X_train = train[features]

  y_test = test[target]
  X_test = test[features]

  model.fit(X_train, y_train)
  y_pred = model.predict(X_train)
  mae1 = mean_absolute_error(y_train, y_pred)
  mse1 = mean_squared_error(y_train, y_pred)
  rmse1 = np.sqrt(mse1)
  r2_1 = r2_score(y_train, y_pred)
  train_coef = model.coef_
  train_inter = model.intercept_

  y_pred = model.predict(X_test)
  mae2 = mean_absolute_error(y_test, y_pred)
  mse2 = mean_squared_error(y_test, y_pred)
  rmse2 = np.sqrt(mse2)
  r2_2 = r2_score(y_test, y_pred)

  print('Train coefficient:', train_coef)
  print('Train Intercept', train_inter)
  print('Train mean squared error:', mse1)
  print('Train root mean squared error:', rmse1)
  print('Train mean absolute error:', mae1)
  print('Train R^2', r2_1, '\n')
  print('test mean squared error:', mse2)
  print('test root mean squared error:', rmse2)
  print('test mean absolute error:', mae2)
  print('test R^2', r2_2,)

In [0]:
# Best results I got by hand
train_test('bedrooms_and_bathrooms','longitude')

Train coefficient: [   822.30767871 -16724.61096234]
Train Intercept -1235838.4289918754
Train mean squared error: 1564574.3696600895
Train root mean squared error: 1250.8294726540823
Train mean absolute error: 805.4563291145259
Train R^2 0.4961158933612979 

test mean squared error: 1559250.3888837276
test root mean squared error: 1248.6994790115546
test mean absolute error: 819.0280292993083
test R^2 0.49831411869293063


In [0]:
# Create a dataframe that doesn't have any categorical variables and removing the target
train1 = train.drop(columns=['created', 'description', 'display_address', 'street_address', 'price'])

In [0]:
# convert interest_level into numeric
test1 = []
for i in train1['interest_level']:
  if i == 'low':
    test1.append(0)
  elif i == 'medium':
    test1.append(1)
  elif i == 'high':
    test1.append(2)

train1['interest_level'] = test1

In [0]:
# Checking head
train.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,dogs_and_cats_allowed,bedrooms_and_bathrooms
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0
5,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6.0
6,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.966,3600,210 West 107th Street,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3.0


In [0]:
# checking head
train1.head()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,dogs_and_cats_allowed,bedrooms_and_bathrooms
2,1.0,1,40.7388,-74.0018,2,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
3,1.0,1,40.7539,-73.9677,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
4,1.0,4,40.8241,-73.9493,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0
5,2.0,4,40.7429,-74.0028,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6.0
6,1.0,2,40.8012,-73.966,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3.0


In [0]:
#checking shape
train.shape

(31844, 36)

In [0]:
#checking shape
train1.shape

(31844, 31)

In [0]:
# Brute force method to find the lowest mae of 5 columns

# import random library
import random

# create two lists that create a list of 5000 random numbers through 0-31
rand1 = [random.randrange(0,31) for i in range(5000)]
rand2 = [random.randrange(0,31) for i in range(5000)]
rand3 = [random.randrange(0,31) for i in range(5000)]
rand4 = [random.randrange(0,31) for i in range(5000)]

# Instantiate the class
model = LinearRegression()

# Identify the target of the dataframe
target = 'price'

# Set the y_train to the target
y_train = train[target]

# default variable of a low mae value
low_mae = 1000

# For loop that randomly selects different columns and does a mae test on the randomized columns
for x,y,z,a in zip(rand1,rand2,rand3,rand4):

  # Set the features with the x and y indexes
  features = [train1.columns[x], train1.columns[y], train1.columns[z], train1.columns[a]]
 
  # Set x_train to the randomized features
  X_train = train1[features]

  # fit the model
  model.fit(X_train, y_train)

  # predict the model
  y_pred = model.predict(X_train)

  # calculate the mean absolute error
  mae_test = mean_absolute_error(y_train, y_pred)

  # Check to see if the calculate mae is lower than the current low_mae variable
  if mae_test < low_mae:

    # Sets low_mae to the new lowest mae value
    low_mae = mae_test

    # print feature one
    print('feature 1:', train1.columns[x])

    # print feature two
    print('feature 2:', train1.columns[y])

    # print feature three
    print('feature 3:', train1.columns[z])

    print('feature 4:', train1.columns[a])

    # print lowest mae
    print('mae:', mae_test, '\n')


feature 1: terrace
feature 2: common_outdoor_space
feature 3: bedrooms_and_bathrooms
feature 4: cats_allowed
mae: 887.5400029425207 

feature 1: laundry_in_building
feature 2: bathrooms
feature 3: latitude
feature 4: elevator
mae: 861.3177704389567 

feature 1: exclusive
feature 2: bedrooms_and_bathrooms
feature 3: bedrooms
feature 4: dishwasher
mae: 810.9010899722233 

feature 1: bathrooms
feature 2: bedrooms
feature 3: laundry_in_building
feature 4: laundry_in_unit
mae: 809.737152451661 

feature 1: longitude
feature 2: swimming_pool
feature 3: dogs_allowed
feature 4: bedrooms_and_bathrooms
mae: 800.8628985482976 

feature 1: interest_level
feature 2: roof_deck
feature 3: bathrooms
feature 4: bedrooms
mae: 787.2465052579074 

feature 1: balcony
feature 2: bathrooms
feature 3: doorman
feature 4: bedrooms_and_bathrooms
mae: 771.5643950956313 

feature 1: doorman
feature 2: bedrooms
feature 3: bedrooms_and_bathrooms
feature 4: elevator
mae: 770.1802758871628 

feature 1: bedrooms_and_ba