Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
print(df.shape)
df.head(2)

(48817, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
# Looking at the type of data that is shown in the created
df['created'].dtype

dtype('O')

In [0]:
# Will change the  date that the apartments were made into a date time object
import datetime

In [0]:
df['created'] = pd.to_datetime(df['created'])
# Looking at the year ranges of the  created info
df['created'].dt.year.value_counts(dropna=False)

2016    48817
Name: created, dtype: int64

In [0]:
# Won't use anything with the date (for feature engineering), 
# becuase they are all in the same year.

In [0]:
# I will make a new feature that will count the number of perks that each
# apartment has. Bedrooms and bathrooms will not be condisdered perks.

# Making my function to add up the perks
def addPerks(row):
  numPerks = 0
  for i in range(10,len(df.columns)):
    if row[i] == 1:
      numPerks = numPerks + 1
  return numPerks

In [0]:
# Using the function to make the new feature
df['numPerks'] = df.apply(addPerks, axis=1)

In [0]:
# Checking to see what values are held in the numPerks column
df['numPerks'].value_counts()

2     10670
3      7259
4      5766
5      4942
6      3853
0      3729
7      2985
8      2459
9      2039
10     1552
11     1221
12      893
13      659
14      393
15      214
16      107
17       60
18       12
19        4
Name: numPerks, dtype: int64

In [0]:
# Going to create a second feature
# Will try to use the descriptions
print(df['description'].dtype)
df['description'].isnull().sum()



object


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [0]:
# I think that I will make the second feature be the length of the description
# Here is my function that will find the length of the description string
def descriptionLen(descriptionStr):
  theString = str(descriptionStr)
  if not theString.strip():
    return 0
  else:
    return len(theString)

In [0]:
df['descriptionLen'] = df['description'].apply(descriptionLen)

In [0]:
df.head(4)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,numPerks,descriptionLen
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,588
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,691
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,492


In [33]:
df['created'].dt.month.value_counts()

6    16973
4    16217
5    15627
Name: created, dtype: int64

In [34]:
# Splitting my data for a train and test groups
train = df[df['created'].dt.month < 6 ]
test = df[df['created'].dt.month == 6]
print(train.shape, test.shape)

(31844, 36) (16973, 36)


In [0]:
# doing the imports for sklearn
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error , mean_squared_error, r2_score

In [0]:
# I will try using 3 different features
# "bedrooms", 'numPerks', and 'descriptionLen'
features = ['bedrooms', "numPerks",'descriptionLen']
target = 'price'

In [0]:
# showing what the baseline would be if we just used the mean of the price
dummyModel = DummyRegressor()
dummyModel.fit(train[features], train[target])
dummyPred = dummyModel.predict(test[features])

In [65]:
# Printing out the mean absolute error for the dummy model(baseline)
maeDummy = mean_absolute_error(test[target], dummyPred )
first = "The mean absolute error using the mean"
second = f"as a baseline with the test group is: {maeDummy:.2f}"
print(first + "\n" + second)

The mean absolute error using the mean
as a baseline with the test group is: 1197.71


In [0]:
myModel = LinearRegression()

In [0]:
# fitting my model
myModel.fit(train[features], train[target])
# predicting on the training set to show the error of the training set
trainPred = myModel.predict(train[features])


In [72]:
# Getting all the scores for the train set
mae = mean_absolute_error(train[target], trainPred)
print(f"Train mean absolute error is: {mae:.3f}")

# Getting the root mean squared error and R squared
rmse = np.sqrt(mean_squared_error(train[target], trainPred))
r_squared = r2_score(train[target], trainPred)

print(f"The Train RMSE is: {rmse:.3f}")
print(f"The Train R squared score is: {r_squared:.3f}")

Train mean absolute error is: 927.144
The Train RMSE is: 1426.756
The Train R squared score is: 0.344


In [0]:
# testing the model with the test group
y_preds = myModel.predict(test[features])


In [74]:
# getting the mean absolute error for the test group
mae = mean_absolute_error(test[target], y_preds)
print(f"The mean absolute error for the test is {mae:.2f}")

# Getting the root mean squared error and R squared
rmse = np.sqrt(mean_squared_error(test[target], y_preds))
r_squared = r2_score(test[target], y_preds)

print(f"The Test RMSE is: {rmse:.3f}")
print(f"The Test R squared score is: {r_squared:.3f}")

The mean absolute error for the test is 936.18
The Test RMSE is: 1426.310
The Test R squared score is: 0.345


In [57]:
print(f"The coeficient is: {myModel.coef_}, and the intercept is: {myModel.intercept_}")

The coeficient is: [8.10806204e+02 1.24307779e+02 3.53696343e-02], and the intercept is: 1717.3220948264493


In [45]:
# Showing it in a pandas dataFrame
myCoef = pd.Series(myModel.coef_, features)
print('Intercept', myModel.intercept_)
print(myCoef)

Intercept 1717.3220948264493
bedrooms          810.806204
numPerks          124.307779
descriptionLen      0.035370
dtype: float64


In [0]:
# This is an inner function that will be used to show the scores such as 
# mean absolute error, rmse, r_squared
def getScores(y_trues, y_preds):
  mae = mean_absolute_error(y_trues, y_preds)
  rmse = np.sqrt(mean_squared_error(y_trues, y_preds))
  r = r2_score(y_trues, y_preds)
  t= f"The mean absolute error is: {mae:.2f}"
  u = f"The RMSE is {rmse:.2f}"
  v = f"The r_squared is {r:.2f}"

  return print(t + "\n" + u + "\n" + v)

In [0]:
# creating a function to predict the  same thing 
def makePred(mfeatures=features, mTarget='price'):
  model = LinearRegression()
  y_test = test[mTarget]
  y_train = train[mTarget]
  model.fit(train[mfeatures], y_train)
  y_pred = model.predict(test[mfeatures])

  # returning the scores
  return getScores(y_test, y_pred)

In [81]:
# Trying the method with the same features and then maybe others
makePred()

The mean absolute error is: 936.18
The RMSE is 1426.31
The r_squared is 0.35


In [82]:
# trying a new prediction with adding the bathrooms to the features
myfeatures = ['bathrooms',	'bedrooms', 'numPerks', 'descriptionLen']
makePred(mfeatures=myfeatures)

The mean absolute error is: 800.84
The RMSE is 1192.96
The r_squared is 0.54


In [0]:
# Creating the function that can be used with the ipwidget to show the price 
# will use same model as the one above using the 4 features
def whatPrice(tBathrooms, tBedrooms, tNumPerks=df['numPerks'].max(), descriptionLen=df['descriptionLen'].max()):
  
  predictFeatures = [tBathrooms, tBedrooms, tNumPerks]
  model = LinearRegression()
  
  model.fit(train[myfeatures], train[target])
  y_pred = model.predict([[tBathrooms, tBedrooms, tNumPerks, descriptionLen]])

  return print(f"The predicted price for this type of apartment is: ${y_pred[0]:.0f}")


In [110]:
whatPrice(2,3,5,100)

The predicted price for this type of apartment is: $5710


In [0]:
# importing the ipwidget.interact
from  ipywidgets import interact

In [115]:
interact(whatPrice, tBathrooms=(0,5), tBedrooms=(0,10), tNumPerks=(0,8), descriptionLen=(0,df['descriptionLen'].max()));

interactive(children=(IntSlider(value=2, description='tBathrooms', max=5), IntSlider(value=5, description='tBe…