 Lambda School Data Science, Unit 2: Predictive Modeling

 # Regression & Classification, Module 2

 ## Assignment

 You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

 - [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
 - [ ] Engineer at least two new features. (See below for explanation & ideas.)
 - [ ] Fit a linear regression model with at least two features.
 - [ ] Get the model's coefficients and intercept.
 - [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
 - [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
 - [ ] As always, commit your notebook to your fork of the GitHub repo.


 #### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

 > "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

 > "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf)

 > Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

 #### Feature Ideas
 - Does the apartment have a description?
 - How long is the description?
 - How many total perks does each apartment have?
 - Are cats _or_ dogs allowed?
 - Are cats _and_ dogs allowed?
 - Total number of rooms (beds + baths)
 - Ratio of beds to baths
 - What's the neighborhood, based on address or latitude & longitude?

 ## Stretch Goals
 - [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
 - [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
 (20 minutes, over 1 million views)
 - [ ] Add your own stretch goal(s) !

In [1]:
import os, sys


In [2]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
# import warnings
# warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')



In [3]:
import numpy
import pandas


In [4]:

# Read New York City apartment rental listing data
df = pandas.read_csv('./data/apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= numpy.percentile(df['price'], 0.5)) & 
        (df['price'] <= numpy.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= numpy.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < numpy.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= numpy.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= numpy.percentile(df['longitude'], 99.95))]


In [5]:
df.head()


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,...,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,...,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,...,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,...,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,...,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,...,0,0,0,0,0,0,0,0,0,0


In [6]:
df['display_address'].value_counts()[df['display_address'].value_counts() <= 5].index


Index(['Franklin Avenue', 'E 40 St.', 'E 21 Street', 'E 90th St',
       'West 120th Street', 'E 77th', 'W.118th St', '86th Street',
       'Bradhurst Avenue', 'E 62 St.',
       ...
       'Ft Washington and W 164 St', '324 Pearl Street',
       '*** Low Fee ** Low East 80's ***', 'Hanover', '9th Ave & W 38th St',
       '1br on w 15 st', '457 Lafayette ve', ' Washington Avenue',
       '380 South 4th Street', '14th Ave  Dyker Heights'],
      dtype='object', length=7356)

In [7]:
cleaned = df.copy()
cleaned['has_description'] = ((cleaned['description'].isna()==False) & (cleaned['description'].str.strip().str.len() > 0)).replace({False: 0, True: 1})
cleaned['created_dt'] = pandas.to_datetime(cleaned['created'])

cleaned['created_week'] = cleaned['created_dt'].dt.weekofyear
cleaned['interest_numeric'] = cleaned['interest_level'].replace({'low':1,'medium':2,'high':3})
# cleaned['is_broadway'] = cleaned['display_address']=='Broadway'
cleaned['display_address'] = cleaned['display_address'].str.strip().str.lower()
top_addresses = list(cleaned['display_address'].value_counts()[cleaned['display_address'].value_counts() >= 15].index)
cleaned['top_addresses'] = cleaned['display_address'].where(cleaned['display_address'].isin(top_addresses), other='other')
cleaned = cleaned.join(pandas.get_dummies(cleaned['top_addresses'], prefix='address_'))

# cleaned['perk_count'] = cleaned[['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman', 'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio', 'wheelchair_access', 'common_outdoor_space']].sum(axis=1)

In [8]:
# top_addresses
cleaned['top_addresses'].value_counts()


other                 16030
broadway                439
east 34th street        355
second avenue           351
wall street             336
                      ...  
e 16th st.               15
west 171st street        15
e 61st st.               15
central park south       15
berry street             15
Name: top_addresses, Length: 676, dtype: int64

In [9]:
cleaned['created_dt'].dt.month



0        6
1        6
2        4
3        4
4        4
        ..
49347    6
49348    4
49349    4
49350    4
49351    4
Name: created_dt, Length: 48817, dtype: int64

In [10]:
import sklearn.model_selection as model_selection

# train, test = model_selection.train_test_split(cleaned)
train = cleaned[(cleaned['created_dt'].dt.month==4) | (cleaned['created_dt'].dt.month==5)]
test = cleaned[cleaned['created_dt'].dt.month==6]


In [11]:
cleaned.columns



Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       ...
       'address__west houston street', 'address__west st', 'address__west st.',
       'address__west street', 'address__williamsburg', 'address__worth',
       'address__worth street', 'address__york ave', 'address__york ave.',
       'address__york avenue'],
      dtype='object', length=715)

In [12]:
# test.loc[11956]


In [13]:
target = 'price'
features = cleaned.columns[cleaned.dtypes!='object']
features = features.drop(target)
features = features.drop('created_dt')
# features = ['bathrooms', 'bedrooms', 'longitude', 'elevator', 'doorman', 'dishwasher', 'fitness_center', 'laundry_in_unit', 'dining_room', 'interest_numeric']

# ['bathrooms', 'bedrooms', 'interest_numeric', 'longitude', 'elevator', 'doorman', 'terrace', 'dishwasher', 'fitness_center']
# cleaned[features].isna().sum()


In [14]:
from sklearn.linear_model import LinearRegression
import numpy

lr_model = LinearRegression(normalize=True)

lr_model.fit(train[features],train[target])


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

In [15]:
import sklearn.metrics as metrics

y_train = lr_model.predict(train[features])
y_test = lr_model.predict(test[features])

train_rmse = numpy.sqrt(metrics.mean_squared_error(train[target], y_train))
train_mae = metrics.mean_absolute_error(train[target], y_train)
train_r2 = metrics.r2_score(train[target], y_train)
test_rmse = numpy.sqrt(metrics.mean_squared_error(test[target], y_test))
test_mae = metrics.mean_absolute_error(test[target], y_test)
test_r2 = metrics.r2_score(test[target], y_test)

mean = numpy.mean(cleaned[target])
baseline_rmse = numpy.sqrt(metrics.mean_squared_error(cleaned[target],numpy.linspace(mean, mean, len(cleaned[target]))))
baseline_mae = metrics.mean_absolute_error(cleaned[target], numpy.linspace(mean, mean, len(cleaned[target])))
baseline_r2 = metrics.r2_score(cleaned[target],numpy.linspace(mean, mean, len(cleaned[target])))

print(f'Features: {features}')
print(f'Baseline Root Mean Squared Error: {baseline_rmse}')
print(f'Baseline Mean Absolute Error: {baseline_mae}')
print(f'Baseline R^2 score: {baseline_r2}')
print(f'Train Root Mean Squared Error: {train_rmse}')
print(f'Train Mean Absolute Error: {train_mae}')
print(f'Train R^2 score: {train_r2}')
print(f'Test Root Mean Squared Error: {test_rmse}')
print(f'Test Mean Absolute Error: {test_mae}')
print(f'Test R^2 score: {test_r2}')



Features: Index(['bathrooms', 'bedrooms', 'latitude', 'longitude', 'elevator',
       'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman',
       'dishwasher',
       ...
       'address__west houston street', 'address__west st', 'address__west st.',
       'address__west street', 'address__williamsburg', 'address__worth',
       'address__worth street', 'address__york ave', 'address__york ave.',
       'address__york avenue'],
      dtype='object', length=707)
Baseline Root Mean Squared Error: 1762.4127206231178
Baseline Mean Absolute Error: 1201.532252154329
Baseline R^2 score: 0.0
Train Root Mean Squared Error: 960.0094157508804
Train Mean Absolute Error: 603.0352381924381
Train R^2 score: 0.703185280887316
Test Root Mean Squared Error: 963.771460800702
Test Mean Absolute Error: 626.3325318152359
Test R^2 score: 0.7011425120344504


In [16]:
# df.describe().loc['std']


In [17]:

# cleaned.corr().loc['price']

