Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
df['created'] = pd.to_datetime(df['created'], infer_datetime_format=True)

In [31]:
#checking to see if 2016 is the only year
test=pd.DataFrame()
test['test'] = df['created']
test['new'] = test['test'].apply(lambda x: x.year)
test['new'].value_counts()

2016    48817
Name: new, dtype: int64

In [0]:
#making a feature of month created in order to split into the test and train
df['month'] = df['created'].apply(lambda x: x.month)

In [0]:
#splitting the data
train = df[(df['month'] == 4) | (df['month'] == 5)]
test = df[df['month'] == 6]

In [0]:
#making two new features

#making a new feature for total number of rooms
train['total_rooms'] = train['bathrooms'] + train['bedrooms']
test['total_rooms'] = test['bathrooms'] + test['bedrooms']

#making a new feature for the if a description is present
train['description_presence'] = train['description'].apply(
        lambda x: str(x).strip()).apply(lambda x: 0 if x == '' else 1)
test['description_presence'] = test['description'].apply(
    lambda x: str(x).strip()).apply(lambda x: 0 if x == '' else 1)

In [0]:
from sklearn.linear_model import LinearRegression

In [77]:
model = LinearRegression()

features = ['total_rooms', 'description_presence']
target = 'price'
X_train = train[features]
y_train = train[target]

model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [78]:
#the coefficient for number of total rooms
model.coef_[0]

810.1229041779217

In [80]:
#coefficient for the presence of a description
model.coef_[1]

-215.7200904925805

In [85]:
#the intercept
y_pred = model.predict([[0,0]])
y_pred[0]

1570.7920066709694

In [0]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def errors(list_of_features):
    '''
    takes a list of features that we want to test and returns the three error
    metrics that we are interested in for both the train and test

    requires that the features used are numeric
    '''
    model = LinearRegression()

    features = list_of_features
    target = 'price'
    
    
    X_train = train[features]
    y_train = train[target]
    X_test = test[features]
    y_test = test[target]

    model.fit(X_train, y_train)
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
    mae_train = mean_absolute_error(y_train, y_pred_train)
    r2_train = r2_score(y_train, y_pred_train)

    rmse_test =np.sqrt(mean_squared_error(y_test, y_pred_test))
    mae_test = mean_absolute_error(y_test, y_pred_test)
    r2_test = r2_score(y_test, y_pred_test)

    print('The train data has the following errors:' 
          + f'\n     Root Mean Squared Error: {rmse_train}'
          + f'\n     Mean Absolute Error: {mae_train}'
          + f'\n     R Squared: {r2_train}')
    
    print('\nThe test data has the following errors:' 
          + f'\n     Root Mean Squared Error: {rmse_test}'
          + f'\n     Mean Absolute Error: {mae_test}'
          + f'\n     R Squared: {r2_test}')
    return

In [97]:
errors(['total_rooms', 'description_presence'])

The train data has the following errors:
     Root Mean Squared Error: 1340.345530608617
     Mean Absolute Error: 893.0715474078747
     R Squared: 0.421413908230891

The test data has the following errors:
     Root Mean Squared Error: 1338.3696776935487
     Mean Absolute Error: 909.1474017082991
     R Squared: 0.4236740228591944


In [111]:
errors(['bedrooms', 'bathrooms', 'total_rooms', 'description_presence', 'latitude', 'longitude', 
        'elevator', 'cats_allowed', 'dogs_allowed', 'no_fee', 'swimming_pool', 'hardwood_floors',
        'doorman', 'dishwasher', 'laundry_in_building'])

The train data has the following errors:
     Root Mean Squared Error: 1109.248788833535
     Mean Absolute Error: 702.9479729631745
     R Squared: 0.6037289231579661

The test data has the following errors:
     Root Mean Squared Error: 1088.3920272885584
     Mean Absolute Error: 706.0884703851164
     R Squared: 0.6188580762564161


In [116]:
df.columns.to_list()

['bathrooms',
 'bedrooms',
 'created',
 'description',
 'display_address',
 'latitude',
 'longitude',
 'price',
 'street_address',
 'interest_level',
 'elevator',
 'cats_allowed',
 'hardwood_floors',
 'dogs_allowed',
 'doorman',
 'dishwasher',
 'no_fee',
 'laundry_in_building',
 'fitness_center',
 'pre-war',
 'laundry_in_unit',
 'roof_deck',
 'outdoor_space',
 'dining_room',
 'high_speed_internet',
 'balcony',
 'swimming_pool',
 'new_construction',
 'terrace',
 'exclusive',
 'loft',
 'garden_patio',
 'wheelchair_access',
 'common_outdoor_space',
 'month']

In [0]:
#deleted the featuresthat are not numeric
#could possibly encode interest level to a numeric value and add it back in
all_possible_features =  ['bathrooms',
 'bedrooms',



 'latitude',
 'longitude',
 
 
 
 'elevator',
 'cats_allowed',
 'hardwood_floors',
 'dogs_allowed',
 'doorman',
 'dishwasher',
 'no_fee',
 'laundry_in_building',
 'fitness_center',
 'pre-war',
 'laundry_in_unit',
 'roof_deck',
 'outdoor_space',
 'dining_room',
 'high_speed_internet',
 'balcony',
 'swimming_pool',
 'new_construction',
 'terrace',
 'exclusive',
 'loft',
 'garden_patio',
 'wheelchair_access',
 'common_outdoor_space']

In [123]:
errors(all_possible_features)

The train data has the following errors:
     Root Mean Squared Error: 1090.1129619730764
     Mean Absolute Error: 692.7740796618217
     R Squared: 0.6172832623502627

The test data has the following errors:
     Root Mean Squared Error: 1078.6112459598567
     Mean Absolute Error: 701.3000059272376
     R Squared: 0.6256775228842276
