<a href="https://colab.research.google.com/github/worldwidekatie/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

#1. Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

In [143]:
print(df['created'].min())
print(df['created'].max())

2016-04-01 22:12:41
2016-06-29 21:41:47


In [144]:
train = df[df['created'] < '2016-06-01 00:00:01']
test = df[df['created'] > '2016-06-01 00:00:01']
train.shape, test.shape

((31844, 34), (16973, 34))

#2. Engineer at least two new features. 

## I engineered interest level and total rooms

###Then later engineered has description and description length

In [145]:
train['interest_level'] = train['interest_level'].replace({'low': 0, 'medium': 1, 'high': 2})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [146]:
train['desirability'] = train['interest_level'] + train['exclusive']
train['desirability'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


0    21117
1     8059
2     2516
3      152
Name: desirability, dtype: int64

In [147]:
train.head()
train['total_rooms'] = train['bathrooms'] + train['bedrooms']+ train['dining_room']
train['total_rooms'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


2.0     9630
3.0     7053
1.0     5833
4.0     4607
5.0     2140
6.0     1241
7.0      473
4.5      166
3.5      136
5.5      127
0.0       95
8.0       91
2.5       75
6.5       70
9.0       40
7.5       30
10.0      11
8.5       10
1.5        6
9.5        5
11.0       4
12.0       1
Name: total_rooms, dtype: int64

In [148]:
test['interest_level'] = test['interest_level'].replace({'low': 0, 'medium': 1, 'high': 2})
test['desirability'] = test['interest_level'] + test['exclusive']
test['total_rooms'] = test['bathrooms'] + test['bedrooms']+ test['dining_room']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [149]:
test['has_description'] = test['description'].isnull()
test['has_description'] = test['has_description'].replace({False: 0, True: 1})
test['has_description'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


0    16517
1      456
Name: has_description, dtype: int64

In [150]:
train['has_description'] = train['description'].isnull()
train['has_description'] = train['has_description'].replace({False: 0, True: 1})
train['has_description'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


0    30875
1      969
Name: has_description, dtype: int64

In [151]:
test['length'] = test["description"].fillna('').str.split(" ").apply(lambda x: len(x) if x != '' else 0)
test['length']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


0         95
1          9
11       105
14        86
24       137
        ... 
49305     52
49310    349
49320     44
49332     84
49347    115
Name: length, Length: 16973, dtype: int64

In [152]:
train['length'] = train["description"].fillna('').str.split(" ").apply(lambda x: len(x) if x != '' else 0)
train['length']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


2         94
3         80
4         68
5          9
6         87
        ... 
49346     45
49348    191
49349     84
49350     99
49351    133
Name: length, Length: 31844, dtype: int64

# 3. Fit a linear regression model with at least two features.

##I'm using desirability and total rooms

In [153]:
features = ['desirability', 'total_rooms']
target = ['price']

X_train = train[features]
X_test = test[features]

y_train = train[target]
y_test = test[target]

from sklearn.linear_model import LinearRegression
model = LinearRegression()

model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#4. Get the model's coefficients and intercept.

In [154]:
print('Intercept:', model.intercept_)
print('Features:', features)
print("Coefficient:", model.coef_)

Intercept: [1631.34546662]
Features: ['desirability', 'total_rooms']
Coefficient: [[-531.15423833  765.13914658]]


#5. Get regression metrics RMSE, MAE, and  R2 , for both the train and test data.
## **Self-Made Stretch Goal:** Make a function so it's easier later

In [0]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def lin_func(X_, y_):
  y_pred = model.predict(X_)
  mse = mean_squared_error(y_, y_pred)
  rmse = np.sqrt(mse)
  mae = mean_absolute_error(y_, y_pred)
  r2 = r2_score(y_, y_pred)
  print('Mean Squared Error:', mse)
  print('Root Mean Squared Error:', rmse)
  print('Mean Absolute Error:', mae)
  print('R^2:', r2)

In [156]:
print("Training Data Stats")
lin_func(X_train, y_train)
print("")
print("Validation Data Stats")
lin_func(X_test, y_test)

Training Data Stats
Mean Squared Error: 1651698.1471388913
Root Mean Squared Error: 1285.184090758554
Mean Absolute Error: 860.2693437827477
R^2: 0.4680569607639088

Validation Data Stats
Mean Squared Error: 1629488.6022558818
Root Mean Squared Error: 1276.5142389553992
Mean Absolute Error: 865.2595762286496
R^2: 0.475715105584927


#6. What's the best test MAE you can get? Share your score and features used with your cohort on Slack!

##All numeric features

In [157]:
features = ['bathrooms', 'bedrooms', 'latitude', 'longitude', 
            'interest_level', 'elevator', 'cats_allowed', 'hardwood_floors', 
            'dogs_allowed', 'doorman', 'dishwasher', 'no_fee', 'laundry_in_building', 
            'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
            'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool', 
            'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio', 
            'wheelchair_access', 'common_outdoor_space', 'has_description']

target = ['price']
X_train = train[features]
X_test = test[features]

y_train = train[target]
y_test = test[target]

model = LinearRegression()
model.fit(X_train, y_train)

print("ALL NUMERIC FEATURES")

print("")
print("Training Data Stats")
lin_func(X_train, y_train)

print(" ")
print("Validation Data Stats")
lin_func(X_test, y_test)

ALL NUMERIC FEATURES

Training Data Stats
Mean Squared Error: 1128130.693886775
Root Mean Squared Error: 1062.1349697127832
Mean Absolute Error: 673.1300421928227
R^2: 0.6366761862625078
 
Validation Data Stats
Mean Squared Error: 1096955.8318152728
Root Mean Squared Error: 1047.3565924818886
Mean Absolute Error: 676.7836426562577
R^2: 0.6470565233380152


##Adding a new feature, has_description

In [135]:
features = ['bathrooms', 'bedrooms', 'latitude', 'longitude', 
            'interest_level', 'elevator', 'cats_allowed', 'hardwood_floors', 
            'dogs_allowed', 'doorman', 'dishwasher', 'no_fee', 'laundry_in_building', 
            'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
            'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool', 
            'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio', 
            'wheelchair_access', 'common_outdoor_space', 'has_description']

target = ['price']
X_train = train[features]
X_test = test[features]

y_train = train[target]
y_test = test[target]

model = LinearRegression()
model.fit(X_train, y_train)

print('ALL NUMERIC + HAS A DESCRIPTION')
print("")
print("Training Data Stats")
lin_func(X_train, y_train)

print(" ")
print("Validation Data Stats")
lin_func(X_test, y_test)

ALL NUMERIC + HAS A DESCRIPTION

Training Data Stats
Mean Squared Error: 1128209.340477661
Root Mean Squared Error: 1062.1719919474722
Mean Absolute Error: 673.2867749623326
R^2: 0.6366508574779148
 
Validation Data Stats
Mean Squared Error: 1096846.776683801
Root Mean Squared Error: 1047.3045291049787
Mean Absolute Error: 676.889002309455
R^2: 0.6470916116215479


##Adding a new feature, length of description

In [158]:
features = ['bathrooms', 'bedrooms', 'latitude', 'longitude', 
            'interest_level', 'elevator', 'cats_allowed', 'hardwood_floors', 
            'dogs_allowed', 'doorman', 'dishwasher', 'no_fee', 'laundry_in_building', 
            'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
            'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool', 
            'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio', 
            'wheelchair_access', 'common_outdoor_space', 'has_description', 'length']

target = ['price']
X_train = train[features]
X_test = test[features]

y_train = train[target]
y_test = test[target]

model = LinearRegression()
model.fit(X_train, y_train)

print('ALL NUMERIC + HAS A DESCRIPTION + DESCRIPTION LENGTH')
print("")

print("Training Data Stats")
lin_func(X_train, y_train)

print(" ")
print("Validation Data Stats")
lin_func(X_test, y_test)

ALL NUMERIC + HAS A DESCRIPTION + DESCRIPTION LENGTH

Training Data Stats
Mean Squared Error: 1127768.8304615894
Root Mean Squared Error: 1061.964608855488
Mean Absolute Error: 673.118239540206
R^2: 0.6367927273693166
 
Validation Data Stats
Mean Squared Error: 1097403.7877360764
Root Mean Squared Error: 1047.5704213732251
Mean Absolute Error: 676.8045073359533
R^2: 0.6469123943626325
