<a href="https://colab.research.google.com/github/LambdaTheda/DS-Unit-2-Linear-Models/blob/master/pt5_F_mar20_REGRSS2_unit2_spr1_mod2_212_assmnt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
pd.set_option('display.max_rows', None)

df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
df.head(200)

In [0]:
df

In [0]:
df.shape

In [0]:
# Train/test split. Use data from April & May 2016 to train. 

df['created'] = pd.to_datetime(df['created'])
train = df[(df['created'].dt.year == 2016) & (df['created'].dt.month.isin([4, 5]))]
train.shape


In [0]:
# explore column to use for train/test split
df['created'].value_counts(48817) # returns counts as < 1 (bc of dt classes? or https://www.w3resource.com/pandas/series/series-value_counts.php- SERIES notdf fcngit)


In [0]:
df['created'].value_counts(48817).head(300)

In [0]:
# explore column to use for train/test split
df['created'].describe()

In [0]:
df['created'].shape

#  train/test split

In [0]:
#ATTEMPT 3: Parsing out only YEAR from df['created']-
df['created'] = pd.to_datetime(df['created']) #dot notation: df['created'] = pd.to_datetime(df.created); from https://www.youtube.com/watch?v=yCgJGsg0Xa4
df['year'] = df['created'].dt.year
df['year']

In [0]:
#temp df for making train set from df with df['created'].dt.year == 2016
df_train_yr = df[(df['year'] == 2016)]
df_train_yr.head(200)


In [0]:
df_train_yr.nunique

In [0]:
#df_train_yr.value_counts() #AttributeError: 'DataFrame' object has no attribute 'value_counts'; BC NO CLASSES ?#


In [0]:
#filter df_train_yr for April and May rows- WORKS
df_train_yr_and_months = df_train_yr[df_train_yr['created'].dt.month.isin([4, 5])]  #df[(df['year'] == 2016)]
df_train_yr_and_months.head(200)          

In [0]:
df_train_yr_and_months.sample(300)

In [0]:
df_train = df_train_yr_and_months
df_train.head(300)

In [0]:
df_train.shape

# FILTER df_train TO MAKE TEST SET USING JUNE 2016 ROWS

In [0]:
# ATTEMPT 1: Use data from June 2016 to test- no rows! ASK WHY! bc df_train has only APR & MAY!
'''
df_test = df_train[(df_train['created'].dt.month == 6)]
df_test.head()
'''

In [0]:
df['created'] = pd.to_datetime(df['created'])
df['created'].dtype

In [0]:
# ATTEMPT 2: AttributeError: 'str' object has no attribute 'dt'
'''
df['created'] = pd.to_datetime(df['created'])

df['month'] = df['created'].dt.month

df_test = df_train_yr[df_train_yr['created'.dt.month] == 6] 
df_test
'''
 

In [0]:
# ATTEMPT 3: WORKS

df['created'] = pd.to_datetime(df['created'])
#df['created'].dtype            # RETURNS dtype('<M8[ns]')
df_test = df_train_yr[df_train_yr['created'].dt.month.isin([6])]
df_test.head(300)

In [0]:
df_test.shape

In [0]:
df_test['created'].value_counts()

# Engineer at least two new features. 

In [0]:
# 1) Does the apartment have a description?

# Using for loop
# for descrp in range (len(df['description'])):  

# Chris suggests apply():

 # use regex to find '     ' string that seems to represent no descriptions
'''
import re 
no = re.compile('        ')
no    # RETURNS: re.compile(r'        ', re.UNICODE)
'''

re.compile(r'        ', re.UNICODE)

In [0]:
# set the new column to the dataframe column with all values EXCEPT '        '
df['has_description'] = ~df['description'].isin(['        '])

df['has_description'] = df['has_description'].map({True: 'Yes', False: 'No'})

df['has_description']

38874    Yes
40302    Yes
39067    Yes
34277    Yes
48618    Yes
37772    Yes
26186    Yes
43612    Yes
21470    Yes
17999    Yes
41152    Yes
26889    Yes
32154    Yes
6641     Yes
11051    Yes
10363    Yes
34925    Yes
4696     Yes
15310    Yes
37729    Yes
27670    Yes
26530    Yes
42044    Yes
6688     Yes
25352     No
17591    Yes
21462    Yes
36170    Yes
14425     No
47071    Yes
28986    Yes
43701    Yes
41654    Yes
4158     Yes
3075     Yes
47219    Yes
25627    Yes
39357    Yes
10693    Yes
18580    Yes
6598     Yes
25404    Yes
3141     Yes
35286     No
48836    Yes
27641    Yes
24424    Yes
42481    Yes
33673    Yes
21019    Yes
47009    Yes
43737    Yes
16126    Yes
45290    Yes
6418     Yes
21795    Yes
30783    Yes
14337    Yes
46785    Yes
30246    Yes
11426    Yes
4891     Yes
23430    Yes
877      Yes
22666    Yes
24917    Yes
15215    Yes
2443     Yes
28257    Yes
41788    Yes
5276     Yes
29055    Yes
5819     Yes
8475     Yes
32896    Yes
40037    Yes
43722    Yes

In [0]:
df.sample(200)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,year,total_rooms,has_description
28497,1.0,2,2016-05-30 04:58:13,,Greenpoint Avenue,40.7305,-73.9524,2462,181 Greenpoint Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2016,3.0,Yes
35127,2.0,2,2016-05-11 06:47:36,The finest rental residences in New York City ...,W 39th St.,40.7556,-73.9922,4995,330 W 39th St.,low,1,0,1,0,1,1,1,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,2016,4.0,Yes
37480,1.0,2,2016-05-20 02:27:10,GORGEOUS NO FEE GIGANTIC 2BDR WITH NEW RENOVAT...,W 144 Street,40.8257,-73.9502,2200,561 W 144 Street,low,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016,3.0,Yes
32816,1.0,2,2016-05-16 02:28:15,This Two Bed has bedrooms on opposite sides of...,Second Avenue,40.7791,-73.9512,2799,1691 Second Avenue,low,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2016,3.0,Yes
9085,1.0,1,2016-04-09 05:46:29,Great deal for this good size one bedroom apar...,West 104th Street,40.8005,-73.9704,2500,314 West 104th Street,low,0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016,2.0,Yes
28619,1.0,1,2016-05-20 02:43:20,This building is a luxury full service doorman...,W 64 St.,40.7716,-73.9812,3695,20 W 64 St.,medium,1,0,1,0,1,1,1,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,2016,2.0,Yes
41776,1.0,0,2016-04-14 06:22:45,Fantastic studio apartment in the Upper East S...,E 73rd St.,40.7704,-73.9599,1975,201 E 73rd St.,low,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016,1.0,Yes
49054,2.0,4,2016-04-14 21:24:28,A FABULOUS 4BR IN WASHINGTON HEIGHTS!\r\r PERF...,WASHINGTON HEIGHTS 4BR! WHAT A DEAL!,40.847,-73.9382,2850,W 176 & BROADWAY,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016,6.0,Yes
11479,2.0,2,2016-06-08 01:34:49,Over 1400sf massive 2 bedroom 2 bath home with...,East 66th Street,40.7661,-73.9636,9500,165 East 66th Street,low,0,1,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016,4.0,Yes
34140,2.0,2,2016-06-15 06:08:51,To view this apartment contact Edy Gar...,West End Avenue,40.7752,-73.9887,6300,101 West End Avenue,low,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016,4.0,Yes


In [0]:
# use findall() 
'''
no.findall('        ')
print(df['description'].no.findall('        ')) # RETURNS: AttributeError: 'Series' object has no attribute 'no'
'''

In [0]:
# 2) Total number of rooms (beds + baths)
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,year,total_rooms
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016,4.5
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016,3.0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016,2.0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016,2.0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016,5.0


In [0]:
df.loc[1, 'description'] # RETURNS: '        '

'        '

In [0]:
df['description'].dtype # RETURNS: dtype('O'), A STRING

dtype('O')