<a href="https://colab.research.google.com/github/arewelearningyet/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/module2-regression-2/LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
df=df.fillna({'description':'nodescription'})
df=df.fillna({'street_address':'noaddress'})

In [4]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
df.describe()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
count,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0
mean,1.201794,1.537149,40.75076,-73.97276,3579.585247,0.524838,0.478276,0.478276,0.447631,0.424852,0.415081,0.367085,0.052769,0.268452,0.185653,0.175902,0.132761,0.138394,0.102833,0.087203,0.060471,0.055206,0.051908,0.046193,0.043305,0.042711,0.039331,0.027224,0.026241
std,0.470711,1.106087,0.038954,0.028883,1762.430772,0.499388,0.499533,0.499533,0.497255,0.494326,0.492741,0.482015,0.223573,0.443158,0.38883,0.380741,0.33932,0.345317,0.303744,0.282136,0.238359,0.228385,0.221844,0.209905,0.203544,0.202206,0.194382,0.162738,0.159852
min,0.0,0.0,40.5757,-74.0873,1375.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,40.7283,-73.9918,2500.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,40.7517,-73.978,3150.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,2.0,40.774,-73.955,4095.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,10.0,8.0,40.9894,-73.7001,15500.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
df.dtypes

bathrooms               float64
bedrooms                  int64
created                  object
description              object
display_address          object
latitude                float64
longitude               float64
price                     int64
street_address           object
interest_level           object
elevator                  int64
cats_allowed              int64
hardwood_floors           int64
dogs_allowed              int64
doorman                   int64
dishwasher                int64
no_fee                    int64
laundry_in_building       int64
fitness_center            int64
pre-war                   int64
laundry_in_unit           int64
roof_deck                 int64
outdoor_space             int64
dining_room               int64
high_speed_internet       int64
balcony                   int64
swimming_pool             int64
new_construction          int64
terrace                   int64
exclusive                 int64
loft                      int64
garden_p

In [0]:
df['created']=pd.to_datetime(df['created'], infer_datetime_format=True)

In [8]:
16973/31844

0.5330046476573295

In [0]:
import plotly.express as px

In [10]:
# engineer two new features 
# 1 of 2
df['cats&dogs']=round((df['cats_allowed']+df['dogs_allowed'])/2)
df['cats&dogs']=df['cats&dogs'].astype('int32')
df.head(10)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,cats&dogs
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.966,3600,210 West 107th Street,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
7,2.0,1,2016-04-13 06:01:42,"This huge sunny ,plenty of lights 1 bed/2 bath...",West 21st Street,40.7427,-73.9957,5645,155 West 21st Street,low,1,0,1,0,1,1,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0
8,1.0,1,2016-04-20 02:36:35,<p><a website_redacted,Hamilton Terrace,40.8234,-73.9457,1725,63 Hamilton Terrace,medium,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
9,2.0,4,2016-04-02 02:58:15,This is a spacious four bedroom with every bed...,522 E 11th,40.7278,-73.9808,5800,522 E 11th,low,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [11]:
# engineer two new features 
# 2 of 2
df['totalfeat']=df['elevator'] + df['cats_allowed'] + df['hardwood_floors'] + df['dogs_allowed'] + df['doorman'] + df['dishwasher'] + df['no_fee'] + df['laundry_in_building'] + df['fitness_center'] + df['pre-war'] + df['laundry_in_unit'] + df['roof_deck'] + df['outdoor_space'] + df['dining_room'] + df['high_speed_internet'] + df['balcony'] + df['swimming_pool'] + df['new_construction'] + df['terrace'] + df['exclusive'] + df['loft'] + df['garden_patio'] + df['wheelchair_access'] + df['common_outdoor_space']
df['totalfeat']

0        0
1        5
2        3
3        2
4        1
        ..
49347    5
49348    9
49349    5
49350    5
49351    1
Name: totalfeat, Length: 48817, dtype: int64

In [12]:
df.totalfeat.describe()

count    48817.000000
mean         4.672594
std          3.414865
min          0.000000
25%          2.000000
50%          4.000000
75%          7.000000
max         19.000000
Name: totalfeat, dtype: float64

In [0]:
df['ave']=df['street_address'].str.contains('venue')
df['Ave']=df['street_address'].str.contains('Ave')
df['street']=df['street_address'].str.contains('treet')
df['noadd']=df['street_address'].str.contains('noaddress')
df['east']=df['street_address'].str.contains('ast')
df['west']=df['street_address'].str.contains('est')
df['East']=df['street_address'].str.contains('E ')

In [0]:
df['landsc']=df['description'].str.contains("landscap*")
df['wine']=df['description'].str.contains("wine")
df['beaut']=df['description'].str.contains("beaut*")
df['lux']=df['description'].str.contains("luxur*")
df['motor']=df['description'].str.contains("motor")
df['estate']=df['description'].str.contains("estate")
df['mag']=df['description'].str.contains("magnificent")
df['powder']=df['description'].str.contains("powder")
df['parlor']=df['description'].str.contains("parlor")
df['theat']=df['description'].str.contains("theat*")
df['basketball']=df['description'].str.contains("basketball")
df['sustain']=df['description'].str.contains('sustainable')
df['location']=df['description'].str.contains('location')
df['pergola']=df['description'].str.contains('pergola')
df['crystal']=df['description'].str.contains('crystal')
df['entertain']=df['description'].str.contains('entertaining')
df['chand']=df['description'].str.contains('chandelier')
df['foyer']=df['description'].str.contains('foyer')
df['scraped']=df['description'].str.contains('scraped')
df['surveil']=df['description'].str.contains('surveil*')
df['secu']=df['description'].str.contains('security')
df['view']=df['description'].str.contains('view')
df['ebony']=df['description'].str.contains('ebony')
df['lava']=df['description'].str.contains('lava')
df['marble']=df['description'].str.contains('marble')
df['brazilian']=df['description'].str.contains('brazilian')
df['granite']=df['description'].str.contains("granite")

df['nodescr']=df['description'].str.contains('nodescription')
df['asis']=df.description.str.contains('as is')
df['asiss']=df.description.str.contains('asis')
df['cozy']=df.description.str.contains('cozy')
df['effic']=df.description.str.contains('efficien*')
df['needs']=df.description.str.contains('needs')
df['TLC']=df.description.str.contains('TLC')
df['potent']=df.description.str.contains('potential')
df['fixer']=df.description.str.contains('fixer')
df['handy']=df.description.str.contains('handy')
df['bones']=df.description.str.contains('bones')
df['diy']=df.description.str.contains('DIY')
df['lift']=df.description.str.contains('lift')
df['coin']=df.description.str.contains('coin')
df['top']=df.description.str.contains('top')
df['village']=df.description.str.contains('illage')
df['brownstone']=df.description.str.contains('brownstone')
df['elite']=df.description.str.contains('elite')
df['steal']=df.description.str.contains('steal')
df['deal']=df.description.str.contains('deal')
df['must']=df.description.str.contains('must')

In [229]:
df.must.value_counts(dropna=False)

False    47909
True       908
Name: must, dtype: int64

In [0]:
df.village.replace({True:1,False:0}, inplace=True)
df.top.replace({True:1,False:0}, inplace=True)
df.coin.replace({True:1,False:0}, inplace=True)
df.lift.replace({True:1,False:0}, inplace=True)
df.diy.replace({True:1,False:0}, inplace=True)
df.bones.replace({True:1,False:0}, inplace=True)
df.handy.replace({True:1,False:0}, inplace=True)
df.fixer.replace({True:1,False:0}, inplace=True)
df.potent.replace({True:1,False:0}, inplace=True)
df.TLC.replace({True:1,False:0}, inplace=True)
df.asiss.replace({True:1,False:0}, inplace=True)
df.needs.replace({True:1,False:0}, inplace=True)
df.effic.replace({True:1,False:0}, inplace=True)
df.cozy.replace({True:1,False:0}, inplace=True)
df.asis.replace({True:1,False:0}, inplace=True)
df.nodescr.replace({True: 1, False: 0}, inplace=True)

df.beaut.replace({True: 1, False: 0}, inplace=True)
df.landsc.replace({True: 1, False: 0}, inplace=True)
df.marble.replace({True: 1, False: 0}, inplace=True)
df.granite.replace({True: 1, False: 0}, inplace=True)
df.wine.replace({True: 1, False: 0}, inplace=True)
df.lux.replace({True: 1, False: 0}, inplace=True)
df.motor.replace({True: 1, False: 0}, inplace=True)
df.estate.replace({True: 1, False: 0}, inplace=True)
df.mag.replace({True: 1, False: 0}, inplace=True)
df.powder.replace({True: 1, False: 0}, inplace=True)
df.parlor.replace({True: 1, False: 0}, inplace=True)
df.theat.replace({True: 1, False: 0}, inplace=True)
df.basketball.replace({True: 1, False: 0}, inplace=True)
df.pergola.replace({True: 1, False: 0}, inplace=True)
df.view.replace({True: 1, False: 0}, inplace=True)
df.secu.replace({True: 1, False: 0}, inplace=True)
df.surveil.replace({True: 1, False: 0}, inplace=True)
df.scraped.replace({True: 1, False: 0}, inplace=True)
df.foyer.replace({True: 1, False: 0}, inplace=True)
df.chand.replace({True: 1, False: 0}, inplace=True)
df.entertain.replace({True: 1, False: 0}, inplace=True)
df.crystal.replace({True: 1, False: 0}, inplace=True)
df.location.replace({True: 1, False: 0}, inplace=True)
df.sustain.replace({True: 1, False: 0}, inplace=True)
df.brazilian.replace({True: 1, False: 0}, inplace=True)
df.ebony.replace({True: 1, False: 0}, inplace=True)
df.lava.replace({True: 1, False: 0}, inplace=True)
df.brownstone.replace({True: 1, False: 0}, inplace=True)
df.elite.replace({True: 1, False: 0}, inplace=True)
df.East.replace({True: 1, False: 0}, inplace=True)
df.Ave.replace({True: 1, False: 0}, inplace=True)
df.must.replace({True: 1, False: 0}, inplace=True)
df.deal.replace({True: 1, False: 0}, inplace=True)
df.steal.replace({True: 1, False: 0}, inplace=True)

In [0]:
df.ave.replace({True: 1, False: 0}, inplace=True)
df.street.replace({True: 1, False: 0}, inplace=True)
df.noadd.replace({True: 1, False: 0}, inplace=True)
df.east.replace({True: 1, False: 0}, inplace=True)
df.west.replace({True: 1, False: 0}, inplace=True)

In [0]:
df['luxseo']=df.landsc + df.east + df.East + df.elite + df.location + df.granite + df.wine + df.lux + df.motor + df.estate + df.mag + df.powder + df.parlor + df.theat + df.basketball + df.marble + df.pergola + df.view + df.secu + df.surveil + df.foyer + df.entertain + df.chand + df.crystal + df.sustain + df.brazilian + df.lava + df.beaut + df.ebony
df['redflags']=0 - df.steal - df.deal - df.brownstone - df.nodescr - df.asis - df.asiss - df.cozy - df.noadd - df.effic - df.needs - df.TLC - df.coin - df.potent - df.fixer - df.handy - df.bones - df.diy - df.lift

In [188]:
df.estate.describe()

count    48817.000000
mean         0.117070
std          0.321507
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: estate, dtype: float64

In [33]:
df.luxseo.value_counts(dropna=False)

0    17460
1    14351
2     8837
3     4856
4     2374
5      703
6      196
7       30
8       10
Name: luxseo, dtype: int64

In [34]:
df.redflags.value_counts(dropna=False)

 0    45335
-1     3422
-2       59
-3        1
Name: redflags, dtype: int64

In [0]:
df.interest_level.replace({'high':3, 'medium':2,'low':1}, inplace=True)

In [22]:
from shapely.geometry import Point
df['latlon']=[Point(xy) for xy in zip(df['longitude'], df['latitude'])]
df.head(10)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,cats&dogs,totalfeat,ave,street,noadd,east,...,estate,mag,powder,parlor,theat,basketball,sustain,location,pergola,crystal,entertain,chand,foyer,scraped,surveil,secu,view,ebony,lava,marble,brazilian,granite,nodescr,asis,asiss,cozy,effic,needs,TLC,potent,fixer,handy,bones,diy,lift,coin,top,village,luxseo,latlon
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,POINT (-73.9425 40.7145)
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,1,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,5,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,POINT (-73.9667 40.7947)
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,3,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,1,0,0,...,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,6,POINT (-74.0018 40.7388)
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,POINT (-73.96769999999999 40.7539)
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,POINT (-73.94929999999999 40.8241)
5,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,POINT (-74.00279999999999 40.7429)
6,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.966,3600,210 West 107th Street,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,POINT (-73.96600000000002 40.8012)
7,2.0,1,2016-04-13 06:01:42,"This huge sunny ,plenty of lights 1 bed/2 bath...",West 21st Street,40.7427,-73.9957,5645,155 West 21st Street,1,1,0,1,0,1,1,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,8,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,POINT (-73.9957 40.7427)
8,1.0,1,2016-04-20 02:36:35,<p><a website_redacted,Hamilton Terrace,40.8234,-73.9457,1725,63 Hamilton Terrace,2,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,4,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,POINT (-73.9457 40.8234)
9,2.0,4,2016-04-02 02:58:15,This is a spacious four bedroom with every bed...,522 E 11th,40.7278,-73.9808,5800,522 E 11th,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,POINT (-73.9808 40.7278)


In [0]:
#import seaborn as sns
#sns.pairplot(df)

In [0]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [232]:
# test and train split
train=df.loc[lambda df: df.created.dt.month<6]
test=df.loc[lambda df: df.created.dt.month==6]
train.shape, test.shape

((31844, 94), (16973, 94))

In [233]:
target='price'
y_train = train[target]
y_test = test[target]

# derive a mean baseline
guess=y_train.mean()

y_pred= [guess] * len(y_train)
(y_pred-y_train)

2         725.604007
3         300.604007
4         225.604007
5       -4419.395993
6         -24.395993
            ...     
49346    -924.395993
49348    -374.395993
49349     980.604007
49350     225.604007
49351    1375.604007
Name: price, Length: 31844, dtype: float64

In [234]:
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error (listings before june): ${mae:,.0f}')

Train Error (listings before june): $1,202


In [235]:
#test error
y_pred= [guess] * len(y_test)
mae= mean_absolute_error(y_test, y_pred)
print(f'Test Error (june listings): ${mae:,.0f}')

Test Error (june listings): $1,198


In [0]:
# import estimator
from sklearn.linear_model import LinearRegression
# instantiate
model=LinearRegression()
# #train on >=2 features
features=['longitude','bathrooms','noadd','wine','bedrooms','interest_level','coin','nodescr','village','East','steal','landsc','lux','east','elite','must','brownstone','hardwood_floors','deal','top','ave','street','view','totalfeat','Ave','redflags']

X_train=train[features]
X_test=test[features]

In [247]:
#fit model
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [248]:
y_pred=model.predict(X_train)
mae=mean_absolute_error(y_train, y_pred)
mse=mean_squared_error(y_train, y_pred)
rmse = np.sqrt(mse)
r2=r2_score(y_train, y_pred)
print(f'train error: ${mae:,.2f}')
print(f'mean squared error: {mse:,}')
print(f'root mse: {rmse:,}')
print(f'r2: {r2}')

train error: $685.74
mean squared error: 1,173,388.9790861148
root mse: 1,083.2308060086339
r2: 0.6221003814635173


In [249]:
y_pred=model.predict(X_test)
mae=mean_absolute_error(y_test, y_pred)
mse=mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2=r2_score(y_test, y_pred)
print(f'train error: ${mae:,.2f}')
print(f'mean squared error: {mse:,}')
print(f'root mse: {rmse:,}')
print(f'r2: {r2}')

train error: $685.62
mean squared error: 1,137,749.3779350792
root mse: 1,066.6533541573285
r2: 0.6339312765638863


In [250]:
#
model.intercept_, model.coef_

(-980739.1328737703,
 array([-13269.82973259,   1796.94486792,   -563.59869202,    487.97825274,
           466.75992696,   -450.5550276 ,   -404.12543918,    318.31992071,
           299.0527274 ,    283.87305098,   -251.31413884,    254.97157237,
           187.10951149,    201.09628725,   -150.40763076,    146.09945664,
          -147.07970661,   -136.98313036,    -99.45772999,     99.43084395,
            86.86320677,     85.73232011,     67.91239425,     59.4932971 ,
            53.64950069,     27.67337298]))

In [251]:
coefficients=pd.DataFrame(data=model.coef_, index=features)
coefficients

Unnamed: 0,0
longitude,-13269.829733
bathrooms,1796.944868
noadd,-563.598692
wine,487.978253
bedrooms,466.759927
interest_level,-450.555028
coin,-404.125439
nodescr,318.319921
village,299.052727
East,283.873051


In [0]:
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def squared_errors(df, feature, target, m, b):
    """
    Visualize linear regression, with squared errors,
    in 2D: 1 feature + 1 target.
    
    Use the m & b parameters to "fit the model" manually.
    
    df : Pandas DataFrame
    feature : string, feature column in df
    target : string, target column in df
    m : numeric, slope for linear equation
    b : numeric, intercept for linear requation
    """
    
    # Plot data
    fig = plt.figure(figsize=(7,7))
    ax = plt.axes()
    df.plot.scatter(feature, target, ax=ax)
    
    # Make predictions
    x = df[feature]
    y = df[target]
    y_pred = m*x + b
    
    # Plot predictions
    ax.plot(x, y_pred)
    
    # Plot squared errors
    xmin, xmax = ax.get_xlim()
    ymin, ymax = ax.get_ylim()
    scale = (xmax-xmin)/(ymax-ymin)
    for x, y1, y2 in zip(x, y, y_pred):
        bottom_left = (x, min(y1, y2))
        height = abs(y1 - y2)
        width = height * scale
        ax.add_patch(Rectangle(xy=bottom_left, width=width, height=height, alpha=0.1))
    
    # Print regression metrics
    mse = mean_squared_error(y, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y, y_pred)
    r2 = r2_score(y, y_pred)
    print('Mean Squared Error:', mse)
    print('Root Mean Squared Error:', rmse)
    print('Mean Absolute Error:', mae)
    print('R^2:', r2)

In [0]:
#get regression metrics RMSE MAE and R2 for train data
squared_errors(train, feature, target, m=beta1, b=y_train.mean())

In [0]:
#get regression metrics RMSE MAE and R2 for test data
squerr(train, feature, target, m=beta1, b=y_test.mean())