<a href="https://colab.research.google.com/github/jcs-lambda/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install -U category_encoders pandas-profiling

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

# my work

In [3]:
df

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49347,1.0,2,2016-06-02 05:41:05,"30TH/3RD, MASSIVE CONV 2BR IN LUXURY FULL SERV...",E 30 St,40.7426,-73.9790,3200,230 E 30 St,medium,1,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
49348,1.0,1,2016-04-04 18:22:34,"HIGH END condo finishes, swimming pool, and ki...",Rector Pl,40.7102,-74.0163,3950,225 Rector Place,low,1,1,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1
49349,1.0,1,2016-04-16 02:13:40,Large Renovated One Bedroom Apartment with Sta...,West 45th Street,40.7601,-73.9900,2595,341 West 45th Street,low,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
49350,1.0,0,2016-04-08 02:13:33,Stylishly sleek studio apartment with unsurpas...,Wall Street,40.7066,-74.0101,3350,37 Wall Street,low,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
df.dtypes

bathrooms               float64
bedrooms                  int64
created                  object
description              object
display_address          object
latitude                float64
longitude               float64
price                     int64
street_address           object
interest_level           object
elevator                  int64
cats_allowed              int64
hardwood_floors           int64
dogs_allowed              int64
doorman                   int64
dishwasher                int64
no_fee                    int64
laundry_in_building       int64
fitness_center            int64
pre-war                   int64
laundry_in_unit           int64
roof_deck                 int64
outdoor_space             int64
dining_room               int64
high_speed_internet       int64
balcony                   int64
swimming_pool             int64
new_construction          int64
terrace                   int64
exclusive                 int64
loft                      int64
garden_p

In [5]:
df.describe()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
count,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0
mean,1.201794,1.537149,40.75076,-73.97276,3579.585247,0.524838,0.478276,0.478276,0.447631,0.424852,0.415081,0.367085,0.052769,0.268452,0.185653,0.175902,0.132761,0.138394,0.102833,0.087203,0.060471,0.055206,0.051908,0.046193,0.043305,0.042711,0.039331,0.027224,0.026241
std,0.470711,1.106087,0.038954,0.028883,1762.430772,0.499388,0.499533,0.499533,0.497255,0.494326,0.492741,0.482015,0.223573,0.443158,0.38883,0.380741,0.33932,0.345317,0.303744,0.282136,0.238359,0.228385,0.221844,0.209905,0.203544,0.202206,0.194382,0.162738,0.159852
min,0.0,0.0,40.5757,-74.0873,1375.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,40.7283,-73.9918,2500.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,40.7517,-73.978,3150.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,2.0,40.774,-73.955,4095.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,10.0,8.0,40.9894,-73.7001,15500.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
df.describe(exclude='number')

Unnamed: 0,created,description,display_address,street_address,interest_level
count,48817,47392.0,48684,48807,48817
unique,48148,37853.0,8674,15135,3
top,2016-04-08 01:14:27,,Broadway,3333 Broadway,low
freq,3,1627.0,435,174,33946


In [7]:
df.isnull().sum()

bathrooms                  0
bedrooms                   0
created                    0
description             1425
display_address          133
latitude                   0
longitude                  0
price                      0
street_address            10
interest_level             0
elevator                   0
cats_allowed               0
hardwood_floors            0
dogs_allowed               0
doorman                    0
dishwasher                 0
no_fee                     0
laundry_in_building        0
fitness_center             0
pre-war                    0
laundry_in_unit            0
roof_deck                  0
outdoor_space              0
dining_room                0
high_speed_internet        0
balcony                    0
swimming_pool              0
new_construction           0
terrace                    0
exclusive                  0
loft                       0
garden_patio               0
wheelchair_access          0
common_outdoor_space       0
dtype: int64

## Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.


In [8]:
created = pd.to_datetime(df['created'], errors='coerce', infer_datetime_format=True)
created.describe()

count                   48817
unique                  48148
top       2016-05-14 01:11:03
freq                        3
first     2016-04-01 22:12:41
last      2016-06-29 21:41:47
Name: created, dtype: object

In [9]:
created.isnull().sum()

0

In [10]:
created.dt.year.value_counts()

2016    48817
Name: created, dtype: int64

In [11]:
created.dt.month.value_counts()

6    16973
4    16217
5    15627
Name: created, dtype: int64

In [0]:
df['created'] = created

In [13]:
df_train = df[df['created'].dt.month != 6].copy()
df_test = df[df['created'].dt.month == 6].copy()
df_train.shape, df_test.shape, df_train.shape[0] + df_test.shape[0] == df.shape[0]

((31844, 34), (16973, 34), True)

## Engineer at least two new features. (See below for explanation & ideas.)


### Does the apartment have a description?

In [14]:
df['description'].describe()

count        47392
unique       37853
top               
freq          1627
Name: description, dtype: object

In [15]:
df['description'].isnull().sum()

1425

In [16]:
(df['description'] == '').sum()

0

In [17]:
df['description']

0        A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...
1                                                         
2        Top Top West Village location, beautiful Pre-w...
3        Building Amenities - Garage - Garden - fitness...
4        Beautifully renovated 3 bedroom flex 4 bedroom...
                               ...                        
49347    30TH/3RD, MASSIVE CONV 2BR IN LUXURY FULL SERV...
49348    HIGH END condo finishes, swimming pool, and ki...
49349    Large Renovated One Bedroom Apartment with Sta...
49350    Stylishly sleek studio apartment with unsurpas...
49351    Look no further!!!  This giant 2 bedroom apart...
Name: description, Length: 48817, dtype: object

In [18]:
df['description'][1]

'        '

In [0]:
description = df['description']

In [20]:
description = description.str.lstrip().str.rstrip()
(description == '').sum()

1863

In [21]:
((description == '') | (description.isnull())).sum()

3288

In [22]:
1863 + 1425

3288

In [23]:
(description.fillna('') == '').sum()

3288

In [24]:
(description.fillna('') != '').sum()

45529

In [25]:
3288 + 45529, len(df)

(48817, 48817)

### How long is the description?

In [26]:
description = df['description'].fillna('').str.lstrip().str.rstrip()
description.str.len()

0         587
1           0
2         690
3         491
4         478
         ... 
49347     786
49348    1125
49349     670
49350     734
49351     798
Name: description, Length: 48817, dtype: int64

### How many total perks does each apartment have?

In [27]:
perk_features = ['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman', 'dishwasher',
                 'no_fee', 'laundry_in_building', 'fitness_center', 'pre-war', 'laundry_in_unit',
                 'roof_deck', 'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
                 'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
                 'wheelchair_access', 'common_outdoor_space'
                ]

df[perk_features]

Unnamed: 0,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49347,1,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
49348,1,1,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1
49349,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
49350,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [28]:
df[perk_features].isnull().any().any()

False

In [29]:
df[perk_features].sum(axis=1)

0        0
1        5
2        3
3        2
4        1
        ..
49347    5
49348    9
49349    5
49350    5
49351    1
Length: 48817, dtype: int64

### Are cats or dogs allowed?

In [30]:
pets_features = df.columns[(df.columns.str.contains('cat') | (df.columns.str.contains('dog')))].to_list()
pets_features

['cats_allowed', 'dogs_allowed']

In [31]:
df[pets_features].sum(axis=1).value_counts()

0    25433
2    21816
1     1568
dtype: int64

In [32]:
df[pets_features].sum(axis=1) > 0

0        False
1         True
2        False
3        False
4        False
         ...  
49347    False
49348     True
49349     True
49350     True
49351    False
Length: 48817, dtype: bool

In [33]:
(df[pets_features].sum(axis=1) > 0).sum()

23384

In [34]:
23384 + 25433 == len(df)

True

### Are cats and dogs allowed?

In [35]:
df[pets_features].sum(axis=1) == 2

0        False
1         True
2        False
3        False
4        False
         ...  
49347    False
49348     True
49349     True
49350     True
49351    False
Length: 48817, dtype: bool

In [36]:
(df[pets_features].sum(axis=1) == 2).sum()

21816

### Total number of rooms (beds + baths)

In [37]:
rooms_features = ['bedrooms', 'bathrooms']
df[rooms_features].isnull().sum()

bedrooms     0
bathrooms    0
dtype: int64

In [38]:
df[rooms_features].dtypes

bedrooms       int64
bathrooms    float64
dtype: object

In [39]:
df[rooms_features].sum(axis=1)

0        4.5
1        3.0
2        2.0
3        2.0
4        5.0
        ... 
49347    3.0
49348    2.0
49349    2.0
49350    1.0
49351    3.0
Length: 48817, dtype: float64

### Ratio of beds to baths

In [40]:
bedrooms = df['bedrooms']
bathrooms = df['bathrooms']
(bedrooms == 0).sum(), (bathrooms == 0.0).sum()

(9317, 304)

In [41]:
bed_bath_ratio = bedrooms / bathrooms
bed_bath_ratio.isnull().sum()

151

In [42]:
bed_bath_ratio

0        2.0
1        2.0
2        1.0
3        1.0
4        4.0
        ... 
49347    2.0
49348    1.0
49349    1.0
49350    0.0
49351    2.0
Length: 48817, dtype: float64

In [43]:
bed_bath_ratio.sum()

inf

In [44]:
bed_bath_ratio.fillna(0.0).value_counts()

1.000000    19014
2.000000    12240
0.000000     9317
3.000000     3603
1.500000     2772
4.000000      366
1.333333      343
0.500000      206
0.666667      188
inf           153
2.500000      129
1.200000      128
0.800000       83
2.666667       61
1.600000       38
1.666667       35
1.250000       31
1.142857       27
0.857143       24
0.333333       14
0.750000       12
5.000000       10
0.888889        7
3.333333        5
1.428571        3
0.400000        2
2.400000        1
0.222222        1
6.000000        1
0.200000        1
2.333333        1
0.571429        1
dtype: int64

In [45]:
(bed_bath_ratio == np.inf).sum()

153

In [0]:
bed_bath_ratio.replace(np.inf, 0.0, inplace=True)

In [47]:
bed_bath_ratio.sum()

61628.513492063496

In [48]:
bed_bath_ratio.mean()

1.266356665681656

In [49]:
(bed_bath_ratio == np.nan).sum()

0

In [50]:
bed_bath_ratio.isnull().sum()

151

In [51]:
(df['bedrooms'] / df['bathrooms']).fillna(0.0).replace(np.inf, 0.0).value_counts()

1.000000    19014
2.000000    12240
0.000000     9470
3.000000     3603
1.500000     2772
4.000000      366
1.333333      343
0.500000      206
0.666667      188
2.500000      129
1.200000      128
0.800000       83
2.666667       61
1.600000       38
1.666667       35
1.250000       31
1.142857       27
0.857143       24
0.333333       14
0.750000       12
5.000000       10
0.888889        7
3.333333        5
1.428571        3
0.400000        2
2.400000        1
0.222222        1
6.000000        1
2.333333        1
0.571429        1
0.200000        1
dtype: int64

### What's the neighborhood, based on address or latitude & longitude?

### **Wrangle function**

In [0]:
def wrangle(dataframe):
  df = dataframe.copy()

  # fill nulls with empty strings and strip leading and trailing whitespace in feature 'description'
  df['description'] = df['description'].fillna('').str.lstrip().str.rstrip()

  # add feature: 'has_description'
  df['has_description'] = df['description'] != ''

  # add feature: 'description_length'
  df['description_length'] = df['description'].str.len()

  # add feature: 'total_perks'
  perk_features = ['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman', 'dishwasher',
                  'no_fee', 'laundry_in_building', 'fitness_center', 'pre-war', 'laundry_in_unit',
                  'roof_deck', 'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
                  'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
                  'wheelchair_access', 'common_outdoor_space'
                  ]
  df['total_perks'] = df[perk_features].sum(axis=1)

  # add feature: 'cats_or_dogs_allowed'
  pets_features = ['cats_allowed', 'dogs_allowed']
  df['cats_or_dogs_allowed'] = df[pets_features].sum(axis=1) > 0

  # add feature: 'cats_and_dogs_allowed'
  df['cats_and_dogs_allowed'] = df[pets_features].sum(axis=1) == 2

  # add feature: 'total_rooms'
  rooms_features = ['bedrooms', 'bathrooms']
  df['total_rooms'] = df[rooms_features].sum(axis=1)

  # add feature: 'bed_bath_ratio'
  df['bed_bath_ratio'] = (df['bedrooms'] / df['bathrooms']).fillna(0.0).replace(np.inf, 0.0)

  return df

In [53]:
df_train, df_test = map(wrangle, [df_train, df_test])
df_train.shape, df_test.shape

((31844, 41), (16973, 41))

## Fit a linear regression model with at least two features.


In [54]:
df_train.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space',
       'has_description', 'description_length', 'total_perks',
       'cats_or_dogs_allowed', 'cats_and_dogs_allowed', 'total_rooms',
       'bed_bath_ratio'],
      dtype='object')

In [0]:
target = 'price'
features = ['bathrooms', 'bedrooms', 'latitude', 'longitude', 'interest_level', 'has_description', 
            'description_length', 'total_perks', 'cats_or_dogs_allowed', 'cats_and_dogs_allowed',
            'total_rooms', 'bed_bath_ratio'
           ]

X_train = df_train[features]
X_test = df_test[features]

y_train = df_train[target]
y_test = df_test[target]

In [56]:
X_train

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,interest_level,has_description,description_length,total_perks,cats_or_dogs_allowed,cats_and_dogs_allowed,total_rooms,bed_bath_ratio
2,1.0,1,40.7388,-74.0018,high,True,690,3,False,False,2.0,1.0
3,1.0,1,40.7539,-73.9677,low,True,491,2,False,False,2.0,1.0
4,1.0,4,40.8241,-73.9493,low,True,478,1,False,False,5.0,4.0
5,2.0,4,40.7429,-74.0028,medium,False,0,0,False,False,6.0,2.0
6,1.0,2,40.8012,-73.9660,low,True,578,3,True,True,3.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...
49346,1.0,1,40.7296,-73.9869,medium,True,394,5,True,False,2.0,1.0
49348,1.0,1,40.7102,-74.0163,low,True,1125,9,True,True,2.0,1.0
49349,1.0,1,40.7601,-73.9900,low,True,670,5,True,True,2.0,1.0
49350,1.0,0,40.7066,-74.0101,low,True,734,5,True,True,1.0,0.0


In [57]:
X_train.dtypes

bathrooms                float64
bedrooms                   int64
latitude                 float64
longitude                float64
interest_level            object
has_description             bool
description_length         int64
total_perks                int64
cats_or_dogs_allowed        bool
cats_and_dogs_allowed       bool
total_rooms              float64
bed_bath_ratio           float64
dtype: object

In [58]:
X_train['interest_level'].value_counts()

low       22053
medium     7381
high       2410
Name: interest_level, dtype: int64

In [59]:
X_train['interest_level'].isnull().sum()

0

In [0]:
from category_encoders import OrdinalEncoder

encoder = OrdinalEncoder()

X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)

In [61]:
X_train_encoded.dtypes

bathrooms                float64
bedrooms                   int64
latitude                 float64
longitude                float64
interest_level             int64
has_description             bool
description_length         int64
total_perks                int64
cats_or_dogs_allowed        bool
cats_and_dogs_allowed       bool
total_rooms              float64
bed_bath_ratio           float64
dtype: object

In [62]:
from sklearn.linear_model import LinearRegression

model = LinearRegression(n_jobs=-1)

model.fit(X_train_encoded, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)

## Get the model's coefficients and intercept.


In [63]:
print(f'Intercept: {model.intercept_}')
coefficients = pd.Series(data=model.coef_, index=features)
print(coefficients.to_string())

Intercept: -1170514.9809037691
bathrooms                 1014.206125
bedrooms                  -239.260640
latitude                  1895.069951
longitude               -14796.041656
interest_level            -137.942036
has_description           -494.382581
description_length           0.135316
total_perks                 48.727852
cats_or_dogs_allowed      -110.156182
cats_and_dogs_allowed       82.685499
total_rooms                774.945486
bed_bath_ratio            -118.410873


## Get regression metrics RMSE, MAE, and  𝑅2 , for both the train and test data.


In [64]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred = model.predict(X_train_encoded)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred))
mae_train = mean_absolute_error(y_train, y_pred)
r2_train = r2_score(y_train, y_pred)

y_pred = model.predict(X_test_encoded)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
mae_test = mean_absolute_error(y_test, y_pred)
r2_test = r2_score(y_test, y_pred)

data = {
    'RMSE' : [rmse_train, rmse_test],
    'MAE' : [mae_train, mae_test],
    'R2' : [r2_train, r2_test]
}

pd.DataFrame(data=data, index=['Train', 'Test'],)

Unnamed: 0,RMSE,MAE,R2
Train,1127.581565,721.363429,0.590522
Test,1110.502704,724.099146,0.603215


In [65]:
features

['bathrooms',
 'bedrooms',
 'latitude',
 'longitude',
 'interest_level',
 'has_description',
 'description_length',
 'total_perks',
 'cats_or_dogs_allowed',
 'cats_and_dogs_allowed',
 'total_rooms',
 'bed_bath_ratio']

## What's the best test MAE you can get?

In [0]:
df2 = df.copy()

### feature engineering

In [67]:
df2.dtypes

bathrooms                      float64
bedrooms                         int64
created                 datetime64[ns]
description                     object
display_address                 object
latitude                       float64
longitude                      float64
price                            int64
street_address                  object
interest_level                  object
elevator                         int64
cats_allowed                     int64
hardwood_floors                  int64
dogs_allowed                     int64
doorman                          int64
dishwasher                       int64
no_fee                           int64
laundry_in_building              int64
fitness_center                   int64
pre-war                          int64
laundry_in_unit                  int64
roof_deck                        int64
outdoor_space                    int64
dining_room                      int64
high_speed_internet              int64
balcony                  

In [68]:
df2.isnull().sum()

bathrooms                  0
bedrooms                   0
created                    0
description             1425
display_address          133
latitude                   0
longitude                  0
price                      0
street_address            10
interest_level             0
elevator                   0
cats_allowed               0
hardwood_floors            0
dogs_allowed               0
doorman                    0
dishwasher                 0
no_fee                     0
laundry_in_building        0
fitness_center             0
pre-war                    0
laundry_in_unit            0
roof_deck                  0
outdoor_space              0
dining_room                0
high_speed_internet        0
balcony                    0
swimming_pool              0
new_construction           0
terrace                    0
exclusive                  0
loft                       0
garden_patio               0
wheelchair_access          0
common_outdoor_space       0
dtype: int64

#### addresses

In [0]:
!pip install arcgis
from arcgis.gis import GIS
from arcgis.geocoding import reverse_geocode
gis = GIS()

In [70]:
address_features = ['display_address', 'street_address', 'longitude', 'latitude']
address = df2[address_features].copy()
address

Unnamed: 0,display_address,street_address,longitude,latitude
0,Metropolitan Avenue,792 Metropolitan Avenue,-73.9425,40.7145
1,Columbus Avenue,808 Columbus Avenue,-73.9667,40.7947
2,W 13 Street,241 W 13 Street,-74.0018,40.7388
3,East 49th Street,333 East 49th Street,-73.9677,40.7539
4,West 143rd Street,500 West 143rd Street,-73.9493,40.8241
...,...,...,...,...
49347,E 30 St,230 E 30 St,-73.9790,40.7426
49348,Rector Pl,225 Rector Place,-74.0163,40.7102
49349,West 45th Street,341 West 45th Street,-73.9900,40.7601
49350,Wall Street,37 Wall Street,-74.0101,40.7066


In [0]:
address['display_address'] = address['display_address'].str.lstrip().str.rstrip()
address['street_address'] = address['street_address'].str.lstrip().str.rstrip()

In [72]:
address.isnull().sum()

display_address    133
street_address      10
longitude            0
latitude             0
dtype: int64

In [73]:
cond_null_sa = address['street_address'].isnull()
address[cond_null_sa]

Unnamed: 0,display_address,street_address,longitude,latitude
18037,,,-74.0059,40.7128
18096,,,-74.0059,40.7128
24543,,,-74.0059,40.7128
26196,,,-74.0045,40.7411
34233,,,-74.0059,40.7128
42898,,,-73.9543,40.7447
42905,,,-73.9948,40.7589
43308,,,-74.0059,40.7128
46454,A FABULOUS 4BR IN GRAMERCY!! PERFECT APARTME...,,-74.0059,40.7128
47761,,,-73.9219,40.85


In [0]:
def long_lat_to_address(row):
  results = reverse_geocode([row['longitude'], row['latitude']])
  return results['address']['Address']

In [75]:
new_street_addresses = pd.DataFrame(address[address['street_address'].isnull()].apply(long_lat_to_address, axis=1), columns=['street_address'])
new_street_addresses

Unnamed: 0,street_address
18037,254 Broadway
18096,254 Broadway
24543,254 Broadway
26196,56 9th Ave
34233,254 Broadway
42898,546 47th Rd
42905,470 W 42nd St
43308,254 Broadway
46454,254 Broadway
47761,1655 Undercliff Ave


In [76]:
address.update(new_street_addresses)
address[cond_null_sa]

Unnamed: 0,display_address,street_address,longitude,latitude
18037,,254 Broadway,-74.0059,40.7128
18096,,254 Broadway,-74.0059,40.7128
24543,,254 Broadway,-74.0059,40.7128
26196,,56 9th Ave,-74.0045,40.7411
34233,,254 Broadway,-74.0059,40.7128
42898,,546 47th Rd,-73.9543,40.7447
42905,,470 W 42nd St,-73.9948,40.7589
43308,,254 Broadway,-74.0059,40.7128
46454,A FABULOUS 4BR IN GRAMERCY!! PERFECT APARTME...,254 Broadway,-74.0059,40.7128
47761,,1655 Undercliff Ave,-73.9219,40.85


In [77]:
condition_null_da = address['display_address'].isnull()
address[condition_null_da]

Unnamed: 0,display_address,street_address,longitude,latitude
244,,80 Madison Avenue,-73.9859,40.7443
1178,,602 W 146 st,-73.9271,40.8390
2308,,527 W 48th St #1RW,-73.9939,40.7642
2911,,5313 6th Ave,-74.0100,40.6416
3401,,3333 Broadway,-73.9569,40.8196
...,...,...,...,...
47761,,1655 Undercliff Ave,-73.9219,40.8500
48030,,185 E 3rd St,-73.9837,40.7231
48370,,328 E 14th St,-73.9839,40.7316
49079,,1073 1st avenue,-73.9627,40.7583


In [78]:
address[condition_null_da]['street_address'].str.extract('[0-9]+[0-9a-zA-Z]+ (.*)')

Unnamed: 0,0
244,Madison Avenue
1178,W 146 st
2308,W 48th St #1RW
2911,6th Ave
3401,Broadway
...,...
47761,Undercliff Ave
48030,E 3rd St
48370,E 14th St
49079,1st avenue


In [79]:
fixed_display_address = address[condition_null_da]['street_address'].str.extract('[0-9]+[0-9a-zA-Z]+ (.*)')
fixed_display_address.columns = ['display_address']
fixed_display_address

Unnamed: 0,display_address
244,Madison Avenue
1178,W 146 st
2308,W 48th St #1RW
2911,6th Ave
3401,Broadway
...,...
47761,Undercliff Ave
48030,E 3rd St
48370,E 14th St
49079,1st avenue


In [80]:
address.update(fixed_display_address)
address.isnull().sum()

display_address    3
street_address     0
longitude          0
latitude           0
dtype: int64

#### **wrangle function**

In [0]:
import sys
if not 'arcgis' in sys.modules:
  !pip install arcgis

from arcgis.gis import GIS
from arcgis.geocoding import reverse_geocode
gis=GIS()

# function to reverse geocode latitude and longitude into a street address
# parameter 'row' = a pandas dataframe row with columns names 'latitude' and 'longitude'
# https://developers.arcgis.com/python/guide/reverse-geocoding/
def reverse_geocode_row(row):
  location = [row['longitude'], row['latitude']]
  results = reverse_geocode(location)
  return results['address']['Address']

def wrangle2(dataframe):
  df = dataframe.copy()

  # fill nulls with empty strings and strip leading and trailing whitespace in feature 'description'
  df['description'] = df['description'].fillna('').str.lstrip().str.rstrip()

  # add feature: 'has_description'
  df['has_description'] = df['description'] != ''

  # add feature: 'description_length'
  df['description_length'] = df['description'].str.len()

  # add feature: 'total_perks'
  perk_features = ['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman', 'dishwasher',
                  'no_fee', 'laundry_in_building', 'fitness_center', 'pre-war', 'laundry_in_unit',
                  'roof_deck', 'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
                  'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
                  'wheelchair_access', 'common_outdoor_space'
                  ]
  df['total_perks'] = df[perk_features].sum(axis=1)

  # add feature: 'cats_or_dogs_allowed'
  pets_features = ['cats_allowed', 'dogs_allowed']
  df['cats_or_dogs_allowed'] = df[pets_features].sum(axis=1) > 0

  # add feature: 'cats_and_dogs_allowed'
  df['cats_and_dogs_allowed'] = df[pets_features].sum(axis=1) == 2

  # add feature: 'total_rooms'
  rooms_features = ['bedrooms', 'bathrooms']
  df['total_rooms'] = df[rooms_features].sum(axis=1)

  # add feature: 'bed_bath_ratio'
  df['bed_bath_ratio'] = (df['bedrooms'] / df['bathrooms']).fillna(0.0).replace(np.inf, 0.0)

  # strip leading and trailing whitespace in features 'display_address' and 'street_address'
  df['display_address'] = df['display_address'].str.lstrip().str.rstrip()
  df['street_address'] = df['street_address'].str.lstrip().str.rstrip()

  # fill in nulls in feature 'street_address'
  new_street_addresses = pd.DataFrame(df[df['street_address'].isnull()].apply(reverse_geocode_row, axis=1), columns='street_address')
  df.update(new_street_addresses)
  assert df['street_address'].isnull().sum() == 0

  # fill nulls in feature 'display address'
  cond_null_da = df['display_address'].isnull()
  new_display_addresses = df[cond_null_da]['street_address'].str.extract('[0-9]+[0-9a-zA-Z]+ (.*)')
  new_display_addresses.columns = ['display_address']
  df.upadte(new_display_addresses)
  assert df['display_address'].isnull().sum() == 0

  return df

### train / test split

In [82]:
df_train2 = wrangle(df2[df2['created'].dt.month != 6])
df_test2 = wrangle(df2[df2['created'].dt.month == 6])
df_train2.shape, df_test2.shape

((31844, 41), (16973, 41))

### define target and features

In [0]:
target = 'price'
features = df_train2.columns.drop(['created', target])

X_train2 = df_train2[features]
X_test2 = df_test2[features]

y_train2 = df_train2[target]
y_test2 = df_test2[target]

### encode, fit model, predict

In [89]:
encoder = OrdinalEncoder()

X_train2_encoded = encoder.fit_transform(X_train2)
X_test2_encoded = encoder.transform(X_test2)

model = LinearRegression(n_jobs=-1)

model.fit(X_train2_encoded, y_train2)

y_pred2 = model.predict(X_test2_encoded)

mae2 = mean_absolute_error(y_test2, y_pred2)

print(f'Test2 MAE: {mae2:.2f} dollars')

Test2 MAE: 695.88 dollars


### another quick try, no feature engineering

In [90]:
df3 = df.copy()

df_train3 = df3[df3['created'].dt.month != 6]
df_test3 = df3[df3['created'].dt.month == 6]

target = 'price'
features = df_train3.columns.drop([target, 'created'])

X_train3 = df_train3[features]
X_test3 = df_test3[features]

y_train3 = df_train3[target]
y_test3 = df_test3[target]

encoder = OrdinalEncoder()

X_train3_encoded = encoder.fit_transform(X_train3)
X_test3_encoded = encoder.transform(X_test3)

model = LinearRegression(n_jobs=-1)

model.fit(X_train3_encoded, y_train3)
y_pred3 = model.predict(X_test3_encoded)

mae3 = mean_absolute_error(y_test3, y_pred3)

print(f'Test3 MAE: {mae3:.2f} dollars')

Test3 MAE: 701.48 dollars
