<a href="https://colab.research.google.com/github/anitashar/DS-Unit-2-Linear-Models/blob/master/ANITA_SHARMA_Copy_of_LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [3]:
#  check the data
print(df.shape)
df.head()

(48817, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
# find missing values
df.isnull().sum()




bathrooms                  0
bedrooms                   0
created                    0
description             1425
display_address          133
latitude                   0
longitude                  0
price                      0
street_address            10
interest_level             0
elevator                   0
cats_allowed               0
hardwood_floors            0
dogs_allowed               0
doorman                    0
dishwasher                 0
no_fee                     0
laundry_in_building        0
fitness_center             0
pre-war                    0
laundry_in_unit            0
roof_deck                  0
outdoor_space              0
dining_room                0
high_speed_internet        0
balcony                    0
swimming_pool              0
new_construction           0
terrace                    0
exclusive                  0
loft                       0
garden_patio               0
wheelchair_access          0
common_outdoor_space       0
dtype: int64

# **Feature engineering**

In [5]:
#  check the datatype
df['created'].dtypes

dtype('O')

In [6]:
# convert the the date column from object type to date type
df['created']=pd.to_datetime(df['created'])

# check the date type again for this column
df['created'].dtypes

dtype('<M8[ns]')

In [0]:
# create a month column
df['month']=df['created'].dt.month

In [8]:
# get the unique values for the month column
# pd.unique(df[‘month’])
df['month'].value_counts()

6    16973
4    16217
5    15627
Name: month, dtype: int64

In [9]:
# see both the columns
df[['created','month']]

Unnamed: 0,created,month
0,2016-06-24 07:54:24,6
1,2016-06-12 12:19:27,6
2,2016-04-17 03:26:41,4
3,2016-04-18 02:22:02,4
4,2016-04-28 01:32:41,4
...,...,...
49347,2016-06-02 05:41:05,6
49348,2016-04-04 18:22:34,4
49349,2016-04-16 02:13:40,4
49350,2016-04-08 02:13:33,4


In [10]:
#check years in the dataset
pd.unique(df['created'].dt.year)

array([2016])

In [11]:
# check datatypes of bedrooms
print(df['bedrooms'].dtypes)
# convert int to float & check again datatypes of bedrooms
df['bedrooms'] = df['bedrooms'].astype(float)
df['bedrooms'].dtypes

int64


dtype('float64')

In [12]:
# datatypes of bathrooms
df['bathrooms'].dtypes

dtype('float64')

In [13]:
#  create new column adding bedrooms & bathrooms
df['rooms'] = df['bathrooms'] + df['bedrooms']
df['rooms']


0        4.5
1        3.0
2        2.0
3        2.0
4        5.0
        ... 
49347    3.0
49348    2.0
49349    2.0
49350    1.0
49351    3.0
Name: rooms, Length: 48817, dtype: float64

In [14]:
#  check to see room column added to the df
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space', 'month',
       'rooms'],
      dtype='object')

In [15]:
df['bathrooms'].describe()

count    48817.000000
mean         1.201794
std          0.470711
min          0.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         10.000000
Name: bathrooms, dtype: float64

In [16]:
# ratio of bed to bath
df['ratio']= df['bedrooms']/df['bathrooms']

# check the column
df['ratio'].sort_values(ascending=True)

42673    0.0
26729    0.0
39392    0.0
10388    0.0
10389    0.0
        ... 
46569    NaN
46740    NaN
46878    NaN
47125    NaN
48834    NaN
Name: ratio, Length: 48817, dtype: float64

In [0]:
# replace NaN value,0 in the dataframe
df['ratio'].replace(np.nan,  0, inplace = True)
df['ratio'].replace(np.inf,  0, inplace = True)

In [18]:
df['ratio'].value_counts()

1.000000    19014
2.000000    12240
0.000000     9470
3.000000     3603
1.500000     2772
4.000000      366
1.333333      343
0.500000      206
0.666667      188
2.500000      129
1.200000      128
0.800000       83
2.666667       61
1.600000       38
1.666667       35
1.250000       31
1.142857       27
0.857143       24
0.333333       14
0.750000       12
5.000000       10
0.888889        7
3.333333        5
1.428571        3
0.400000        2
2.400000        1
0.222222        1
6.000000        1
2.333333        1
0.571429        1
0.200000        1
Name: ratio, dtype: int64

In [0]:
# # drop na values
# df = df.dropna(axis=0, subset=['ratio'])
# df.isnull().sum() 

In [20]:
# check the ratio column
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space', 'month',
       'rooms', 'ratio'],
      dtype='object')

In [0]:
# create test(June 2016) & train data(April & May 2016)
train= df.query('month <= 5')
test = df.query('month == 6')

 Fit a linear regression model with at least two features.

In [0]:
# 1. Import the appropriate estimator class from Scikit-Learn
from sklearn.linear_model import LinearRegression

In [0]:
# 2. Instantiate this class
model = LinearRegression()

In [24]:
# 3. Arrange X features matrices & y target
features = ['rooms','ratio']
print(f'Linear Regression ,dependent on :{features}')
X_train = train[features]
X_test = test[features]
print(X_train.shape, X_test.shape)

# y target
target = 'price' 
y_train = train[target]
y_test = test[target]
print(f'Linear Regression ,dependent on: {features}')
print(X_train.shape,X_test.shape)


Linear Regression ,dependent on :['rooms', 'ratio']
(31844, 2) (16973, 2)
Linear Regression ,dependent on: ['rooms', 'ratio']
(31844, 2) (16973, 2)


In [25]:
X_train['ratio'].sort_values(ascending=True)

31484    0.0
6187     0.0
33254    0.0
41104    0.0
33253    0.0
        ... 
29780    5.0
8000     5.0
428      5.0
45164    5.0
34324    6.0
Name: ratio, Length: 31844, dtype: float64

In [26]:
df['bathrooms'].value_counts()

1.0     39152
2.0      7619
3.0       680
1.5       645
0.0       304
2.5       256
4.0        93
3.5        55
4.5         8
5.0         4
10.0        1
Name: bathrooms, dtype: int64

In [27]:
df['bedrooms'].value_counts()

1.0    15651
2.0    14569
0.0     9317
3.0     7188
4.0     1825
5.0      221
6.0       43
8.0        2
7.0        1
Name: bedrooms, dtype: int64

In [28]:
# import 
from sklearn.metrics import mean_absolute_error
# 4. Fit the model 
model.fit(X_train,y_train)
# train error
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train,y_pred)
print(f'Train Error:$ {mae:.2f}  ')



Train Error:$ 826.02  


In [29]:
# 5. Apply the model to new data
# test error
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error:$ {mae:.2f} ')

Test Error:$ 832.25 


### **Get the model's coefficients and intercept.**

In [30]:
model.intercept_,model.coef_

(1460.4820854964014, array([1130.13528575, -776.35640613]))

In [33]:
# equation for the model

beta0 =model.intercept_
beta1,beta2 = model.coef_
print(f'y = {beta0} + {beta1}x1 + {beta2}x2')

y = 1460.4820854964014 + 1130.1352857542017x1 + -776.3564061252885x2


In [34]:
# model coedfficients & intercept

print('Intercept', model.intercept_)
coefficients = pd.Series(model.coef_ , features)
print(coefficients.to_string())

Intercept 1460.4820854964014
rooms    1130.135286
ratio    -776.356406


# **Get regression metrics RMSE, MAE, and  𝑅2 , for both the train and test data**

In [43]:
# regression for train metrices
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


y_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_pred) #Mean sqaured error
rmse = np.sqrt(mse) #root mean squared Error
mae = mean_absolute_error(y_train, y_pred) #mean absolute Error
r2 = r2_score(y_train, y_pred)

#  print RMSE,MAE , R2
print('Mean squared error:',mse)
print('root mean squared error',rmse)
print('mean absolute error:',mae)
print('R^2',r2)

  

Mean squared error: 1540592.0195382172
root mean squared error 1241.2058731484544
mean absolute error: 826.0239929058141
R^2 0.5038396074273039


In [45]:
# regression for test metrices

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred) #Mean sqaured error
rmse = np.sqrt(mse) #root mean squared Error
mae = mean_absolute_error(y_test, y_pred) #mean absolute Error
r2 = r2_score(y_test, y_pred)

#  print RMSE,MAE , R2
print('Mean squared error:',mse)
print('root mean squared error',rmse)
print('mean absolute error:',mae)
print('R^2',r2)

  
  

Mean squared error: 1515639.4763700743
root mean squared error 1231.11310462121
mean absolute error: 832.2514945352539
R^2 0.5123458478077657
