<a href="https://colab.research.google.com/github/cjakuc/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [x] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [x] Engineer at least two new features. (See below for explanation & ideas.)
- [x] Fit a linear regression model with at least two features.
- [x] Get the model's coefficients and intercept.
- [x] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [x] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [x] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [x] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

## 1) Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

In [7]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
type(df['created'][0])

str

In [8]:
print(df['created'][0])

2016-06-24 07:54:24


In [0]:
# Create a new datetime variable called 'date'
from datetime import datetime
df['date'] = pd.to_datetime(df['created'])

In [11]:
df['date'].head()

0   2016-06-24 07:54:24
1   2016-06-12 12:19:27
2   2016-04-17 03:26:41
3   2016-04-18 02:22:02
4   2016-04-28 01:32:41
Name: date, dtype: datetime64[ns]

In [138]:
# Set train equal to dates from May & April of 2016
condition = ((df['date'].dt.year == 2016) & ((df['date'].dt.month == 4) | (df['date'].dt.month == 5)))
train = df[condition]
train.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,date,interest_low,interest_med,interest_high
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016-04-17 03:26:41,0,0,1
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016-04-18 02:22:02,1,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016-04-28 01:32:41,1,0,0
5,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016-04-19 04:24:47,0,1,0
6,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.966,3600,210 West 107th Street,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016-04-27 03:19:56,1,0,0


In [17]:
# Set test equal to dates from June 2016
condition = ((df['date'].dt.year == 2016) & (df['date'].dt.month == 6))
test = df[condition]
test.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,date
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016-06-24 07:54:24
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016-06-12 12:19:27
11,1.0,1,2016-06-03 03:21:22,Check out this one bedroom apartment in a grea...,W. 173rd Street,40.8448,-73.9396,1675,644 W. 173rd Street,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016-06-03 03:21:22
14,1.0,1,2016-06-01 03:11:01,Spacious 1-Bedroom to fit King-sized bed comfo...,East 56th St..,40.7584,-73.9648,3050,315 East 56th St..,low,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2016-06-01 03:11:01
24,2.0,4,2016-06-07 04:39:56,SPRAWLING 2 BEDROOM FOUND! ENJOY THE LUXURY OF...,W 18 St.,40.7391,-73.9936,7400,30 W 18 St.,medium,1,1,1,1,1,1,0,0,1,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,2016-06-07 04:39:56


## 2) Engineer at least two new features. (See below for explanation & ideas.)

In [18]:
# Create dummy variables for interest level == low
df['interest_low'] = df['interest_level']
df['interest_low'] = df['interest_low'].replace({'low':1,
                                                 'medium':0,
                                                 'high':0})
df['interest_low'].head()

0    0
1    1
2    0
3    1
4    1
Name: interest_low, dtype: int64

In [0]:
# Create dummy variables for interest level == medium
df['interest_med'] = df['interest_level']
df['interest_med'] = df['interest_med'].replace({'low':0,
                                                 'medium':1,
                                                 'high':0})

In [0]:
# Create dummy variables for interest level == low
df['interest_high'] = df['interest_level']
df['interest_high'] = df['interest_high'].replace({'low':0,
                                                 'medium':0,
                                                 'high':1})

In [0]:
# Add the new features to the train and test matrices
condition = ((df['date'].dt.year == 2016) & ((df['date'].dt.month == 4) | (df['date'].dt.month == 5)))
train = df[condition]
condition = ((df['date'].dt.year == 2016) & (df['date'].dt.month == 6))
test = df[condition]

## 3) Fit a linear regression model with at least two features.

In [0]:
# Import estimator class from sklearn
from sklearn.linear_model import LinearRegression

In [0]:
# Instantiate the class
lmodel = LinearRegression()

In [29]:
# Arrange X feature matrices and y target vectors
features = ['interest_low',
            'interest_med',
            'interest_high',
            'bedrooms',
            'bathrooms']
X_train = train[features]
X_test = test[features]
target = 'price'
y_train = train[target]
y_test = test[target]
print(f'Linear Regression, dependent on: {features}')

Linear Regression, dependent on: ['interest_low', 'interest_med', 'interest_high', 'bedrooms', 'bathrooms']


In [31]:
# Fit the model
from sklearn.metrics import mean_absolute_error
lmodel.fit(X_train,
           y_train)
y_pred_train = lmodel.predict(X_train)
mae = mean_absolute_error(y_train,
                          y_pred_train)
print(f'Train Error: ${mae:,.0f}')

Train Error: $792


In [32]:
# Apply the model to new data
y_pred_test = lmodel.predict(X_test)
mae = mean_absolute_error(y_test,
                          y_pred_test)
print(f'Test Error: ${mae:,.0f}')

Test Error: $793


## 4) Get the model's coefficients and intercept.

In [33]:
print(f"The model's coefficients are: {lmodel.coef_} and its intercept is {lmodel.intercept_}")

The model's coefficients are: [1.68036008e+14 1.68036008e+14 1.68036008e+14 4.19675781e+02
 1.98765234e+03] and its intercept is -168036007536780.94


In [36]:
# Easier to read verions
print("Intercept",
      lmodel.intercept_)
coefficients = pd.Series(lmodel.coef_,
                        features)
print(coefficients.to_string())

Intercept -168036007536780.94
interest_low     1.680360e+14
interest_med     1.680360e+14
interest_high    1.680360e+14
bedrooms         4.196758e+02
bathrooms        1.987652e+03


## 5) Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.

In [39]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mse_train = mean_squared_error(y_train,
                         y_pred_train)
rmse_train = np.sqrt(mse_train)
mae_train = mean_absolute_error(y_train,
                          y_pred_train)
r2_train = r2_score(y_train,
              y_pred_train)
print('Train RMSE: ', rmse_train)
print('Train MAE: ', mae_train)
print('Train R^2: ', r2_train)

mse_test = mean_squared_error(y_test,
                         y_pred_test)
rmse_test = np.sqrt(mse_test)
mae_test = mean_absolute_error(y_test,
                          y_pred_test)
r2_test = r2_score(y_test,
              y_pred_test)
print('Test RMSE: ', rmse_test)
print('Test MAE: ', mae_test)
print('Test R^2: ', r2_test)

Train RMSE:  1194.0145613830978
Train MAE:  792.1054348935435
Train R^2:  0.5408509447417822
Test RMSE:  1176.63327444387
Test MAE:  792.6162710628646
Test R^2:  0.5545507114611612


## 6) What's the best test MAE you can get? Share your score and features used with your cohort on Slack!

In [0]:
# Make a new model to try and improve MAE
lmodel1 = LinearRegression()

In [41]:
# Add lat and long
features = ['interest_low',
            'interest_med',
            'interest_high',
            'bedrooms',
            'bathrooms',
            'latitude',
            'longitude']
X_train = train[features]
X_test = test[features]
target = 'price'
y_train = train[target]
y_test = test[target]
print(f'Linear Regression, dependent on: {features}')

Linear Regression, dependent on: ['interest_low', 'interest_med', 'interest_high', 'bedrooms', 'bathrooms', 'latitude', 'longitude']


In [45]:
# Fit the model and find the MAE
lmodel1.fit(X_train,
           y_train)
y_pred_train = lmodel1.predict(X_train)
y_pred_test = lmodel1.predict(X_test)
mae = mean_absolute_error(y_test,
                          y_pred_test)
print(f'Test MAE: ${mae:,.0f}')

Test MAE: $717


## Plot the residuals to see how they are distributed

In [0]:
error = (y_test - lmodel1.predict(X_test))

In [64]:
import plotly.express as px
fig = px.scatter(x=y_test,y=error)
# Add a horizontal line at residuals = 0
fig.update_layout(shapes=[
    dict(
      type= 'line',
      yref= 'paper', y0= 0.38, y1= 0.38,
      xref= 'x', x0= 0, x1= 16000
    )
])
# Residuals increase as price increases
# Perhaps could be helped by log transforming prices?

## Try log transforming the target variable price

In [0]:
# Fit the model and predict
from sklearn.compose import TransformedTargetRegressor
regr_trans = TransformedTargetRegressor(regressor=LinearRegression(),
                                        func=np.log1p,
                                        inverse_func=np.expm1)
regr_trans.fit(X_train, y_train)
y_pred_test = regr_trans.predict(X_test)

In [71]:
# Check out the MAE
mae = mean_absolute_error(y_test,
                          y_pred_test)
mae

677.4536281241336

## Plot the residuals again to see how they've changed

In [0]:
error2 = (y_test - regr_trans.predict(X_test))

In [78]:
fig = px.scatter(x=y_test,y=error2)
# Add a horizontal line at residuals = 0
fig.update_layout(shapes=[
    dict(
      type= 'line',
      yref= 'paper', y0= 0.56, y1= 0.56,
      xref= 'x', x0= 0, x1= 16000
    )
])

In [0]:
# Still has the same effect but not as bad and the lower price residuals
# are tighter around zero

## Try using a stepwise regression that minimizes MAE

In [0]:
# Heavily edited method from here that prioritized p-values https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn

import numpy as np

X_train = train.select_dtypes(include='number')
X_test = test.select_dtypes(include='number')



def stepwise_selection(X_train,
                       X_test,
                       y_train,
                       y_test, 
                       initial_list=[], 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        verbose - whether to print the sequence of inclusions and exclusions
    Returns: list of selected features and best MAE
    Include a feature if its inclusion decreases MAE
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        excluded.remove('price')
        new_mae = pd.Series(index=excluded)
        old_model = LinearRegression()
        old_model.fit(X_train[included], y_train)
        original_mae = mean_absolute_error(y_test, old_model.predict(X_test[included]))
        for new_column in excluded:
            model = LinearRegression()
            model.fit(X_train[included+[new_column]],y_train)
            new_mae[new_column] = mean_absolute_error(y_test, model.predict(X_test[included+[new_column]]))
        best_mae = new_mae.min()
        if best_mae < original_mae:
            best_feature = new_mae.idxmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print(f'Add {best_feature} with MAE {best_mae}')

        model = LinearRegression()
        model.fit(X_train[included], y_train)
        
        # Removed the backwards step
        # worst_pval = pvalues.max() # null if pvalues is empty
        # if worst_pval > threshold_out:
        #     changed=True
        #     worst_feature = pvalues.argmax()
        #     included.remove(worst_feature)
        #     if verbose:
        #         print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

In [143]:
initial_list = []
initial_list.append('latitude')
result = stepwise_selection(X_train,
                            X_test,
                            y_train,
                            y_test,
                            initial_list=initial_list)

print('resulting features:')
print(result)

Add bathrooms with MAE 885.897294637873
Add bedrooms with MAE 820.0689171931438
Add longitude with MAE 744.9752658120601
Add doorman with MAE 715.5147917917018
Add interest_low with MAE 689.0601308151838
Add laundry_in_unit with MAE 686.5075053201265
Add high_speed_internet with MAE 683.0633546287492
Add elevator with MAE 680.7010669260205
Add hardwood_floors with MAE 679.3993159577394
Add dishwasher with MAE 678.2170691391451
Add interest_high with MAE 677.216638823318
Add laundry_in_building with MAE 676.7085965827969
Add fitness_center with MAE 676.2019568022401
Add roof_deck with MAE 675.6285796905796
Add terrace with MAE 675.1762496768362
Add dogs_allowed with MAE 674.8655916018316
Add new_construction with MAE 674.5852230572945
Add pre-war with MAE 674.3201336085016
Add wheelchair_access with MAE 674.1438959818391
Add swimming_pool with MAE 674.0226036758975
Add exclusive with MAE 673.927732053142
Add cats_allowed with MAE 673.8989244851414
resulting features:
['latitude', 'bathr

## Use the log transformed price with the stepwise predictors

In [0]:
regr_trans.fit(X_train[result], y_train)
y_pred_test = regr_trans.predict(X_test[result])

In [145]:
mae = mean_absolute_error(y_test,
                          y_pred_test)
mae

633.8919702807701

## Adjust the stepwise method so that it only adds features that increase the adjusted R^2

In [0]:
# Method for calculating adjusted R2
def adj_r2_score(model,y_test,X_test):
  from sklearn import metrics

  yhat = model.predict(X_test)

  adj = 1 - float(len(y)-1)/(len(y)-len(model.coef_)-1)*(1 -
  metrics.r2_score(y_test,yhat))

  return adj

In [0]:
X_train = train.select_dtypes(include='number')
X_test = test.select_dtypes(include='number')

def stepwise_selection(X_train,
                       X_test,
                       y_train,
                       y_test, 
                       initial_list=[], 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    Arguments:
        X_train - pandas.DataFrame w/ of train data candidate features
        X_test - pandas.DataFrame of test data w/ candidate features
        y_train - train data with the target
        y_test - test data with the target
        initial_list - list of features to start with (column names of X)
        verbose - whether to print the sequence of inclusions
    Returns: list of selected features and best MAE
    Include a feature if its inclusion increases adjusted R^2
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    i=0
    while True:
        changed=False

        # forward step
        excluded = list(set(X.columns)-set(included))
        excluded.remove('price')
        new_mae = pd.Series(index=excluded)
        new_aR2 = pd.Series(index=excluded)
        old_model = LinearRegression()
        old_model.fit(X_train[included],
                      y_train)
        original_mae = mean_absolute_error(y_test,
                                           old_model.predict(X_test[included]))
        original_aR2 = adj_r2_score(old_model,
                                    y_test,
                                    X_test[included])

        for new_column in excluded:
            model = LinearRegression()
            model.fit(X_train[included+[new_column]],
                      y_train)
            new_mae[new_column] = mean_absolute_error(y_test,
                                                      model.predict(X_test[included+[new_column]]))
            new_aR2[new_column] = adj_r2_score(model,
                                               y_test,
                                               X_test[included+[new_column]])

        best_mae_ind = new_mae.idxmin()
        best_aR2 = new_aR2.max()
        if new_aR2[best_mae_ind] > original_aR2:
            best_feature = best_mae_ind
            included.append(best_feature)
            changed=True
            if verbose:
              if i==0:
                i=1
                print(f'Original model w/ {initial_list} has MAE {original_mae} and adjusted R^2 {original_aR2}')
              print(f'Add {best_feature} with MAE {new_mae.min()} and adjusted R^2 {new_aR2[best_mae_ind]}')

        model = LinearRegression()
        model.fit(X_train[included], y_train)

        if not changed:
            break
    return included

In [217]:
initial_list = []
initial_list.append('latitude')
result = stepwise_selection(X_train,
                            X_test,
                            y_train,
                            y_test,
                            initial_list=initial_list)

print('resulting features:')
print(result)

Original model w/ ['latitude'] has MAE 1195.6199393006195 and adjusted R^2 0.0007851819624993261
Add bathrooms with MAE 885.897294637873 and adjusted R^2 0.4818688659796445
Add bedrooms with MAE 820.0689171931438 and adjusted R^2 0.5227308743036241
Add longitude with MAE 744.9752658120601 and adjusted R^2 0.5881933020979921
Add doorman with MAE 715.5147917917018 and adjusted R^2 0.6082167391616802
Add interest_low with MAE 689.0601308151838 and adjusted R^2 0.6333689586934176
Add laundry_in_unit with MAE 686.5075053201265 and adjusted R^2 0.6363657463964121
Add high_speed_internet with MAE 683.0633546287492 and adjusted R^2 0.6386645363265231
Add elevator with MAE 680.7010669260205 and adjusted R^2 0.6393209698870413
Add hardwood_floors with MAE 679.3993159577394 and adjusted R^2 0.6409769942189529
Add dishwasher with MAE 678.2170691391451 and adjusted R^2 0.6412163618177482
Add interest_high with MAE 677.216638823318 and adjusted R^2 0.6421135654679799
Add laundry_in_building with MAE

In [218]:
print(len(result))
print(len(X_train.columns))

21
32


## Make a model with the parameters created by the stepwise function to make sure that it is printing the correct MAE and adjusted R^2

In [219]:
test_model = LinearRegression()
features = result
X_train = train[features]
X_test = test[features]
target = 'price'
y_train = train[target]
y_test = test[target]
print(f'Linear Regression, dependent on: {features}')

Linear Regression, dependent on: ['latitude', 'bathrooms', 'bedrooms', 'longitude', 'doorman', 'interest_low', 'laundry_in_unit', 'high_speed_internet', 'elevator', 'hardwood_floors', 'dishwasher', 'interest_high', 'laundry_in_building', 'fitness_center', 'roof_deck', 'terrace', 'dogs_allowed', 'new_construction', 'pre-war', 'wheelchair_access', 'swimming_pool']


In [221]:
test_model.fit(X_train,
           y_train)
y_pred_train = test_model.predict(X_train)
y_pred_test = test_model.predict(X_test)
mae = mean_absolute_error(y_test,
                          y_pred_test)
ar2 = adj_r2_score(test_model,
                   y_test,
                   X_test)
print(f'Test MAE: ${mae:f}')
print(f'Test Adjusted R^2: {ar2:f}')

Test MAE: $674.022604
Test Adjusted R^2: 0.645021
