# Regression Model Build and Evaluate

In [1]:
# imports
import pandas as pd
import numpy as np
import statsmodels.api as sm
import os
import seaborn as sns

## Import data

In [2]:
# import all_data from part 3
path = '/Users/brigitteasullivan/My Drive/0.Bootcamp/1. Data Course/lighthouse-data-notes/Week_12/Project/w12-statistical-modelling-project/data/all_data.csv'
all_data = pd.read_csv(path)
all_data.head(3)

Unnamed: 0,station_id,name,free_bikes,empty_slots,has_ebikes,ebikes,slots,renting,returning,timestamp,station_location,outdoor_space_num,category_name,num_by_cat,num_parks
0,36c6491aa1b52e5ef7005f984738de27,Gare d'autocars de Montréal (Berri / Ontario),4,11,True,2,15,1,1,2023-08-31T14:04:02.318000Z,"45.516926210319546,-73.56425732374191",49,Campground,1,27
1,36c6491aa1b52e5ef7005f984738de27,Gare d'autocars de Montréal (Berri / Ontario),4,11,True,2,15,1,1,2023-08-31T14:04:02.318000Z,"45.516926210319546,-73.56425732374191",49,Dog Park,1,27
2,36c6491aa1b52e5ef7005f984738de27,Gare d'autocars de Montréal (Berri / Ontario),4,11,True,2,15,1,1,2023-08-31T14:04:02.318000Z,"45.516926210319546,-73.56425732374191",49,Farm,1,27


In [3]:
# import all_data_numeric from part 3
path = '/Users/brigitteasullivan/My Drive/0.Bootcamp/1. Data Course/lighthouse-data-notes/Week_12/Project/w12-statistical-modelling-project/data/all_data_numeric.csv'
all_data_numeric = pd.read_csv(path)
all_data_numeric.head(3)

Unnamed: 0,station_id,free_bikes,empty_slots,ebikes,slots,outdoor_space_num,num_parks
0,36c6491aa1b52e5ef7005f984738de27,4,11,2,15,49,27
1,8db822a266b5ccb3a1e323ddc8721d62,3,16,0,19,8,3
2,660275cd7d4368cc7590f1606c633bd6,10,15,8,25,9,7


## Build a regression model.

In [4]:
all_data_numeric.describe()

Unnamed: 0,free_bikes,empty_slots,ebikes,slots,outdoor_space_num,num_parks
count,768.0,768.0,768.0,768.0,768.0,768.0
mean,8.734375,12.132812,1.695312,21.923177,23.640625,9.791667
std,8.605796,7.39379,2.607457,7.108075,16.251479,6.811742
min,0.0,0.0,0.0,11.0,1.0,1.0
25%,2.0,6.0,0.0,18.75,9.0,4.0
50%,7.0,12.0,1.0,19.0,19.0,8.0
75%,13.0,17.0,2.0,23.0,38.0,14.0
max,60.0,56.0,15.0,81.0,51.0,34.0


### Approach

**Original approach:** 
I first attempted simple linear regression on all possible combinations (8) of
target variable:
- outdoor_space_num
- num_parks

with independent variable:
- free_bikes
- empty_slots
- slots
- ebikes

The highest adjusted r-square value was 0.0055 of all the models I attempted. 

**Revised approach:**
Decided to restructure the data to have the number of outdoor space by type as seperate columns so that there are more options for independent variables, and narrowed the independent variable to number of slots. (There was no explicit mention that free bikes is the required independent variable in compass content and felt that number of slots represented overall supply/demand of bikes better than a point in time number of free bikes). 

1. Restructure Data
2. Address NaN values
3. Perform multivariate linear regression with backward selection using number of slots as dependent variable. 

#### 1. Restructure data

To make the number of columns to be created more manageable, a sample of outdoor space types to use in analysis. This was done by finding the smallest number of categories represent the largest reasonable proportion of data. 

In [50]:
# for each category what is the number of spaces within 1k of a station
all_data['category_name'].value_counts()

category_name
Park                           768
Playground                     572
Monument                       461
Farm                           409
Garden                         402
Dog Park                       380
Campground                     363
Hiking Trail                   333
Landmarks and Outdoors         316
Historic and Protected Site    261
Roof Deck                      198
Rock Climbing Spot             192
Plaza                          166
Other Great Outdoors           143
Urban Park                     115
Sculpture Garden                96
Tunnel                          84
Stable                          81
Bridge                          77
Structure                       67
Windmill                        61
Harbor or Marina                60
Picnic Area                     54
Neighborhood                    46
Hot Spring                      32
Beach                           28
Lake                            25
Bathing Area                    24
Founta

In [52]:
total_outdoor_spaces = all_data['category_name'].value_counts().sum()
total_outdoor_spaces

5941

In [57]:
# Find the % of data each category represents
percent = (all_data['category_name'].value_counts().sort_values(ascending = False) / total_outdoor_spaces) * 100
percent[:1].sum()


12.927116647029118

In [58]:
# the top 10 categories make up 71% of the data... use these categories. 
category_list = list(percent[:10].index)
category_list



['Park',
 'Playground',
 'Monument',
 'Farm',
 'Garden',
 'Dog Park',
 'Campground',
 'Hiking Trail',
 'Landmarks and Outdoors',
 'Historic and Protected Site']

In [64]:
# take 10 most common outdoor space type, and make a col with number within 1k for each station 
columns = ['station_id', 'outdoor_space_num', 'category_name', 'num_by_cat']

all_data_test = all_data.loc[all_data['category_name'].isin(category_list), columns]
all_data_test

Unnamed: 0,station_id,outdoor_space_num,category_name,num_by_cat
0,36c6491aa1b52e5ef7005f984738de27,49,Campground,1
1,36c6491aa1b52e5ef7005f984738de27,49,Dog Park,1
2,36c6491aa1b52e5ef7005f984738de27,49,Farm,1
3,36c6491aa1b52e5ef7005f984738de27,49,Garden,3
5,36c6491aa1b52e5ef7005f984738de27,49,Monument,2
...,...,...,...,...
5931,e2cf66a0da3c867233306941853190b4,50,Historic and Protected Site,4
5932,e2cf66a0da3c867233306941853190b4,50,Landmarks and Outdoors,2
5933,e2cf66a0da3c867233306941853190b4,50,Monument,6
5935,e2cf66a0da3c867233306941853190b4,50,Park,13


In [65]:
# restructure / transpose / pivot the data so that each category name value becomes its own column
all_data_test = all_data_test.pivot(index = 'station_id', columns='category_name', values='num_by_cat')
all_data_test

category_name,Campground,Dog Park,Farm,Garden,Hiking Trail,Historic and Protected Site,Landmarks and Outdoors,Monument,Park,Playground
station_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
00c210cb99cf9d1b923c1548938aee56,3.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,12.0,5.0
00c84f03ca5970eaa144ed6867d1e2b9,2.0,,,,,,,,4.0,2.0
014e10dba2d92bd20c826b88864dc6b6,,,5.0,1.0,2.0,,,16.0,13.0,1.0
01f9b7e63833ad61e80a7963e2ad9b25,1.0,,,,,,,,3.0,
02044f52405851c50980c20964349a5d,2.0,,4.0,1.0,1.0,,1.0,,16.0,5.0
...,...,...,...,...,...,...,...,...,...,...
ff534fa61076a7bc3694dbc2ea26efaf,,,4.0,2.0,2.0,1.0,1.0,10.0,17.0,5.0
ff57fa074079c8cb3d2bfeffa3adc4bf,,3.0,1.0,1.0,,2.0,1.0,1.0,18.0,1.0
ff85c138d884da9a4540f32b12086338,1.0,6.0,5.0,2.0,1.0,1.0,,5.0,16.0,2.0
ffb74f094efc5f559700c6b2431da637,1.0,3.0,,3.0,3.0,,1.0,2.0,10.0,4.0


In [66]:
all_data_test = all_data_test.reset_index(drop=False)
all_data_test

category_name,station_id,Campground,Dog Park,Farm,Garden,Hiking Trail,Historic and Protected Site,Landmarks and Outdoors,Monument,Park,Playground
0,00c210cb99cf9d1b923c1548938aee56,3.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,12.0,5.0
1,00c84f03ca5970eaa144ed6867d1e2b9,2.0,,,,,,,,4.0,2.0
2,014e10dba2d92bd20c826b88864dc6b6,,,5.0,1.0,2.0,,,16.0,13.0,1.0
3,01f9b7e63833ad61e80a7963e2ad9b25,1.0,,,,,,,,3.0,
4,02044f52405851c50980c20964349a5d,2.0,,4.0,1.0,1.0,,1.0,,16.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...
763,ff534fa61076a7bc3694dbc2ea26efaf,,,4.0,2.0,2.0,1.0,1.0,10.0,17.0,5.0
764,ff57fa074079c8cb3d2bfeffa3adc4bf,,3.0,1.0,1.0,,2.0,1.0,1.0,18.0,1.0
765,ff85c138d884da9a4540f32b12086338,1.0,6.0,5.0,2.0,1.0,1.0,,5.0,16.0,2.0
766,ffb74f094efc5f559700c6b2431da637,1.0,3.0,,3.0,3.0,,1.0,2.0,10.0,4.0


In [67]:
all_data_test = all_data_test.reset_index(drop=True)
all_data_test

category_name,station_id,Campground,Dog Park,Farm,Garden,Hiking Trail,Historic and Protected Site,Landmarks and Outdoors,Monument,Park,Playground
0,00c210cb99cf9d1b923c1548938aee56,3.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,12.0,5.0
1,00c84f03ca5970eaa144ed6867d1e2b9,2.0,,,,,,,,4.0,2.0
2,014e10dba2d92bd20c826b88864dc6b6,,,5.0,1.0,2.0,,,16.0,13.0,1.0
3,01f9b7e63833ad61e80a7963e2ad9b25,1.0,,,,,,,,3.0,
4,02044f52405851c50980c20964349a5d,2.0,,4.0,1.0,1.0,,1.0,,16.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...
763,ff534fa61076a7bc3694dbc2ea26efaf,,,4.0,2.0,2.0,1.0,1.0,10.0,17.0,5.0
764,ff57fa074079c8cb3d2bfeffa3adc4bf,,3.0,1.0,1.0,,2.0,1.0,1.0,18.0,1.0
765,ff85c138d884da9a4540f32b12086338,1.0,6.0,5.0,2.0,1.0,1.0,,5.0,16.0,2.0
766,ffb74f094efc5f559700c6b2431da637,1.0,3.0,,3.0,3.0,,1.0,2.0,10.0,4.0


In [69]:
# just in case add a top10 total
all_data_2 = all_data_test
all_data_2['top_10_total'] =  all_data_2[category_list].sum(axis=1)
all_data_2

category_name,station_id,Campground,Dog Park,Farm,Garden,Hiking Trail,Historic and Protected Site,Landmarks and Outdoors,Monument,Park,Playground,top_10_total
0,00c210cb99cf9d1b923c1548938aee56,3.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,12.0,5.0,29.0
1,00c84f03ca5970eaa144ed6867d1e2b9,2.0,,,,,,,,4.0,2.0,8.0
2,014e10dba2d92bd20c826b88864dc6b6,,,5.0,1.0,2.0,,,16.0,13.0,1.0,38.0
3,01f9b7e63833ad61e80a7963e2ad9b25,1.0,,,,,,,,3.0,,4.0
4,02044f52405851c50980c20964349a5d,2.0,,4.0,1.0,1.0,,1.0,,16.0,5.0,30.0
...,...,...,...,...,...,...,...,...,...,...,...,...
763,ff534fa61076a7bc3694dbc2ea26efaf,,,4.0,2.0,2.0,1.0,1.0,10.0,17.0,5.0,42.0
764,ff57fa074079c8cb3d2bfeffa3adc4bf,,3.0,1.0,1.0,,2.0,1.0,1.0,18.0,1.0,28.0
765,ff85c138d884da9a4540f32b12086338,1.0,6.0,5.0,2.0,1.0,1.0,,5.0,16.0,2.0,39.0
766,ffb74f094efc5f559700c6b2431da637,1.0,3.0,,3.0,3.0,,1.0,2.0,10.0,4.0,27.0


In [70]:
# data validation/QA - there are many nulls in each column
all_data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   station_id                   768 non-null    object 
 1   Campground                   363 non-null    float64
 2   Dog Park                     380 non-null    float64
 3   Farm                         409 non-null    float64
 4   Garden                       402 non-null    float64
 5   Hiking Trail                 333 non-null    float64
 6   Historic and Protected Site  261 non-null    float64
 7   Landmarks and Outdoors       316 non-null    float64
 8   Monument                     461 non-null    float64
 9   Park                         768 non-null    float64
 10  Playground                   572 non-null    float64
 11  top_10_total                 768 non-null    float64
dtypes: float64(11), object(1)
memory usage: 72.1+ KB


In [71]:
all_data_2.head(3)

category_name,station_id,Campground,Dog Park,Farm,Garden,Hiking Trail,Historic and Protected Site,Landmarks and Outdoors,Monument,Park,Playground,top_10_total
0,00c210cb99cf9d1b923c1548938aee56,3.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,12.0,5.0,29.0
1,00c84f03ca5970eaa144ed6867d1e2b9,2.0,,,,,,,,4.0,2.0,8.0
2,014e10dba2d92bd20c826b88864dc6b6,,,5.0,1.0,2.0,,,16.0,13.0,1.0,38.0


In [72]:
# merge data back with citibikes based on station id
all_data_numeric_2 = pd.merge(all_data_numeric, all_data_2, how='inner', left_on = all_data_numeric['station_id'], right_on = all_data_2['station_id'])

In [73]:
all_data_numeric_2 = all_data_numeric_2.drop(columns=['key_0', 'station_id_y']).rename(columns={'station_id_x': 'station_id'})
                        

In [74]:
all_data_numeric_2

Unnamed: 0,station_id,free_bikes,empty_slots,ebikes,slots,outdoor_space_num,num_parks,Campground,Dog Park,Farm,Garden,Hiking Trail,Historic and Protected Site,Landmarks and Outdoors,Monument,Park,Playground,top_10_total
0,36c6491aa1b52e5ef7005f984738de27,4,11,2,15,49,27,1.0,1.0,1.0,3.0,,,,2.0,27.0,6.0,41.0
1,8db822a266b5ccb3a1e323ddc8721d62,3,16,0,19,8,3,,1.0,,,1.0,,,,3.0,3.0,8.0
2,660275cd7d4368cc7590f1606c633bd6,10,15,8,25,9,7,1.0,,,,,,,,7.0,1.0,9.0
3,fddada5adc997290212b3f540c017274,8,6,5,15,6,3,,2.0,,,,,,,3.0,1.0,6.0
4,83d02cd8a043b8305a4031063005d32e,12,3,1,15,8,4,,1.0,1.0,,,,,,4.0,2.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,7c116c44d814279d54202eeec81ddddb,5,12,0,17,48,10,1.0,1.0,8.0,3.0,1.0,,2.0,9.0,10.0,2.0,37.0
764,0bfb7384b255c89b94b68ee5f227a792,11,5,3,17,3,2,,,,,,,,,2.0,1.0,3.0
765,f05dc5b7b3635787d9fe3fd8565bee7a,2,21,1,23,39,11,2.0,2.0,5.0,4.0,2.0,,2.0,5.0,11.0,2.0,35.0
766,3e138c9acff07bff5f9e684c01bc564f,34,0,2,34,50,9,,,2.0,4.0,1.0,5.0,3.0,13.0,9.0,1.0,38.0


#### Address Missing values

In this case, the NaN values represent a lack of rather than a missing value. Example: A NaN in the campground column means that there are no campgrounds within 1000m of that station, so it is safe to replace the NaNs with 0 in the data. 

In [38]:
model_data = all_data_numeric_2.fillna(0).drop(columns=['station_id', 'empty_slots', 'ebikes', 'free_bikes', 'top_10_total', 'outdoor_space_num', 'num_parks'])
model_data

Unnamed: 0,slots,Campground,Dog Park,Farm,Garden,Hiking Trail,Historic and Protected Site,Landmarks and Outdoors,Monument,Park,Playground
0,15,1.0,1.0,1.0,3.0,0.0,0.0,0.0,2.0,27.0,6.0
1,19,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,3.0,3.0
2,25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,1.0
3,15,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0
4,15,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,4.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...
763,17,1.0,1.0,8.0,3.0,1.0,0.0,2.0,9.0,10.0,2.0
764,17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
765,23,2.0,2.0,5.0,4.0,2.0,0.0,2.0,5.0,11.0,2.0
766,34,0.0,0.0,2.0,4.0,1.0,5.0,3.0,13.0,9.0,1.0


## Provide model output and an interpretation of the results. 

### Model Output

In [39]:
#run full model
y = model_data['slots']
X = model_data.drop('slots', axis=1)
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  slots   R-squared:                       0.201
Model:                            OLS   Adj. R-squared:                  0.190
Method:                 Least Squares   F-statistic:                     19.00
Date:                Thu, 31 Aug 2023   Prob (F-statistic):           2.32e-31
Time:                        15:17:44   Log-Likelihood:                -2509.5
No. Observations:                 768   AIC:                             5041.
Df Residuals:                     757   BIC:                             5092.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

In [75]:
#run full model, remove independent variable with the higest p-value
y = model_data['slots']
X = model_data.drop(['slots', 'Campground'], axis=1)
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  slots   R-squared:                       0.201
Model:                            OLS   Adj. R-squared:                  0.191
Method:                 Least Squares   F-statistic:                     21.13
Date:                Thu, 31 Aug 2023   Prob (F-statistic):           4.90e-32
Time:                        16:31:55   Log-Likelihood:                -2509.5
No. Observations:                 768   AIC:                             5039.
Df Residuals:                     758   BIC:                             5085.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

In [76]:
#run full model, remove variable with highest pvalue in latest iteration
y = model_data['slots']
X = model_data.drop(['slots', 'Campground', 'Hiking Trail'], axis=1)
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  slots   R-squared:                       0.200
Model:                            OLS   Adj. R-squared:                  0.192
Method:                 Least Squares   F-statistic:                     23.79
Date:                Thu, 31 Aug 2023   Prob (F-statistic):           1.03e-32
Time:                        16:32:13   Log-Likelihood:                -2509.6
No. Observations:                 768   AIC:                             5037.
Df Residuals:                     759   BIC:                             5079.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

In [78]:
#run full model, remove variable wiith highest pvalue in latest iteration
y = model_data['slots']
X = model_data.drop(['slots', 'Campground', 'Hiking Trail','Landmarks and Outdoors' ], axis=1)
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  slots   R-squared:                       0.200
Model:                            OLS   Adj. R-squared:                  0.193
Method:                 Least Squares   F-statistic:                     27.19
Date:                Thu, 31 Aug 2023   Prob (F-statistic):           2.09e-33
Time:                        16:32:23   Log-Likelihood:                -2509.7
No. Observations:                 768   AIC:                             5035.
Df Residuals:                     760   BIC:                             5072.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

In [79]:
#run full model, remove variable wiith highest pvalue in latest iteration
y = model_data['slots']
X = model_data.drop(['slots', 'Campground', 'Hiking Trail','Landmarks and Outdoors', 'Playground'], axis=1)
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  slots   R-squared:                       0.200
Model:                            OLS   Adj. R-squared:                  0.194
Method:                 Least Squares   F-statistic:                     31.70
Date:                Thu, 31 Aug 2023   Prob (F-statistic):           4.08e-34
Time:                        16:32:25   Log-Likelihood:                -2509.8
No. Observations:                 768   AIC:                             5034.
Df Residuals:                     761   BIC:                             5066.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

#### Model Output A:

In [44]:
#run full model, remove variable wiith highest pvalue in latest iteration
y = model_data['slots']
X = model_data.drop(['slots', 'Campground', 'Hiking Trail','Landmarks and Outdoors', 'Playground','Dog Park'], axis=1)
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  slots   R-squared:                       0.199
Model:                            OLS   Adj. R-squared:                  0.194
Method:                 Least Squares   F-statistic:                     37.87
Date:                Thu, 31 Aug 2023   Prob (F-statistic):           9.66e-35
Time:                        15:26:59   Log-Likelihood:                -2510.3
No. Observations:                 768   AIC:                             5033.
Df Residuals:                     762   BIC:                             5060.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

#### Model Output B (final model):

In [80]:
#run full model, remove variable with highest pvalue in latest iteration
y = model_data['slots']
X = model_data.drop(['slots', 'Campground', 'Hiking Trail','Landmarks and Outdoors', 'Playground','Dog Park', 'Garden'], axis=1)
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  slots   R-squared:                       0.196
Model:                            OLS   Adj. R-squared:                  0.192
Method:                 Least Squares   F-statistic:                     46.42
Date:                Thu, 31 Aug 2023   Prob (F-statistic):           6.18e-35
Time:                        16:32:45   Log-Likelihood:                -2511.8
No. Observations:                 768   AIC:                             5034.
Df Residuals:                     763   BIC:                             5057.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

### Model Interpretation:

Perform backward selection to to remove least significant variables one at a time. 

**Note:**

* In the first iteration, several independent variables came back as not statistically significant to the model with high pvalues:
    * Campground  - 0.907 pvalue
    * Hiking Trail - 0.701 
    * Landmarks and Outdoors - 0.622
    * Playground - 0.614
* When I compared Model Output A and B, there was a decrease in the Adjusted r-square value, however all of Model B's variables were statistically significant. I concluded that model output B is still the preferred / 'best' model since the decrease in adjusted r square is minimal (0.002) for a model where all variables are statistically significant. 
    * Model A Adjusted R-Square:  0.194
    * Model B Adjusted R-Square:  0.192
* The adjusted R-quare value in the final model of 0.192 means that about 19.2% of the variance in the dependent variable can be explained by the independent variables
* In model output A, Park is technically not statistiscally significant with a pvalue of 0.062 (but very close to 0.05 threshold). When Garden is removed from Model B, the pvalue for Park drops and Park becomes a statistically significant. 
* Historic and Protected sites is the outdoor space category that makes the largest contribution to the model with the highest coefficient value, followed by monuments. Both categories have a P-value of 0.000 indicating high significance. 

# Stretch

**How can you turn the regression model into a classification model?**


**Answer:**
in city bikes extract, there is a feature for whether or not there are e-bikes at a particular station (presented as true or false)
I could use the e-bike status (T/F) as the dependent variable and see if I could use the number of outdoor spaces or the number of parks to predict whether a station will have e-bikes or not. 
