# Part 4: Building a model



## Part 4: Building a Model
Build a regression model using Python’s statsmodels module that demonstrates a relationship between the number of bikes in a particular location and the characteristics of the POIs in that location.
Interpret results. Expand on the model output, and derive insights from your model.
Stretch: can you think of a way to turn the above regression problem into a classification one? Without coding, can you sketch out how you would cast the problem specifically, and lay out your approaches?
Complete the model_building.ipynb notebook to demonstrate how you executed the tasks above.

Build a regression model.

In [1]:
# imports
import requests
from IPython.display import JSON 
import pandas as pd
import os
import requests
import json
import matplotlib

In [2]:
import statsmodels.api as sm

## Foursquare Data

In [3]:
# bring in data from part 3 
mtl_results_fsq = pd.read_csv('../data/mtl_results_fsq.csv')
mtl_results_fsq

Unnamed: 0,latitude,longitude,restaurant count,num stalls
0,45.6175,-73.606011,6,11
1,45.516926,-73.564257,50,14
2,45.541549,-73.565012,41,11
3,45.506176,-73.711186,9,19
4,45.512994,-73.682498,41,19
5,45.514734,-73.691449,42,15
6,45.522341,-73.721679,20,15
7,45.566869,-73.641017,10,15
8,45.548136,-73.62434,50,13
9,45.447916,-73.583819,6,19


In [4]:
X = mtl_results_fsq['restaurant count']
y = pd.Series(mtl_results_fsq['num stalls'])

In [5]:
# check y and x values
X

0      6
1     50
2     41
3      9
4     41
5     42
6     20
7     10
8     50
9      6
10    15
11     4
12    23
13    34
14    15
15    27
16    11
17    48
18     7
19    49
20    50
21    50
22    16
23    13
24    50
25     6
26     3
27    19
28    50
29    13
30     0
31    28
32    35
33    30
34    48
35    21
36    10
37     4
38     0
39    19
40     2
41    50
42    18
43    13
44    13
45    43
46    18
47    50
48    50
49    13
Name: restaurant count, dtype: int64

In [6]:
y

0     11
1     14
2     11
3     19
4     19
5     15
6     15
7     15
8     13
9     19
10    23
11    23
12    29
13    15
14    18
15    10
16    29
17    22
18    23
19    27
20    32
21    29
22    15
23    18
24    15
25    19
26    21
27    24
28    17
29    13
30    15
31    22
32    13
33    21
34    19
35    33
36    21
37    22
38    19
39    19
40    14
41    20
42    22
43    25
44    23
45    15
46    12
47    24
48    22
49    21
Name: num stalls, dtype: int64

Provide model output and an interpretation of the results. 

In [7]:
# make a regresssion to predict the num of restaurants based on the num of stalls available 
x = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,x)
print('x type:', type(x))

x type: <class 'pandas.core.frame.DataFrame'>


### Model Analysis - Foursquare:

* very high p-value of 0.818 indicates there is not a statistically significant relationship between the the number of restaurants to effectively predict the # of stalls available
* a negative adjusted r-square value of -0.020 indicates that this model does not effecitively explain any patterns in the data.  
* Despite finding the original bug causing only 8 data points in previous analysis, It's clear that even with additional data points, there is no clear relationship between the count of restaurants and the number of city bikes stalls (See Scatter plots created in Joining Data). 


In [8]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             num stalls   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.020
Method:                 Least Squares   F-statistic:                   0.05346
Date:                Fri, 21 Apr 2023   Prob (F-statistic):              0.818
Time:                        16:08:50   Log-Likelihood:                -155.59
No. Observations:                  50   AIC:                             315.2
Df Residuals:                      48   BIC:                             319.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               19.2420      1.364  

## Yelp Data

In [9]:
mtl_results_yelp = pd.read_csv('../data/mtl_results_yelp.csv')
mtl_results_yelp

Unnamed: 0,latitude,longitude,restaurant count,num stalls
0,45.6175,-73.606011,18,11
1,45.516926,-73.564257,602,14
2,45.541549,-73.565012,100,11
3,45.506176,-73.711186,29,19
4,45.512994,-73.682498,89,19
5,45.514734,-73.691449,89,15
6,45.522341,-73.721679,27,15
7,45.566869,-73.641017,31,15
8,45.548136,-73.62434,150,13
9,45.447916,-73.583819,31,19


In [10]:
X = mtl_results_yelp['restaurant count']
y = pd.Series(mtl_results_yelp['num stalls'])

In [11]:
# check y and x values
X

0      18
1     602
2     100
3      29
4      89
5      89
6      27
7      31
8     150
9      31
10     23
11      8
12     32
13     58
14     29
15     50
16     18
17     82
18     12
19     84
20    711
21    360
22     28
23     27
24    705
25     14
26      6
27     17
28    108
29     20
30      4
31     33
32     70
33     41
34     61
35     40
36     35
37     16
38      0
39     29
40      4
41    452
42     52
43     22
44     39
45    141
46     23
47    200
48    386
49     36
Name: restaurant count, dtype: int64

In [12]:
y

0     11
1     14
2     11
3     19
4     19
5     15
6     15
7     15
8     13
9     19
10    23
11    23
12    29
13    15
14    18
15    10
16    29
17    22
18    23
19    27
20    32
21    29
22    15
23    18
24    15
25    19
26    21
27    24
28    17
29    13
30    15
31    22
32    13
33    21
34    19
35    33
36    21
37    22
38    19
39    19
40    14
41    20
42    22
43    25
44    23
45    15
46    12
47    24
48    22
49    21
Name: num stalls, dtype: int64

In [13]:
# make a regresssion to predict the num of restaurants based on the num of stalls available 
x = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,x)
print('x type:', type(x))

x type: <class 'pandas.core.frame.DataFrame'>


In [14]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             num stalls   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                 -0.008
Method:                 Least Squares   F-statistic:                    0.6206
Date:                Fri, 21 Apr 2023   Prob (F-statistic):              0.435
Time:                        16:10:57   Log-Likelihood:                -155.29
No. Observations:                  50   AIC:                             314.6
Df Residuals:                      48   BIC:                             318.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               19.1242      0.914  

### Model Analysis - Yelp:

* pvalue of 0.434 indicates there is not a statistically significant relationship between the the number of restaurants to effectively predict the # of stalls available
* a negative adjusted r-squared value of -0.008 indicates that this model explains no patterns  in the data. 
* Despite finding the original bug causing only 8 data points in previous analysis, It's clear that even with additional data points, there is no clear relationship between the count of restaurants and the number of city bikes stalls (See Scatter plots created in Joining Data).

## Stretch

How can you turn the regression model into a classification model?

# trash 

Build a regression model.

In [1]:
# imports
import requests
from IPython.display import JSON 
import pandas as pd
import os
import requests
import json
import matplotlib

In [2]:
import statsmodels.api as sm

## Foursquare Data

In [3]:
# bring in data from part 3 
mtl_results_fsq = pd.read_csv('../data/mtl_results_fsq.csv')
mtl_results_fsq

Unnamed: 0,latitude,longitude,restaurant count,num stalls
0,45.6175,-73.606011,6,11
1,45.516926,-73.564257,50,14
2,45.541549,-73.565012,41,11
3,45.506176,-73.711186,9,19
4,45.512994,-73.682498,41,19
5,45.514734,-73.691449,42,15
6,45.522341,-73.721679,20,15
7,45.566869,-73.641017,10,15
8,45.548136,-73.62434,50,13
9,45.447916,-73.583819,6,19


In [4]:
X = mtl_results_fsq['restaurant count']
y = pd.Series(mtl_results_fsq['num stalls'])

In [5]:
# check y and x values
X

0      6
1     50
2     41
3      9
4     41
5     42
6     20
7     10
8     50
9      6
10    15
11     4
12    23
13    34
14    15
15    27
16    11
17    48
18     7
19    49
20    50
21    50
22    16
23    13
24    50
25     6
26     3
27    19
28    50
29    13
30     0
31    28
32    35
33    30
34    48
35    21
36    10
37     4
38     0
39    19
40     2
41    50
42    18
43    13
44    13
45    43
46    18
47    50
48    50
49    13
Name: restaurant count, dtype: int64

In [6]:
y

0     11
1     14
2     11
3     19
4     19
5     15
6     15
7     15
8     13
9     19
10    23
11    23
12    29
13    15
14    18
15    10
16    29
17    22
18    23
19    27
20    32
21    29
22    15
23    18
24    15
25    19
26    21
27    24
28    17
29    13
30    15
31    22
32    13
33    21
34    19
35    33
36    21
37    22
38    19
39    19
40    14
41    20
42    22
43    25
44    23
45    15
46    12
47    24
48    22
49    21
Name: num stalls, dtype: int64

Provide model output and an interpretation of the results. 

In [7]:
# make a regresssion to predict the num of restaurants based on the num of stalls available 
x = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,x)
print('x type:', type(x))

x type: <class 'pandas.core.frame.DataFrame'>


### Model Analysis - Foursquare:

* very high p-value of 0.818 indicates there is not a statistically significant relationship between the the number of restaurants to effectively predict the # of stalls available
* a negative adjusted r-square value of -0.020 indicates that this model does not effecitively explain any patterns in the data.  
* Despite finding the original bug causing only 8 data points in previous analysis, It's clear that even with additional data points, there is no clear relationship between the count of restaurants and the number of city bikes stalls (See Scatter plots created in Joining Data). 


In [8]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             num stalls   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.020
Method:                 Least Squares   F-statistic:                   0.05346
Date:                Fri, 21 Apr 2023   Prob (F-statistic):              0.818
Time:                        16:00:07   Log-Likelihood:                -155.59
No. Observations:                  50   AIC:                             315.2
Df Residuals:                      48   BIC:                             319.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               19.2420      1.364  

## Yelp Data

In [9]:
mtl_results_yelp = pd.read_csv('../data/mtl_results_yelp.csv')
mtl_results_yelp

Unnamed: 0,latitude,longitude,restaurant count,num stalls
0,45.6175,-73.606011,18,11
1,45.516926,-73.564257,602,14
2,45.541549,-73.565012,100,11
3,45.506176,-73.711186,29,19
4,45.512994,-73.682498,89,19
5,45.514734,-73.691449,89,15
6,45.522341,-73.721679,27,15
7,45.566869,-73.641017,31,15
8,45.548136,-73.62434,150,13
9,45.447916,-73.583819,31,19


In [10]:
X = mtl_results_yelp['restaurant count']
y = pd.Series(mtl_results_yelp['num stalls'])

In [11]:
# check y and x values
X

0      18
1     602
2     100
3      29
4      89
5      89
6      27
7      31
8     150
9      31
10     23
11      8
12     32
13     58
14     29
15     50
16     18
17     82
18     12
19     84
20    711
21    360
22     28
23     27
24    705
25     14
26      6
27     17
28    108
29     20
30      4
31     33
32     70
33     41
34     61
35     40
36     35
37     16
38      0
39     29
40      4
41    452
42     52
43     22
44     39
45    141
46     23
47    200
48    386
49     36
Name: restaurant count, dtype: int64

In [12]:
y

0     11
1     14
2     11
3     19
4     19
5     15
6     15
7     15
8     13
9     19
10    23
11    23
12    29
13    15
14    18
15    10
16    29
17    22
18    23
19    27
20    32
21    29
22    15
23    18
24    15
25    19
26    21
27    24
28    17
29    13
30    15
31    22
32    13
33    21
34    19
35    33
36    21
37    22
38    19
39    19
40    14
41    20
42    22
43    25
44    23
45    15
46    12
47    24
48    22
49    21
Name: num stalls, dtype: int64

In [13]:
# make a regresssion to predict the num of restaurants based on the num of stalls available 
x = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,x)
print('x type:', type(x))

x type: <class 'pandas.core.frame.DataFrame'>


In [14]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             num stalls   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                 -0.008
Method:                 Least Squares   F-statistic:                    0.6206
Date:                Fri, 21 Apr 2023   Prob (F-statistic):              0.435
Time:                        16:00:39   Log-Likelihood:                -155.29
No. Observations:                  50   AIC:                             314.6
Df Residuals:                      48   BIC:                             318.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               19.1242      0.914  

### Model Analysis - Yelp:

* pvalue of 0.435 indicates there is not a statistically significant relationship between the the number of restaurants to effectively predict the # of stalls available
* a negative adjusted r-squared value of -0.008 indicates that this model explains no patterns  in the data. 
* Despite finding the original bug causing only 8 data points in previous analysis, It's clear that even with additional data points, there is no clear relationship between the count of restaurants and the number of city bikes stalls (See Scatter plots created in Joining Data).

## Stretch

How can you turn the regression model into a classification model?

## the trash's trash

Build a regression model.

In [None]:
# imports
import requests
from IPython.display import JSON 
import pandas as pd
import os
import requests
import json
import matplotlib

In [None]:
import statsmodels.api as sm

## Foursquare Data

In [None]:
# bring in data from part 3 
mtl_results = pd.read_csv('../data/mtl_results.csv')
mtl_results

In [None]:
X = mtl_results['restaurant count']
y = pd.Series(mtl_results['num stalls'])

In [None]:
# check y and x values
X

In [None]:
y

Provide model output and an interpretation of the results. 

In [None]:
# make a regresssion to predict the num of restaurants based on the num of stalls available 
x = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,x)
print('x type:', type(x))

### Model Analysis - Foursquare:

* very high p-value of 0.891 indicates there is not a statistically significant relationship between the the number of restaurants to effectively predict the # of stalls available
* very small r-square value of 0.003 indicates that this model explains less than 0.3% of the patterns in the data. 
* it's most likely that the low number of available data points (there are only 8 city bikes stations in montreal) is contributing to the lack of 'goodness'for the model
* conclusion: not a strong model. Need to either look at other variables, or increase the # of stations. Even with the maximum results from foursquare. 
   - figures provided in 'joining_data" support this claim. 

In [None]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

## Yelp Data

In [None]:
mtl_results_yelp = pd.read_csv('../data/mtl_results_yelp_clean.csv')
mtl_results_yelp

In [None]:
X = mtl_results_yelp['restaurant count']
y = pd.Series(mtl_results_yelp['num stalls'])

In [None]:
# check y and x values
X

In [None]:
y

In [None]:
# make a regresssion to predict the num of restaurants based on the num of stalls available 
x = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,x)
print('x type:', type(x))

In [None]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

### Model Analysis - Yelp:

* pvalue of 0.703 there is not a statistically significant relationship between the the number of restaurants to effectively predict the # of stalls available
* adjusted r-squared value of 0.032 indicates that this model can explain 3.2% of the patterns in the data. 
* it's most likely that the low number of available data points (there are only 8 city bikes stations in montreal) is contributing to the lack of 'goodness'for the model
* conclusion: the model based on the yelp data is not a strong model and need to look at changing variables, increase the radius, etc. however, the yelp data model is a slightly stronger model than the model based on the foursquare data only.


## Stretch

How can you turn the regression model into a classification model?