## Test a Logistic Regression Model

***

## Project Description

**Data preparation for this assignment:**

1) If your response variable is categorical with more than two categories, you will need to collapse it down to two categories, or subset your data to select observations from 2 categories.

2) If your response variable is quantitative, you will need to bin it into two categories.

**The assignment:**

1) what you found, making sure you discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (odds ratios, p-values, and 95% confidence intervals for the odds ratios) in your summary. 

2) Report whether or not your results supported your hypothesis for the association between your primary explanatory variable and your response variable. 3) Discuss whether or not there was evidence of confounding for the association between your primary explanatory and the response variable (Hint: adding additional explanatory variables to your model one at a time will make it easier to identify which of the variables are confounding variables).  

## Data Dictionary

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| incomeperperson|	2010 Gross Domestic Product per capita in constant 2000 US$.|
| alcconsumption|	2008 alcohol consumption per adult (age 15+), litres|
| armedforcesrate|	Armed forces personnel (% of total labor force)|
| breastcancerper100TH|	2002 breast cancer new cases per 100,000 female|
| co2emissions|	2006 cumulative CO2 emission (metric tons)|
| femaleemployrate|	2007 female employees age 15+ (% of population)|
| employrate|	2007 total employees age 15+ (% of population)|
| HIVrate|	2009 estimated HIV Prevalence %|
| Internetuserate|	2010 Internet users (per 100 people)|
| lifeexpectancy|	2011 life expectancy at birth (years)|
| oilperperson|	2010 oil Consumption per capita (tonnes per year and person)|
| polityscore|	2009 Democracy score (Polity)|
| relectricperperson|	2008 residential electricity consumption, per person (kWh)|
| suicideper100TH|	2005 Suicide, age adjusted, per 100 000|
| urbanrate|	2008 urban population (% of total)|

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as sfa
from statsmodels.formula.api import ols, logit
import datetime
from datetime import datetime, timedelta
import scipy.stats
import pandas_profiling
from pandas_profiling import ProfileReport


%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)

plt.rc('axes', titlesize=9)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

import warnings
warnings.filterwarnings('ignore')

# Use Folium library to plot values on a map.
#import folium

# Use Feature-Engine library
#import feature_engine
#import feature_engine.missing_data_imputers as mdi
#from feature_engine.outlier_removers import Winsorizer
#from feature_engine import categorical_encoders as ce
#from feature_engine.discretisation import EqualWidthDiscretiser, EqualFrequencyDiscretiser, DecisionTreeDiscretiser
#from feature_engine.encoding import OrdinalEncoder

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)
pd.option_context('float_format','{:.2f}'.format)

np.random.seed(0)
np.set_printoptions(suppress=True)

Autosaving every 60 seconds


In [2]:
df = pd.read_csv("gapminderfinal5.csv")

In [3]:
df

Unnamed: 0,incomeperperson,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,polityscore,relectricperperson,suicideper100th,employrate,urbanrate,demoscorecat,co2cat,incomecat,alccat,electricat
0,8740.97,0.03,0.57,27,76.0,25.60,1.94,4,49,1.48,0,1173.18,7,55.70,24.04,1,1,3,0,0.155844
1,1915.00,7.29,1.02,57,224.0,42.10,1.94,45,77,1.48,9,636.34,8,51.40,46.72,3,2,1,3,0.101449
2,2231.99,0.69,2.31,24,2932.0,31.70,0.10,12,73,0.42,2,590.51,5,50.50,65.22,2,3,2,0,0.101449
3,21943.34,10.17,1.44,37,5033.0,47.55,1.94,81,70,1.48,4,1173.18,5,58.64,88.92,2,4,4,4,0.155844
4,1381.00,5.57,1.46,23,248.0,69.40,2.00,10,51,1.48,-2,173.00,15,75.70,56.70,1,2,1,2,0.101449
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
208,722.81,3.91,1.09,16,1425.0,67.60,0.40,28,75,1.48,-7,302.73,12,71.00,27.84,0,3,1,1,0.101449
209,8740.97,6.69,5.94,37,14.0,11.30,1.94,36,73,1.48,4,1173.18,10,32.00,71.90,2,0,3,2,0.155844
210,610.36,0.20,2.32,35,235.0,20.30,1.94,12,65,1.48,-2,130.06,6,39.00,30.64,1,2,0,0,0.101449
211,432.23,3.56,0.34,13,132.0,53.50,13.50,10,49,1.48,7,168.62,12,61.00,35.42,3,2,0,1,0.101449


## Exploratory Data Analysis

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213 entries, 0 to 212
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   incomeperperson       213 non-null    float64
 1   alcconsumption        213 non-null    float64
 2   armedforcesrate       213 non-null    float64
 3   breastcancerper100th  213 non-null    int64  
 4   co2emissions          213 non-null    float64
 5   femaleemployrate      213 non-null    float64
 6   hivrate               213 non-null    float64
 7   internetuserate       213 non-null    int64  
 8   lifeexpectancy        213 non-null    int64  
 9   oilperperson          213 non-null    float64
 10  polityscore           213 non-null    int64  
 11  relectricperperson    213 non-null    float64
 12  suicideper100th       213 non-null    int64  
 13  employrate            213 non-null    float64
 14  urbanrate             213 non-null    float64
 15  demoscorecat          2

In [5]:
df.describe()

Unnamed: 0,incomeperperson,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,polityscore,relectricperperson,suicideper100th,employrate,urbanrate,demoscorecat,co2cat,incomecat,alccat,electricat
count,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0
mean,8740.966338,6.689484,1.443052,37.323944,5033.244131,47.549531,1.936854,35.685446,69.751174,1.481362,3.765258,1173.17939,9.685446,58.63662,56.76939,2.061033,2.0,2.0,1.99061,0.126761
std,13466.912542,4.589345,1.498692,20.443277,24936.503422,13.364005,3.632102,26.418255,9.241981,0.987116,5.487663,1341.777091,5.955782,9.61196,23.275759,1.009872,1.420869,1.420869,1.417514,0.076255
min,103.78,0.03,0.0,4.0,0.0,11.3,0.06,0.0,48.0,0.03,-10.0,0.0,0.0,32.0,10.4,0.0,0.0,0.0,0.0,0.0
25%,952.83,3.23,0.57,23.0,38.0,40.3,0.2,12.0,65.0,1.48,1.0,431.63,6.0,53.5,37.34,2.0,1.0,1.0,1.0,0.101449
50%,3665.35,6.69,1.21,35.0,235.0,47.55,1.2,36.0,72.0,1.48,4.0,1173.18,10.0,58.64,56.77,2.0,2.0,2.0,2.0,0.101449
75%,8740.97,9.5,1.44,44.0,2422.0,53.6,1.94,52.0,76.0,1.48,8.0,1173.18,12.0,63.7,73.5,3.0,3.0,3.0,3.0,0.155844
max,105147.44,23.01,10.64,101.0,334221.0,83.3,25.9,96.0,83.0,12.23,10.0,11154.76,36.0,83.2,100.0,3.0,4.0,4.0,4.0,0.291667


In [6]:
df.columns

Index(['incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate', 'demoscorecat', 'co2cat', 'incomecat', 'alccat', 'electricat'], dtype='object')

In [7]:
df[['demoscorecat', 'co2cat', 'incomecat', 'alccat', 'electricat']] = df[['demoscorecat', 'co2cat', 'incomecat', 'alccat', 'electricat']].astype('category')

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213 entries, 0 to 212
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   incomeperperson       213 non-null    float64 
 1   alcconsumption        213 non-null    float64 
 2   armedforcesrate       213 non-null    float64 
 3   breastcancerper100th  213 non-null    int64   
 4   co2emissions          213 non-null    float64 
 5   femaleemployrate      213 non-null    float64 
 6   hivrate               213 non-null    float64 
 7   internetuserate       213 non-null    int64   
 8   lifeexpectancy        213 non-null    int64   
 9   oilperperson          213 non-null    float64 
 10  polityscore           213 non-null    int64   
 11  relectricperperson    213 non-null    float64 
 12  suicideper100th       213 non-null    int64   
 13  employrate            213 non-null    float64 
 14  urbanrate             213 non-null    float64 
 15  demosc

In [9]:
df.describe(include='all')

Unnamed: 0,incomeperperson,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,polityscore,relectricperperson,suicideper100th,employrate,urbanrate,demoscorecat,co2cat,incomecat,alccat,electricat
count,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0
unique,,,,,,,,,,,,,,,,4.0,5.0,5.0,5.0,5.0
top,,,,,,,,,,,,,,,,3.0,4.0,4.0,2.0,0.155844
freq,,,,,,,,,,,,,,,,90.0,43.0,43.0,45.0,77.0
mean,8740.966338,6.689484,1.443052,37.323944,5033.244131,47.549531,1.936854,35.685446,69.751174,1.481362,3.765258,1173.17939,9.685446,58.63662,56.76939,,,,,
std,13466.912542,4.589345,1.498692,20.443277,24936.503422,13.364005,3.632102,26.418255,9.241981,0.987116,5.487663,1341.777091,5.955782,9.61196,23.275759,,,,,
min,103.78,0.03,0.0,4.0,0.0,11.3,0.06,0.0,48.0,0.03,-10.0,0.0,0.0,32.0,10.4,,,,,
25%,952.83,3.23,0.57,23.0,38.0,40.3,0.2,12.0,65.0,1.48,1.0,431.63,6.0,53.5,37.34,,,,,
50%,3665.35,6.69,1.21,35.0,235.0,47.55,1.2,36.0,72.0,1.48,4.0,1173.18,10.0,58.64,56.77,,,,,
75%,8740.97,9.5,1.44,44.0,2422.0,53.6,1.94,52.0,76.0,1.48,8.0,1173.18,12.0,63.7,73.5,,,,,


In [10]:
df.demoscorecat.value_counts()

3    90
2    71
1    27
0    25
Name: demoscorecat, dtype: int64

In [11]:
df2 = df[df["demoscorecat"] != 1]

In [12]:
df2

Unnamed: 0,incomeperperson,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,polityscore,relectricperperson,suicideper100th,employrate,urbanrate,demoscorecat,co2cat,incomecat,alccat,electricat
1,1915.00,7.29,1.02,57,224.0,42.10,1.94,45,77,1.48,9,636.34,8,51.40,46.72,3,2,1,3,0.101449
2,2231.99,0.69,2.31,24,2932.0,31.70,0.10,12,73,0.42,2,590.51,5,50.50,65.22,2,3,2,0,0.101449
3,21943.34,10.17,1.44,37,5033.0,47.55,1.94,81,70,1.48,4,1173.18,5,58.64,88.92,2,4,4,4,0.155844
5,11894.46,8.17,1.44,37,16.0,47.55,1.94,81,70,1.48,4,1173.18,2,58.64,30.46,2,0,4,3,0.155844
6,10749.42,9.35,0.56,74,5872.0,45.90,0.50,36,76,0.64,8,768.43,8,58.40,92.00,3,4,3,3,0.101449
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
206,1543.96,0.96,1.44,24,3.0,47.55,1.94,8,71,1.48,4,1173.18,5,58.64,24.76,2,0,1,0,0.155844
208,722.81,3.91,1.09,16,1425.0,67.60,0.40,28,75,1.48,-7,302.73,12,71.00,27.84,0,3,1,1,0.101449
209,8740.97,6.69,5.94,37,14.0,11.30,1.94,36,73,1.48,4,1173.18,10,32.00,71.90,2,0,3,2,0.155844
211,432.23,3.56,0.34,13,132.0,53.50,13.50,10,49,1.48,7,168.62,12,61.00,35.42,3,2,0,1,0.101449


In [13]:
df3 = df2[df2["demoscorecat"] != 2]

In [14]:
df3

Unnamed: 0,incomeperperson,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,polityscore,relectricperperson,suicideper100th,employrate,urbanrate,demoscorecat,co2cat,incomecat,alccat,electricat
1,1915.00,7.29,1.02,57,224.0,42.1,1.94,45,77,1.48,9,636.34,8,51.4,46.72,3,2,1,3,0.101449
6,10749.42,9.35,0.56,74,5872.0,45.9,0.50,36,76,0.64,8,768.43,8,58.4,92.00,3,4,3,3,0.101449
9,25249.99,10.21,0.49,83,12970.0,54.6,0.10,76,82,1.91,10,2825.39,8,61.5,88.74,3,4,4,4,0.033333
10,26692.98,12.40,0.82,70,4466.0,49.7,0.30,73,81,1.55,10,2068.12,13,57.1,67.16,3,4,4,4,0.033333
11,2344.90,13.34,1.98,32,511.0,56.2,0.10,47,71,0.36,-7,921.56,1,60.9,51.92,0,2,2,4,0.101449
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,37491.18,9.70,0.97,101,334221.0,56.0,0.60,74,79,2.74,10,4542.85,10,62.3,81.70,3,4,4,3,0.000000
204,9106.33,8.99,1.58,83,276.0,46.0,0.50,48,77,1.48,10,823.82,15,57.5,92.30,3,2,3,3,0.101449
205,952.83,3.61,0.71,17,1718.0,52.6,0.10,19,68,0.18,-9,261.43,5,57.5,36.82,0,3,1,1,0.101449
208,722.81,3.91,1.09,16,1425.0,67.6,0.40,28,75,1.48,-7,302.73,12,71.0,27.84,0,3,1,1,0.101449


In [15]:
df3.demoscorecat.value_counts()

3    90
0    25
2     0
1     0
Name: demoscorecat, dtype: int64

In [16]:
df3.reset_index(inplace=True, drop=True)
df3

Unnamed: 0,incomeperperson,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,polityscore,relectricperperson,suicideper100th,employrate,urbanrate,demoscorecat,co2cat,incomecat,alccat,electricat
0,1915.00,7.29,1.02,57,224.0,42.1,1.94,45,77,1.48,9,636.34,8,51.4,46.72,3,2,1,3,0.101449
1,10749.42,9.35,0.56,74,5872.0,45.9,0.50,36,76,0.64,8,768.43,8,58.4,92.00,3,4,3,3,0.101449
2,25249.99,10.21,0.49,83,12970.0,54.6,0.10,76,82,1.91,10,2825.39,8,61.5,88.74,3,4,4,4,0.033333
3,26692.98,12.40,0.82,70,4466.0,49.7,0.30,73,81,1.55,10,2068.12,13,57.1,67.16,3,4,4,4,0.033333
4,2344.90,13.34,1.98,32,511.0,56.2,0.10,47,71,0.36,-7,921.56,1,60.9,51.92,0,2,2,4,0.101449
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110,37491.18,9.70,0.97,101,334221.0,56.0,0.60,74,79,2.74,10,4542.85,10,62.3,81.70,3,4,4,3,0.000000
111,9106.33,8.99,1.58,83,276.0,46.0,0.50,48,77,1.48,10,823.82,15,57.5,92.30,3,2,3,3,0.101449
112,952.83,3.61,0.71,17,1718.0,52.6,0.10,19,68,0.18,-9,261.43,5,57.5,36.82,0,3,1,1,0.101449
113,722.81,3.91,1.09,16,1425.0,67.6,0.40,28,75,1.48,-7,302.73,12,71.0,27.84,0,3,1,1,0.101449


## Regression Analysis

In [17]:
df3.columns

Index(['incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate', 'demoscorecat', 'co2cat', 'incomecat', 'alccat', 'electricat'], dtype='object')

In [18]:
y = df3['demoscorecat']
X = df3[['incomeperperson','internetuserate','lifeexpectancy','relectricperperson','employrate','urbanrate']]

In [19]:
sub = df3[['incomeperperson','internetuserate','lifeexpectancy','relectricperperson','employrate','urbanrate','demoscorecat']]

In [20]:
sub.dtypes

incomeperperson        float64
internetuserate          int64
lifeexpectancy           int64
relectricperperson     float64
employrate             float64
urbanrate              float64
demoscorecat          category
dtype: object

In [21]:
sub["demoscorecat"] = sub["demoscorecat"].astype('int')

In [22]:
sub.demoscorecat.value_counts()

3    90
0    25
Name: demoscorecat, dtype: int64

In [23]:
sub["demoscorecat"].replace(to_replace=3, value=1, inplace=True)

In [24]:
sub.demoscorecat.value_counts()

1    90
0    25
Name: demoscorecat, dtype: int64

In [25]:
sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115 entries, 0 to 114
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   incomeperperson     115 non-null    float64
 1   internetuserate     115 non-null    int64  
 2   lifeexpectancy      115 non-null    int64  
 3   relectricperperson  115 non-null    float64
 4   employrate          115 non-null    float64
 5   urbanrate           115 non-null    float64
 6   demoscorecat        115 non-null    int32  
dtypes: float64(4), int32(1), int64(2)
memory usage: 6.0 KB


In [26]:
#X = sm.add_constant(X)

In [27]:
model = logit(formula= 'demoscorecat ~ incomeperperson + internetuserate + lifeexpectancy + relectricperperson + employrate + urbanrate', data=sub).fit()

Optimization terminated successfully.
         Current function value: 0.452156
         Iterations 7


In [28]:
model.summary()

0,1,2,3
Dep. Variable:,demoscorecat,No. Observations:,115.0
Model:,Logit,Df Residuals:,108.0
Method:,MLE,Df Model:,6.0
Date:,"Tue, 23 Mar 2021",Pseudo R-squ.:,0.1364
Time:,14:50:42,Log-Likelihood:,-51.998
converged:,True,LL-Null:,-60.212
Covariance Type:,nonrobust,LLR p-value:,0.01163

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,5.0185,3.243,1.548,0.122,-1.337,11.374
incomeperperson,5.27e-05,5.55e-05,0.950,0.342,-5.61e-05,0.000
internetuserate,0.0362,0.020,1.806,0.071,-0.003,0.076
lifeexpectancy,-0.0269,0.041,-0.657,0.511,-0.107,0.053
relectricperperson,-0.0005,0.000,-2.166,0.030,-0.001,-5.13e-05
employrate,-0.0313,0.030,-1.041,0.298,-0.090,0.028
urbanrate,-0.0159,0.017,-0.953,0.340,-0.049,0.017


In [29]:
model.pvalues

Intercept             0.121711
incomeperperson       0.342365
internetuserate       0.070971
lifeexpectancy        0.511064
relectricperperson    0.030324
employrate            0.297932
urbanrate             0.340397
dtype: float64

In [30]:
model.params

Intercept             5.018460
incomeperperson       0.000053
internetuserate       0.036229
lifeexpectancy       -0.026946
relectricperperson   -0.000539
employrate           -0.031263
urbanrate            -0.015891
dtype: float64

## Summary

In [31]:
result1 = pd.DataFrame(model.pvalues, columns=["P-value"])

In [32]:
result1

Unnamed: 0,P-value
Intercept,0.121711
incomeperperson,0.342365
internetuserate,0.070971
lifeexpectancy,0.511064
relectricperperson,0.030324
employrate,0.297932
urbanrate,0.340397


In [33]:
result2 = pd.DataFrame(model.params, columns=["Coef"])

In [34]:
result2

Unnamed: 0,Coef
Intercept,5.01846
incomeperperson,5.3e-05
internetuserate,0.036229
lifeexpectancy,-0.026946
relectricperperson,-0.000539
employrate,-0.031263
urbanrate,-0.015891


#### Python code done by Dennis Lam