## Logistic Regression
Uses data from the General Social Survey.
See https://gss.norc.org/ for more information & to get data.
    
Based on an exercise in *Introduction to Python for Business and Social Sciences*. Frederick Kaefer and Paul Kaefer. In preparation. SAGE Publishing: 2020. All rights reserved.

In [1]:
# use Pandas to hold data &
# in-built statsmodels module for modeling
import pandas as pd
import statsmodels.formula.api as smf

### What's missing here?

* Clean the data!
* Produce _dummy variable(s)_ (1 for Yes, 0 for No)

In [2]:
model_df = pd.read_csv("data/GSS1993_HealthDummy.csv")

print("Observations: ", len(model_df))
model_df.head()

Observations:  1601


Unnamed: 0,UNHAPPY,AGE,HEALTHY,RINCOM91,EDUC,MALE,TRAUMA1,DIVORCED,WIDOWED,SEPARATED,NEVER_MARRIED,BLACK,OTHER_RACE
0,1,43.0,1,17.0,11.0,1,0,1,0,0,0,0,0
1,0,44.0,0,18.0,16.0,1,0,0,0,0,1,1,0
2,0,43.0,1,18.0,16.0,0,0,1,0,0,0,0,0
3,0,45.0,0,12.33,15.0,0,0,0,0,0,1,0,0
4,0,83.0,1,12.33,11.0,1,1,0,0,0,0,0,0


In [3]:
model_df.describe()

Unnamed: 0,UNHAPPY,AGE,HEALTHY,RINCOM91,EDUC,MALE,TRAUMA1,DIVORCED,WIDOWED,SEPARATED,NEVER_MARRIED,BLACK,OTHER_RACE
count,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0
mean,0.111181,46.01574,0.152405,12.329619,13.05634,0.427233,0.235478,0.143036,0.106808,0.027483,0.186758,0.110556,0.049969
std,0.314454,17.338264,0.359525,4.260623,3.045789,0.494831,0.42443,0.350219,0.308966,0.163537,0.389839,0.313679,0.217949
min,0.0,18.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,32.0,0.0,11.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,43.0,0.0,12.33,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,58.0,0.0,15.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,89.0,1.0,21.0,20.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [4]:
model_df['UNHAPPY'].value_counts()

# another way to count JUST THE 1s is:
#pd.Series(model_df['UNHAPPY'].sum(), index = ['UNHAPPY'])

0    1423
1     178
Name: UNHAPPY, dtype: int64

In [5]:
model_df.groupby(['UNHAPPY']).mean()

Unnamed: 0_level_0,AGE,HEALTHY,RINCOM91,EDUC,MALE,TRAUMA1,DIVORCED,WIDOWED,SEPARATED,NEVER_MARRIED,BLACK,OTHER_RACE
UNHAPPY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,45.671258,0.127196,12.387217,13.176493,0.432186,0.218552,0.134223,0.09487,0.021785,0.189037,0.108222,0.048489
1,48.769663,0.353933,11.869157,12.095787,0.38764,0.370787,0.213483,0.202247,0.073034,0.168539,0.129213,0.061798


In [6]:
# performs a logistic regression that estimates
# the likelihood of a respondent being unhappy (variable UNHAPPY = 1).
model= smf.logit(formula = "UNHAPPY ~ WIDOWED + DIVORCED + SEPARATED " +
                "+ NEVER_MARRIED + AGE + HEALTHY + RINCOM91 + EDUC " +
                "+ MALE + TRAUMA1", data = model_df).fit()
 
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.317927
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                UNHAPPY   No. Observations:                 1601
Model:                          Logit   Df Residuals:                     1590
Method:                           MLE   Df Model:                           10
Date:                Wed, 29 Jan 2020   Pseudo R-squ.:                 0.08897
Time:                        15:21:33   Log-Likelihood:                -509.00
converged:                       True   LL-Null:                       -558.71
Covariance Type:            nonrobust   LLR p-value:                 7.119e-17
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -2.0274      0.549     -3.696      0.000      -3.102      -0.952
WIDOWED           0.

### What's next?
* train on a portion of the data & use the rest for testing
* perform cross-validation
* remove variables that have no effect (based on z-score)
* try another type of model