
# Peer-graded Assignment: Test a Logistic Regression Model


<div class="rc-AssignmentInstructions"><div class="introduction"><div class="rc-CML styled"><div><p> This week's assignment is to test a logistic regression model. </p><p><strong>Data preparation for this assignment:</strong></p><p>1) If your response variable is categorical with more than two categories, you will need to collapse it down to two categories, or subset your data to select observations from 2 categories.</p><p>2) If your response variable is quantitative, you will need to bin it into two categories.</p><p><strong>The assignment:</strong></p><p>Write a blog entry that summarize in a few sentences 1) what you found, making sure you discuss the results for the associations between <strong>all</strong> of your explanatory variables and your response variable. Make sure to include statistical results (odds ratios, p-values, and 95% confidence intervals for the odds ratios) in your summary. 2) Report whether or not your results supported your hypothesis for the association between your primary explanatory variable and your response variable. 3) Discuss whether or not there was evidence of confounding for the association between your primary explanatory and the response variable (Hint: adding additional explanatory variables to your model one at a time will make it easier to identify which of the variables are confounding variables).  </p><p>

<strong>What to Submit: Write a blog entry and submit the URL for your blog. Your blog entry should include 1) the summary of your results that addresses parts 1-3 of the assignment, 2) the output from your logistic regression model.</strong></p><p>  Example of how to write logistic regression results:</p><p>After adjusting for potential confounding factors (list them), the odds of having nicotine dependence were more than two times higher for participants with major depression than for participants without major depression (OR=2.36, 95% CI = 1.44-3.81, p=.0001). Age was also significantly associated with nicotine dependence, such that older older participants were significantly less likely to have nicotine dependence (OR= 0.81, 95% CI=0.40-0.93, p=.041).  </p></div></div></div><div class="rc-AssignmentInstructionSection"><div class="title-container bgcolor-primary-light"><span class="body-2-text">Review criteria</span></div><div class="instructions-content-container" id="instructions-container-1529142337748"><div class="rc-CML styled"><div><p> Your assessment will be based on the evidence you provide that you have completed all of the steps. When relevant, gradients in the scoring will be available to reward clarity (for example, you will get one point for submitting an inaccurate or incomplete description of your results, but two points if the description is accurate and complete). In all cases, consider that the peer assessing your work is likely not an expert in the field you are analyzing. You will be assessed equally on all parts of the assignment, and whether you post your program and output.  </p></div></div></div></div></div>

In [67]:

import numpy as np
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import statsmodels.formula.api as smf
import inflect
import math

In [68]:

%run ../../../load_magic/storage2.py 2
%run ../data-visualization/load-gapminder_df.ipynb
%who

data_folder = ../../data/
saves_folder = ../../saves/
DBSCAN	 HTML	 IPython	 Image	 MinMaxScaler	 PCA	 Statements	 add_facet_grid	 add_joint_plot	 
angle_between	 anova_with_moderator	 attempt_to_pickle	 binary_brw	 binary_column_list	 binary_column_name	 category_column_list	 chi2_contingency	 chi_square_with_moderator	 
column_name	 column_value	 conf_int_df	 conjunctify_list	 create_binned_categories	 data_folder	 data_measures_dict	 data_procedures_dict	 data_provider_dict	 
download_url	 encoding	 euclidean	 explanation_str	 force_correct_scale	 formal_name_dict	 gapminder_df	 get_correct_scale	 get_min_max	 
inflect	 inflect_engine	 informal_column_name	 informal_name_dict	 informal_odds_name	 informal_significant_name	 less_more_str	 load_csv	 load_dataframes	 
load_object	 low_high_list	 lower_ci	 match_series	 math	 max_odds_column	 max_odds_ratio	 model_interpretation	 moderator_conclusion	 
np	 number_column_list	 obj_path	 odds_match_series	 odds_ratio	 original_columns	 os

In [69]:

gapminder_df = load_object('gapminder_df')
gapminder_df.dropna(subset=None, inplace=True)
for column_name in gapminder_df.columns:
    try:
        gapminder_df = create_binned_categories(df=gapminder_df, number_of_categories=2,
                                                column_name=column_name, prefix=column_name)
    except Exception as e:
        print('{}: {}'.format(column_name, e))
category_column_list = [column_name for column_name in gapminder_df.columns if column_name.endswith('_categories')]
for column_name in category_column_list:
    low_high_list = sorted(gapminder_df[column_name].unique().tolist())
    if len(low_high_list) == 2:
        binary_column_name = '_'.join(column_name.split('_')[:-1] + ['binary'])
        gapminder_df[binary_column_name] = gapminder_df[column_name].map(lambda x: low_high_list.index(x)).cat.codes
binary_column_list = [column_name for column_name in gapminder_df.columns if column_name.endswith('_binary')]

inflect_engine = inflect.engine()

# Find some formulas that don't blow up
patsy_formula_list = []
for dependent_variable in binary_column_list:
    independent_variables_list = [col_name for col_name in binary_column_list if col_name != dependent_variable]
    patsy_formula = dependent_variable + ' ~ ' + ' + '.join(independent_variables_list)
    
    # Logistic regression
    try:
        binary_brw = smf.logit(formula=patsy_formula, data=gapminder_df).fit()
        patsy_formula_list.append(patsy_formula)
    except:
        pass

country_name: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U11') dtype('<U11') dtype('<U11')
         Current function value: inf
         Iterations: 35
         Current function value: 0.331220
         Iterations: 35


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


Optimization terminated successfully.
         Current function value: 0.277487
         Iterations 23
         Current function value: inf
         Iterations: 35
Optimization terminated successfully.
         Current function value: 0.447065
         Iterations 7
         Current function value: 0.283118
         Iterations: 35
Optimization terminated successfully.
         Current function value: 0.391555
         Iterations 8
         Current function value: inf
         Iterations: 35
         Current function value: 0.012979
         Iterations: 35
         Current function value: inf
         Iterations: 35
         Current function value: inf
         Iterations: 35
         Current function value: 0.312532
         Iterations: 35
         Current function value: inf
         Iterations: 35
Optimization terminated successfully.
         Current function value: 0.372858
         Iterations 8




In [70]:

# Find some formulas with at least one significant explanatory variable
significant_patsy_formula_list = []
for patsy_formula in patsy_formula_list:
    binary_brw = smf.logit(formula=patsy_formula, data=gapminder_df).fit()
    pvalues_series = binary_brw.pvalues
    match_series = (pvalues_series <= 0.05) & (pvalues_series.index != 'Intercept')
    significant_column_list = pvalues_series[match_series].index.tolist()
    if len(significant_column_list) > 0:
        patsy_formula = column_name + ' ~ ' + ' + '.join(significant_column_list)
        try:
            binary_brw = smf.logit(formula=patsy_formula, data=gapminder_df).fit()
            significant_patsy_formula_list.append(patsy_formula)
        except:
            pass

         Current function value: 0.331220
         Iterations: 35
Optimization terminated successfully.
         Current function value: 0.447065
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.539393
         Iterations 6
         Current function value: 0.283118
         Iterations: 35
Optimization terminated successfully.
         Current function value: 0.391555
         Iterations 8
         Current function value: 0.012979
         Iterations: 35
         Current function value: 0.312532
         Iterations: 35
Optimization terminated successfully.
         Current function value: 0.608424
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.372858
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.519606
         Iterations 6


  bse_ = np.sqrt(np.diag(self.cov_params()))
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


In [75]:

for patsy_formula in significant_patsy_formula_list:
    binary_brw = smf.logit(formula=patsy_formula, data=gapminder_df).fit()
    print()
    #print(patsy_formula)
    #print(binary_brw.summary())

    # Odd ratios with 95% confidence intervals
    params_series = binary_brw.params
    conf_int_df = binary_brw.conf_int()
    conf_int_df['OR'] = params_series
    conf_int_df.columns = ['Lower CI', 'Upper CI', 'Odds Ratios']
    conf_int_df = np.exp(conf_int_df)

    # Get column with highest odds
    max_odds_ratio = conf_int_df['Odds Ratios'].max()
    match_series = (conf_int_df['Odds Ratios'] == max_odds_ratio)
    max_odds_column = conf_int_df[match_series].index.tolist()[0]
    
    # Get other explanation params
    match_series = (conf_int_df.index == max_odds_column)
    lower_ci = conf_int_df[match_series]['Lower CI'].squeeze()
    upper_ci = conf_int_df[match_series]['Upper CI'].squeeze()
    
    # Get p-value of highest odds column
    pvalues_series = binary_brw.pvalues
    match_series = (pvalues_series.index == max_odds_column)
    p_value = pvalues_series[match_series].squeeze()
    
    if (max_odds_ratio >= 2.0) and (p_value <= 0.05):
        
        # Get list of other significant columns
        match_series = (pvalues_series.index != 'Intercept') & (pvalues_series.index != max_odds_column)
        other_significants_list = pvalues_series[match_series].index.tolist()
        
        # Get informal name of response variable
        informal_column_name = informal_name_dict['_'.join(column_name.split('_')[:-1])]
        
        # Get informal name of explanatory variable
        informal_odds_name = informal_name_dict['_'.join(max_odds_column.split('_')[:-1])]
        
        if len(other_significants_list) > 0:
            other_significants_list = [informal_name_dict['_'.join(c.split('_')[:-1])] for c in other_significants_list]
            explanation_str = ('After adjusting for potential confounding factors (' +
                               conjunctify_list(other_significants_list) +
                               '), t')
        else:
            explanation_str = ('T')
        explanation_str += ('he odds of having a high ' + informal_column_name +
                            ' were more than ' + inflect_engine.number_to_words(math.floor(max_odds_ratio)) +
                            ' times higher for nations with a high ' + informal_odds_name +
                            ' than for nations with a low ' + informal_odds_name +
                            ' (OR={:.2f}, 95% CI={:.2f}-{:.2f}, p={:.4f})'.format(max_odds_ratio, lower_ci,
                                                                                  upper_ci, p_value) +
                            '.')
        confounded_variables_list = []
        for significant_column, significant_pvalue in pvalues_series[match_series].iteritems():
            informal_significant_name = informal_name_dict['_'.join(significant_column.split('_')[:-1])]
            odds_match_series = (conf_int_df.index == significant_column)
            odds_ratio = conf_int_df[odds_match_series]['Odds Ratios'].max()
            if odds_ratio < 1.0:
                less_more_str = 'less'
            elif odds_ratio > 1.0:
                less_more_str = 'more'
            else:
                less_more_str = 'equally'
            lower_ci = conf_int_df[odds_match_series]['Lower CI'].max()
            upper_ci = conf_int_df[odds_match_series]['Upper CI'].max()
            if significant_pvalue <= 0.05:
                explanation_str += (' ' + informal_significant_name[0].upper() + informal_significant_name[1:] +
                                    ' was also significantly associated with ' + informal_column_name +
                                    ', such that higher nations were significantly ' + less_more_str +
                                    ' likely to have a high ' + informal_column_name +
                                    ' (OR={:.2f}, 95% CI={:.2f}-{:.2f}, p={:.4f})'.format(odds_ratio, lower_ci,
                                                                                          upper_ci,
                                                                                          significant_pvalue) +
                                    '.')
            else:
                confounded_variables_list.append(significant_column)
                confounded_formula = patsy_formula.split('~')[0].strip() + ' ~ ' + confounded_variable
                confounded_brw = smf.logit(formula=confounded_formula, data=gapminder_df).fit()
                confounded_pvalues_series = confounded_brw.pvalues
                confounded_match_series = (confounded_pvalues_series <= 0.05) & (confounded_pvalues_series.index != 'Intercept')
                significant_confounded_list = confounded_pvalues_series[confounded_match_series].index.tolist()
                if len(significant_confounded_list) > 0:
                    explanation_str += (' ' + informal_significant_name[0].upper() + informal_significant_name[1:] +
                                        ' was also insignificantly associated with ' + informal_column_name +
                                        ', such that higher nations were insignificantly ' + less_more_str +
                                        ' likely to have a high ' + informal_column_name +
                                        ' (OR={:.2f}, 95% CI={:.2f}-{:.2f}, p={:.4f})'.format(odds_ratio, lower_ci,
                                                                                              upper_ci,
                                                                                              significant_pvalue) +
                                        '. (There is evidence that this variable was confounded by the addition of ' +
                                        'other variables as it is significant in isolation.)')
        print()
        print(explanation_str)
        #print(patsy_formula)
        print()
        print(binary_brw.summary())
        print(conf_int_df)
        print()
        print()

Optimization terminated successfully.
         Current function value: 0.539393
         Iterations 6

Optimization terminated successfully.
         Current function value: 0.687197
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.687197
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.687197
         Iterations 4

After adjusting for potential confounding factors (armed forces rate, oil per person, and residential electricity), the odds of having a high urban rate were more than nine times higher for nations with a high employment rate than for nations with a low employment rate (OR=9.20, 95% CI=1.81-46.62, p=0.0074).

                           Logit Regression Results                           
Dep. Variable:      urban_rate_binary   No. Observations:                   47
Model:                          Logit   Df Residuals:                       42
Method:                           MLE   Df