## Moderator
### moderation aka statistical interaction
### Causation study
* causation is very hard to prove in statistics
* lurking variables, or hidden variable might exist
* the best is to do experiments, but sometimes we cannot due to some reasons
* retrospective vs prospective studies

In [1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 

In [2]:
data = pd.read_csv('data/nesarc.csv', low_memory=False)

In [3]:
data['S3AQ3B1'] = pd.to_numeric(data['S3AQ3B1'], errors='coerce')
data['S3AQ3C1'] = pd.to_numeric(data['S3AQ3C1'], errors='coerce')
data['CHECK321'] = pd.to_numeric(data['CHECK321'], errors='coerce')
data['S9Q1A'] = pd.to_numeric(data['S9Q1A'], errors='coerce')

#subset data to young adults age 18 to 25 who have smoked in the past 12 months
sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['CHECK321']==1)].copy()

# use all data
#sub1=data.copy()
#SETTING MISSING DATA
sub1['S3AQ3B1']=sub1['S3AQ3B1'].replace(9, np.nan)
sub1['S3AQ3C1']=sub1['S3AQ3C1'].replace(99, np.nan)
sub1['S9Q1A']=sub1['S9Q1A'].replace(9, np.nan)

#recoding number of days smoked in the past month
recode1 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1}
sub1['USFREQMO']= sub1['S3AQ3B1'].map(recode1)

# Creating a secondary variable multiplying the days smoked/month and the number of cig/per day
sub1['NUMCIGMO_EST']=sub1['USFREQMO'] * sub1['S3AQ3C1']

sub1['NUMCIGMO_EST']= sub1['NUMCIGMO_EST']

ct1 = sub1.groupby('NUMCIGMO_EST').size()

In [4]:
sub2 = sub1[['NUMCIGMO_EST', 'MAJORDEPLIFE', 'MAJORDEP12', 'ALCABDEP12DX']].dropna()
model1 = smf.ols(formula='NUMCIGMO_EST ~ C(MAJORDEPLIFE)', data=sub2)
results1 = model1.fit()
print (results1.summary())

print ('means for numcigmo_est by major depression status')
m1= sub2.groupby('MAJORDEPLIFE').mean()
print (m1)
print ('standard deviations for numcigmo_est by major depression status')
sd1 = sub2.groupby('MAJORDEPLIFE').std()
print (sd1)

                            OLS Regression Results                            
Dep. Variable:           NUMCIGMO_EST   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     3.550
Date:                Sat, 07 May 2022   Prob (F-statistic):             0.0597
Time:                        15:44:59   Log-Likelihood:                -11934.
No. Observations:                1697   AIC:                         2.387e+04
Df Residuals:                    1695   BIC:                         2.388e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept              312.8380 

The question of interest here is the dependence of number of cigarette smoked and depression in life. As shown in the following figure,  the F-statistic is 3.55 and p score is 0.059, which means statistically they are not dependent. 

In [5]:
for aldep in [0, 1]:
    print ("Alcohol abuse/dependence status", aldep)
    sub2 = sub1[['NUMCIGMO_EST', 'MAJORDEPLIFE', 'MAJORDEP12', 'ALCABDEP12DX']].dropna()
    sub2 = sub2[sub2.ALCABDEP12DX == aldep]
    model1 = smf.ols(formula='NUMCIGMO_EST ~ C(MAJORDEPLIFE)', data=sub2)
    results1 = model1.fit()
    print (results1.summary())

    print ('means for numcigmo_est by major depression status')
    m1= sub2.groupby('MAJORDEPLIFE').mean()
    print (m1)
    print ('standard deviations for numcigmo_est by major depression status')
    sd1 = sub2.groupby('MAJORDEPLIFE').std()
    print (sd1)

Alcohol abuse/dependence status 0
                            OLS Regression Results                            
Dep. Variable:           NUMCIGMO_EST   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     4.341
Date:                Sat, 07 May 2022   Prob (F-statistic):             0.0374
Time:                        15:44:59   Log-Likelihood:                -8134.9
No. Observations:                1168   AIC:                         1.627e+04
Df Residuals:                    1166   BIC:                         1.628e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------

Now I added a moderator “Alcohol abuse”. Interestingly, for Alcohol abuse = 0 (no abuse), the dependence of number of cigarette smoked correlate with life depression. The F-statistic is 4.314 and p score is 0.0374. For alcohol abuse = 1 (yes abuse), the F-statistic is 0.03285 and p socre is 0.856, which indicates there are no correlations between dependence of number of cigarette smoked and life depression.

Conclusion: Alchohol abuse is a moderator for the dependence of number of cigarette smoked and life depression.