### Objective

I aimed to answer the following question:

- Is increased parental involvement correlated with stronger student performance?

### Data Preparation

I used data from the Parent and Family Involvement in Education Survey (2016). The data was collected a part of the National Household Education Survey Program, by the National Center for Education Statistics. The survey covers children from grades kindergarten to 12th grade and asks various questions about the child's performance at school, involvement of the parents, as well as other parent, child, and school characteristics. The survey is filled out by the parents (or guardians). The data is compiled in a csv file with 822 columns and 14075 entries. The data is nationally representative and uses a two-stage addressed-based sampling.

Initially, I prepared the data for analysis by dropping the parts related to home-schooled children, recoding the ordinal features when necessary to reflect continuity, and handling missing values. I used the following approach:
    1. When the number of missing values was large, I dropped the feature.
    2. When the number of missing values was small, I dropped these observations.
    3. For categorical features, I maintained the missing values as a seperate category.
This gave me a total of 13,095 observations.

I identified four groups of features defined as:
    1. School characteristics
    2. Parent characteristics
    3. Student characteristics
    4. Parental involvement
170 features in total, 42 of them continuous. 

I also identified 13 potential target variables.

In [1]:
%run data_prep.py

In [2]:
y_df, school_characteristics_df, parent_characteristics_df, student_characteristics_df, parental_involvement_df, X_cont_labels = select_feats()

In [3]:
y_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SEENJOY,13095.0,1.752195,0.700917,1.0,1.0,2.0,2.0,4.0
SEGRADES,13095.0,2.050477,1.31422,1.0,1.0,2.0,2.0,5.0
SEADPLCXX,13095.0,-0.012142,1.288855,-1.0,-1.0,-1.0,1.0,2.0
SEBEHAVX,13095.0,0.508286,2.10106,0.0,0.0,0.0,0.0,75.0
SESCHWRK,13095.0,0.607637,2.353676,0.0,0.0,0.0,0.0,97.0
SEGBEHAV,13095.0,1.20756,3.507458,0.0,0.0,0.0,1.0,99.0
SEGWORK,13095.0,1.190454,3.263526,0.0,0.0,0.0,1.0,99.0
SEABSNT,13095.0,4.228942,6.943639,0.0,1.0,3.0,5.0,364.0
SEREPEAT,13095.0,1.923559,0.265713,1.0,2.0,2.0,2.0,2.0
SESUSOUT,13095.0,1.935319,0.245972,1.0,2.0,2.0,2.0,2.0


In [4]:
school_characteristics_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SCPUBPRI,13095.0,3.756319,0.741296,1.0,4.0,4.0,4.0,4.0
DISTASSI,13095.0,0.926613,0.765875,-1.0,1.0,1.0,1.0,2.0
SCHRTSCHL,13095.0,1.606796,0.950766,-1.0,2.0,2.0,2.0,2.0
SNEIGHBRX,13095.0,1.814357,0.388833,1.0,2.0,2.0,2.0,2.0
SPUBCHOIX,13095.0,1.938297,0.770927,1.0,1.0,2.0,3.0,3.0
SCONSIDR,13095.0,1.694845,0.46049,1.0,1.0,2.0,2.0,2.0
SPERFORM,13095.0,-0.327606,1.038755,-1.0,-1.0,-1.0,1.0,2.0
S1STCHOI,13095.0,1.179,0.383367,1.0,1.0,1.0,1.0,2.0
SSAMSC,13095.0,1.027033,0.162186,1.0,1.0,1.0,1.0,2.0
SNETCRSX,13095.0,1.957694,0.201295,1.0,2.0,2.0,2.0,2.0


In [5]:
parent_characteristics_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
FHAMOUNT,13095.0,1.248110,0.851458,-1.0,1.0,1.0,1.0,3.0
CLIVYN,13095.0,1.808477,0.393515,1.0,2.0,2.0,2.0,2.0
CSPEAKX,13095.0,2.317831,0.949270,1.0,2.0,2.0,2.0,6.0
HHTOTALXX,13095.0,4.023673,1.217668,2.0,3.0,4.0,5.0,10.0
HHBROSX,13095.0,0.527606,0.709736,0.0,0.0,0.0,1.0,5.0
...,...,...,...,...,...,...,...,...
YRSADDR,13095.0,9.627186,8.231129,0.0,3.0,8.0,14.0,70.0
OWNRNTHB,13095.0,1.261168,0.480314,1.0,1.0,1.0,1.0,3.0
HVINTSPHO,13095.0,1.052692,0.223426,1.0,1.0,1.0,1.0,2.0
HVINTCOM,13095.0,1.071554,0.257758,1.0,1.0,1.0,1.0,2.0


In [6]:
student_characteristics_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
GRADE,13095.0,9.646354,3.835666,2.0,7.0,10.0,13.0,15.0
FHCAMT,13095.0,1.274609,0.754524,-1.0,1.0,1.0,2.0,3.0
HDHEALTH,13095.0,1.565178,0.76416,1.0,1.0,1.0,2.0,5.0
HDINTDIS,13095.0,1.982589,0.130803,1.0,2.0,2.0,2.0,2.0
HDSPEECHX,13095.0,1.937152,0.242699,1.0,2.0,2.0,2.0,2.0
HDDISTRBX,13095.0,1.970981,0.167865,1.0,2.0,2.0,2.0,2.0
HDDEAFIMX,13095.0,1.988163,0.108154,1.0,2.0,2.0,2.0,2.0
HDBLINDX,13095.0,1.986789,0.114182,1.0,2.0,2.0,2.0,2.0
HDORTHOX,13095.0,1.978007,0.146667,1.0,2.0,2.0,2.0,2.0
HDAUTISMX,13095.0,1.975181,0.155578,1.0,2.0,2.0,2.0,2.0


In [7]:
parental_involvement_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SEFUTUREX,13095.0,4.96617,1.211422,1.0,4.0,5.0,6.0,6.0
FSSPORTX,13095.0,1.193967,0.395419,1.0,1.0,1.0,1.0,2.0
FSVOL,13095.0,1.575181,0.494334,1.0,1.0,2.0,2.0,2.0
FSMTNG,13095.0,1.136082,0.342889,1.0,1.0,1.0,1.0,2.0
FSPTMTNG,13095.0,1.554334,0.497058,1.0,1.0,2.0,2.0,2.0
FSATCNFN,13095.0,1.253761,0.435179,1.0,1.0,1.0,2.0,2.0
FSFUNDRS,13095.0,1.385949,0.486837,1.0,1.0,1.0,2.0,2.0
FSCOMMTE,13095.0,1.868881,0.337543,1.0,2.0,2.0,2.0,2.0
FSCOUNSLR,13095.0,1.637572,0.48072,1.0,1.0,2.0,2.0,2.0
FSFREQ,13095.0,7.978007,9.032643,0.0,3.0,5.0,10.0,99.0


### Training

To start with, I will divide my data into a training and test sample.

In [8]:
y_df = y_df[['SEGRADES']].loc[(y_df.SEGRADES != 5)]
y_df.SEGRADES.value_counts()  

1    5984
2    3908
3    1329
4     306
Name: SEGRADES, dtype: int64

In [9]:
y_df.loc[y_df.SEGRADES == 2, 'SEGRADES'] = 1
y_df.loc[(y_df.SEGRADES == 3) | (y_df.SEGRADES == 4), 'SEGRADES'] = 0
y_df.SEGRADES.value_counts()

1    9892
0    1635
Name: SEGRADES, dtype: int64

In [10]:
X_df = school_characteristics_df.merge(parent_characteristics_df, left_index = True, right_index = True)
X_df = X_df.merge(student_characteristics_df, left_index = True, right_index = True)
X_df = X_df.merge(parental_involvement_df, left_index = True, right_index = True)
X_df = X_df.loc[y_df.index]

In [11]:
X_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11527 entries, 0 to 14073
Columns: 170 entries, SCPUBPRI to HDDEVIEPX
dtypes: int64(170)
memory usage: 15.0 MB


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from imblearn.over_sampling import SMOTE

Using TensorFlow backend.


In [49]:
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, \
                                                    stratify = y_df.SEGRADES, \
                                                    test_size = 0.2, \
                                                    random_state = 20720)

In [50]:
X_train_cont = X_train[X_cont_labels]
X_train_cat = X_train.drop(columns = X_train_cont.columns)

In [51]:
sclr = StandardScaler()
X_train_cont = sclr.fit_transform(X_train_cont)
X_train_cont = pd.DataFrame(sclr.fit_transform(X_train_cont), columns = X_cont_labels)

In [52]:
encdr = OneHotEncoder(drop = 'first', handle_unknown = 'ignore')
encdr.fit(X_train_cat)

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)

In [89]:
%run training_test.py

In [53]:
X_train_cat = encoder_transform(encdr, X_train_cat)

In [54]:
X_train = pd.concat((X_train_cont, X_train_cat), axis = 1)

In [68]:
sm = SMOTE()
X_train, y_train = sm.fit_resample(X_train, y_train.SEGRADES)

#### Logistic Regression

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc

In [21]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9221 entries, 0 to 9220
Columns: 406 entries, FSFREQ to HDDEVIEPX_2
dtypes: float64(406)
memory usage: 28.6 MB


In [22]:
y_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9221 entries, 9167 to 2713
Data columns (total 1 columns):
SEGRADES    9221 non-null int64
dtypes: int64(1)
memory usage: 144.1 KB


In [69]:
lr_clf = LogisticRegression(random_state = 20720, penalty = 'l1', solver = 'liblinear', max_iter = 1000)
lr_clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=20720, solver='liblinear', tol=0.0001,
                   verbose=0, warm_start=False)

In [88]:
y_train_hat = lr_clf.predict(X_train)
print(classification_report(y_train, y_train_hat))

              precision    recall  f1-score   support

           0       0.83      0.85      0.84      7913
           1       0.85      0.82      0.83      7913

    accuracy                           0.84     15826
   macro avg       0.84      0.84      0.84     15826
weighted avg       0.84      0.84      0.84     15826



In [90]:
X_train_upd = calculate_vif_(X_train, thresh=5.0)

  vif = 1. / (1. - r_squared_i)
  return 1 - self.ssr/self.centered_tss


dropping 'SCPUBPRI_1' at index: 42
dropping 'SCPUBPRI_4' at index: 44
dropping 'DISTASSI_-1' at index: 44
dropping 'DISTASSI_1' at index: 44
dropping 'SCHRTSCHL_-1' at index: 45
dropping 'SNEIGHBRX_1' at index: 47
dropping 'SPUBCHOIX_1' at index: 48
dropping 'SCONSIDR_1' at index: 50
dropping 'SCONSIDR_2' at index: 50
dropping 'SPERFORM_-1' at index: 50
dropping 'S1STCHOI_1' at index: 52
dropping 'SSAMSC_1' at index: 53
dropping 'SNETCRSX_1' at index: 54
dropping 'FSNOTESX_1' at index: 55
dropping 'FSMEMO_1' at index: 56
dropping 'FSPHONCHX_1' at index: 57
dropping 'HDSCHLX_-1' at index: 58
dropping 'HDSCHLX_1' at index: 58
dropping 'HDIEPX_-1' at index: 59
dropping 'HDIEPX_1' at index: 59
dropping 'HDIEPX_2' at index: 59
dropping 'CENGLPRG_-1' at index: 59
dropping 'CENGLPRG_1' at index: 59
dropping 'FHAMOUNT_-1' at index: 60
dropping 'FHAMOUNT_1' at index: 60
dropping 'CLIVYN_1' at index: 62
dropping 'CSPEAKX_1' at index: 63
dropping 'HHENGLISH_1' at index: 68
dropping 'P1REL_1' at i

In [91]:
X_train_upd.to_csv('X_train_upd.csv')

In [111]:
import statsmodels.api as sm

In [112]:
X_train_upd_0 = sm.add_constant(X_train_upd)
logit_mod = sm.Logit(y_train, X_train_upd_0)
logit_res = logit_mod.fit()
print(logit_res.summary())

  return ptp(axis=axis, out=out, **kwargs)


Optimization terminated successfully.
         Current function value: 0.387624
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:               SEGRADES   No. Observations:                15826
Model:                          Logit   Df Residuals:                    15658
Method:                           MLE   Df Model:                          167
Date:                Wed, 12 Feb 2020   Pseudo R-squ.:                  0.4408
Time:                        13:00:24   Log-Likelihood:                -6134.5
converged:                       True   LL-Null:                       -10970.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const            2.9280      0.285     10.284      0.000       2.370       3.486
FSFREQ           0.1114

In [114]:
logit_res_reg = logit_mod.fit_regularized(method = 'l1', alpha = 0.1, maxiter = 5000)
print(logit_res_reg.summary())

Optimization terminated successfully.    (Exit mode 0)
            Current function value: 0.3881068801500095
            Iterations: 1248
            Function evaluations: 1249
            Gradient evaluations: 1248


Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers


                           Logit Regression Results                           
Dep. Variable:               SEGRADES   No. Observations:                15826
Model:                          Logit   Df Residuals:                    15658
Method:                           MLE   Df Model:                          167
Date:                Wed, 12 Feb 2020   Pseudo R-squ.:                  0.4408
Time:                        13:12:29   Log-Likelihood:                -6134.6
converged:                       True   LL-Null:                       -10970.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const            2.8423      0.281     10.120      0.000       2.292       3.393
FSFREQ           0.1113      0.032      3.438      0.001       0.048       0.175
FSSPPERF        -0.2317      0.034     -6.81

In [96]:
y_train_hat = lr_clf.predict(X_train_upd)
print(classification_report(y_train, y_train_hat))

              precision    recall  f1-score   support

           0       0.82      0.84      0.83      7913
           1       0.84      0.82      0.83      7913

    accuracy                           0.83     15826
   macro avg       0.83      0.83      0.83     15826
weighted avg       0.83      0.83      0.83     15826



In [97]:
lr_clf = LogisticRegression(random_state = 20720, penalty = 'l1', solver = 'saga', max_iter = 5000)
lr_clf.fit(X_train_upd, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=5000,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=20720, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)

In [98]:
y_train_hat = lr_clf.predict(X_train_upd)
print(classification_report(y_train, y_train_hat))

              precision    recall  f1-score   support

           0       0.82      0.84      0.83      7913
           1       0.84      0.82      0.83      7913

    accuracy                           0.83     15826
   macro avg       0.83      0.83      0.83     15826
weighted avg       0.83      0.83      0.83     15826



In [104]:
print(X_train_upd.columns)

Index(['FSFREQ', 'FSSPPERF', 'FSSPHW', 'FSSPCOUR', 'FSSPROLE', 'FSSPCOLL',
       'FCSCHOOL', 'FCTEACHR', 'FCSTDS', 'FCORDER',
       ...
       'FOSPORT_2', 'FORESPON_2', 'FOHISTX_2', 'FOLIBRAYX_2', 'FOBOOKSTX_2',
       'FOCONCRTX_2', 'FOGROUPX_2', 'FOSPRTEVX_2', 'HDDEVIEPX_1',
       'HDDEVIEPX_2'],
      dtype='object', length=167)


In [109]:
print(list(zip(X_train_upd.columns, lr_clf.coef_[0])))

[('FSFREQ', 0.11023301030878188), ('FSSPPERF', -0.225831567132003), ('FSSPHW', 0.0409536027926186), ('FSSPCOUR', 0.018881287426289296), ('FSSPROLE', 0.08887187936432256), ('FSSPCOLL', -0.10134715279501479), ('FCSCHOOL', -0.33496360965324207), ('FCTEACHR', -0.3274727789740915), ('FCSTDS', -0.0549955640067768), ('FCORDER', 0.15333407979902922), ('FCSUPPRT', 0.1775180450273288), ('FHHOME', 0.18394800944330966), ('FHWKHRS', 0.18782394867525365), ('FHCHECKX', -0.01717570719927581), ('FHHELP', -0.46799501898903867), ('FODINNERX', -0.00552270111228819), ('HDHEALTH', -0.07501656888988859), ('HHBROSX', 0.01310227761173736), ('HHSISSX', -0.08062956564784726), ('HHMOM', -0.03925243440905829), ('HHDAD', 0.04965542157877613), ('HHAUNTSX', -0.0631383375540246), ('HHUNCLSX', -0.02525830406983699), ('HHGMASX', 0.09639140744285866), ('HHGPASX', -0.08232664483059438), ('HHCSNSX', 0.04689670928823984), ('HHPRTNRSX', 0.02103508987381549), ('HHORELSX', 0.029367413227377428), ('HHONRELSX', 0.014296133583125

In [76]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

In [71]:
rf_clf = RandomForestClassifier(random_state = 20720)
rf_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=20720,
                       verbose=0, warm_start=False)

In [72]:
y_train_hat = rf_clf.predict(X_train)
print(classification_report(y_train, y_train_hat))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7913
           1       1.00      1.00      1.00      7913

    accuracy                           1.00     15826
   macro avg       1.00      1.00      1.00     15826
weighted avg       1.00      1.00      1.00     15826



In [77]:
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 4), random_state = 20720)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=4,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                          

In [78]:
y_train_hat = ada_clf.predict(X_train)
print(classification_report(y_train, y_train_hat))

              precision    recall  f1-score   support

           0       0.99      0.97      0.98      7913
           1       0.97      0.99      0.98      7913

    accuracy                           0.98     15826
   macro avg       0.98      0.98      0.98     15826
weighted avg       0.98      0.98      0.98     15826



In [80]:
from sklearn.svm import LinearSVC

In [85]:
svc_clf = LinearSVC(fit_intercept = 'False', max_iter = 5000, random_state = 20720)
svc_clf.fit(X_train, y_train)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept='False',
          intercept_scaling=1, loss='squared_hinge', max_iter=5000,
          multi_class='ovr', penalty='l2', random_state=20720, tol=0.0001,
          verbose=0)

In [86]:
y_train_hat = svc_clf.predict(X_train)
print(classification_report(y_train, y_train_hat))

              precision    recall  f1-score   support

           0       0.83      0.85      0.84      7913
           1       0.85      0.82      0.83      7913

    accuracy                           0.84     15826
   macro avg       0.84      0.84      0.84     15826
weighted avg       0.84      0.84      0.84     15826



### Test

In [60]:
X_test_cont = X_test[X_cont_labels]
X_test_cat = X_test.drop(columns = X_cont_labels)

In [61]:
X_test_cont = sclr.transform(X_test_cont)
X_test_cont = pd.DataFrame(X_test_cont, columns = X_cont_labels)

In [62]:
X_test_cat = encoder_transform(encdr, X_test_cat)

In [63]:
X_test = pd.concat((X_test_cont, X_test_cat), axis = 1)

In [110]:
y_test_hat = lr_clf.predict(X_test)
print(classification_report(y_test, y_test_hat))

ValueError: X has 406 features per sample; expecting 167

In [74]:
y_test_hat = rf_clf.predict(X_test)
print(classification_report(y_test, y_test_hat))

              precision    recall  f1-score   support

           0       0.57      0.21      0.31       327
           1       0.88      0.97      0.93      1979

    accuracy                           0.87      2306
   macro avg       0.73      0.59      0.62      2306
weighted avg       0.84      0.87      0.84      2306



In [79]:
y_test_hat = ada_clf.predict(X_test)
print(classification_report(y_test, y_test_hat))

              precision    recall  f1-score   support

           0       0.21      0.37      0.27       327
           1       0.88      0.77      0.82      1979

    accuracy                           0.71      2306
   macro avg       0.54      0.57      0.54      2306
weighted avg       0.79      0.71      0.74      2306



In [87]:
y_test_hat = svc_clf.predict(X_test)
print(classification_report(y_test, y_test_hat))

              precision    recall  f1-score   support

           0       0.29      0.73      0.41       327
           1       0.94      0.70      0.80      1979

    accuracy                           0.71      2306
   macro avg       0.61      0.72      0.61      2306
weighted avg       0.85      0.71      0.75      2306

