### Objective

I aimed to answer the following question:

- Is increased parental involvement correlated with stronger student performance?

### Data Preparation

I used data from the Parent and Family Involvement in Education Survey (2016). The data was collected a part of the National Household Education Survey Program, by the National Center for Education Statistics. The survey covers children from grades kindergarten to 12th grade and asks various questions about the child's performance at school, involvement of the parents, as well as other parent, child, and school characteristics. The survey is filled out by the parents (or guardians). The data is compiled in a csv file with 822 columns and 14075 entries. The data is nationally representative and uses a two-stage addressed-based sampling.

Initially, I prepared the data for analysis by dropping the parts related to home-schooled children, recoding the ordinal features when necessary to reflect continuity, and handling missing values. I used the following approach:
    1. When the number of missing values was large, I dropped the feature.
    2. When the number of missing values was small, I dropped these observations.
    3. For categorical features, I maintained the missing values as a seperate category.
This gave me a total of 13,095 observations.

I identified four groups of features defined as:
    1. School characteristics
    2. Parent characteristics
    3. Student characteristics
    4. Parental involvement
170 features in total, 42 of them continuous. 

I also identified 13 potential target variables.

In [49]:
%run data_prep.py

In [62]:
y_df, school_characteristics_df, parent_characteristics_df, student_characteristics_df, parental_involvement_df, X_cont_labels = select_feats()

In [63]:
y_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SEENJOY,13095.0,1.752195,0.700917,1.0,1.0,2.0,2.0,4.0
SEGRADES,13095.0,2.050477,1.31422,1.0,1.0,2.0,2.0,5.0
SEADPLCXX,13095.0,-0.012142,1.288855,-1.0,-1.0,-1.0,1.0,2.0
SEBEHAVX,13095.0,0.508286,2.10106,0.0,0.0,0.0,0.0,75.0
SESCHWRK,13095.0,0.607637,2.353676,0.0,0.0,0.0,0.0,97.0
SEGBEHAV,13095.0,1.20756,3.507458,0.0,0.0,0.0,1.0,99.0
SEGWORK,13095.0,1.190454,3.263526,0.0,0.0,0.0,1.0,99.0
SEABSNT,13095.0,4.228942,6.943639,0.0,1.0,3.0,5.0,364.0
SEREPEAT,13095.0,1.923559,0.265713,1.0,2.0,2.0,2.0,2.0
SESUSOUT,13095.0,1.935319,0.245972,1.0,2.0,2.0,2.0,2.0


In [64]:
school_characteristics_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SCPUBPRI,13095.0,3.756319,0.741296,1.0,4.0,4.0,4.0,4.0
DISTASSI,13095.0,0.926613,0.765875,-1.0,1.0,1.0,1.0,2.0
SCHRTSCHL,13095.0,1.606796,0.950766,-1.0,2.0,2.0,2.0,2.0
SNEIGHBRX,13095.0,1.814357,0.388833,1.0,2.0,2.0,2.0,2.0
SPUBCHOIX,13095.0,1.938297,0.770927,1.0,1.0,2.0,3.0,3.0
SCONSIDR,13095.0,1.694845,0.46049,1.0,1.0,2.0,2.0,2.0
SPERFORM,13095.0,-0.327606,1.038755,-1.0,-1.0,-1.0,1.0,2.0
S1STCHOI,13095.0,1.179,0.383367,1.0,1.0,1.0,1.0,2.0
SSAMSC,13095.0,1.027033,0.162186,1.0,1.0,1.0,1.0,2.0
SNETCRSX,13095.0,1.957694,0.201295,1.0,2.0,2.0,2.0,2.0


In [65]:
parent_characteristics_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
FHAMOUNT,13095.0,1.248110,0.851458,-1.0,1.0,1.0,1.0,3.0
CLIVYN,13095.0,1.808477,0.393515,1.0,2.0,2.0,2.0,2.0
CSPEAKX,13095.0,2.317831,0.949270,1.0,2.0,2.0,2.0,6.0
HHTOTALXX,13095.0,4.023673,1.217668,2.0,3.0,4.0,5.0,10.0
HHBROSX,13095.0,0.527606,0.709736,0.0,0.0,0.0,1.0,5.0
...,...,...,...,...,...,...,...,...
YRSADDR,13095.0,9.627186,8.231129,0.0,3.0,8.0,14.0,70.0
OWNRNTHB,13095.0,1.261168,0.480314,1.0,1.0,1.0,1.0,3.0
HVINTSPHO,13095.0,1.052692,0.223426,1.0,1.0,1.0,1.0,2.0
HVINTCOM,13095.0,1.071554,0.257758,1.0,1.0,1.0,1.0,2.0


In [66]:
student_characteristics_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
GRADE,13095.0,9.646354,3.835666,2.0,7.0,10.0,13.0,15.0
FHCAMT,13095.0,1.274609,0.754524,-1.0,1.0,1.0,2.0,3.0
HDHEALTH,13095.0,1.565178,0.76416,1.0,1.0,1.0,2.0,5.0
HDINTDIS,13095.0,1.982589,0.130803,1.0,2.0,2.0,2.0,2.0
HDSPEECHX,13095.0,1.937152,0.242699,1.0,2.0,2.0,2.0,2.0
HDDISTRBX,13095.0,1.970981,0.167865,1.0,2.0,2.0,2.0,2.0
HDDEAFIMX,13095.0,1.988163,0.108154,1.0,2.0,2.0,2.0,2.0
HDBLINDX,13095.0,1.986789,0.114182,1.0,2.0,2.0,2.0,2.0
HDORTHOX,13095.0,1.978007,0.146667,1.0,2.0,2.0,2.0,2.0
HDAUTISMX,13095.0,1.975181,0.155578,1.0,2.0,2.0,2.0,2.0


In [67]:
parental_involvement_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SEFUTUREX,13095.0,4.96617,1.211422,1.0,4.0,5.0,6.0,6.0
FSSPORTX,13095.0,1.193967,0.395419,1.0,1.0,1.0,1.0,2.0
FSVOL,13095.0,1.575181,0.494334,1.0,1.0,2.0,2.0,2.0
FSMTNG,13095.0,1.136082,0.342889,1.0,1.0,1.0,1.0,2.0
FSPTMTNG,13095.0,1.554334,0.497058,1.0,1.0,2.0,2.0,2.0
FSATCNFN,13095.0,1.253761,0.435179,1.0,1.0,1.0,2.0,2.0
FSFUNDRS,13095.0,1.385949,0.486837,1.0,1.0,1.0,2.0,2.0
FSCOMMTE,13095.0,1.868881,0.337543,1.0,2.0,2.0,2.0,2.0
FSCOUNSLR,13095.0,1.637572,0.48072,1.0,1.0,2.0,2.0,2.0
FSFREQ,13095.0,7.978007,9.032643,0.0,3.0,5.0,10.0,99.0
