In [1]:
import pandas as pd

In [2]:
%load_ext watermark
%watermark

2018-11-27T13:33:00-07:00

CPython 3.6.5
IPython 7.1.1

compiler   : GCC 4.2.1 (Apple Inc. build 5666) (dot 3)
system     : Darwin
release    : 18.2.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit


In [3]:
df = pd.read_csv('../data_processing/cleaned-data.csv')

# Binary Features

There are ten binary features in the raw dataset. I will convert these from their descriptive labels to binary `(1, 0)` values for descriptive computational analysis (sum, count, looping) and machine learning. The columns that I will transform are the following:

column | description | current values | transformed values
-------|-------------|----------------|-------------------
school | the school attended by the student | MS and GP | 0 (MS) and 1 (GP)
sex | the student's gender | M and F | 0 (M) and 1 (F)
address | an urban or rural address | U and R | 0 (U) and 1 (R)
famsize | <=3 or >3 in the family | LE3 and GT3 | 0 (LE3) and 1 (GT3)
Pstatus | separated or together parents | A and T | 0 (A) and 1 (T)
schoolsup | learning after school hours | no and yes | 0 (no) and 1 (yes)
famsup | family support at home | no and yes | 0 (no) and 1 (yes)
paid | paid tutoring | no and yes | 0 (no) and 1 (yes)
activities | extracurricular activities | no and yes | 0 (no) and 1 (yes)
nursery | attended nursery school | no and yes | 0 (no) and 1 (yes)
higher | planning for higher education | no and yes | 0 (no) and 1 (yes)
internet | internet access at home | no and yes | 0 (no) and 1 (yes)
romantic | in a romantic relationship | no and yes | 0 (no) and 1 (yes)

# Correlated Features

`G1` and `G2` are described as period one and two grades. In my experience, these features would correspond to trimester one (G1) and trimester two (G2) grades which lead up to the final grade (G3) for the year. These two features —period one and two grades (G1 & G2)— are highly correlated to the responce variable (G3) because they represent the same grading process and performance metric that we are trying to predict. However, there is a powerful binary feature that can be engineered from these two features (G1 & G2) that gets to the source of improved learning perforance, resiliance, and grit; improved grades from period one to period two.

I will engineer a new binary feature that is `0` —for students who had grade depreciation or maintained the same grade from period one to period two— or `1` —for students who improved their grade from period one to period two— and use this new feature to perform statistical analysis and machine learning.

    improved = 0, if G1 - G2 >= 0
    improved = 1, if G1 - G2 < 0

****

# Remaining non-Numeric Features

At this point in feature engineering we have 4 non-numeric features; the mother's job (Mjob), the father's job (Fjob), the reason for attending the school (reason), and the student's guardianship status (guardian).

The mother's and father's job seem like significantly large categories ('teacher', 'health' care related, civil 'services', 'at_home' or 'other') and the highest percentage of respondents from both genders responded 'other'. There may be some significance in investigating the statistics for 'at_home' leading to a change in the response varible (G3). This could lead to a binary description of at home parents or working. However, there may be some correlation between the parent job columns and the guardian column. If the guardian column is 'other', then the parents may not be a strong presence within the family unit.

The reason for choosing the school doesn't naturally lend itself to a binary descriptor. The response choices are, close to 'home', school 'reputation', 'course' preference or 'other'. The hypothesis that I would like to investigate, does a thoughtful decision to attend the school (school 'reputation' or 'course' preference) lead to higher final grades? If we end up rejecting the null hypothesis and except that thoughtfully deciding to attend the school results in significantly higher grades, we can transform this feature to a binary, decided or not.

That leaves the guardian status with its input values ('mother', 'father' or 'other'). The hypothesis here would be that any parent guardian —mother or father— results in the same final grade. If we can accept this as true then this feature could also be a binary, parent guardian or not.

# Confidence interval for mother or father staying home (not leaving home for work) resulting in higher final grades

We want to be **99%** confident that having a mother or father at home will increase final grades.

    null hypothesis: there is no difference between having either parent at home and having both parents working
    alternate hypothesis: there is a significant difference

In [4]:
import numpy as np

at_home = df[(df['Mjob']=='at_home') | (df['Fjob']=='at_home')]
not_at_home = df[(~(df['Mjob']=='at_home')) | (~(df['Fjob']=='at_home'))]
print('at home mean final grade:', np.mean(at_home.G3), 
      '\nnot at home mean final grade:', np.mean(not_at_home.G3))
print('at home final grade standard deviation:', np.std(at_home.G3), 
      '\nnot at home final grade standard deviation:', np.std(not_at_home.G3))

at home mean final grade: 10.770491803278688 
not at home mean final grade: 11.475783475783476
at home final grade standard deviation: 2.8307065828392624 
not at home final grade standard deviation: 3.2103572413261


# 99% confidence interval function

In [5]:
def run_99_percent_confidence_interval(filter_with, filter_without, target_variable):
    std_mean_diff = np.sqrt(np.std(filter_with[target_variable]) ** 2 / len(filter_with[target_variable]) 
                        + np.std(filter_without[target_variable]) ** 2 / len(filter_without[target_variable]))
    observed_mean_diff = np.mean(filter_with[target_variable]) - np.mean(filter_without[target_variable])
    conf_int = std_mean_diff * 2.58
    
    print('We are 99% confident that the true difference of means lies between: {:.2f} and {:.2f}'.format(observed_mean_diff - conf_int, observed_mean_diff + conf_int))

In [6]:
run_99_percent_confidence_interval(at_home, not_at_home, 'G3')

We are 99% confident that the true difference of means lies between: -1.74 and 0.33


# Result

The resulting 99% confidence interval includes 0 and therefore we find no significant difference between having a parent at home and not having a parent at home.

****

# Confidence interval for thoughtful reason to attend school (course offerings or school reputation) resulting in higher final grades

We want to be **99%** confident that making a thoughtful decision to attend the school will increase final grades.

    null hypothesis: there is no difference between thoughtfully deciding or not
    alternate hypothesis: there is a significant difference

In [7]:
thoughtful = df[(df['reason']=='reputation') | (df['reason']=='course')]
not_thoughtful = df[(~(df['reason']=='reputation')) | (~(df['reason']=='course'))]
print('thoughtful mean final grade:', np.mean(thoughtful.G3), 
      '\nnot thoughtful mean final grade:', np.mean(not_thoughtful.G3))
print('thoughtful final grade standard deviation:', np.std(thoughtful.G3), 
      '\nnot thoughtful final grade standard deviation:', np.std(not_thoughtful.G3))

thoughtful mean final grade: 11.528888888888888 
not thoughtful mean final grade: 11.523809523809524
thoughtful final grade standard deviation: 3.2877295253865952 
not thoughtful final grade standard deviation: 3.2232730459867853


In [8]:
run_99_percent_confidence_interval(thoughtful, not_thoughtful, 'G3')

We are 99% confident that the true difference of means lies between: -0.71 and 0.72


# Result

The resulting 99% confidence interval includes 0 and therefore we find no significant difference between thoughfully choosing the school and simply choosing the school because it is close by or another reason.

****

# Confidence interval for mother guardianship resulting in higher final grades

We want to be **99%** confident that having a mother guardian will increase final grades.

    null hypothesis: there is no difference between mother and father guardians
    alternate hypothesis: there is a significant difference

In [9]:
mother = df[df['guardian']=='mother']
father = df[df['guardian']=='father']
print('mother mean final grade:', np.mean(mother.G3), 
      '\nfather mean final grade:', np.mean(father.G3))
print('mother final grade standard deviation:', np.std(mother.G3), 
      '\nfather final grade standard deviation:', np.std(father.G3))

mother mean final grade: 11.540322580645162 
father mean final grade: 11.731707317073171
mother final grade standard deviation: 3.313947051060419 
father final grade standard deviation: 3.064544610612524


In [10]:
run_99_percent_confidence_interval(mother, father, 'G3')

We are 99% confident that the true difference of means lies between: -1.22 and 0.84


## Additional look at the parents vs. other

In [11]:
parent = df[(df['guardian']=='mother') | (df['guardian']=='father')]
other = df[(~(df['guardian']=='mother')) | (~(df['guardian']=='father'))]
print('parent mean final grade:', np.mean(parent.G3), 
      '\nother mean final grade:', np.mean(other.G3))
print('parent final grade standard deviation:', np.std(parent.G3), 
      '\nother final grade standard deviation:', np.std(other.G3))

parent mean final grade: 11.587878787878788 
other mean final grade: 11.523809523809524
parent final grade standard deviation: 3.2548106657924216 
other final grade standard deviation: 3.2232730459867853


In [12]:
run_99_percent_confidence_interval(parent, other, 'G3')

We are 99% confident that the true difference of means lies between: -0.57 and 0.70


# Result

The resulting 99% confidence interval includes 0 and therefore we find no significant difference between having a mother, father, or other guardian at home.

****

### Therefore, we will leave these features without any further engineering.