# Student Alcohol Consumption
## Eric Lin, Naveen Janarthanan, Estelle Jiang, Nuo Chen

In [15]:
# import data
from IPython.display import HTML

from statistical_analysis import df_nice, regression, regression_aft_removing, df
import plotly
import plotly.plotly as py
import plotly.tools as tls
import plotly.graph_objs as go
import pandas as pd

accuracy_df = pd.read_csv("accuracy_df.csv")

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to view imported code."></form>''')

## Project Description
> An issue that persists in modern day is abusive alcohol consumption by adolescents. These adolescents tend to start drinking at a very young age for various physical, emotional, and lifestyle changes. Puberty and learning how to live independently often contribute to the commence of alcohol consumption. However, due to the immature mindset that most adolescents have during these early ages, they tend to make bad decisions regarding anything they might term as _"risky"_ or _"cool"_, such as consuming large amounts of alcohol to get drunk. In fact, **51%** of junior and senior high school students have had at least one drink within the past year and **8 million students drink weekly**. **More than 3 million students drink alone**, **more than 4 million drink when they are upset**, and **less than 3 million drink because they are bored**. In addition, parents, friends, and alcoholic beverage advertisements influences students’ attitudes about alcohol. Students' drinking habit is often heavily influenced by their surroundings and is likely impacted by the local student culture and norms. As such, we wanted to analyze this issue in further detail by **analyzing all the possible variables that could potentially have an effect on student alcohol consumption**, such as personal statistics, parent statistics and education values, and **produce a model to help predict student drinking rates based on these features**.


## Data Set 

> We obtained this Student Alcohol Consumption dataset from Kaggle. The data was attained in a **survey by students of ages 15-22** in regards to their math and/or portuguese language courses during their secondary school education at Gabriel Pereira or Mousinho da Silveira. In the data set, there are **more than thirty features about the student**, such as the gender, the age, as well as whether the student is engaged in a romantic relationship ('Social Index' and 'Drinking Index' in the table below are variables we created, further explained in the 'New Variables' section). The website provided us with two data sets: one dataset contained **students from a math course**, and the other dataset contained **students from a Portuguese language course**. We will be using both of these datasets to **predict the student weekly alcohol consumption rates**. A sample of the dataset features can be seen below: 

In [11]:
df.head()

Unnamed: 0,Course,Dalc,Fedu,Fjob,G1,G2,G3,Medu,Mjob,Pstatus,Walc,absences,activities,address,age,failures,famrel,famsize,famsup,freetime,goout,guardian,health,higher,internet,nursery,paid,reason,romantic,school,schoolsup,sex,studytime,traveltime
0,Math,1,4,teacher,5,6,6,4,at_home,A,1,6,no,U,18,0,4,GT3,no,3,4,mother,3,yes,no,yes,no,course,no,GP,yes,F,2,2
1,Math,1,1,other,5,5,6,1,at_home,T,1,4,no,U,17,0,5,GT3,yes,3,3,father,3,yes,yes,no,no,course,no,GP,no,F,2,1
2,Math,2,1,other,7,8,10,1,at_home,T,3,10,no,U,15,3,4,LE3,no,3,2,mother,3,yes,yes,yes,yes,other,no,GP,yes,F,2,1
3,Math,1,2,services,15,14,15,4,health,T,1,2,yes,U,15,0,3,GT3,yes,2,2,mother,5,yes,yes,yes,yes,home,yes,GP,no,F,3,1
4,Math,1,3,other,6,10,10,3,other,T,2,4,no,U,16,0,4,GT3,yes,3,2,father,5,yes,no,yes,yes,home,no,GP,no,F,2,1


- **school** - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
- **sex** - student's sex (binary: 'F' - female or 'M' - male)
- **age** - student's age (numeric: from 15 to 22)
- **address** - student's home address type (binary: 'U' - urban or 'R' - rural)
- **famsize** - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
- **Pstatus** - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
- **Medu** - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 secondary education or 4 – higher education)
- **Fedu** - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 secondary education or 4 – higher education)
- **Mjob** - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- **Fjob** - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- **reason** - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
- **guardian** - student's guardian (nominal: 'mother', 'father' or 'other')
- **traveltime** - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
- **studytime** - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- **failures** - number of past class failures (numeric: n if 1<=n<3, else 4)
- **schoolsup** - extra educational support (binary: yes or no)
- **famsup** - family educational support (binary: yes or no)
- **paid** - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- **activities** - extra-curricular activities (binary: yes or no)
- **nursery** - attended nursery school (binary: yes or no)
- **higher** - wants to take higher education (binary: yes or no)
- **internet** - Internet access at home (binary: yes or no)
- **romantic** - with a romantic relationship (binary: yes or no)
- **famrel** - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- **freetime** - free time after school (numeric: from 1 - very low to 5 - very high)
- **goout** - going out with friends (numeric: from 1 - very low to 5 - very high)
- **Dalc** - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- **Walc** - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- **health** - current health status (numeric: from 1 - very bad to 5 - very good)
- **absences** - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:
- **G1** - first period grade (numeric: from 0 to 20)
- **G2** - second period grade (numeric: from 0 to 20)
- **G3** - final grade (numeric: from 0 to 20, output target)

Descriptions from https://www.kaggle.com/uciml/student-alcohol-consumption#student-por.csv

## Data Preparation

### Data Cleaning
> Upon analyzing both datasets, we were glad to find that there were **no missing information**. As such, we did not need to worry about any significant data manipulation of the actual data obtained. The only data manipulation we had to perform was **combining the 2 datasets**: the math course dataset and the Portugese language course dataset. Because we were analyzing the rates of drinking based on various students' personal features, and since we came to a consensus that the **type of course would _NOT_ have a significant impact on the the amout a student drinks**, we combined the rows in both the dataframes to create a large dataframe (**1044 rows**) so that we have **more data to determine a stronger model to predict the student weekly drinking rates**. 

### New Variables
#### Drinking per Week Index (DWI)
> The **Drinking per Week Index (DWI) is an index that measures the amount of alcohol a student consumes throughout the entire week**. From the dataset given, we wanted a way of measuring alcohol consumption rates in the entire week. As such, the best way to go about this was to combine both weekday and weekend indexes and find the average of these indexes. However, since both indexes are weighted differently, we had to take that into account when finding the average of the weekday and weekend alcohol consumption index by multiplying each index by the number of days there are  in the given variable divided by the number of days there are in a week (as shown below). This **DWI ranges from one to five, where five represents high alcohol consumption and one represents low alcohol consumption**. Ultimately, the DWI allows us to convert categorical variables into a continuous variable, which we can use to run a statisticsl regression model and make predictions about how much alcohol a student consumes in the entire week. 


$$ 
Drinking~per~Week~Index = \bigg[ \dfrac{ \big(Weekend~Alc~Consumption \times 5\big) + \big(Weekday~Alc~Consumption \times 2\big) }{7}  \bigg]
$$

#### Social Index
> A study done by healthtalk on drugs and alcohol suggests that **most high school and college students drink to socialize with others, to have fun and relax**. When junior/senior high school students and college students go out with their friends, it is usually to grab some drinks and attend a party. Students tend to drink at abnormally high raters their first year of college when they move out of the house. Students with access to the internet tend to utilize social media at high rates, and as such are closely associated with how social students are. Additionally, whether a student is in a relationship or not strongly dictates how social they are, as relationships usually mandate meeting your partner's friends, hanging out with your partnet often, and attending various events with your partner and your/their friends. Engaging in extra-curriculat activities has a negative correlation with respect to how social a student is, as a student entertained in more extra-curricular activities will likely _NOT_ have time to socialize with others. As such, **we decided to create the Social Index feature, which indicates how social a given student is based on how often they go out with friends, whether they have access to the internet (1=yes, 0=no), whether they are in a romantic relationship (1=yes, 0=no), and whether they are doing any extra-curricular activities (1=yes, 0=no).** We created the social index formula based on how each feature correlates with the the DWI. Below is the formula to determine the Social Index:

$$ 
Social~Index = \big(0.25 \times Go~Out~w/~Friends \big) + \big(0.02 \times Internet \big) + \big(0.03 \times In~a~Romantic~Relationship \big) + \big(\text{-} 0.01 \times Extra~Curricular~Activities \big)
$$

## Exploratory Data Analysis (the statistical analysis stuff)

### Distribution of Studnets' Social Index

![](img/plt_hist_social_index.png)

> The distribution of students' social index graph (illustrated above) allows us to visually see how social most students in the dataset are. **A lower social index indicates lower social interactions and thus, a less social student. On the contrary, a higher social index indicates many more social interactions with others likely from going out often with friends, a more social student.** The graph illustrates that most students are social to some extent, with very few people being completely anti-social (as there are only ~100 students with a social index less than 0.5. The remaining are somewhat social (with an index of 0.5 - 1.0) or very social (with an index of 1.2 to 1.29), which makes sense since most students as this age tend to be very social to gain popularity. We believe that **this desire to be popular has a strong correlation with drinking**, as most "cool" people partake in some illegal activities like underage drinking, which a student who would want to be deemed as "cool" and be a "popular kid" would have to engage in these illegal activities. 

### Age Distribution in our dataset

In [3]:
# Age Distribution - plotly
data = [go.Bar(
            x= df['age'].value_counts(),
            y= df['age'].value_counts().index,
            marker=dict(
                color='rgba(122, 120, 168, 0.8)',
            ),
            orientation = 'h'
)]

layout = go.Layout(
    title='Age Distribution',
    xaxis=dict(
        title='Occurrence of particular age'),
    yaxis=dict(
        title='Age')
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='horizontal-bar')


Consider using IPython.display.IFrame instead



> Since our dataset is about the drinking behaviors of students, we are curious about the age distribution of our dataset. Therefore, we count the occurrence of each age we have on the dataset and show the results by drawing a horizontal bar chart. We found **the majority of our users are around 15 - 18 years old** and there are a few notable outliers here, with students as old as 21-22. Based on the information of our dataset, the students who took the survey came from secondary school, therefore, the age distribution makes lots of sense. 

### Alcohol Consumption Comparision

In [12]:

dr = df['Dalc'].value_counts().tolist()
wr = df['Walc'].value_counts().tolist()

# draw graph by using plotly
trace0 = go.Bar(
    x = df['Dalc'].value_counts().index.tolist(),
    y = dr,
    name='Weekday Alcohol Consumption',
    marker=dict(
        color='rgb(49,130,189)'
    )
)
trace1 = go.Bar(
    x = df['Walc'].value_counts().index,
    y = wr,
    name='Weekends Alcohol Consumption',
    marker=dict(
        color='rgb(204,204,204)',
    )
)

data = [trace0, trace1]
layout = go.Layout(
    xaxis=dict(tickangle=-45),
    barmode='group',   
    title='Weekday Alcohol Consumption VS Weekends Alcohol Consumption',
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='angled-text-bar')


Consider using IPython.display.IFrame instead



> In our dataset, we can see each student's alcohol consumption for weekdays and weekends. Therefore, we think it will be interesting to compare the number of students' alcohol consumption level both in weekdays and weekends. Based on the graph, we can tell that most of students will not drink on weekdays (level 1-2) and also on weekends. But there are more students drink a lot(level5) on weekends compare to students drink a lot on week.

### Correlation between the Social Index and the DWI (the 2 New Variables)

![](img/plt_scatter_social_dwi.png)

> The scatter plot above represents the correlation between the social index and the drinking per week index, both features that we created. Ultimately, **we wanted to see whether there was a correlation between the social index and the variable we are predicting in the statistical analysis: the DWI.** From the scatter plot, there seems to be a positive linear correlation, such that **as a student's social index increases their weekly drinking consumption rate increases**. This is a low-level method of verifying that the feature we created will in-fact have some significance with our desired output before we compute regression models on the enitre dataset with these new variables.

### Regression Summary

In [5]:
regression.summary()

0,1,2,3
Dep. Variable:,drinking,R-squared:,0.393
Model:,OLS,Adj. R-squared:,0.354
Method:,Least Squares,F-statistic:,10.07
Date:,"Tue, 12 Mar 2019",Prob (F-statistic):,5.39e-70
Time:,15:53:57,Log-Likelihood:,-1320.1
No. Observations:,1044,AIC:,2768.0
Df Residuals:,980,BIC:,3085.0
Df Model:,63,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0987,0.690,0.143,0.886,-1.256,1.453
school[T.MS],-0.0349,0.080,-0.437,0.663,-0.192,0.122
sex[T.M],0.5601,0.064,8.754,0.000,0.435,0.686
address[T.U],-0.1000,0.072,-1.387,0.166,-0.242,0.041
Pstatus[T.T],0.0563,0.094,0.597,0.551,-0.129,0.242
famsize[T.LE3],0.1804,0.065,2.776,0.006,0.053,0.308
Medu[T.1],-0.5001,0.316,-1.584,0.114,-1.120,0.120
Medu[T.2],-0.7287,0.317,-2.302,0.022,-1.350,-0.107
Medu[T.3],-0.5355,0.321,-1.670,0.095,-1.165,0.094

0,1,2,3
Omnibus:,19.689,Durbin-Watson:,1.915
Prob(Omnibus):,0.0,Jarque-Bera (JB):,20.322
Skew:,0.326,Prob(JB):,3.86e-05
Kurtosis:,3.205,Cond. No.,9680000000000000.0


> The above regression summary provides the regression statistics about how each variable correlates with the DWI. We will use this multi-linear regression to predict the DWI of students based on the dataset given. 

### Actual DWI vs. Predicted DWI (first analysis)

![](img/plt_actual_pred_dwi_before.png)

> After determining the predicted values of the DWI based on the regression model, we decided to plot it against the actual outcome to see how well the model predicts the DWI. Ideally, **the data points should lie on the red line, as any data on the red line illustrates an exact match of DWI between the predicted and actual data**. However, its is quite clear that **most of the data points do _NOT_ lie on the red line. This is most likely due to the dataset having many features that do _NOT_ have a strong impact on DWI**, as depicted by a high p-value score. As such, we decided **there are clearly some features we should remove and re-run the multi-linear regression to generate a more accurate regression model**. However, before removing features only based on their p-values, we wanted to see how the continuous variables correlate with one another in this dataset. <br/>  (**See 'Challenges' section for why we did not use feature selection methods**)

### Correlation of Continuous Variables from the dataset

![](img/plt_heat_map.png)

> **The heatmap above captures the covariance between continuous variables**. We created this map to **observe the correlation between the drinking index and other variables**. It it worth noting that we did _NOT_ include any categorical varibable, as we want to keep different types of variable seperated. Before creating the heatmap, **we set an arbitary threshold of 0.1**. In other words, we will be eliminating variables from the model if we do not observe significant correlation (correlation below 0.1) between the variables to prevent overfitting the model. Nonetheless, after drawing out the heatmap, every variable has a covariance higher than 0.1, so **we kept all the continuous variables in the regression model**. 

### Regression Summary (after removing insignificant variables)

In [13]:
regression_aft_removing.summary()

0,1,2,3
Dep. Variable:,drinking,R-squared:,0.381
Model:,OLS,Adj. R-squared:,0.355
Method:,Least Squares,F-statistic:,14.67
Date:,"Tue, 12 Mar 2019",Prob (F-statistic):,1.41e-77
Time:,19:41:22,Log-Likelihood:,-1330.2
No. Observations:,1044,AIC:,2746.0
Df Residuals:,1001,BIC:,2959.0
Df Model:,42,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.4934,0.760,-0.649,0.516,-1.985,0.999
sex[T.M],0.5831,0.060,9.659,0.000,0.465,0.702
Medu[T.1],-0.5075,0.313,-1.623,0.105,-1.121,0.106
Medu[T.2],-0.7835,0.313,-2.504,0.012,-1.398,-0.169
Medu[T.3],-0.5892,0.316,-1.866,0.062,-1.209,0.030
Medu[T.4],-0.7153,0.319,-2.243,0.025,-1.341,-0.090
Fedu[T.1],0.9742,0.308,3.160,0.002,0.369,1.579
Fedu[T.2],0.9826,0.311,3.163,0.002,0.373,1.592
Fedu[T.3],0.9947,0.315,3.158,0.002,0.377,1.613

0,1,2,3
Omnibus:,21.226,Durbin-Watson:,1.897
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21.968
Skew:,0.346,Prob(JB):,1.7e-05
Kurtosis:,3.163,Cond. No.,1870.0


> This new regression gives us the **correlation between the remaining 18 features in the dataset that seemed to have a strong impact on DWI**. Normally, more features will help increse the R-squared value. Althogh our R-squared went down by 0.012 after removing the features, we must consider that we removed close to half the features in the inital dataset (15 features) yet still mainained a similar R-squared value. **This indicates that removing those 15 variables actually made a more precise model.** We will use this new multi-linear regression model to predict the DWI. 

### Actual DWI vs. Predicted DWI (after removing insignificant variables)

![](img/plt_actual_pred_dwi_aft.png)

> TEXT ABOUT THE ACTUAL VS. PREDICTED GRAPH HERE

![](img/plt_resid_actual_pred_dwi_aft.png)

> TEXT ABOUT THE RESIDUAL GRAPH

In [14]:
accuracy_df

Unnamed: 0,algorithms,weekday_drinking_accuracy,weekend_drinking_accuracy
0,Decision Tree,0.666667,0.542857
1,Random Forest,0.8,0.761905
2,SVM,0.628571,0.580952
3,Logistic Reg,0.66,0.47


## Machine Learning Approach
> We used machine learning approaches to predict the level of students' alcohol consumption on Weekdays and on Weekends using other features that reflect students' family, social, and academic environments. Since we are trying to predict categorical variables (level of students' alcohol consumption from 1 to 5 with 1 being very low (less and 5 being very high); we decided to use decision tree classifier, random forest classifier, and support vector classifier as our primary machine learning algorithms with multi-class logistic regression as a benchmark for comparison. We believe that decision tree based algorithms like decision tree classifier & random forests  classifier are best suited for this type of classification problem since a significant portion of our independent/expanatory variables are categorical in nature. 

#### Decision Tree Learning 
> Decision tree learning is one of the most predictive modeling approaches in data mining and machine learning. It is a simple and widely used classification technique. In essence, the decision tree organizes a series of test questions and conditions just like a tree structure. Starting from the root node, we implemented the test condition, and thereby seperated the data into two groups. From there, we apply different test condition on the different subset until the data is entirely classified into smaller subsets. One of the major adventages of a decision tree model is that it is very intuitive and relatively straight-forward, which we thought might potentially render a more accurate model. In addition, nonlinear relationships between parameter do not affect tree performance. Nonetheless, without proper pruning and feature engineering, the tree tends to overfit the training data, which would drastically affect the predicted outcome. 

#### Random Forest Classifier 
> The main idea behind a random forest classifier algorithm is to combine many decision trees into a single model. This model is leveraging the idea of sample mean, which suggests that individually, predictions made by decision trees may not be accurate, but combined together, the predictions will be closer to the mark on average.
The random forest algorithm pools in predictions and incorporate much more knowledge and accuracy than a single Tree Classifier. It's main advantage is that the model does not get swayed by a single anomalous data source, hence, the predcition will be more accurate and normally distributed, which we hope will produce a more accurate model.

### Data Preparation & Preprocessing
> To prepare our dataset for machine learning algorithms, we first converted all the categorical dependent variables into binary dummie variables using the 'get_dummies()' function (since all of the categorical features contain fewer than 20 unique values). In addition, we included normalization using 'MinMaxScaler' as part of the pipeline and used the 'sklearn.preprocessing' package's 'train_test_split()' function to split our data into training and testing set.

### Predict student alcohol consumption on weekdays
The graphs below represents actual level of alcohol consumption versus predicted level of alcohol consumption on Weekdays. We can clearly see that most data points are clustered around the lower left corner and that our machine learning models performed relatively well on predicting lower levels of alcohol consumptions. In particular, the random forests classifer model yielded the best results with most data points on the 45 degree line (actual = predicted). Furthermore, we discovered that a single decision tree by itself is relatively unstable and inaccurate. On the other hand, the logistic regression (benchmark) performed on par with decision tree classifier and support vector classifier.  

Decision Tree Classification | Random Forest Classification | Support Vector Classification | Logistic Regression (Benchmark)
---- | ---- | ---- | ----
![Using decision tree to predict weekday alcohol consumption](img/decision_tree_weekday.png) | ![Using random forest to predict weekday alcohol consumption](img/random_forest_weekday.png) | ![Using SVC to predict weekday alcohol consumption](img/svc_weekday.png) | ![Using logistic regression to predict weekday alcohol consumption](img/logistic_regression_weekday.png)

### Predict student alcohol consumption on weekends
> The graphs below represents actual level of alcohol consumption versus predicted level of alcohol consumption on Weekends. Unlike the previous actual v.s. prediction graph for Weekday alcohol consumption levels, the data points in the graphs below are spread out more evenly across the diagonal line meaning that the overall levels of students' alcohol consumption are higher on Weekends compare to Weekdays. In addition, we discovered that the random forests classifier consistently out performs the benchmark and the other machine learning classifiers. While decision tree classifier, and the support vector classifer performed similarly with almost the same amount of data points on the 45 degree line (actual = predicted). 

Decision Tree Classification | Random Forest Classification | Support Vector Classification | Logistic Regression (Benchmark)
---- | ---- | ---- | ----
![Using decision tree to predict weekend alcohol consumption](img/decision_tree_weekend.png) | ![Using random forest to predict weekend alcohol consumption](img/random_forest_weekend.png) | ![Using SVC to predict weekend alcohol consumption](img/svc_weekend.png) | ![Using logistic regression to predict weekend alcohol consumption](img/logistic_regression_weekend.png)

## Challenges
1. There is a big limitation to the data we gathered from kaggle: the data came from a survey. As such, there may be some students who were not completely honest with their responsese, and we have no way of cross-checking whether the survey informoation truly matches with the students' real data.

2. We attemted to apply forward/backward selection methods on this dataset, but were unable to perform such calculations on our machines due to the number of features the dataset required the feature selection methods to compute. When we attempted to run these feature selection methods on our machines, they ran indefinitely. As a result, we had to resort to lower level means of identifying and removing insignificant variables from the dataset, such as using the P-value scores and viewing the correlation indexes between continuous variables.  

Sources: 
- http://www.healthtalk.org/young-peoples-experiences/drugs-and-alcohol/alcohol-and-social-life
- https://www.kaggle.com/uciml/student-alcohol-consumption