# Predicting Student Alcohol Consumption

## Basic Info
Title: Predicting Student Alcohol Consumption

Group Members:

Elizabeth Armstrong  
elizabeth.armstrong@chemeng.utah.edu  
u0726588  

Nipun Gunawardena  
1.nipun@gmail.com  
u0624269  

Karen DeMille   
karen.demille@utah.edu   
u0787257    

## Background and Motivation
As recommended in class, our first goal was to find a dataset that was easily accessible but interesting. In the UCI Machine Learning Repository, we found a dataset describing high school student alcohol consumption. This is a potentially interesting dataset for several reasons. Underage drinking is a health problem, and understanding it better can lead to better treatment. As an alternate conclusion, rehabilitation efforts can be better focused if drinking isn't a large problem. Additionally, it would be interesting to see how alcohol consumption correlates with socioeconomic and educational factors. Finally, it would be very informative if we could find another dataset to compare with. If student alcohol consumption correlates well with general population alcohol consumption, future health problems can be prevented.

All the members of our group are engineers. While this topic doesn't really relate to any of our research topics, this project gives us the opportunity to work with human-oriented data, something that is sometimes lacking in our field. Ultimately, determining what factors lead to alcohol consumption can help policymakers come to better decisions.

## Project Objectives
This project will have two main objectives: General data exploration and alcoholism tendency prediction. General data exploration will let us find factors that are important to alcohol consumption in students, and any other interesting information. Developing a tool to predict alcohol consumption will be a proof of concept that could be used by future institutions. These objectives can be completed with the current tools and data we have access to. However, if we can find or gather more alcohol consumption data, we would like to also see how the existing dataset matches the new one. This extra data can also be used to test our predictive tool. Since our dataset is already quite clean, we are planning on spending most of our time on analysis, though this might change if we find a new dataset to work with. 

## Data

We plan to use the data folder for the Student Alcohol Consumption Data Set from the UCI Machine Learning Repository. This webpage can be accessed at http://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION. Datasets for attributes of high school students in a math course and a Portuguese language course can be downloaded as CSV files, which are part of a compressed folder accessed through the Data Folder link or directly from http://archive.ics.uci.edu/ml/machine-learning-databases/00356/. This data has been downloaded and included in the same folder as the Jupyter notebook to allow for easy access and importing. The file is read and printed below using pandas. The data attributes are separated by semicolons, making “;” the delimiter.

With permission, we would also like to create a survey for University of Utah students that collects some data on similar topics to the attributes in the student dataset from the UCI dataset. We would then hope to see where our class fits in to the model we will be creating for predicting alcohol consumption based on other traits. These attributes include sex, age, home to school travel time, weekly study time, extra-curricular activities, workday alcohol consumption, and weekend alcohol consumption to name a few. There are 33 total attributes recorded in the UCI data sets to choose from. A full list and description of attributes included can be found on the website.

# Data Cleaning

Most of the data entries are strings, so the string entries will have to be converted to an integer value for processing. Many are binary ‘yes’ or ‘no’ answers, but some have more than two choices such as the father or mother’s job being ‘teacher’, ‘health’ care related, civil ‘services’, ‘at_home’, or ‘other.’

There are 395 entries for the math course data set and 649 entries for the Portuguese language dataset, giving a total of 1044 data entries. These datasets could be combined into one large set for analysis since the number of different attributes considered are the same for each course. There are some students that belong to both datasets. When the datasets are combined, we must find the duplicates to properly merge the two datasets.

In [1]:
import pandas as pd
import numpy as np

## Load Data
Load data and print number of points.

In [2]:
mathclass = pd.read_csv("student-mat.csv",delimiter=";") #math course csv
portclass = pd.read_csv("student-por.csv",delimiter=";") #portuguese language course csv

print('Math class data points: {}'.format(len(mathclass)))
print('Portuguese class data points: {}'.format(len(portclass)))
print('Total data points: {}'.format(len(mathclass)+len(portclass)))

Math class data points: 395
Portuguese class data points: 649
Total data points: 1044


## Define groups of columns

Groups of columns need to be defined for finding duplicates and merging datasets.

In [21]:
#define lists of column names
all_courseindependent = ['school','sex','age','address','famsize','Pstatus','Medu','Fedu','Mjob','Fjob','reason','guardian', \
                         'traveltime','studytime','failures','schoolsup','famsup','activities','nursery','higher','internet', \
                         'romantic','famrel','freetime','goout','Dalc','Walc','health']
all_ = ['school','sex','age','address','famsize','Pstatus','Medu','Fedu','Mjob','Fjob','reason','guardian','traveltime', \
       'studytime','failures','schoolsup','famsup','paid','activities','nursery','higher','internet','romantic','famrel', \
       'freetime','goout','Dalc','Walc','health','absences','G1','G2','G3']
student_merge = ['school','sex','age','address','famsize','Pstatus','Medu','Fedu','Mjob','Fjob','reason','nursery','internet']
math_columns = ['school','sex','age','address','famsize','Pstatus','Medu','Fedu','Mjob','Fjob','reason','guardian', \
                'traveltime','studytime','failures','schoolsup','famsup','paid_math','activities','nursery','higher', \
                'internet','romantic','famrel','freetime','goout','Dalc','Walc','health','absences_math','G1_math', \
                'G2_math','G3_math','math']
port_columns = ['school','sex','age','address','famsize','Pstatus','Medu','Fedu','Mjob','Fjob','reason','guardian', \
                'traveltime','studytime','failures','schoolsup','famsup','paid_port','activities','nursery','higher', \
                'internet','romantic','famrel','freetime','goout','Dalc','Walc','health','absences_port','G1_port', \
                'G2_port','G3_port','port']

## Choose columns to merge datasets on

The data source states that there are 382 students that belong to both classes (math and Portuguese).  Included with the datasets was an example file of how to find the students which are included both datasets.  The example file merged the two datasets on the following columns: school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, nursery, and internet.

In [5]:
bothclass = pd.concat([portclass, mathclass])
print('Number of students in both classes: {}'.format(bothclass.duplicated(keep='first',subset=student_merge).sum()))

Number of students in both classes: 382


Before going forward with the list of columns provided by the source, we will test whether the list includes enough columns to avoid incorrectly labeling data as a duplicate.

In [6]:
#determine whether there are duplicates in the individual datasets
print('Number of duplicates found in Portuguese class dataset: {}'. \
      format(portclass.duplicated(keep='first',subset=student_merge).sum()))
print('Number of duplicates found in math class dataset: {}'. \
      format(mathclass.duplicated(keep='first',subset=student_merge).sum()))

Number of duplicates found in Portuguese class dataset: 12
Number of duplicates found in math class dataset: 4


Since duplicates appear to be present in the individual datasets when the list of columns provided by the source was used to find duplicates, we will determine if they are actually duplicates by comparing all of the columns in each dataframe.

In [7]:
#determine whether there are duplicates in the individual datasets
print('Number of duplicates found in Portuguese class dataset: {}'. \
      format(portclass.duplicated(keep='first',subset=all_).sum()))
print('Number of duplicates found in math class dataset: {}'. \
      format(mathclass.duplicated(keep='first',subset=all_).sum()))

Number of duplicates found in Portuguese class dataset: 0
Number of duplicates found in math class dataset: 0


When we search for duplicates using all of the columns in each dataframe, no duplicates are found.  Thus, we can assume that the list of columns provided by the source for merging the dataframes is not suffucient to find true duplicates between the two datasets.  To determine whether the set of all columns will find duplicates betwwen the two datasets, we search for duplicates in the concatenated dataframe, considering all columns.

In [8]:
print('Number of students in both classes: {}'.format(bothclass.duplicated(keep='first',subset=all_).sum()))

Number of students in both classes: 0


Using all of the columns to find duplicates between the datasets resulted in 0 duplicates being found.  Thus, there may be some columns which are based on the class which the student's data was aquired in.  From information about the columns, it appears that the following columns are class-specific:

* paid: extra paid classes within the course subject
* absences: number of school absences
* G1: first period grade
* G2: second period grade
* G3: third period grade

The class-specific columns were removed from the list of columns to consider when finding duplicates.  

In [9]:
#determine whether there are duplicates in the individual datasets
print('Number of duplicates found in Portuguese class dataset: {}'. \
      format(portclass.duplicated(keep='first',subset=all_courseindependent).sum()))
print('Number of duplicates found in math class dataset: {}'. \
      format(mathclass.duplicated(keep='first',subset=all_courseindependent).sum()))

Number of duplicates found in Portuguese class dataset: 2
Number of duplicates found in math class dataset: 0


In [10]:
print('Number of students in both classes: {}'.format(bothclass.duplicated(keep='first',subset=all_courseindependent).sum()))

Number of students in both classes: 322


When all columns except for the class-specific columns were considered, only 2 duplicates were found in the Portuguese class dataset and no duplicates were found in the math class.

In order to verify that all class-independent columns should be used to find duplicates between the two datasets, we added or omitted one column at a time from the list.  Whenever one of the five class-specific columns were included in the list, the number of duplicates reduced significantly.  On the other hand, when one class-independent column was omitted from the list, the number of duplicates did not change.

We will remove the 2 duplicate Portuguese class data points since the class-independent column list is the best we can do for finding duplicates between the Portuguese and math class datasets.  By removing these 2 false duplicates from the Portuguese class dataset, we can ensure that the merging of datasets is accurate.

## Find Duplicates and Merge Datasets

To prepare to merge mathclass and portclass dataframes, we will create columns for the class-specific columns which specify which class the values in the column are from.

In [26]:
mathclass['paid_math'] = mathclass['paid']
mathclass['absences_math'] = mathclass['absences']
mathclass['G1_math'] = mathclass['G1']
mathclass['G2_math'] = mathclass['G2']
mathclass['G3_math'] = mathclass['G3']
mathclass['math'] = 1

portclass['paid_port'] = portclass['paid']
portclass['absences_port'] = portclass['absences']
portclass['G1_port'] = portclass['G1']
portclass['G2_port'] = portclass['G2']
portclass['G3_port'] = portclass['G3']
portclass['port'] = 1

mathclass_merge = mathclass[math_columns]
portclass_merge = portclass[port_columns]

#remove duplicates from individual datasets based on class-independent column set
portclass_nodup = portclass_merge.drop_duplicates(subset=all_courseindependent,keep='first')
mathclass_nodup = mathclass_merge.drop_duplicates(subset=all_courseindependent,keep='first')

Merge mathclass and portclass dataframes

In [23]:
bothclass = pd.merge(portclass_nodup, mathclass_nodup, how='outer', on=all_courseindependent)
print(bothclass.columns.tolist())

bothclass

['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid_port', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences_port', 'G1_port', 'G2_port', 'G3_port', 'port', 'paid_math', 'absences_math', 'G1_math', 'G2_math', 'G3_math', 'math']


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,G1_port,G2_port,G3_port,port,paid_math,absences_math,G1_math,G2_math,G3_math,math
0,GP,F,18.0,U,GT3,A,4.0,4.0,at_home,teacher,...,0.0,11.0,11.0,1.0,no,6.0,5.0,6.0,6.0,1.0
1,GP,F,17.0,U,GT3,T,1.0,1.0,at_home,other,...,9.0,11.0,11.0,1.0,no,4.0,5.0,5.0,6.0,1.0
2,GP,F,15.0,U,LE3,T,1.0,1.0,at_home,other,...,12.0,13.0,12.0,1.0,,,,,,
3,GP,F,15.0,U,GT3,T,4.0,2.0,health,services,...,14.0,14.0,14.0,1.0,yes,2.0,15.0,14.0,15.0,1.0
4,GP,F,16.0,U,GT3,T,3.0,3.0,other,other,...,11.0,13.0,13.0,1.0,yes,4.0,6.0,10.0,10.0,1.0
5,GP,M,16.0,U,LE3,T,4.0,3.0,services,other,...,12.0,12.0,13.0,1.0,yes,10.0,15.0,15.0,15.0,1.0
6,GP,M,16.0,U,LE3,T,2.0,2.0,other,other,...,13.0,12.0,13.0,1.0,no,0.0,12.0,12.0,11.0,1.0
7,GP,F,17.0,U,GT3,A,4.0,4.0,other,teacher,...,10.0,13.0,13.0,1.0,no,6.0,6.0,5.0,6.0,1.0
8,GP,M,15.0,U,LE3,A,3.0,2.0,services,other,...,15.0,16.0,17.0,1.0,yes,0.0,16.0,18.0,19.0,1.0
9,GP,M,15.0,U,GT3,T,3.0,4.0,other,other,...,12.0,12.0,13.0,1.0,yes,0.0,14.0,15.0,15.0,1.0


To make sure that the merge was done properly, the numbers of students represented from each class are found.

In [34]:
#check that numbers work out

print('Total datasets from both classes: {}'.format(len(mathclass_nodup)+len(portclass_nodup)))
print('Length of dataset after duplicates removed: {}'.format(len(bothclass)))
print('Length of mathclass dataset: {}'.format(len(mathclass_nodup)))
print('Number of mathclass only in bothclass: {}'.format(len(bothclass[(bothclass['math']==1) & (bothclass['port']!=1)])))
print('Length of portclass dataset: {}'.format(len(portclass_nodup)))
print('Number of portclass only in bothclass: {}'.format(len(bothclass[(bothclass['math']!=1) & (bothclass['port']==1)])))
print('Number of mathclass AND portclass in bothclass: {}'.format(len(bothclass[(bothclass['math']==1) & (bothclass['port']==1)])))

Total datasets from both classes: 1042
Length of dataset after duplicates removed: 722
Length of mathclass dataset: 395
Number of mathclass only in bothclass: 75
Length of portclass dataset: 647
Number of portclass only in bothclass: 327
Number of mathclass AND portclass in bothclass: 320


The number of data points in the Portuguese class and math class add up to the total dataset length if no duplicates are removed.  Furthermore, when the number of duplicates is removed from each each class dataset, the number of students in only the math or Portuguese class is correct.  The number of duplicates plus the number of datapoints after duplicates were removed equals the total number of datasets.

## Assign Integer Values to String Entries

In [35]:
bothclass['school_num'] = bothclass['school'].map({'GP':0, 'MS':1})
bothclass['sex_num'] = bothclass['sex'].map({'F':0, 'M':1})
bothclass['address_num'] = bothclass['address'].map({'U':0, 'R':1})
bothclass['famsize_num'] = bothclass['famsize'].map({'LE3':0, 'GT3':1})
bothclass['Pstatus_num'] = bothclass['Pstatus'].map({'T':0, 'A':1})
bothclass['Mjob_num'] = bothclass['Mjob'].map({'teacher':0, 'health':1, 'services':2, 'at_home':3, 'other':4})
bothclass['Fjob_num'] = bothclass['Fjob'].map({'teacher':0, 'health':1, 'services':2, 'at_home':3, 'other':4})
bothclass['reason_num'] = bothclass['reason'].map({'home':0, 'reputation':1, 'course':2, 'other':3})
bothclass['guardian_num'] = bothclass['guardian'].map({'mother':0, 'father':1, 'other':2})
bothclass['schoolsup_num'] = bothclass['schoolsup'].map({'yes':0, 'no':1})
bothclass['famsup_num'] = bothclass['famsup'].map({'yes':0, 'no':1})
bothclass['activities_num'] = bothclass['activities'].map({'yes':0, 'no':1})
bothclass['nursery_num'] = bothclass['nursery'].map({'yes':0, 'no':1})
bothclass['higher_num'] = bothclass['higher'].map({'yes':0, 'no':1})
bothclass['internet_num'] = bothclass['internet'].map({'yes':0, 'no':1})
bothclass['romantic_num'] = bothclass['romantic'].map({'yes':0, 'no':1})
bothclass['paid_port_num'] = bothclass['paid_port'].map({'yes':0, 'no':1})
bothclass['paid_math_num'] = bothclass['paid_math'].map({'yes':0, 'no':1})

## Save data to new csv

In [36]:
bothclass.to_csv('CleanData.csv', sep=',', na_rep=np.nan)