# Initial Proposal
Initial proposal for CS5963 final project. See the [project page](http://datasciencecourse.net/2016/project/) for more information.

## Basic Info
Title: Predicting Student Alcohol Consumption

Group Members:

Elizabeth Armstrong  
elizabeth.armstrong@chemeng.utah.edu  
u0726588  

Nipun Gunawardena  
1.nipun@gmail.com  
u0624269  


## Background and Motivation
As recommended in class, our first goal was to find a dataset that was easily accessible but interesting. In the UCI Machine Learning Repository, we found a dataset describing high school student alcohol consumption. This is a potentially interesting dataset for several reasons. Underage drinking is a health problem, and understanding it better can lead to better treatment. As an alternate conclusion, rehabilitation efforts can be better focused if drinking isn't a large problem. Additionally, it would be interesting to see how alcohol consumption correlates with socioeconomic and educational factors. Finally, it would be very informative if we could find another dataset to compare with. If student alcohol consumption correlates well with general population alcohol consumption, future health problems can be prevented.

All the members of our group are engineers. While this topic doesn't really relate to any of our research topics, this project gives us the opportunity to work with human-oriented data, something that is sometimes lacking in our field.

* Personal reasons?
* I like alcohol, but would like to know more about its effects on society.

## Project Objectives
This project will have two main objectives: General data exploration and alcoholism tendency prediction. General data exploration will let us find factors that are important to alcoholism in students, and any other interesting information. Developing a tool to predict alcohol tendency will be a proof of concept that could be used by future institutions. These objectives can be completed with the current tools and data we have access to. However, if we can find or gather more alcoholism data, we would like to also see how the existing dataset matches the new one. This extra data can also be used to test our predictive tool. Since our dataset is already quite clean, we are planning on spending most of our time on analysis, though this might change if we find a new dataset to work with. 

## Data

We plan to use the data folder for the Student Alcohol Consumption Data Set from the UCI Machine Learning Repository. This webpage can be accessed at http://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION. Datasets for attributes of high school students in a math course and a Portuguese language course can be downloaded as CSV files, which are part of a compressed folder accessed through the Data Folder link or directly from http://archive.ics.uci.edu/ml/machine-learning-databases/00356/. This data has been downloaded and included in the same folder as the Jupyter notebook to allow for easy access and importing. The file is read and printed below using pandas. The data attributes are separated by semicolons, making “;” the delimiter.

With permission, we would also like to create a survey for University of Utah students that collects some data on similar topics to the attributes in the student dataset from the UCI dataset. We would then hope to see where our class fits in to the model we will be creating for predicting alcohol consumption based on other traits. These attributes include sex, age, home to school travel time, weekly study time, extra-curricular activities, workday alcohol consumption, and weekend alcohol consumption to name a few. There are 33 total attributes recorded in the UCI data sets to choose from. A full list and description of attributes included can be found on the website.

In [2]:
import pandas as pd
mathclass = pd.read_csv("student-mat.csv",delimiter=";") #math course csv
portclass = pd.read_csv("student-por.csv",delimiter=";") #portuguese language course csv

portclass.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


In [2]:
mathclass.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [3]:
mathclass.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


In [4]:
portclass.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0
mean,16.744222,2.514638,2.306626,1.568567,1.930663,0.22188,3.930663,3.180277,3.1849,1.502311,2.280431,3.53621,3.659476,11.399076,11.570108,11.906009
std,1.218138,1.134552,1.099931,0.74866,0.82951,0.593235,0.955717,1.051093,1.175766,0.924834,1.28438,1.446259,4.640759,2.745265,2.913639,3.230656
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,16.0,2.0,1.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,2.0,0.0,10.0,10.0,10.0
50%,17.0,2.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,2.0,11.0,11.0,12.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,6.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,32.0,19.0,19.0,19.0


## Data Processing

The data description from the website states there are no missing values from the dataset. Therefore we do not expect to do substantial data cleanup. However, most of the data entries are strings, so will have to be converted to an integer value for processing. Many are binary ‘yes’ or ‘no’ answers, but some have more than two choices such as the father or mother’s job being ‘teacher’, ‘health’ care related, civil ‘services’, ‘at_home’, or ‘other.’ Since there are 33 different attributes considered, we may want to ignore some before converting strings to associated integer values, or we could change all the entries into numerical values and start will a model using all attributes to try to predict alcohol consumption. In the sense of obtaining enough data for analysis, we will not need a lot of data extraction since the two CSV files with data are provided for download. This data will just have to be manipulated in such a way as to make it useful for creating a prediction model.

There are 395 entries for the math course data set and 649 entries for the Portuguese language dataset, giving a total of 1044 data entries. These datasets could be combined into one large set for analysis since the number of different attributes considered are the same for each course.

The student data ranks alcohol consumption on weekdays and weekends from 1 (very low) to 5 (very high). We will be attempting to create an accurate model for predicting a student’s alcohol consumption levels for weekdays and weekends based on other attributes of that student. So the quantities derived will include the model parameters for different modeling methods that give the highest accuracy of prediction and the relationship of alcohol consumption ranging from 1 to 5 based on the numerical associations assigned to attributes. We also hope to be able to try using the model for predicting our class data to see where our student population might fit in with the student population in the dataset.

Data processing will include reassigning strings with numerical values, concatenating datasets for two different student populations (math course and Portuguese language course), and performing various modeling and visualization techniques. The model with the highest accuracy can then be used to try to predict our class alcohol consumption frequencies, and be compared with data obtained through a survey, tying in a University of Utah student population for the Intro to Data Science class, to see how accurately the model can be extended to a population outside the dataset locations. The methods for modeling and visualizing are discussed in the following sections.

## Exploratory Analysis

## Analysis Methodology

## Project Schedule