<a href="https://colab.research.google.com/github/ViktorSivek/Student_Alcohol_Consumption/blob/main/Plotly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Student Alcohol Consumption

## Context

The data were obtained in a survey of students math course in secondary school. It contains a lot of interesting social, gender and study informations about students.

## Content

Attributes for student-mat.csv (Math course) dataset:

1.   school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2.   sex - student's sex (binary: 'F' - female or 'M' - male)
3.   age - student's age (numeric: from 15 to 22)
4.   address - student's home address type (binary: 'U' - urban or 'R' - rural)
5.   famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6.   Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7.   Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8.   Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9.   Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10.   Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11.   reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12.   guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13.   traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14.   studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15.   failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16.   schoolsup - extra educational support (binary: yes or no)
17.   famsup - family educational support (binary: yes or no)
18.   paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19.   activities - extra-curricular activities (binary: yes or no)
20.   nursery - attended nursery school (binary: yes or no)
21.   higher - wants to take higher education (binary: yes or no)
22.   internet - Internet access at home (binary: yes or no)
23.   romantic - with a romantic relationship (binary: yes or no)
24.   famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25.   freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26.   goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27.   Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 
28.   Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29.   health - current health status (numeric: from 1 - very bad to 5 - very good)
30.   absences - number of school absences (numeric: from 0 to 93)


These grades in Math:

1.   G1 - first period grade (numeric: from 0 to 20)
2.   G2 - second period grade (numeric: from 0 to 20)
3.   G3 - final grade (numeric: from 0 to 20, output target)


## Imports and data input

In [1]:
#@title
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
#@title
data_path = '/content/drive/MyDrive/Colab Notebooks/student-mat.csv'
data = pd.read_csv(data_path)

## Data cleaning

In [3]:
#@title
# Check for missing values
print("Missing values count:")
print(data.isnull().sum())

Missing values count:
school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64


In [4]:
#@title
# Check for duplicates
duplicates = data.duplicated()

# Print the number of duplicate rows
print("Number of duplicate rows:", duplicates.sum())

Number of duplicate rows: 0


## Data description

In [5]:
#@title
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [6]:
#@title
data.shape

(395, 33)

In [7]:
#@title
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

In [8]:
#@title
data.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


## Correlation

In [9]:
#@title
correlations = data.corr()
print(correlations)

                 age      Medu      Fedu  traveltime  studytime  failures  \
age         1.000000 -0.163658 -0.163438    0.070641  -0.004140  0.243665   
Medu       -0.163658  1.000000  0.623455   -0.171639   0.064944 -0.236680   
Fedu       -0.163438  0.623455  1.000000   -0.158194  -0.009175 -0.250408   
traveltime  0.070641 -0.171639 -0.158194    1.000000  -0.100909  0.092239   
studytime  -0.004140  0.064944 -0.009175   -0.100909   1.000000 -0.173563   
failures    0.243665 -0.236680 -0.250408    0.092239  -0.173563  1.000000   
famrel      0.053940 -0.003914 -0.001370   -0.016808   0.039731 -0.044337   
freetime    0.016434  0.030891 -0.012846   -0.017025  -0.143198  0.091987   
goout       0.126964  0.064094  0.043105    0.028540  -0.063904  0.124561   
Dalc        0.131125  0.019834  0.002386    0.138325  -0.196019  0.136047   
Walc        0.117276 -0.047123 -0.012631    0.134116  -0.253785  0.141962   
health     -0.062187 -0.046878  0.014742    0.007501  -0.075616  0.065827   

In [10]:
#@title
fig = px.imshow(correlations,
                x=correlations.columns,
                y=correlations.columns,
                title="Correlation Matrix Heatmap",
                labels=dict(color="Correlation"))
fig.show()

## Analysis

### Pie chart - Percentage of students by sex
This pie chart shows the percentage of students by sex.


In [11]:
#@title
pie_chart = px.pie(data, names='sex', title='Percentage of Students by Sex')
pie_chart.show()

### Line plot: Number of students in each age group
This line plot shows the number of students in each age group. It helps to understand the age distribution of the dataset.

In [12]:
#@title
students_per_age = data["age"].value_counts().sort_index().reset_index()
students_per_age.columns = ["age", "count"]

fig = px.line(students_per_age, x="age", y="count", title="Number of Students in Each Age Group",
              labels={"age": "Age", "count": "Number of Students"})
fig.update_layout(xaxis_title="Age", yaxis_title="Number of Students")
fig.show()

### Bar plot: Number of students with different levels of mother's education
This bar plot shows the number of students with different levels of mother's education. It helps to understand the distribution of mother's education in the dataset.

In [13]:
#@title
students_per_mother_edu = data["Medu"].value_counts().sort_index().reset_index()
students_per_mother_edu.columns = ["Medu", "count"]

fig = px.bar(students_per_mother_edu, x="Medu", y="count", title="Number of Students by Mother's Education Level",
             labels={"Medu": "Mother's Education Level", "count": "Number of Students"})
fig.update_layout(xaxis_title="Mother's Education Level", yaxis_title="Number of Students")
fig.show()

### Pie chart: Proportion of students with different levels of father's education
This pie chart shows the proportion of students with different levels of father's education. It helps to understand the distribution of father's education in the dataset.

In [14]:
#@title
students_per_father_edu = data["Fedu"].value_counts().reset_index()
students_per_father_edu.columns = ["Fedu", "count"]

fig = px.pie(students_per_father_edu, values="count", names="Fedu", title="Proportion of Students by Father's Education Level")
fig.show()

### Box plot of final grades (G3) by mother's job (Mjob)
This box plot shows the distribution of students' final grades (G3) by their mother's job.

In [15]:
#@title
fig = px.box(data, x='Mjob', y='G3', points='all')
fig.update_layout(title='Final Grades (G3) by Mother\'s Job', xaxis_title='Mother\'s Job', yaxis_title='Final Grade (G3)')
fig.show()

### Bar plot of average study time (studytime) by age
This bar plot shows the average study time of students by age.

In [16]:
#@title
age_studytime = data.groupby('age')['studytime'].mean().reset_index()

fig = px.bar(age_studytime, x='age', y='studytime')
fig.update_layout(title='Average Study Time by Age', xaxis_title='Age', yaxis_title='Average Study Time')
fig.show()

### Scatter plot: Relationship between study time and final grade (G3)
This scatter plot shows the relationship between study time and the final grade (G3) of students. It can help determine if there is a correlation between study time and academic performance.

In [17]:
#@title
fig = px.scatter(data, x="studytime", y="G3", title="Study Time vs Final Grade (G3)",
                 labels={"studytime": "Study Time", "G3": "Final Grade"})
fig.update_layout(xaxis_title="Study Time", yaxis_title="Final Grade")
fig.show()

### Scatter plot: Age vs. Absences
This scatter plot shows the relationship between students' age and the number of absences. Each point represents a student, with the hover information showing the school, sex, and final grade.

In [18]:
#@title
fig = px.scatter(data, x="age", y="absences", title="Age vs. Absences",
                 labels={"age": "Age", "absences": "Number of Absences"},
                 hover_name="school",
                 hover_data=["sex", "G3"])
fig.update_layout(xaxis_title="Age", yaxis_title="Number of Absences")
fig.show()

### Violin plot: Distribution of absences by sex
This violin plot shows the distribution of absences by sex

In [19]:
#@title
fig = px.violin(data, x="sex", y="absences", title="Distribution of Absences by Sex",
                labels={"sex": "Sex", "absences": "Absences"})
fig.update_layout(xaxis_title="Sex", yaxis_title="absences")

### Scatter plot of final grades (G3) vs. absences with color based on alcohol consumption (Dalc + Walc)
This scatter plot shows the relationship between students' final grades (G3) and their number of absences, with points colored based on their alcohol consumption (Dalc + Walc).

In [20]:
#@title
data['total_alcohol'] = data['Dalc'] + data['Walc']

fig = px.scatter(data, x='absences', y='G3', color='total_alcohol')
fig.update_layout(title='Final Grades (G3) vs. Absences (colored by Alcohol Consumption)',
                  xaxis_title='Absences', yaxis_title='Final Grade (G3)')
fig.show()

### Heatmap: Correlation between alcohol consumption (Dalc and Walc) and academic performance (G1, G2, G3)
This heatmap displays the correlation between alcohol consumption (weekday and weekend) and academic performance (G1, G2, G3)

In [21]:
#@title
corr_matrix = data[["Dalc", "Walc", "G1", "G2", "G3"]].corr()

fig = px.imshow(corr_matrix, title="Correlation Heatmap: Alcohol Consumption vs. Academic Performance",
                labels=dict(x="Variable", y="Variable", color="Correlation"))
fig.update_xaxes(tickvals=[0, 1, 2, 3, 4], ticktext=["Dalc", "Walc", "G1", "G2", "G3"])
fig.update_yaxes(tickvals=[0, 1, 2, 3, 4], ticktext=["Dalc", "Walc", "G1", "G2", "G3"])
fig.show()

### Histogram of freetime
This histogram shows the distribution of students' free time after school.

In [22]:
#@title
fig = px.histogram(data, x='freetime', nbins=5)
fig.update_layout(title='Histogram of Free Time', xaxis_title='Free Time', yaxis_title='Count')
fig.show()

### Box plot: G3 distribution by sex
This box plot shows the distribution of final grades (G3) for male and female students. It provides a visual representation of the median, quartiles, and outliers for each group.

In [23]:
#@title
fig = px.box(data, x="sex", y="G3", title="G3 Distribution by Sex",
             labels={"sex": "Sex", "G3": "Final Grade (G3)"})
fig.update_layout(xaxis_title="Sex", yaxis_title="Final Grade (G3)")
fig.show()

### Bar plot: Average final grade by reason for choosing a school
This bar plot shows the average final grade (G3) for students based on their reason for choosing a school. It helps to understand if there's any connection between the students' motivation and their academic performance.

In [24]:
#@title
mean_grade_by_reason = data.groupby("reason")["G3"].mean().reset_index()

fig = px.bar(mean_grade_by_reason, x="reason", y="G3", title="Average Final Grade by Reason for Choosing School",
             labels={"reason": "Reason for Choosing School", "G3": "Average Final Grade"})
fig.update_layout(xaxis_title="Reason for Choosing School", yaxis_title="Average Final Grade")
fig.show()

### Scatter plot - Relationship between G1 and G3 grades
This scatter plot shows the relationship between G1 and G3 grades.

In [25]:
#@title
# 5. Scatter plot - Relationship between G1 and G3 grades
scatter_plot = px.scatter(data, x='G1', y='G3', title='Relationship between G1 and G3 Grades')
scatter_plot.update_xaxes(title_text='G1 Grades')
scatter_plot.update_yaxes(title_text='G3 Grades')
scatter_plot.show()

### Box plot: Distribution of final grades (G3) by address type (Urban or Rural)
This box plot shows the distribution of final grades (G3) based on the address type (urban or rural). It can help explore whether there is a difference in academic performance between urban and rural students.



In [26]:
#@title
fig = px.box(data, x="address", y="G3", title="Final Grade (G3) Distribution by Address Type",
             labels={"address": "Address Type", "G3": "Final Grade"})
fig.update_layout(xaxis_title="Address Type", yaxis_title="Final Grade")
fig.show()

### Sunburst plot of final grades (G3) by school and sex
This sunburst plot shows the distribution of students' final grades (G3) by school and sex. The innermost circle represents the schools, the middle circle represents the sexes, and the outermost circle represents final grades (G3).

In [27]:
#@title
fig = px.sunburst(data, path=['school', 'sex', 'G3'])
fig.update_layout(title='Sunburst Plot of Final Grades (G3) by School and Sex')
fig.show()

## Source

Author of analysis:
Viktor Sívek (sivv01)

Source:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

Fabio Pagnotta, Hossain Mohammad Amran.
Email:fabio.pagnotta@studenti.unicam.it, mohammadamra.hossain '@' studenti.unicam.it
University Of Camerino

https://www.kaggle.com/datasets/uciml/student-alcohol-consumption