# Analysis of Student Academic Performance

## 1. Introduction
When we talk about student academic performance, having a good teacher or a motivated student is only a small portion of what ultimately contributes to a student’s success. There are various factors involved, some of which include parental involvement, student assessment, lesson delivery and so on. In this project, we aim at identifying the major factors that influence student academics performance.
#### 1.1 Research Question
What are the major factors that influence student academics performance?

## 2. Methods
* Data Preparation
* Data Merging
* Data Cleaning
* Exploratory Data Analysis
* Results

#### 2.1 Data Preparation
* For this research, we will use 2 datasets downloaded in csv format from UCI machine learning repository, link [here](https://archive.ics.uci.edu/ml/datasets/Student+Performance).
* These datasets provide the performance of high school students from 2 Portuguese schools in 2 distinct subjects: Mathematics (mat) and Portuguese language (por)
* Both datasets have common attributes which include student grades, demographic, social and school related features described below:

|Attribute name| Descrption| Details
|---------|-------------|-----------------|
|school | student's school	| binary: "GP" = Gabriel Pereira, "MS" = Mousinho da Silveira|
|sex | student's sex	| binary: "F" = female, "M" = male|
|age   |student's age	| numeric: from 15 to 22|
|address | student's home address type	| binary: "U" = urban, "R" = rural|
|famsize   |family size of student	| binary: "LE3" = less or equal to 3, "GT3" - greater than 3|
|Pstatus | parent's cohabitation status	| binary: "T" - living together or "A" - apart|
|Medu   |mother's education	| numeric: 0 = none,  1 = primary education (4th grade), 2 = 5th to 9th grade, 3 = secondary education, 4 = higher education|
|Fedu | father's education	| numeric: 0 = none,  1 = primary education (4th grade), 2 = 5th to 9th grade, 3 = secondary education, 4 = higher education|
|Mjob   | mother's job	| nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home", "other"|
|Fjob | father's job	| nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home", "other"|
|reason   |reason to choose this schools	| nominal: close to "home", school "reputation", "course" preference, "other"|
|guardian | student's guardian	| nominal: "mother", "father", "other"|
|traveltime   |home to school travel time	| numeric: 1 = <15 min., 2 = 15 to 30 min., 3 = 30 min. to 1 hour, 4 = >1 hour|
|studytime | weekly study time	| numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours|
|failures   |number of past class failures	| numeric: n if 1<=n<3, else 4|
|schoolsup | extra educational support	| binary: yes or no|
|famsup   |family educational support	| binary: yes or no|
|paid |extra paid classes within the course subject (Math or Portuguese) | binary: yes or no|
|activities | extra-curricular activities	| binary: yes or no|
|nursery   |attended nursery school	| binary: yes or no|
|higher | wants to take higher education| binary: yes or no|
|internet   |Internet access at home	| binary: yes or no|
|romantic | in a romantic relationship	| binary: yes or no|
|famrel |quality of family relationships | numeric: from 1 - very bad to 5 - excellent|
|freetime | free time after school	| numeric: from 1 - very low to 5 - very high|
|goout   |going out with friends	| numeric: from 1 - very low to 5 - very high|
|Dalc | workday alcohol consumption	| numeric: from 1 - very low to 5 - very high|
|Walc   |weekend alcohol consumption | numeric: from 1 - very low to 5 - very high|
|health   |current health status | numeric: from 1 - very bad to 5 - very good|
|absences   |number of school absences	| numeric: from 0 to 93|
|G1 | first period grade	| numeric: from 0 to 20|
|G2 | second period grade	| numeric: from 0 to 20|
|G3   |final grade	| numeric: from 0 to 20|

These grades (G1, G2, G3) are related with the course subject, Math or Portuguese

To prepare data, let us import the needed libraries and load the 2 datasets into python

In [1]:
# Import needed libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

**Dataset 1 - Provides the performance of high school students from 2 Portuguese schools in Mathematics (mat)**

In [8]:
# Load dataset 1
mat_grades = pd.read_csv('student/student-mat.csv', sep = ";")
print(mat_grades.shape)
mat_grades.head()

(395, 33)


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


**Dataset 2 - Provides the performance of high school students from 2 Portuguese schools in Portuguese language (por)**

In [9]:
# Load dataset 2
por_grades = pd.read_csv('student/student-por.csv', sep = ";")
print(por_grades.shape)
por_grades.head()

(649, 33)


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


In [36]:
por_grades.shape

(649, 33)

### Data Merging
There are several (382) students that belong to both datasets. These students can be identified by searching for identical attributes that characterize each student, since each student doesn't have a unique identifier (e.g student id). These attributes will then be used to merge both datasets.

In [32]:
stu=pd.concat([mat_grades,por_grades])

In [33]:
stu.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [34]:
stu.shape

(1044, 33)

In [39]:
649*2-39

1259

In [29]:
# merge dataset 1 & 2
student_grades = pd.merge(mat_grades,
                          por_grades,
                          on = ["school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet",'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup',
                                'famsup', 'paid', 'activities', 'higher', 'romantic', 'famrel',
                                'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences'],
                          #suffixes = ("", ""),
                          how ='inner'
                         )
student_grades.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,Dalc,Walc,health,absences,G1_x,G2_x,G3_x,G1_y,G2_y,G3_y
0,GP,M,16,U,LE3,T,2,2,other,other,...,1,1,3,0,12,12,11,13,12,13
1,GP,M,15,U,GT3,A,2,2,other,other,...,1,1,3,0,14,16,16,14,14,15
2,GP,M,15,U,GT3,T,4,3,teacher,other,...,1,1,1,0,13,14,15,12,13,14
3,GP,M,15,U,GT3,T,4,4,health,health,...,1,1,5,0,12,15,15,11,12,12
4,GP,M,15,U,GT3,T,4,4,health,services,...,3,4,5,0,9,11,12,10,11,11


In [30]:
student_grades.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1_x           int64
G2_x           int64
G3_x           int64
G1_y           int64
G2_y           int64
G3_y           int64
dtype: object

In [31]:
student_grades.shape

(39, 36)

### Data Cleaning
* Renaming columns to more meaningful names
* Assigning the appropriate data types to each feature
* Dealing with missing & duplicate values

###### Rename Columns

In [None]:
new_col_names = {'famsize': 'fam_size',
                 'Pstatus': 'parent_living_status',
                 'Medu': 'mother_edu',
                 'Fedu': 'father_edu',
                 'Mjob': 'mother_job',
                 'Fjob': 'father_job',
                 'traveltime': 'sch_home_dist',
                 'studytime': 'study_time',
                 'schoolsup': 'sch_support',
                 'famsup': 'fam_support',
                 'paid': 'extra_lesson',
                 'higher': 'higher_edu',
                 'internet': 'internet_access',
                 'famrel': 'fam_relationship',
                 'goout': 'friend_hangout',
                 'Dalc': 'workday_alc_level',
                 'Walc': 'weekend_alc_level',
                 'G1': 'first_grade',
                 'G2': 'second_grade',
                 'G3': 'final_grade'
                }
student_grades.rename(columns = new_col_names, inplace = True)
print(student_grades.head())

In [6]:
student_data.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object