# Student exam performance EDA

<a id="Problems-and-Objectives"></a>
## 1. Problems and Objectives

The analysis of student exam scores is crucial for understanding the relationships between different variables and identifying trends and patterns within the dataset. By gaining insights into these relationships and patterns, we can develop predictive models that help improve student performance and academic success. However, without a comprehensive examination of the data and appropriate visualization techniques, it becomes challenging to draw meaningful conclusions and to conduct later on research.

Therefore, the objectives of this notebook are:
* To explore the relationships between different variables in the dataset
* To identify trends and patterns in the data
* To visualize the data using various plotting techniques

<a id="Importing-libraries"></a>
## 2. Importing libraries

In [None]:
import numpy as np
import pandas as pd
import missingno as msno
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

<a id="Dataset-overview"></a>
## 3. Dataset overview

In [None]:
exam_scores = pd.read_csv('/kaggle/input/students-exam-scores/Expanded_data_with_more_features.csv')

In [None]:
exam_scores.columns

In [None]:
exam_scores.sample(10)

#### Features explaination
1. **Gender**: Gender of the student (male/female)
2. **EthnicGroup**: Ethnic group of the student (group A to E)
3. **ParentEduc**: Parent(s) education background (from some_highschool to master's degree)
4. **LunchType**: School lunch type (standard or free/reduced)
5. **TestPrep**: Test preparation course followed (completed or none)
6. **ParentMaritalStatus**: Parent(s) marital status (married/single/widowed/divorced)
7. **PracticeSport**: How often the student parctice sport (never/sometimes/regularly))
8. **IsFirstChild**: If the child is first child in the family or not (yes/no)
9. **NrSiblings**: Number of siblings the student has (0 to 7)
10. **TransportMeans**: Means of transport to school (schoolbus/private)
11. **WklyStudyHours**: Weekly self-study hours(less that 5hrs; between 5 and 10hrs; more than 10hrs)
12. **MathScore**: math test score(0-100)
13. **ReadingScore**: reading test score(0-100)
14. **WritingScore**: writing test score(0-100)

<a id="Data-Preprocessing"></a>
## 4. Data Preprocessing

In [None]:
# "Unnamed: 0" seem like an extra index column so we drop it
exam_scores = exam_scores.drop('Unnamed: 0', axis = 1)

In [None]:
exam_scores.info()

Look at the **Dtype**, it looks like we have an incorrect data types. **NrSiblings** , it should be **interger** (not float).

In [None]:
exam_scores['NrSiblings'] = exam_scores['NrSiblings'].fillna(0).astype(int)

In [None]:
msno.matrix(exam_scores)

There are missing values in **['EthnicGroup', 'ParentEduc', 'TestPrep','ParentMaritalStatus', 'PracticeSport',  'IsFirstChild','TransportMeans', 'WklyStudyHours']**

#### Getting unique values of each category

In [None]:
from IPython.core.display import HTML

#Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell
def multi_table(table_list):
        return HTML('<table><tr style="background-color:white;">' +  ''.join(['<td>' + table._repr_html_() + '</td>' for table in table_list]) +'</tr></table>')
    

In [None]:
nunique_df={var:pd.DataFrame(exam_scores[var].value_counts())
           for var in {'Gender','EthnicGroup', 'ParentEduc', 'LunchType', 'TestPrep', 'ParentMaritalStatus', 'PracticeSport', 'IsFirstChild','TransportMeans', 'WklyStudyHours', 'MathScore', 'ReadingScore', 'WritingScore'}}

multi_table([nunique_df['Gender'],nunique_df['EthnicGroup'],nunique_df['ParentEduc'],nunique_df['LunchType'],nunique_df['TestPrep'],nunique_df['ParentMaritalStatus'],nunique_df['PracticeSport'],nunique_df['IsFirstChild'],nunique_df['TransportMeans'], nunique_df['WklyStudyHours']])

In ParentEduc, I will group "**some high school**" and "**high school**" into "**high school**" for better analysis. Besides that, drop all NaN values.

In [None]:
filtered_data = exam_scores.dropna(how = 'any', subset=['EthnicGroup', 'ParentEduc', 'TestPrep','ParentMaritalStatus', 'PracticeSport', 'IsFirstChild','TransportMeans', 'WklyStudyHours'])

In [None]:
filtered_data['ParentEduc'] = filtered_data['ParentEduc'].replace("some high school", "high school")

In [None]:
filtered_data.isnull().sum()

In [None]:
filtered_data.shape

In [None]:
filtered_data.describe()

#### Overview
* After fixed the missing values, this data has 20,266 observations.
* The minimum math score is 0, while the minimum reading score is 10 and the minimum writing score is 4, which suggests that some students scored very low on these tests.
* The 25th percentile for the math, reading, and writing scores are around 56-59, which indicates that 25% of the students scored below this level.
* The 50th percentile (median) for the math, reading, and writing scores are around 67-70, which indicates that 50% of the students scored below this level.
* The 75th percentile for the math, reading, and writing scores are around 78-80, which indicates that 25% of the students scored above this level.
* The maximum math score is 100, while the maximum reading and writing scores are also 100, which suggests that some students scored very high on these tests.
* The average math score (MathScore) is around 67, while the average reading score (ReadingScore) and writing score (WritingScore) are around 69 and 68, respectively.
* The standard deviation for the math, reading, and writing scores are similar (around 15), which indicates that the scores have similar variability.

<a id="Explodatory-Data-Analysis"></a>
## 5. Explodatory Data Analysis

In [None]:
def plot_bar(df,feature):
    '''Plot histogram plot'''
    match feature:
        case 'EthnicGroup':
            fig=px.histogram(data_frame=df,x=feature,
                             title='{} distribution'.format(feature),
                             width=600, height=500,
                             template='simple_white',
                             category_orders={
                                 feature: ["group A", "group B", "group C", "group D", "group E"]})
            colors = ['lightsalmon',] * 5
            colors[2] = 'RebeccaPurple'
            fig.update_traces(marker_color=colors, marker_line_color=None,
                              marker_line_width=2.5, opacity=None)
            fig.show()
        case 'ParentEduc':
            fig=px.histogram(data_frame=df,x=feature,
                             title='{} distribution'.format(feature),
                             width=600, height=500,
                             template='simple_white',
                             category_orders={
                                 feature: ["high school","associate's degree","some college","bachelor's degree","master's degree"]})
            colors = ['lightsalmon',] * 5
            colors[0] = 'RebeccaPurple'
            fig.update_traces(marker_color=colors, marker_line_color=None,
                              marker_line_width=2.5, opacity=None)
            fig.show()
        case 'WklyStudyHours':
            fig=px.histogram(data_frame=df,x=feature,
                             title='{} distribution'.format(feature),
                             width=600, height=500,
                             template='simple_white',
                             category_orders={
                                 feature: ['< 5','5 - 10','> 10']})
            colors = ['lightsalmon',] * 3
            colors[1] = 'RebeccaPurple'
            fig.update_traces(marker_color=colors, marker_line_color=None,
                              marker_line_width=2.5, opacity=None)
            fig.show()
        case _:
            fig=px.histogram(data_frame=df,x=feature,
                             title='{} distribution'.format(feature),
                             width=600, height=500,
                             template='simple_white')
            colors = ['lightsalmon',] * filtered_data[feature].nunique()
            colors[0] = 'RebeccaPurple'
            fig.update_traces(marker_color=colors, marker_line_color=None,
                              marker_line_width=2.5, opacity=None)
            fig.show()

In [None]:
def plot_pie(df,feature):
    '''Plot pie plot'''
    ftr_cnt = df[feature].value_counts()
    fig=px.pie(values=ftr_cnt,
              names=ftr_cnt.index,
              title='{} distribution'.format(feature),
              template='simple_white')
    fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=['RebeccaPurple','lightsalmon'],
        line=dict(color='#000000',
                  width=2)))
    fig.show()

In [None]:
def plot_box(df,feature):
    '''Plot box plot'''
    fig = px.box(filtered_data,
                 y=['MathScore','ReadingScore','WritingScore'],
                 color=feature,
                 template='simple_white',
                 title='{} and Exam scores'.format(feature),
                 height=500, width=1000)
    fig.show()

In [None]:
def plot_kde(df,feature):
    '''Plot kde plot'''
    fig, ax=plt.subplots(ncols=3, figsize=(30,10))
    math=sns.kdeplot(data=filtered_data,x='MathScore',ax=ax[0],hue=feature)
    read=sns.kdeplot(data=filtered_data,x='ReadingScore',ax=ax[1],hue=feature)
    write=sns.kdeplot(data=filtered_data,x='WritingScore',ax=ax[2],hue=feature)
    fig.suptitle('{} education and Exam scores'.format(feature))
    sns.move_legend(math,'upper left',bbox_to_anchor=(0,1))
    sns.move_legend(read,'upper left',bbox_to_anchor=(0,1))
    sns.move_legend(write,'upper left',bbox_to_anchor=(0,1))

<a id="Univarient-analysis"></a>
### Univarient analysis

In [None]:
plot_pie(filtered_data, 'Gender')

**Observations**: Genders are equally distributed

In [None]:
plot_bar(filtered_data, 'EthnicGroup')

**Observations**: Seem like most of the students are in group C and followed by D, B, E and A.

In [None]:
plot_bar(filtered_data,'ParentEduc')

**Observations**: Most of the students' parents had studied high school while very few of them had a master's degree

In [None]:
plot_bar(filtered_data, 'LunchType')

**Observations**: Students prefer standard lunch over free/reduced lunch and almost double it

In [None]:
plot_bar(filtered_data,'PracticeSport')

**Observations**: Many students play sports. Very few of them never play

In [None]:
plot_pie(filtered_data,'IsFirstChild')

**Observations**: More than half of them are first born.

In [None]:
plot_bar(filtered_data,'WklyStudyHours')

**Observations**: Most of students spent 5 to 10 hour a week to study

<a id="Multivariate-analysis"></a>
### Multivariate analysis

In [None]:
corr = filtered_data.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    fig, ax = plt.subplots(figsize=(8, 8))
    ax = sns.heatmap(corr,mask=mask,square=True,linewidths=.8,annot=True)

**Observations**: There are correlations between MathScore, ReadingScore and WritingScore. It means those who have high scores on one test, the others test may have high scores too.

#### Gender and Exam scores

In [None]:
plot_box(filtered_data,'Gender')

**Observations**: It looks like males do better in math than females. But in Reading and Writing, Females do better than male

#### Ethnic Group and Exam scores

In [None]:
plot_box(filtered_data,'EthnicGroup')

**Observation**: Theres slight shift in between these groups. Group E has the highest exam scores, following by D, C, B, A. All five groups have achieved max score in all 3 exams

#### Parent's education and Exam scores

In [None]:
plot_kde(filtered_data, 'ParentEduc')

**Observations**: Parents who have a Master's degree tend to have their children score higher. But that doesn't affect much to the exam scores

#### Lunch Type and Exam Scores

In [None]:
plot_box(filtered_data,'LunchType')

**Observations**: Students who have standard lunch tend to have higher scores than free/reduced lunch. This also indicated a big impact on exam scores by choosing lunch type.

#### Test preparation and Exam scores

In [None]:
plot_kde(filtered_data,'TestPrep')

**Observations**: Students who have prepared for the exams tend to have higher than students who don't

#### Parent marital status and Exam scores

In [None]:
plot_box(filtered_data,'ParentMaritalStatus')

**Observations**: Only slight changes between these groups. The plot showed us that widowed have a little higher scores than others

#### Practice Sports and Exam Scores

In [None]:
plot_box(filtered_data,'PracticeSport')

**Observations**: Students who practice sports have higher scores than students who never practice sports. However, the impact was not significant

#### Weekly study hours and Exam scores

In [None]:
plot_box(filtered_data,'WklyStudyHours')

**Observations**: Not surprisingly, the more study hours, the higher the scores are. Students who studied more than 10 hours a week have the highest scores, followed by 5 to 10 hours a week and less 5 than hours a week

<a id="Conclusion"></a>
## Conclusion

### Some main conclusions can be drawn:
#### 1. In terms of **Gender**:
*         Males had higher scores in math while Females had higher scores in reading and writing.

#### 2. In terms of **Ethnicity**:
*         They are not equally distributed.
*         Group E has the highest score in all 3 tests.

#### 3. In term of **Parent's education**:
*         Most of them had studied high school.
*         Master's degree had the lowest population but they perform better at the exams

#### 4. In terms of **Lunch type**:
*         Students prefer standard lunch to free/reduced lunch (almost double).
*         Students who eat standard lunch have higher scores than free/reduced lunch.
*         Lunch Type greatly impacts the exam scores.

#### 5. In terms of **Practice sport**:
*         More than 3/4 of the respondent said that they practice sport sometimes/regularly.
*         Those Who practice sports tend to have higher scores than those who don't.
*         They had a somewhat impact on the exam scores but were not significant.

#### 6. In term of **Weekly study hours**:
*         The more the students study, the higher scores they get.
*         This impact should be considered.