# Student Dropout Prediction

### Problem:

School dropout and educational failure pose significant challenges to economic growth and societal well-being, directly impacting students, families, institutions, and the broader community. 

According to **Education Data Initiative**, around **40% of undergraduate students** drop out before completing their degree. College dropouts face more financial challenges, earning an average of **35% less income** and experiencing a **20% higher unemployment rate** compared to their peers who graduate.

### Objective:

In this project, I aim to build a **classification model** to predict students' dropout and identify which students need more support to prevent them from dropping out.


# 1.0 - EDA - Expoloratory Data Analysis

In [98]:
# Import main libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from plotly.subplots import make_subplots 
import plotly.graph_objects as go
import plotly.io as pio

In [99]:
# Load dataset into a pandas Dataframe
df = pd.read_csv('../dataset/student_dropout.csv')

In [101]:
# Check for missing values
# Show add columns from dataframe with missing values
df.columns[df.isnull().any()]

Index([], dtype='object')

In [102]:
# Check for duplicate rows
df.duplicated().sum()

np.int64(0)

In [103]:
# Drop duplicate rows
df.drop_duplicates(inplace=True)
# Change column name to lowercase
df.columns = df.columns.str.lower()
# Show rename columns
df.columns


Index(['school', 'gender', 'age', 'address', 'family_size', 'parental_status',
       'mother_education', 'father_education', 'mother_job', 'father_job',
       'reason_for_choosing_school', 'guardian', 'travel_time', 'study_time',
       'number_of_failures', 'school_support', 'family_support',
       'extra_paid_class', 'extra_curricular_activities', 'attended_nursery',
       'wants_higher_education', 'internet_access', 'in_relationship',
       'family_relationship', 'free_time', 'going_out',
       'weekend_alcohol_consumption', 'weekday_alcohol_consumption',
       'health_status', 'number_of_absences', 'grade_1', 'grade_2',
       'final_grade', 'dropped_out'],
      dtype='object')

#### 1.1 - Data Dictionary

- **age**: Student age in years

- **address**: Student address 

- **family_size**:  
   - **GT3**: Greater Than 3  
   - **LE3**: Less Than 3

- **parental_status**:  
   - **T**: Together  
   - **A**: Apart

- **mother_education** & **father_education**:  
   - **1**: No formal education or very basic education (e.g., did not complete primary school).  
   - **2**: Primary education completed (equivalent to elementary school).  
   - **3**: Secondary education completed (high school diploma or equivalent).  
   - **4**: Higher education (college degree, university, or higher).

- **travel_time**: Expended time until arriving at school

- **attended_nursery**: A nursery refers to a facility or program that provides early childhood education and care for young children, typically from infancy until around 3 or 4 years old. It is also known as a daycare or pre-school in some countries.

- **free_time**: Number of days withoud occupation in a week.

- **number_of_absences**: Number of absencess during study period.


# 2.0 - Generate Data Visualization

### 2.1 - Categorical Data

In [155]:
pio.renderers.default = "browser"

fig_numerical_data = make_subplots(rows=5, cols=1)
fig_numerical_data.add_trace(go.Box(x=df['age'], name='Age'), row=1, col=1)
fig_numerical_data.add_trace(go.Box(x=df['number_of_absences'], name='Number of Absences'), row=2, col=1)
fig_numerical_data.add_trace(go.Box(x=df['study_time'], name='Study Time'), row=3, col=1)
fig_numerical_data.add_trace(go.Box(x=df['travel_time'], name='Travel Time'), row=4, col=1)
fig_numerical_data.add_trace(go.Box(x=df['free_time'], name='Free Time'), row=5, col=1)

pio.show(fig_numerical_data)


### 2.2 - Categorical Data

In [169]:
pio.renderers.default = "browser"

fig_categorical_data = make_subplots(rows=2, cols=5)
fig_categorical_data.add_trace(go.Bar(y=df['school'].value_counts(), x=df['school'].value_counts().index, name='School'), row=1, col=1)
fig_categorical_data.add_trace(go.Bar(y=df['gender'].value_counts(), x=['Male', 'Female'], name='Gender'), row=1, col=2)
fig_categorical_data.add_trace(go. Bar(y=df['father_education'].value_counts(), x=['Elementary School ', 'Middle School', 'High School', 'Degree'], name='Father Education'), row=1, col=3)



