<a href="https://colab.research.google.com/github/VitorGit93/Pesquisa_Evasao/blob/main/Recursos/Notebooks%20de%20exemplo/student_dropout_analysis_for_school_education_if.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Student Dropout Analysis for School Education**

**Introduction:**
              Access to quality education is a fundamental right, and governments worldwide strive to ensure every child's enrollment and completion of their schooling. However, dropout rates in schools remain a persistent challenge, often influenced by various social, economic, and demographic factors. In an effort to address this issue, the Government of Gujarat has recognized the need for a comprehensive analysis of dropout patterns at the school level. By understanding the underlying causes and identifying vulnerable groups, the government aims to formulate targeted interventions that can significantly reduce dropout rates.

This notebook presents an in-depth analysis of student dropout trends in school education, utilizing a dataset titled "Predict Students' Dropout and Academic Success - Investigating the Impact of Social and Economic Factors." **The dataset, sourced from Kaggle and contributed by thedevastator**(https://www.kaggle.com/thedevastator), encompasses a wide range of attributes that shed light on the dynamics contributing to student dropout.

**Project Overview:**
                  The primary objective of this project is to conduct a comprehensive analysis of student dropout rates in school education, with a focus on the state of Gujarat, utilizing the available dataset titled "Predict Students' Dropout and Academic Success - Investigating the Impact of Social and Economic Factors." While the dataset may not include information on schools, areas, or castes, we can still extract valuable insights from the existing attributes.

The analysis aims to provide insights into the following key aspects:

**Demographic Analysis:** We will explore how demographic factors such as gender, age at enrollment, marital status, and nationality correlate with student dropout rates.

**Economic Factors:** Investigate the influence of economic factors, such as parental occupation, tuition fee payment status, and scholarship eligibility, on student dropout rates.

**Academic Performance:** Analyze how students' academic performance, represented by variables like curricular units and evaluations, impacts their likelihood of dropping out.

**Social and Special Needs:** Explore whether students with educational special needs or those facing unique challenges like displacement or debt are more susceptible to dropout.

**Macro-economic Factors:** Investigate how broader economic indicators like unemployment rate, inflation rate, and GDP growth relate to dropout rates, as these can indirectly affect education outcomes.

The expected outcome of this analysis is to provide valuable insights into the complex web of factors influencing student dropout. By identifying high-risk groups and understanding the nuanced factors contributing to dropout rates, the government can develop targeted interventions and policies to improve student retention and foster a conducive learning environment.

In the subsequent sections of this notebook, we will delve into data preprocessing, exploratory data analysis, and the development of predictive models to aid in the dropout analysis. While we may not have school-wise, area-wise, or caste-wise information, we will use the available attributes to contribute to the government's efforts in ensuring every child's right to education and reducing dropout rates where possible.

**About DataSet:**
This dataset provides a comprehensive view of students enrolled in various undergraduate degrees offered at a higher education institution. It includes demographic data, social-economic factors and academic performance information that can be used to analyze the possible predictors of student dropout and academic success. This dataset contains multiple disjoint databases consisting of relevant information available at the time of enrollment, such as application mode, marital status, course chosen and more. Additionally, this data can be used to estimate overall student performance at the end of each semester by assessing curricular units credited/enrolled/evaluated/approved as well as their respective grades. Finally, we have unemployment rate, inflation rate and GDP from the region which can help us further understand how economic factors play into student dropout rates or academic success outcomes. This powerful analysis tool will provide valuable insight into what motivates students to stay in school or abandon their studies for a wide range of disciplines such as agronomy, design, education nursing journalism management social service or technologies

**Columns:**

| **Column Name**                     | **Description**                                                                                         |
|---------------------------------|-----------------------------------------------------------------------------------------------------|
| Marital status                  | The marital status of the student. (Categorical)                                                     |
| Application mode                | The method of application used by the student. (Categorical)                                         |
| Application order               | The order in which the student applied. (Numerical)                                                  |
| Course                          | The course taken by the student. (Categorical)                                                       |
| Daytime/evening attendance      | Whether the student attends classes during the day or in the evening. (Categorical)                 |
| Previous qualification          | The qualification obtained by the student before enrolling in higher education. (Categorical)       |
| Nationality                     | The nationality of the student. (Categorical)                                                         |
| Mother's qualification          | The qualification of the student's mother. (Categorical)                                              |
| Father's qualification          | The qualification of the student's father. (Categorical)                                              |
| Mother's occupation             | The occupation of the student's mother. (Categorical)                                                 |
| Father's occupation             | The occupation of the student's father. (Categorical)                                                 |
| Displaced                       | Whether the student is a displaced person. (Categorical)                                             |
| Educational special needs       | Whether the student has any special educational needs. (Categorical)                                 |
| Debtor                          | Whether the student is a debtor. (Categorical)                                                       |
| Tuition fees up to date         | Whether the student's tuition fees are up to date. (Categorical)                                      |
| Gender                          | The gender of the student. (Categorical)                                                               |
| Scholarship holder              | Whether the student is a scholarship holder. (Categorical)                                            |
| Age at enrollment               | The age of the student at the time of enrollment. (Numerical)                                         |
| International                   | Whether the student is an international student. (Categorical)                                        |
| Curricular units 1st sem (credited) | The number of curricular units credited by the student in the first semester. (Numerical)       |
| Curricular units 1st sem (enrolled) | The number of curricular units enrolled by the student in the first semester. (Numerical)       |
| Curricular units 1st sem (evaluations) | The number of curricular units evaluated by the student in the first semester. (Numerical)   |
| Curricular units 1st sem (approved) | The number of curricular units approved by the student in the first semester. (Numerical)     |


<div style="color:white;display:fill;border-radius:8px;
            background-color:#03112A;font-size:150%;
            letter-spacing:1.0px">
    <p style="padding: 8px;color:white;"><b><b><span style='color:#FFFFFF'></span></b> 1. Importing Required Libraries </b></p>
</div>

In [None]:
import pandas as pd
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn import svm


from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import VotingClassifier


In [None]:
data = pd.read_csv("/kaggle/input/higher-education-predictors-of-student-retention/dataset.csv")
data.head()

In [None]:
data.info()

<div style="color:white;display:fill;border-radius:8px;
            background-color:#03112A;font-size:150%;
            letter-spacing:1.0px">
    <p style="padding: 8px;color:white;"><b><b><span style='color:#FFFFFF'></span></b> 2. Data Preprocessing </b></p>
</div>

In [None]:
data.rename(columns = {'Nacionality':'Nationality', 'Age at enrollment':'Age'}, inplace = True)

Lets check whether there is any null values in this dataset

In [None]:
data.isnull().sum()/len(data)*100

Hereby we can say that there is no null values in the dataset, which is a good news !
so, we need to do two other steps before moving into the EDA part. They are,
* Encoding the target column(Since it is the only non-numeric field in the dataset)
* Feauture Engineering(Considering only the relevant data to feed our model)

In [None]:
print(data["Target"].unique())

So there are 3 unique values in target column which we can replace by
* Dropout - 0
* Enrolled - 1
* Graduate - 2

In [None]:
data['Target'] = data['Target'].map({
    'Dropout':0,
    'Enrolled':1,
    'Graduate':2
})

In [None]:
print(data["Target"].unique())

Since the number of unique values is less, we used this map() method, if it is large consider using LableEncoder()

So the first part is over.lets move on to the next part.
for the next part we have to,
1. Find how the features are correlated with the Target
2. Remove other unwanted or irrelevant features from the data

In [None]:
data.corr()['Target']

In [None]:
plt.figure(figsize=(30, 30))
sns.heatmap(data.corr() , annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

To decide which columns to remove based on low correlation with the target variable ('Target'), you can set a correlation threshold and remove columns with correlations below that threshold. Here, I'll suggest removing columns with an absolute correlation coefficient less than 0.1 (you can adjust this threshold as needed). Here are some columns to consider removing:

* **Nationality:** Since its correlation is very close to zero **(-0.004740)**, it may not have a significant impact on the target variable.

* **Mother's qualification:** With a correlation of **-0.038346**, it appears to have a weak relationship with the target variable.

* **Father's qualification:** Similarly, with a correlation of **0.000329**, it seems to have little influence on the target variable.

* **Educational special needs:** This column has a low correlation of **-0.007353**, suggesting it may not strongly affect the target variable.

* **International:** With a correlation of **0.003934**, this column has minimal impact on the target variable.

* **Curricular units 1st sem (without evaluations):** It has a correlation of **-0.068702**, which is relatively low compared to other columns related to curricular units.

* **Unemployment rate:** This column's correlation of **0.008627** indicates a weak relationship with the target variable.

* **Inflation rate:** With a correlation of **-0.026874**, it has a relatively low impact on the target variable.

These columns have low absolute correlation values and may not provide significant predictive power for your target variable 'Target.' However, before removing them, consider the context of your analysis and whether these columns may have any theoretical significance or could be useful in a broader context. Additionally, it's a good practice to run your analysis both with and without these columns to see if they make a meaningful difference in your model's performance.

In [None]:
new_data = data.copy()
new_data = new_data.drop(columns=['Nationality',
                                  'Mother\'s qualification',
                                  'Father\'s qualification',
                                  'Educational special needs',
                                  'International',
                                  'Curricular units 1st sem (without evaluations)',
                                  'Unemployment rate',
                                  'Inflation rate'], axis=1)
new_data.info()

Lets move on to the EDA part

<div style="color:white;display:fill;border-radius:8px;
            background-color:#03112A;font-size:150%;
            letter-spacing:1.0px">
    <p style="padding: 8px;color:white;"><b><b><span style='color:#FFFFFF'></span></b> 3. Exploratory Data Analysis </b></p>
</div>

lets see how many dropouts, enrolled & graduates are there in Target column

In [None]:
new_data['Target'].value_counts()

In [None]:
x = new_data['Target'].value_counts().index
y = new_data['Target'].value_counts().values

df = pd.DataFrame({
    'Target': x,
    'Count_T' : y
})

fig = px.pie(df,
             names ='Target',
             values ='Count_T',
            title='How many dropouts, enrolled & graduates are there in Target column')

fig.update_traces(labels=['Graduate','Dropout','Enrolled'], hole=0.4,textinfo='value+label', pull=[0,0.2,0.1])
fig.show()

Let's plot the Top 10 Features with Highest Correlation to Target

In [None]:
correlations = data.corr()['Target']
top_10_features = correlations.abs().nlargest(10).index
top_10_corr_values = correlations[top_10_features]

plt.figure(figsize=(10, 11))
plt.bar(top_10_features, top_10_corr_values)
plt.xlabel('Features')
plt.ylabel('Correlation with Target')
plt.title('Top 10 Features with Highest Correlation to Target')
plt.xticks(rotation=45)
plt.show()

Distribution of age of students at the time of enrollment

In [None]:
px.histogram(new_data['Age'], x='Age',color_discrete_sequence=['lightblue'])

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Target', y='Age', data=new_data)
plt.xlabel('Target')
plt.ylabel('Age')
plt.title('Relationship between Age and Target')
plt.show()

In [None]:
X = new_data.drop('Target', axis=1)
y = new_data['Target']

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#03112A;font-size:150%;
            letter-spacing:1.0px">
    <p style="padding: 8px;color:white;"><b><b><span style='color:#FFFFFF'></span></b> 4. Building Models </b></p>
</div>

In [None]:
dtree = DecisionTreeClassifier(random_state=0)
rfc = RandomForestClassifier(random_state=2)
lr = LogisticRegression(random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
abc = AdaBoostClassifier(n_estimators=50,learning_rate=1, random_state=0)
xbc = XGBClassifier(tree_method='gpu_hist')
svm = svm.SVC(kernel='linear',probability=True)

In [None]:
dtree.fit(X_train,y_train)
rfc.fit(X_train,y_train)
lr.fit(X_train,y_train)
knn.fit(X_train,y_train)
abc.fit(X_train, y_train)
xbc.fit(X_train, y_train)
svm.fit(X_train, y_train)

In [None]:
y_pred = dtree.predict(X_test)
print("Accuracy :",round(accuracy_score(y_test,y_pred)*100,2),"%")

In [None]:
y_pred = rfc.predict(X_test)
print("Accuracy :",round(accuracy_score(y_test,y_pred)*100,2),"%")

In [None]:
y_pred = lr.predict(X_test)
print("Accuracy :",round(accuracy_score(y_test,y_pred)*100,2),"%")

In [None]:
y_pred = knn.predict(X_test)
print("Accuracy :",round(accuracy_score(y_test,y_pred)*100,2),"%")

In [None]:
y_pred = abc.predict(X_test)
print("Accuracy :",round(accuracy_score(y_test,y_pred)*100,2),"%")

In [None]:
y_pred = xbc.predict(X_test)
print("Accuracy :",round(accuracy_score(y_test,y_pred)*100,2),"%")

In [None]:
y_pred = svm.predict(X_test)
print("Accuracy :",round(accuracy_score(y_test,y_pred)*100,2),"%")

So, Lets improve our accuracy using Ensemble Voting Classifier

In [None]:
#param_grid = {
#    'bootstrap': [False,True],
#    'max_depth': [5,8,10, 20],
#    'max_features': [3, 4, 5, None],
#    'min_samples_split': [2, 10, 12],
#    'n_estimators': [100, 200, 300]
#}

#clf = GridSearchCV(estimator = rfc, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 1)

#clf.fit(X_train,y_train)
#y_pred = clf.predict(X_test)
#print("Accuracy: ",accuracy_score(y_test,y_pred))
#print(clf.best_params_)
#print(clf.best_estimator_)

In [None]:
ens1 = VotingClassifier(estimators=[('rfc', rfc), ('lr', lr), ('abc',abc), ('xbc',xbc)], voting='soft')
ens1.fit(X_train, y_train)

y_pred = ens1.predict(X_test)
print("Accuracy :",round(accuracy_score(y_test,y_pred)*100,2),"%")

In [None]:
ens2 = VotingClassifier(estimators=[('rfc', rfc), ('lr', lr), ('abc',abc), ('xbc',xbc)], voting='hard')
ens2.fit(X_train, y_train)

y_pred = ens2.predict(X_test)
print("Accuracy :",round(accuracy_score(y_test,y_pred)*100,2),"%")

There is still some works have to be done, Im leaving it here for further Development