<a href="https://www.kaggle.com/code/yazanjian/students-performance-notebook?scriptVersionId=144872159" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Students Performance Study
In this notebook, we are going to study the performance of some students based on the folllowing [dataset](https://www.kaggle.com/datasets/spscientist/students-performance-in-exams) 

The notebook is divided into sections as follows: 
1. Dataset Import: Load and read the dataset
2. Some Analysis: Study the dataset and the relations between the features. 
3. Data Preprocessing: Transform the problem into classification problem by averaging the three exam results and replace the average with a binary category; Pass >= 65 and No Pass < 65. Then do categorical encoding. 
4. Model Training and Evaluation: Using some ML models available in sklearn library 


**Note**: This notebook uses a utility file developed by the same author, and available [here](https://www.kaggle.com/code/yazanjian/utils) on kaggle

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import utils

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA

from sklearn.preprocessing import OrdinalEncoder

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Dataset Import

In [None]:
# Read the dataframe
df = pd.read_csv('/kaggle/input/students-performance-in-exams/StudentsPerformance.csv')
df.head()

# Some Analysis
In this section we are goining to check the following: 
1. The number of null values for each column
2. Visualize the frequency of each value per feature
3. Analyze any possible relation between 

In [None]:
# Print some details of the loaded df
print("The total number of records is: {} with {} features \n".format(df.shape[0], df.shape[1]))
print("Data description \n {} \n".format(df.describe()))
print("{} \n".format(df.info()))

In [None]:
# Visualize features
utils.visualize_features(df)

In [None]:
# Plot the correlation matrix between the three exams scores
correlation_columns = ['math score', 'reading score', 'writing score']
utils.plot_correlation_figure(df, correlation_columns)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a scatter plot
sns.scatterplot(x=df['reading score'], y=df['writing score'])

# Add labels and title
plt.xlabel('reading score')
plt.ylabel('writing score')
plt.title('Scatter Plot Example')

# Show the plot
plt.show()


### Analysis Discussion:
* As we can notice from the previous cells, we have **1000 records**, with **no missing data**. 
* We have **5 categorical features** with **3 numerical attributes** (exam results). 
* Test preparation course and lunch features have imbalanced data. 
* Overall, there is a correlation between the three exams. Moroever, **there is a clear relationship between the writing exam score and reading exam score**. 

# Data Preprocessing 

In this section, we are going to do the following:
1. Convert the problem in hand into classification problem. The process for that is to calculate the avg. for all the three exams. After that, **all avg. results >= 65** will be **labled as True** and **all results < 65** will be **labled as False**. 
2. Extract the target and features dfs.
3. Categorical Encoding.

##### 1. Convert the problem into classification

In [None]:
df['Passed'] = df[['math score', 'reading score', 'writing score']].mean(axis=1) >= 65
df

In [None]:
# Drop the exam results and keep the Passed column
df.drop(columns=['math score', 'reading score', 'writing score'], inplace=True)
df

In [None]:
df[['Passed']].describe()

In [None]:
utils.visualize_single_feature_as_histogram(df, 'Passed')

##### 2. Extract the features and target dataframes

In [None]:
# Extract the target attribute and the features dataframe
target = df['Passed']
features = df.drop(columns=['Passed'])
features

##### 3. Categorical Encoding

In [None]:
features_encoded = pd.get_dummies(features)
features_encoded

# Model Training
In this section, we are going to train some sklearn classification models, namely LR, RF, KNN and DT. 
Steps: 
1. Split the data into train and test splits with 80% and 20% respectively. 
2. Model training & Model evaluation

##### 1. Split the data into train and test splits with 80% and 20% respectively.


In [None]:
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(features_encoded, target, test_size=0.2, random_state=42)


In [None]:
# Shape of the dfs
print("Shape of X_train {}".format(X_train.shape))
print('Shape of X_test {}'.format(X_test.shape))
print('Shape of y_train {}'.format(y_train.shape))
print('Shape of y_test {}'.format(y_test.shape))

##### 2. Model training & Evaluation


**Logistic Regression**

In [None]:
#Create and fit the model on training data
LR_model = LogisticRegression(solver='liblinear', penalty='l2', random_state=32).fit(X_train, y_train)

print("The accuracy score for the training data = {}".format(LR_model.score(X_train, y_train)))
print("The accuracy score for the testing data = {}".format(LR_model.score(X_test, y_test)))

**Support Vector Machine**


In [None]:
SVC_model = SVC(kernel='poly', degree=4, random_state=42).fit(X_train, y_train)
print("The accuracy score for the training data = {}".format(SVC_model.score(X_train, y_train)))
print("The accuracy score for the testing data = {}".format(SVC_model.score(X_test, y_test)))

In [None]:
SVC_model = SVC(kernel='rbf', C=0.5, gamma='scale', random_state=42).fit(X_train, y_train)
print("The accuracy score for the training data = {}".format(SVC_model.score(X_train, y_train)))
print("The accuracy score for the testing data = {}".format(SVC_model.score(X_test, y_test)))

**Decision Tree**


In [None]:
DT_model = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)
print("The accuracy score for the training data = {}".format(DT_model.score(X_train, y_train)))
print("The accuracy score for the testing data = {}".format(DT_model.score(X_test, y_test)))

### Apply PCA before training

In [None]:
principal=PCA(n_components=5)
X_train_pca = principal.fit_transform(X_train)
X_test_pca = principal.transform(X_test)
X_train_pca.shape

**LR with PCA**

In [None]:
#Create and fit the model on training data
LR_model_pca = LogisticRegression(solver='liblinear', penalty='l2', random_state=32).fit(X_train_pca, y_train)

print("The accuracy score for the training data = {}".format(LR_model_pca.score(X_train_pca, y_train)))
print("The accuracy score for the testing data = {}".format(LR_model_pca.score(X_test_pca, y_test)))

**Support Vector Machine with PCA**


In [None]:
SVC_model_pca = SVC(kernel='rbf', C=0.5, gamma='scale', random_state=42).fit(X_train_pca, y_train)
print("The accuracy score for the training data = {}".format(SVC_model_pca.score(X_train_pca, y_train)))
print("The accuracy score for the testing data = {}".format(SVC_model_pca.score(X_test_pca, y_test)))

## Use Ordinal Encoding with PCA

1. Apply Ordinal Encoding

In [None]:
# Split the data into training and testing sets (80% train, 20% test)
X_train_oe, X_test_oe, y_train_oe, y_test_oe = train_test_split(features, target, test_size=0.3, random_state=42)

enc = OrdinalEncoder()
X_train_oe = enc.fit_transform(X_train_oe)
X_test_oe = enc.transform(X_test_oe)
X_train_oe.shape

In [None]:
principal_oe=PCA(n_components=5)
X_train_oe_pca = principal_oe.fit_transform(X_train_oe)
X_test_oe_pca = principal_oe.transform(X_test_oe)
X_train_oe_pca.shape

In [None]:
#Create and fit the model on training data
LR_model_oe_pca = LogisticRegression(solver='liblinear', penalty='l1', random_state=32).fit(X_train_oe_pca, y_train_oe)

print("The accuracy score for the training data = {}".format(LR_model_oe_pca.score(X_train_oe_pca, y_train_oe)))
print("The accuracy score for the testing data = {}".format(LR_model_oe_pca.score(X_test_oe_pca, y_test_oe)))