# Load dataset - Student Performance (Multiple Linear Regression) 

About Dataset
Description:

The Student Performance Dataset is a dataset designed to examine the factors influencing academic student performance. The dataset consists of 10,000 student records, with each record containing information about various predictors and a performance index.

Variables:

Hours Studied: The total number of hours spent studying by each student.
Previous Scores: The scores obtained by students in previous tests.
Extracurricular Activities: Whether the student participates in extracurricular activities (Yes or No).
Sleep Hours: The average number of hours of sleep the student had per day.
Sample Question Papers Practiced: The number of sample question papers the student practiced.
Target Variable:

Performance Index: A measure of the overall performance of each student. The performance index represents the student's academic performance and has been rounded to the nearest integer. The index ranges from 10 to 100, with higher values indicating better performance.
The dataset aims to provide insights into the relationship between the predictor variables and the performance index. Researchers and data analysts can use this dataset to explore the impact of studying hours, previous scores, extracurricular activities, sleep hours, and sample question papers on student performance.

P.S: Please note that this dataset is synthetic and created for illustrative purposes. The relationships between the variables and the performance index may not reflect real-world scenarios

In [None]:
! pip install kagglehub pandas numpy scikit-learn matplotlib seaborn --quiet

# Load dataset

In [None]:
import pandas as pd


df = pd.read_csv("Student_Performance.csv")
df.head()

# EDA
* Check Null values, if present update it with the 


In [None]:
df.isna().sum()

In [None]:
df.describe(include='all')

Convert extracurricular activities into binary 0,1 

In [None]:
s = lambda x: 0 if x.lower() == 'no' else 1

df['Extracurricular Activities'] = df['Extracurricular Activities'].apply(s)
df.head()

In [None]:
# split dataset into 3 catogories 
# 1. train dataset 
# 2. validation dataset
# 3. test dataset

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, random_state=42)
train, val = train_test_split(train, test_size=0.25, random_state=42) # 0.25 x 0.8 = 0.2

In [None]:
train.shape, val.shape, test.shape 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns   


# Define features and target variable
features = ['Hours Studied', 'Previous Scores',	'Extracurricular Activities', 'Sleep Hours', 'Sample Question Papers Practiced']
target = 'Performance Index'
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]
y_test = test[target]

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train) 

# Make predictions
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test) 

# Evaluate the model
val_mse = mean_squared_error(y_val, y_val_pred)
val_r2 = r2_score(y_val, y_val_pred)
print(f"Validation MSE: {val_mse}, R2: {val_r2}")

test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)
print(f"Test MSE: {test_mse}, R2: {test_r2}")

# Plotting Actual vs Predicted for Validation set
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x=y_val, y=y_val_pred, color='blue')
plt.xlabel('Actual Performance Index')
plt.ylabel('Predicted Performance Index')
plt.title('Validation Set: Actual vs Predicted')
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)

In [None]:
model.coef_, model.intercept_

In [None]:
model.score(X_test, y_test)

In [None]:
model.score(X_val, y_val)

In [None]:
model.predict([[5, 80, 1, 7, 10]])

With normalization

In [None]:
# normalization of the dataset 

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.fit_transform(X_val)
X_test_scaled = scaler.fit_transform(X_test)
model_n1 = LinearRegression()
model_n1.fit(X_train_scaled, y_train)

model_n1.score(X_test_scaled, y_test)

In [None]:
def get_performance_index(hours_studied, previous_scores, extracurricular_activities, sleep_hours, sample_question_papers_practiced):
    input_data = np.array([[hours_studied, previous_scores, extracurricular_activities, sleep_hours, sample_question_papers_practiced]])
    input_data_scaled = scaler.fit_transform(input_data)
    predicted_index = model_n1.predict(input_data_scaled)
    return int(predicted_index[0])

In [None]:
get_performance_index(10, 80, 0, 10, 10)