## **Assignment 1 - Student Performance Prediction**
This notebook aims to predict students' performance index using factors including hours studied, previous score, extracurricular, sleep hours and sample question papers practiced. The dataset contains 10,000 student records, and the target variable is the performance index. We will go through the following tasks:

1.   Clean and process the data (handle duplicates and missing values, transform categorical columns to numerical, etc)
2.   Split the Dataset to training set and validation set.
3.   Train a linear regression model on the train set and evaluate it's performance on the validation set.

### **Step 1: Data Cleaning and Processing**

#### **1.1: Load the Dataset**

We start by loading the dataset and displaying the first few rows to understand its structure and data types.

In [5]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# load the dataset
df = pd.read_csv("/content/Student_Performance.csv")

print("First few rows of the dataset:")
print(df.head())

# Check the data types of the columns
print("\nData types of the columns:")
print(df.dtypes)


First few rows of the dataset:
   Hours Studied  Previous Scores Extracurricular Activities  Sleep Hours  \
0              7               99                        Yes            9   
1              4               82                         No            4   
2              8               51                        Yes            7   
3              5               52                        Yes            5   
4              7               75                         No            8   

   Sample Question Papers Practiced  Performance Index  
0                                 1               91.0  
1                                 2               65.0  
2                                 2               45.0  
3                                 2               36.0  
4                                 5               66.0  

Data types of the columns:
Hours Studied                         int64
Previous Scores                       int64
Extracurricular Activities           object
Slee

#### **1.2: Check for Duplicates**
In this step, we check for duplicate rows in the dataset. Duplicate records can distort the analysis, so it's important to remove them if found.

In [6]:
# Check for duplicate rows
duplicates = df.duplicated().sum()

# Remove duplicates if any
df.drop_duplicates(inplace=True)

print(f"Number of duplicates removed: {duplicates}")

Number of duplicates removed: 127


#### **1.3: Handle Missing Values**
Now we handle missing values. If there are missing entries in any numerical columns, we impute them with the mean of that column. This prevents the loss of data while maintaining statistical balance.

In [7]:
# Check for missing values in each column
missing_values = df.isnull().sum()
print(f"Missing values in each column:\n{missing_values}")

# Fill missing values in numerical columns with the column mean
df.fillna(df.mean(numeric_only=True), inplace=True)

# Verify that no missing values remain
df.isnull().sum()


Missing values in each column:
Hours Studied                       0
Previous Scores                     0
Extracurricular Activities          0
Sleep Hours                         0
Sample Question Papers Practiced    0
Performance Index                   0
dtype: int64


Unnamed: 0,0
Hours Studied,0
Previous Scores,0
Extracurricular Activities,0
Sleep Hours,0
Sample Question Papers Practiced,0
Performance Index,0


#### **1.4: Convert Categorical Variables to Numerical**
We now convert the categorical column "Extracurricular Activities" from Yes/No to 1/0. This allows the model to interpret this data as numerical.

In [8]:
# Convert 'Extracurricular Activities' from Yes/No to 1/0
df['Extracurricular Activities'] = df['Extracurricular Activities'].apply(lambda x: 1 if x == 'Yes' else 0)

# Display the updated dataset to verify
df.head()


Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,1,9,1,91.0
1,4,82,0,4,2,65.0
2,8,51,1,7,2,45.0
3,5,52,1,5,2,36.0
4,7,75,0,8,5,66.0


#### **1.5: Normalize and Standardize Numerical Features**
Normalization and standardization ensure all numerical features have comparable scales, improving model performance. <br/>

Standardization: Rescales data to have a mean of 0 and standard deviation of 1.
Normalization: Scales data to [0, 1].<br/>
We standardize features for the linear regression model.

In [10]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler_standard = StandardScaler()
df_standardized = pd.DataFrame(scaler_standard.fit_transform(df), columns=df.columns)

print("\nDataset after standardization:")
print(df_standardized.head())

# Normalization
scaler_normal = MinMaxScaler()
df_normalized = pd.DataFrame(scaler_normal.fit_transform(df), columns=df.columns)

print("\nDataset after normalization:")
print(df_normalized.head())



Dataset after standardization:
   Hours Studied  Previous Scores  Extracurricular Activities  Sleep Hours  \
0       0.775566         1.706168                    1.010078     1.454025   
1      -0.383205         0.724912                   -0.990022    -1.491315   
2       1.161822        -1.064438                    1.010078     0.275889   
3       0.003052        -1.006717                    1.010078    -0.902247   
4       0.775566         0.320865                   -0.990022     0.864957   

   Sample Question Papers Practiced  Performance Index  
0                         -1.249715           1.862979  
1                         -0.900925           0.509348  
2                         -0.900925          -0.531907  
3                         -0.900925          -1.000471  
4                          0.145444           0.561411  

Dataset after normalization:
   Hours Studied  Previous Scores  Extracurricular Activities  Sleep Hours  \
0          0.750         1.000000                

### **Step 2: Splitting the Dataset**
#### **2.1: Define Features and Target Variable**
Before splitting the data, we need to separate the independent variables (features) from the dependent variable (target).


*   Features: All columns except 'Performance Index'.
*   Target: The 'Performance Index' column.

In [11]:
# Define the features (independent variables) and target (dependent variable)
X = df.drop('Performance Index', axis=1)
y = df['Performance Index']

# Display the shapes of the features and target
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")


Features shape: (9873, 5)
Target shape: (9873,)


#### **2.2: Split the Data into Training and Validation Sets**
Next, we split the data into training and validation sets using an 80/20 ratio. The training set is used to fit the model, while the validation set helps evaluate its performance.

In [12]:
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets (80% training, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and validation sets
print(f"Training set size (X_train, y_train): {X_train.shape}, {y_train.shape}")
print(f"Validation set size (X_val, y_val): {X_val.shape}, {y_val.shape}")


Training set size (X_train, y_train): (7898, 5), (7898,)
Validation set size (X_val, y_val): (1975, 5), (1975,)


### **Step 3: Training the Linear Regression Model**
#### **3.1: Initialize the Linear Regression Model**
We now initialize the LinearRegression model from scikit-learn. This is the algorithm we will use to predict the students' performance index.

In [13]:
from sklearn.linear_model import LinearRegression

# Initialize the Linear Regression model
model = LinearRegression()


#### **3.2: Train the Model**
We fit the model on the training data. This means the model will learn the relationship between the features and the target variable based on the training set.

In [14]:
# Train the model using the training set
model.fit(X_train, y_train)

# Display the learned coefficients for each feature
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})
coefficients


Unnamed: 0,Feature,Coefficient
0,Hours Studied,2.851022
1,Previous Scores,1.01843
2,Extracurricular Activities,0.573823
3,Sleep Hours,0.472073
4,Sample Question Papers Practiced,0.188704


#### **3.3: Predict on the Validation Set**
After training the model, we use it to make predictions on the validation set. These predictions will later be compared to the actual values to evaluate the model's performance.

In [15]:
# Make predictions on the validation set
y_pred = model.predict(X_val)

# Display the first few predictions
y_pred[:5]


array([46.48001281, 80.2853795 , 61.06518835, 22.706315  , 74.8368676 ])

#### **3.4: Calculate MSE (Mean Squared Error)**
MSE measures how far off the predictions are from the actual target values. Lower MSE indicates better performance.

In [16]:
from sklearn.metrics import mean_squared_error

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_val, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")


Mean Squared Error (MSE): 4.31


#### **3.5: Calculate R-squared (R²)**
R-squared (R²) explains the proportion of variance in the target variable that is explained by the model. A value close to 1 indicates a good fit.

In [17]:
from sklearn.metrics import r2_score

# Calculate R-squared (R²)
r2 = r2_score(y_val, y_pred)

print(f"R-squared (R²): {r2:.2f}")


R-squared (R²): 0.99
