**Rules for the code:**

- Include all the code you used for your report in this file. The code for any section in the report should go under the same section in this file.
- Any missing code will result in -20% from its corresponding section in the report.
- Any irrelevant code will result in -20% from its corresponding section in the report.
- Make sure that you run your code before rendering, so all the necessary visual/numeric outputs are visible.
- Any code that is not properly run or throws errors will be considered missing/irrelevant.

## 3) Data

In [52]:
import pandas as pd

In [54]:
data = pd.read_csv('StudentPerformanceFactors.csv')
data

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,69,High,Medium,No,7,76,Medium,Yes,1,High,Medium,Public,Positive,2,No,High School,Near,Female,68
6603,23,76,High,Medium,No,8,81,Medium,Yes,3,Low,High,Public,Positive,2,No,High School,Near,Female,69
6604,20,90,Medium,Low,Yes,6,65,Low,Yes,3,Low,Medium,Public,Negative,2,No,Postgraduate,Near,Female,68
6605,10,86,High,High,Yes,6,91,High,Yes,2,Low,Medium,Private,Positive,3,No,High School,Far,Female,68


## 5) Data Cleaning 

### a) Cleaning 1
*By Amir*

In [58]:
# check for missing values
print(data.isnull().sum())

Hours_Studied                  0
Attendance                     0
Parental_Involvement           0
Access_to_Resources            0
Extracurricular_Activities     0
Sleep_Hours                    0
Previous_Scores                0
Motivation_Level               0
Internet_Access                0
Tutoring_Sessions              0
Family_Income                  0
Teacher_Quality               78
School_Type                    0
Peer_Influence                 0
Physical_Activity              0
Learning_Disabilities          0
Parental_Education_Level      90
Distance_from_Home            67
Gender                         0
Exam_Score                     0
dtype: int64


In [60]:
# impute missing categorical data with mode
columns_with_missing = ['Teacher_Quality', 'Parental_Involvement', 'Distance_from_Home']
for col in columns_with_missing:
    data[col].fillna(data[col].mode()[0], inplace=True)

# one-hot encoding to retain all levels for multi-class categorical cols
multi_class_categorical_columns = ['Parental_Involvement', 'Access_to_Resources', 
                                   'Family_Income', 'Teacher_Quality', 'School_Type']
data = pd.get_dummies(data, columns=multi_class_categorical_columns)

# correctly format numeric cols
numeric_columns = ['Hours_Studied', 'Attendance', 'Sleep_Hours', 'Previous_Scores', 
                   'Tutoring_Sessions', 'Physical_Activity', 'Exam_Score']
data[numeric_columns] = data[numeric_columns].apply(pd.to_numeric, errors='coerce')

# drop irrelevant cols
irrelevant_columns = ['Internet_Access', 'Extracurricular_Activities', 
                      'Learning_Disabilities', 'Parental_Education_Level', 
                      'Gender']
data = data.drop(columns=irrelevant_columns)

data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 24 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Hours_Studied                6607 non-null   int64 
 1   Attendance                   6607 non-null   int64 
 2   Sleep_Hours                  6607 non-null   int64 
 3   Previous_Scores              6607 non-null   int64 
 4   Motivation_Level             6607 non-null   object
 5   Tutoring_Sessions            6607 non-null   int64 
 6   Peer_Influence               6607 non-null   object
 7   Physical_Activity            6607 non-null   int64 
 8   Distance_from_Home           6607 non-null   object
 9   Exam_Score                   6607 non-null   int64 
 10  Parental_Involvement_High    6607 non-null   bool  
 11  Parental_Involvement_Low     6607 non-null   bool  
 12  Parental_Involvement_Medium  6607 non-null   bool  
 13  Access_to_Resources_High     6607

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[col].fillna(data[col].mode()[0], inplace=True)


Unnamed: 0,Hours_Studied,Attendance,Sleep_Hours,Previous_Scores,Motivation_Level,Tutoring_Sessions,Peer_Influence,Physical_Activity,Distance_from_Home,Exam_Score,...,Access_to_Resources_Low,Access_to_Resources_Medium,Family_Income_High,Family_Income_Low,Family_Income_Medium,Teacher_Quality_High,Teacher_Quality_Low,Teacher_Quality_Medium,School_Type_Private,School_Type_Public
0,23,84,7,73,Low,0,Positive,3,Near,67,...,False,False,False,True,False,False,False,True,False,True
1,19,64,8,59,Low,2,Negative,4,Moderate,61,...,False,True,False,False,True,False,False,True,False,True
2,24,98,7,91,Medium,2,Neutral,4,Near,74,...,False,True,False,False,True,False,False,True,False,True
3,29,89,8,98,Medium,1,Negative,4,Moderate,71,...,False,True,False,False,True,False,False,True,False,True
4,19,92,6,65,Medium,3,Neutral,4,Near,70,...,False,True,False,False,True,True,False,False,False,True


### b) Cleaning 2
*By \<Name of team member>*

### c) Cleaning 3
*By \<Name of team member>*

### d) Cleaning 4
*By \<Name of team member>*

## 6) Data Analysis

### a) Analysis 1
*By Amir*

In [63]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# split into Individual Effort and External Factors
individual_effort_factors = ['Hours_Studied', 'Attendance', 'Sleep_Hours']
external_factors = ['Parental_Involvement_High', 'Parental_Involvement_Medium', 'Parental_Involvement_Low',
                    'Access_to_Resources_High', 'Access_to_Resources_Medium', 'Access_to_Resources_Low',
                    'Family_Income_High', 'Family_Income_Medium', 'Family_Income_Low',
                    'Teacher_Quality_High', 'Teacher_Quality_Medium', 'Teacher_Quality_Low',
                    'School_Type_Private', 'School_Type_Public']

target = 'Exam_Score'
# split data for individual effort factors
X_individual = data[individual_effort_factors]
# split data for external factors
X_external = data[external_factors]
y = data[target]

# split train-test data for both sets
X_indiv_train, X_indiv_test, y_train, y_test = train_test_split(X_individual, y, test_size=0.2, random_state=42)
X_ext_train, X_ext_test, _, _ = train_test_split(X_external, y, test_size=0.2, random_state=42)

# train linear regression models
model_individual = LinearRegression()
model_individual.fit(X_indiv_train, y_train)

model_external = LinearRegression()
model_external.fit(X_ext_train, y_train)

# predictions
y_pred_individual = model_individual.predict(X_indiv_test)
y_pred_external = model_external.predict(X_ext_test)

# calculate R-squared values
r2_individual = r2_score(y_test, y_pred_individual)
r2_external = r2_score(y_test, y_pred_external)

# display R-squared values
r2_individual, r2_external

(0.5884110279740731, 0.06074380079456099)

### b) Analysis 2
*By \<Name of team member>*

### c) Analysis 3
*By \<Name of team member>*

### d) Analysis 4
*By \<Name of team member>*