### The dataset `student-mat.csv` taken from kaggle https://www.kaggle.com/datasets/ishandutta/student-performance-data-set contains the performance data of student  in secondary education of two Portuguese schools. It comprises information about students' academic performance, as well as their demographic, social, and school-related characteristics and was gathered through a combination of school records and surveys/questionnaires. You can find more detail about the dataset's attribute in the student.txt file

### In this lab, our main objectives are:
### 1) Identify valuable insights influencing the students' final academic scores(G3).
### 2) Use these attributes to predict the student's final academic scores.

#### Step 1: Import all the necessary libraries. All your imports should be here and here only (2 marks)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

#### Step 2: Read the csv file and display the first few rows (2 marks)

In [2]:
df = pd.read_csv("student-mat.csv", sep=';')

In [3]:
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


#### Step 3: Get to know your datasets. Check total number of rows and columns, what are the different columns, datatypes, basic statistics and missing values. How many of them are categorical column? Write a short report of your observation at the end.(5 marks + 1 marks + 1 marks)

In [4]:
print(f"Total rows: {df.shape[0]}")
print(f"Total columns: {df.shape[1]}")

Total rows: 395
Total columns: 33


In [5]:
# Display the column names
print(f"Column names: {df.columns.tolist()}")

Column names: ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3']


In [6]:
# Display datatypes
df.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

In [7]:
# Basic statistics
df.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


In [8]:
df.describe(include=['object'])

Unnamed: 0,school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic
count,395,395,395,395,395,395,395,395,395,395,395,395,395,395,395,395,395
unique,2,2,2,2,2,5,5,4,3,2,2,2,2,2,2,2,2
top,GP,F,U,GT3,T,other,other,course,mother,no,yes,no,yes,yes,yes,yes,no
freq,349,208,307,281,354,141,217,145,273,344,242,214,201,314,375,329,263


In [9]:
# Check for missing values
print(df.isnull().sum())

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64


In [11]:
# Check for categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns
print(f"Categorical columns: {categorical_columns.tolist()}")
print(f"Number of categorical columns: {len(categorical_columns)}")

Categorical columns: ['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic']
Number of categorical columns: 17


In [12]:
"""
The dataset consists of 395 rows and 33 columns. It includes both numerical and categorical data types. There are 0 missing values, the data is somewhat clean.
"""

'\nThe dataset consists of 395 rows and 33 columns. It includes both numerical and categorical data types. There are 0 missing values, the data is somewhat clean.\n'

#### Step 4: Let's go back to the code where we outputted the basic statistics. What can you observe from that data? Do you notice anything wrong there? Does anything change in the categorical columns? (5 marks)

Analyze here:
"absences" has a max value of 75, while 75% percentile is 8. This is a sign of outliers.
other columns seem to have a regular distribution with little skews
Does anything change in the categorical columns - No

#### Step 5: Fix missing data if any. Explain your reasoning. (1 marks)

In [13]:
# fix the outliers
df['absences'] = df['absences'].clip(upper=8)

#### Step 6: Now comes the main part. You want to find out how these attributes or columns are related to the final grade(G3) of the student. Use appropriate method/s to find out the factors influencing the final grade. Write a report at the very end explaining how they correlate and influence G3. (10 marks)

#### Step 7: Create Linear Regression and Random-Forest Regressor model using scikit learn. Also define the KFold with n_splits=10. (3 marks)
1. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
2. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
3. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Initialize the models
linear_model = LinearRegression()
random_forest_model = RandomForestRegressor()

# Define the KFold cross-validator
kf = KFold(n_splits=10)

# To store results
linear_mse = []
random_forest_mse = []


#### Step 8: You have selected the attributes that influence G3 in step 6. You will use them to predict G3; however, you cannot simply feed them to your regression model. You will need to transform your dataset to appropriate format. Complete the following step:
1. Define X and y  
2. Split them into train and test dataset in 80:20 ratio
3. Scale the data using standard scaler if necessary
4. Encode your categorical variable using suitable encoding method (One hot Encoder or Label Encoder)
#### (4 marks)

In [17]:

# Assume you have your data in a DataFrame `df`
# X is your feature matrix and y is your target vector
X = df.drop('G3', axis=1).values
y = df['G3'].values

#### Step 9: Before we start training our data, we will need to find the best model among the two different models we have defined. For this we will evaluate the linear regression model and random forest regressor model using cross validation technique. Perform cross-validation with appropriate scoring criteria and select the best model.(Optional: You can define and add more models for evaluation) (5 marks)

In [None]:
#### you should print what is the average scores for each model

#### Step 10: Now using GridSearchCV and find the best combination of hyperparameters for your model. Print out the best estimators.(5 marks)

#### Step 11: Train your model (1 marks)

#### Step 12: Predict the value of your test dataset and calculate the mean squared error. (2 marks)

#### Step 13: Plot the a horizontal barchart using barh to print the feature importances value with its appropriate column. (3 marks)