# ðŸ“Š ***Analysis of Factors Affecting Student Academic Performance***



## ðŸ“˜ **Step 1: Data Loading and Initial Overview**



---

## ðŸ“˜ **Domain : **Education****

## ðŸ“˜ ****Objective:****



 The main objective of this project is to analyze student performance data using Python. The project involves cleaning the dataset, exploring relationships between different variables, creating visualizations, and drawing useful insights that explain how different factors impact studentsâ€™ academic performance.

## ðŸ“˜ ****Project Definition**:**



 This project analyzes student performance data using Python to explore relationships between various factors, visualize patterns, and derive insights that explain their impact on academic performance.

## ðŸ“˜ ****Dataset Description**:**



 The dataset contains student demographic, academic, and socio-economic information, including study habits, attendance, parental background, and grades, used to analyze factors affecting academic performance.

## ðŸ“˜ ****Dataset Source**: Kaggle â€“ Student Performance Data Set**



https://www.kaggle.com/datasets/larsen0966/student-performance-data-set

### ðŸ”¹ ****The Scope of this Project Includes**:**



- Loading and exploring the dataset using Python.



- Cleaning and preprocessing data (handling missing values, duplicates, and data types).



- Creating derived features for better analysis.



- Performing exploratory data analysis using statistical summaries and visualizations.



- Interpreting results and presenting key insights.

### ðŸ”¹ ****Tools & Technologies Used**:**



- Python



- Pandas



- NumPy



- Matplotlib



- Seaborn



- Scikit-learn



- Jupyter Notebook



- PyCharm

### ðŸ”¹ **Step 1: Data Loading and Initial Overview**

*Objective*: Set up the Python environment for data analysis.

In [38]:
import pandas as pd

#Display settings
pd.set_option('display.max_columns', None)              #show all columns
pd.set_option('display.float_format','{:.2f}'.format)    #format all decimals to 2 places

print("Libraries imported successfully!")

Libraries imported successfully!


### ðŸ”¹ **1.1 Data Source and Loading**



**Data Source:** Data-Science Sean - Student Performance Data Set



**Source Link:** https://www.kaggle.com/datasets/larsen0966/student-performance-data-set



**Dataset Description:**



This dataset contains account information for 10,000 customers at a European bank, including:



- Student Demographics: Age, gender, and address type



- Family Background: Parental education levels, family size, and family relationship quality



- Academic Information: School type, study time, previous failures, and absences



- Social & Lifestyle Factors: Free time, internet access, health status, alcohol consumption



- Performance Indicators: Period grades (G1, G2) and final grade (G3)



**Loading Objectives:**



We will load the CSV file and perform initial inspection to understand:



- The total number of records (rows), representing individual students



- The total number of features (columns), representing student, family, and academic attributes



- Column names and their data types to ensure correct interpretation of variables



- The overall data structure and quality, including the presence of missing values or inconsistencies

*Objective*: Load the provided dataset into a Pandas DataFrame.

In [31]:
#Load the dataset
df = pd.read_csv(r"C:\Users\anton\Downloads\archive\Student_Performance_Cleaned.csv", encoding='latin-1')

# Display dataset dimensions
print("Dataset loaded successfully!")
print("="*70)
print(f"Dataset Shape: {df.shape[0]:,} rows Ã— {df.shape[1]} columns")

Dataset loaded successfully!
Dataset Shape: 649 rows Ã— 33 columns


### ðŸ”¹ **1.2 Initial Data Overview**

**We will examine the first few rows of the dataset to understand:**

- The structure and format of the data

- Actual values in each column

- Data types (numbers, text, etc.)

- Any obvious data quality issues

The `df.head()` function displays the first 5 rows by default, giving a quick preview of the dataset. We can use the `df.head(n)` method to check the top n rows of the dataframe, where n is an integer.

In [32]:
# Display the first 10 rows of the dataset
print('First 10 rows of the dataset:')
df.head(10)

First 10 rows of the dataset:


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,0,yes,no,no,no,yes,yes,yes,no,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,no,yes,yes,yes,yes,yes,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,no,no,yes,yes,no,no,4,3,2,1,2,5,0,11,13,13
5,GP,M,16,U,LE3,T,4,3,services,other,reputation,mother,1,2,0,no,yes,no,yes,yes,yes,yes,no,5,4,2,1,2,5,6,12,12,13
6,GP,M,16,U,LE3,T,2,2,other,other,home,mother,1,2,0,no,no,no,no,yes,yes,yes,no,4,4,4,1,1,3,0,13,12,13
7,GP,F,17,U,GT3,A,4,4,other,teacher,home,mother,2,2,0,yes,yes,no,no,yes,yes,no,no,4,1,4,1,1,1,2,10,13,13
8,GP,M,15,U,LE3,A,3,2,services,other,home,mother,1,2,0,no,yes,no,no,yes,yes,yes,no,4,2,2,1,1,1,0,15,16,17
9,GP,M,15,U,GT3,T,3,4,other,other,home,mother,1,2,0,no,yes,no,yes,yes,yes,yes,no,5,5,1,1,1,5,0,12,12,13


### ðŸ”¹ **1.3 Last Rows of the Dataset**



**Checking the last few rows helps us:**



- Verify the entire file loaded completely



- Check if data patterns change at the end



- Ensure no corruption at the file's end

The `df.tail(n)` function displays the last n rows.

In [10]:
# Display the last 10 rows of the dataset
print("Last 10 rows of the dataset:")
df.tail(10)

Last 10 rows of the dataset:


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
639,MS,M,19,R,GT3,T,1,1,other,services,...,4,3,2,1,3,5,0,5,8,0
640,MS,M,18,R,GT3,T,4,2,other,other,...,5,4,3,4,3,3,0,7,7,0
641,MS,F,18,R,GT3,T,2,2,at_home,other,...,5,3,3,1,3,4,0,14,17,15
642,MS,F,17,U,GT3,T,4,3,teacher,other,...,5,5,4,1,1,1,0,6,9,11
643,MS,F,18,R,GT3,T,4,4,teacher,at_home,...,4,4,3,2,2,5,4,7,9,10
644,MS,F,19,R,GT3,T,2,3,services,other,...,5,4,2,1,2,5,4,10,11,10
645,MS,F,18,U,LE3,T,3,1,teacher,services,...,4,3,4,1,1,1,4,15,15,16
646,MS,F,18,U,GT3,T,1,1,other,other,...,1,1,1,1,1,5,6,11,12,9
647,MS,M,17,U,LE3,T,3,1,services,services,...,2,4,5,3,4,2,6,10,10,10
648,MS,M,18,R,LE3,T,3,2,services,other,...,4,4,1,3,4,5,4,10,11,11


### ðŸ”¹ **1.4 Datasetâ€™s Size and Structure**

**Why checking for number of rows and columns is important**

- Helps you quickly understand dataset size

- Confirms whether the dataset loaded correctly

- Useful before and after data cleaning (to see if rows/columns changed)

`df.shape` returns the number of rows and columns in a DataFrame, helping us understand the datasetâ€™s size and structure

In [33]:
#Display the Total Number of Rows and Columns
df.shape


(649, 33)

### ðŸ”¹ **1.  5 Dataset Information and Data Types**



**Why using columns function is important:**



- Helps verify column names after loading data



- Useful for feature selection



- Needed when renaming columns or checking spelling errors

`df.columns` returns the names of all columns in a DataFrame.

In [34]:
df.columns


Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

**Why using important info function is important :**

- Confirms dataset loaded correctly

- Identifies missing values early

- Helps decide which columns need encoding or type conversion

`df.info()` gives a concise summary of the DataFrame.

In [37]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      649 non-null    object
 1   sex         649 non-null    object
 2   age         649 non-null    int64 
 3   address     649 non-null    object
 4   famsize     649 non-null    object
 5   Pstatus     649 non-null    object
 6   Medu        649 non-null    int64 
 7   Fedu        649 non-null    int64 
 8   Mjob        649 non-null    object
 9   Fjob        649 non-null    object
 10  reason      649 non-null    object
 11  guardian    649 non-null    object
 12  traveltime  649 non-null    int64 
 13  studytime   649 non-null    int64 
 14  failures    649 non-null    int64 
 15  schoolsup   649 non-null    object
 16  famsup      649 non-null    object
 17  paid        649 non-null    object
 18  activities  649 non-null    object
 19  nursery     649 non-null    object
 20  higher    

### ðŸ”¹ **1.6 Statistical Summary of Numerical Features**

The `describe()` function provides key statistics for all numerical columns:

**Student & School Information**

- **school:** School attended by the student (GP or MS)

- **sex: Studentâ€™s gender (encoded numerically)**

- **age:** Studentâ€™s age in years

- **address:** Type of home address (urban or rural)

- **famsize:** Family size (small or large)

- **Pstatus:** Parentsâ€™ cohabitation status (living together or apart)

- **guardian:** Studentâ€™s legal guardian

**Parental Background**

- **Medu:** Motherâ€™s education level

- **Fedu:** Fatherâ€™s education level

- **Mjob:** Motherâ€™s job

- **Fjob:** Fatherâ€™s job

**Academic & School-Related Factors**

- **reason:** Reason for choosing the school

- **traveltime:** Travel time from home to school

- **studytime:** Weekly study time

- **failures:** Number of past class failures

- **schoolsup:** Extra educational support from school

- **famsup:** Family educational support

- **paid:** Extra paid classes within the course

- **activities:** Participation in extracurricular activities

- **nursery:** Attended nursery school

- **higher:** Intention to pursue higher education

- **internet:** Internet access at home

- **romantic:** In a romantic relationship

**Lifestyle & Health Factors**

- **famrel:** Quality of family relationships

- **freetime:** Free time after school

- **goout:** Going out with friends

- **Dalc:** Workday alcohol consumption

- **Walc:** Weekend alcohol consumption

- **health:** Current health status

- **absences:** Number of school absences

**Academic Performance (Original Grades)**

- **G1:** First period grade

- **G2:** Second period grade

- **G3:** Final period grade

**Derived / Engineered Features**

- **avg_score:** Average of G1, G2, and G3 grades

- **total_score:** Total score obtained across G1, G2, and G3

- **result:** Final academic result based on G3 (Pass / Fail)

- **Target Variable (for Analysis)**

- **G3 / result:** Represents the studentâ€™s final academic performance and outcome

Why is it important

- Gives a quick statistical overview

- Helps detect outliers

- Identifies unusual values

`df.describe()` generates summary statistics for the numerical columns in a DataFrame

In [36]:
df.describe()


Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0
mean,16.74,2.51,2.31,1.57,1.93,0.22,3.93,3.18,3.18,1.5,2.28,3.54,3.66,11.4,11.57,11.91
std,1.22,1.13,1.1,0.75,0.83,0.59,0.96,1.05,1.18,0.92,1.28,1.45,4.64,2.75,2.91,3.23
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,16.0,2.0,1.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,2.0,0.0,10.0,10.0,10.0
50%,17.0,2.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,2.0,11.0,11.0,12.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,6.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,32.0,19.0,19.0,19.0


### ðŸ”¹ **1.7 Summary of Findings**



**Dataset Overview**



- **Total Rows:** 649 students



- **Total Columns:** 33 features (including original and derived features)



**Target Variable:**



- **G3:** Final period grade



- **result:** Pass / Fail classification based on final grade



**Student Profile**



- **Average Age:** Approximately 16â€“17 years



- **Gender Distribution:** Both male and female students represented



- **Study Time:** Most students fall within moderate weekly study time categories



**Academic Performance:**



- **Grade Range:** 0 to 20



- **Overall Outcome:** Majority of students score above the pass threshold



Data Quality Issues Identified



- **Missing Values:** No missing values detected in any column



- **Duplicate Records:** No duplicate rows found in the dataset



**Data Type Issues:**



- **Verification:** Numeric and categorical columns checked



- **Correction:** Age column converted to integer format



**Text Inconsistencies:**



- **Resolution:** Categorical text fields standardized using lowercase and trimming



- **Feature Engineering and Transformation**



**Derived Features:**



- **avg_score:** Average of G1, G2, and G3



- **total_score:** Total of G1, G2, and G3



- **result:** Pass / Fail outcome



**Feature Scaling:**



- **Normalization:** Minâ€“Max scaling applied to selected numerical features



- **Standardization:** Applied to grade-related features



**Categorical Encoding:**



- **Label Encoding:** Text-based categorical variables converted to numeric form



**Key Academic Insights**



- **Grade Dependency:** Final grade (G3) strongly correlates with G1 and G2



- **Pass Rate:** Majority of students pass the final examination



- **Influencing Factors:** Study time, absences, and lifestyle factors affect performance



- **Analysis Readiness:** Dataset is suitable for EDA and predictive modeling



**Final Verdict**



- **Data Cleanliness:** No unresolved quality issues



- **Consistency:** All variables validated and standardized



- **Readiness:** Dataset is fully prepared for Exploratory Data Analysis