# Student key perfomance indicator (EDA)

##### 1. Objective

The objective of this EDA project is to explore the relationship between student KPI and the factors of gender, ethnicity, parental education, lunch, and test preparation. By conducting a thorough analysis of the data, we aim to gain a better understanding of the impact of these factors on student performance.

 ##### 2. Data Collection

Dataset Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977

##### 2.1 Importing Required packages and Data 

In [184]:
# Packages : Pandas, Seaborn, Nummpy, OS
import numpy as np
import pandas as pd
import seaborn as sns
import os
import matplotlib

from simple_colors import *

##### Importing Data and performing preliminary data check

In [185]:
# Pulling data
os.getcwd()
os.listdir()

path = "Data/stud.csv"
db = pd.read_csv(path)


In [186]:
# Priting top 10 columns inorder get the gist of the data
# Overview of Data set
db.head(10)


Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
5,female,group B,associate's degree,standard,none,71,83,78
6,female,group B,some college,standard,completed,88,95,92
7,male,group B,some college,free/reduced,none,40,43,39
8,male,group D,high school,free/reduced,completed,64,64,67
9,female,group B,high school,free/reduced,none,38,60,50


##### Data Checks to perform

    Check Missing values
    Check Duplicates
    Check data type
    Check the number of unique values of each column
    Check statistics of data set
    Check various categories present in the different categorical column

In [187]:
# db.isna().sum()
# No Duplicates found in this Dataset
# db.duplicated().sum()

print(green("No. of Duplicates\n\n",'bold'),db.isna().sum(),green("\n\n Null Values in Columns\n",'bold'),db.isna().sum())

[1;32mNo. of Duplicates

[0m gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64 [1;32m

 Null Values in Columns
[0m gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64


##### Checking Unique Values


In [188]:
db.nunique()

gender                          2
race_ethnicity                  5
parental_level_of_education     6
lunch                           2
test_preparation_course         2
math_score                     81
reading_score                  72
writing_score                  77
dtype: int64

##### Checking Data Types

In [189]:
db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race_ethnicity               1000 non-null   object
 2   parental_level_of_education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test_preparation_course      1000 non-null   object
 5   math_score                   1000 non-null   int64 
 6   reading_score                1000 non-null   int64 
 7   writing_score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


##### Data Exploration

In [190]:
# Understanding the size and Contents Data

print(green("DB Size : ",'bold'),db.size,"",green("\n\n Description\n\n",'bold'),db.describe())

[1;32mDB Size : [0m 8000  [1;32m

 Description

[0m        math_score  reading_score  writing_score
count  1000.00000    1000.000000    1000.000000
mean     66.08900      69.169000      68.054000
std      15.16308      14.600192      15.195657
min       0.00000      17.000000      10.000000
25%      57.00000      59.000000      57.750000
50%      66.00000      70.000000      69.000000
75%      77.00000      79.000000      79.000000
max     100.00000     100.000000     100.000000


In [191]:
# Overview of Data set
db.head(4)


Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44


In [192]:
"""
Displaying unique contents in a columns
"""


# Gender
print(green("Unique Gender Values :",'bold'),db['gender'].unique())

# Etnicity
print(green("\nUnique race_ethnicity Values :",'bold'),db['race_ethnicity'].unique())

# Education
print(green("\nUnique Education Values :",'bold'),db['parental_level_of_education'].unique())

# Lunch
print(green("\nUnique Lunch Values :",'bold'),db['lunch'].unique())

# Preparation
print(green("\nUnique Preparation Values :",'bold'),db['test_preparation_course'].unique())

[1;32mUnique Gender Values :[0m ['female' 'male']
[1;32m
Unique race_ethnicity Values :[0m ['group B' 'group C' 'group A' 'group D' 'group E']
[1;32m
Unique Education Values :[0m ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']
[1;32m
Unique Lunch Values :[0m ['standard' 'free/reduced']
[1;32m
Unique Preparation Values :[0m ['none' 'completed']


##### Identifying Categorial and numerical Values

In [193]:
numerical_values = db.select_dtypes(include=['int64']).head()
categorical_values = db.select_dtypes(include=['object']).head()

print(green("Numerical values in Columns : \n",'bold'),numerical_values.columns,
            green("\nCategorical Values in columns :\n",'bold'),categorical_values.columns)

[1;32mNumerical values in Columns : 
[0m Index(['math_score', 'reading_score', 'writing_score'], dtype='object') [1;32m
Categorical Values in columns :
[0m Index(['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch',
       'test_preparation_course'],
      dtype='object')


In [194]:
db.tail(2)

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
998,female,group D,some college,standard,completed,68,78,77
999,female,group D,some college,free/reduced,none,77,86,86


# Calculating Required Mertics
1. Total Score.
1. Oveall Average Score.
1. No of Students Scored full marks at Subject level.
1. No of students who scored less than 35.



In [195]:
db['total'] = db['math_score'] + db['reading_score'] + db['writing_score']
db['avg_score'] = db['total']/3


In [196]:
# Scored 100
print(
green("No. Students Scored full in Maths :",'bold'),db[db.math_score== 100].shape[0],
green("\nNo. Students Scored full in Reading :",'bold'),db[db.reading_score== 100].shape[0],
green("\nNo. Students Scored full in Writing :",'bold'),db[db.writing_score== 100].shape[0]
    )

[1;32mNo. Students Scored full in Maths :[0m 7 [1;32m
No. Students Scored full in Reading :[0m 17 [1;32m
No. Students Scored full in Writing :[0m 14


In [197]:
# Average Score in a particular subject : 
print(
green("No. Students Scored full in Maths :",'bold'),db.math_score.sum()/db.math_score.shape[0],
green("\nNo. Students Scored full in Reading :",'bold'),db.reading_score.sum()/db.reading_score.shape[0],
green("\nNo. Students Scored full in Writing :",'bold'),db.writing_score.sum()/db.writing_score.shape[0]
    )

[1;32mNo. Students Scored full in Maths :[0m 66.089 [1;32m
No. Students Scored full in Reading :[0m 69.169 [1;32m
No. Students Scored full in Writing :[0m 68.054


In [198]:
# Scored less than 35
print(
green("No. Students Scored <35 in Maths :",'bold'),db[db.math_score<= 35].shape[0],
green("\nNo. Students Scored <35 in Reading :",'bold'),db[db.reading_score<= 35].shape[0],
green("\nNo. Students Scored <35 in Writing :",'bold'),db[db.writing_score<= 35].shape[0],
"\nTotal students :",db.shape[0]
    )

[1;32mNo. Students Scored <35 in Maths :[0m 27 [1;32m
No. Students Scored <35 in Reading :[0m 15 [1;32m
No. Students Scored <35 in Writing :[0m 18 
Total students : 1000


##### Insights
1. Toppers in maths are less compared to other sections.
2. Also on an average score in subject Maths has the least average Score.
2. Maths is the subject which has most no of low scorers compared to other subjects

#### EDA (Visualization)
Libraries : Matplotlib, seaborn