In [7]:
import pandas as pd

In [8]:
df=pd.read_csv("StudentsPerformance.csv")

In [9]:
print(df.head())

   gender race/ethnicity parental level of education         lunch  \
0  female        group B           bachelor's degree      standard   
1  female        group C                some college      standard   
2  female        group B             master's degree      standard   
3    male        group A          associate's degree  free/reduced   
4    male        group C                some college      standard   

  test preparation course  math score  reading score  writing score  
0                    none          72             72             74  
1               completed          69             90             88  
2                    none          90             95             93  
3                    none          47             57             44  
4                    none          76             78             75  


In [8]:
print(df.tail())

     gender race/ethnicity parental level of education         lunch  \
995  female        group E             master's degree      standard   
996    male        group C                 high school  free/reduced   
997  female        group C                 high school  free/reduced   
998  female        group D                some college      standard   
999  female        group D                some college  free/reduced   

    test preparation course  math score  reading score  writing score  
995               completed          88             99             95  
996                    none          62             55             55  
997               completed          59             71             65  
998               completed          68             78             77  
999                    none          77             86             86  


In [12]:
print(df.shape)

(1000, 8)


In [13]:
print(df.columns)

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')


In [14]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
None


In [15]:
print(df.describe())

       math score  reading score  writing score
count  1000.00000    1000.000000    1000.000000
mean     66.08900      69.169000      68.054000
std      15.16308      14.600192      15.195657
min       0.00000      17.000000      10.000000
25%      57.00000      59.000000      57.750000
50%      66.00000      70.000000      69.000000
75%      77.00000      79.000000      79.000000
max     100.00000     100.000000     100.000000


Day 1 – Dataset Understanding

Rows: 1000
Columns: 8

Each row represents one student.
The dataset contains student details and exam scores.

Columns:
- gender: student gender
- race/ethnicity: student group
- parental level of education: parents’ education
- lunch: lunch type
- test preparation course: test prep status
- math score: math marks
- reading score: reading marks
- writing score: writing marks

Observation:
- Scores are numeric.
- Other columns are categorical.
- Data looks clean and structured.

In [16]:
df.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

In [18]:
df.duplicated().sum()

np.int64(0)

In [19]:
df.dtypes

gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

Day 2 – Data Quality Check

Missing Values:
- No missing values found in the dataset.

Duplicates:
- Number of duplicate rows: 0

Data Types:
- Score columns are integers.
- Other columns are categorical.

Conclusion:
- The dataset is clean and ready for analysis.

In [20]:
df=df.drop_duplicates()

In [21]:
df["gender"]=df["gender"].str.strip()
df["race/ethnicity"]=df["race/ethnicity"].str.strip

In [23]:
df.duplicated().sum()
df.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

Day 3 – Data Cleaning

- Removed duplicate rows.
- Cleaned text columns to remove extra spaces.
- Verified no missing or duplicate values remain.

Result:
- Dataset is clean and ready for analysis.

Day 4 – Exploratory Data Analysis 

Business Questions:
1. What is the average score in each subject?
2. Which subject has the highest average score?
3. Is there a performance difference between male and female students?

In [24]:
df[["math score","reading score","writing score"]].mean()

math score       66.089
reading score    69.169
writing score    68.054
dtype: float64

In [27]:
df[["math score","reading score","writing score"]].mean().idxmax()

'reading score'

In [28]:
df.groupby("gender")[["math score","reading score","writing score"]].mean()

Unnamed: 0_level_0,math score,reading score,writing score
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,63.633205,72.608108,72.467181
male,68.728216,65.473029,63.311203


Insights:
- Reading has the highest average score.
- Female students perform better in reading and writing.
- Male students perform slightly better in math.
  

 DAY 5- GROUPING AND COMPARISON
 
1.How do students perform across different race/ethnicity groups?

2.Does parental education affect student performance?

3.Does lunch type or test preparation impact scores?

4.Are the scores of different subjects correlated?


In [11]:
print("Average scores by race/ethnicity:\n")
print(df.groupby("race/ethnicity")[["math score","reading score","writing score"]].mean())

Average scores by race/ethnicity:

                math score  reading score  writing score
race/ethnicity                                          
group A          61.629213      64.674157      62.674157
group B          63.452632      67.352632      65.600000
group C          64.463950      69.103448      67.827586
group D          67.362595      70.030534      70.145038
group E          73.821429      73.028571      71.407143


In [13]:
print("Average scores by parental education")
print(df.groupby("parental level of education")[["math score","reading score","writing score"]].mean())

Average scores by parental education
                             math score  reading score  writing score
parental level of education                                          
associate's degree            67.882883      70.927928      69.896396
bachelor's degree             69.389831      73.000000      73.381356
high school                   62.137755      64.704082      62.448980
master's degree               69.745763      75.372881      75.677966
some college                  67.128319      69.460177      68.840708
some high school              63.497207      66.938547      64.888268


In [14]:
print("Average scores by lunch type:")
print(df.groupby("lunch")[["math score","reading score","writing score"]].mean())

Average scores by lunch type:
              math score  reading score  writing score
lunch                                                 
free/reduced   58.921127      64.653521      63.022535
standard       70.034109      71.654264      70.823256


In [15]:
print("Average scores by test prep")
print(df.groupby("test preparation course")[["math score","reading score","writing score"]].mean())

Average scores by test prep
                         math score  reading score  writing score
test preparation course                                          
completed                 69.695531      73.893855      74.418994
none                      64.077882      66.534268      64.504673


In [16]:
df[["math score","reading score","writing score"]].corr()

Unnamed: 0,math score,reading score,writing score
math score,1.0,0.81758,0.802642
reading score,0.81758,1.0,0.954598
writing score,0.802642,0.954598,1.0


Observations / Insights:

Race/ethnicity group E (or highest avg) performs best in most subjects.                                                                                     
Students whose parents have higher education score slightly better.                                                                                      
Students who completed test preparation perform better on average.                                                                                                                                                                                                                                        Reading and writing scores are strongly correlated.                                                                                                          

Day 6 – Create Average Score & Grade

Business Questions:

What is each student’s average score across subjects?

Can we assign grades based on average performance?

In [21]:
#calculating avg score for each student
df["average_score"]=df[["math score","reading score","writing score"]].mean(axis=1)
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,average_score
0,female,group B,bachelor's degree,standard,none,72,72,74,72.666667
1,female,group C,some college,standard,completed,69,90,88,82.333333
2,female,group B,master's degree,standard,none,90,95,93,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,49.333333
4,male,group C,some college,standard,none,76,78,75,76.333333
...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,94.000000
996,male,group C,high school,free/reduced,none,62,55,55,57.333333
997,female,group C,high school,free/reduced,completed,59,71,65,65.000000
998,female,group D,some college,standard,completed,68,78,77,74.333333


In [23]:
def grade(avg):
    if avg>=90:
        return "A"
    elif avg>=75:
        return "B"
    elif avg>=60:
        return "C"    
    else:
        return "D"
df["grade"]=df["average_score"].apply(grade)        
df["grade"].value_counts()

grade
C    391
D    285
B    272
A     52
Name: count, dtype: int64

Observations / Insights :

Most students fall in Grade B and C.

Few students achieve Grade A, indicating top performers.

This average score and grade column will help in filtering and comparing student performance in next steps.

Day 7 – Filtering & Top Performers

Business Questions:

Who are the top performers in each subject?

Are there students scoring top marks in all subjects?

Can we identify patterns like gender or parental education among top students?

In [24]:
#Top 5 students in maths
df.nlargest(5,"math score")[["math score","reading score","writing score"]]

Unnamed: 0,math score,reading score,writing score
149,100,100,93
451,100,92,97
458,100,100,100
623,100,96,86
625,100,97,99


In [25]:
#Top 5 students in reading
df.nlargest(5,"reading score")[["math score","reading score","writing score"]]

Unnamed: 0,math score,reading score,writing score
106,87,100,100
114,99,100,100
149,100,100,93
165,96,100,100
179,97,100,100


In [26]:
#Top 5 students in writing
df.nlargest(5,"writing score")[["math score","reading score","writing score"]]

Unnamed: 0,math score,reading score,writing score
106,87,100,100
114,99,100,100
165,96,100,100
179,97,100,100
377,85,95,100


In [28]:
#Students scoring >=90 in all 3 subjects 
top_students_all=df[(df["math score"]>=90)&(df["reading score"]>=90)&(df["writing score"]>=90)]
top_students_all[["gender","race/ethnicity","parental level of education","average_score","grade"]]

Unnamed: 0,gender,race/ethnicity,parental level of education,average_score,grade
2,female,group B,master's degree,92.666667,A
114,female,group E,bachelor's degree,99.666667,A
149,male,group E,associate's degree,97.666667,A
165,female,group C,bachelor's degree,98.666667,A
179,female,group D,some high school,99.0,A
263,female,group E,high school,94.0,A
451,female,group E,some college,96.333333,A
458,female,group E,bachelor's degree,100.0,A
474,female,group B,associate's degree,90.333333,A
546,female,group A,some high school,96.333333,A


Observations / Insights:

Top performers vary by subject; some excel in Math, others in Reading or Writing.

Few students score >= 90 in all three subjects — these are overall top achievers.

Gender trends can be observed among top students: females often perform better in Reading/Writing, males slightly higher in Math.

Parental education sometimes correlates with top performance.

In [29]:
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,average_score,grade
0,female,group B,bachelor's degree,standard,none,72,72,74,72.666667,C
1,female,group C,some college,standard,completed,69,90,88,82.333333,B
2,female,group B,master's degree,standard,none,90,95,93,92.666667,A
3,male,group A,associate's degree,free/reduced,none,47,57,44,49.333333,D
4,male,group C,some college,standard,none,76,78,75,76.333333,B
...,...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,94.000000,A
996,male,group C,high school,free/reduced,none,62,55,55,57.333333,D
997,female,group C,high school,free/reduced,completed,59,71,65,65.000000,C
998,female,group D,some college,standard,completed,68,78,77,74.333333,C


'C:\\Users\\Lenovo'