# Data Dive in Python

### Chloe Israel

## Data Description
This data set is called Student Performance Factors, which was found on Kaggle (kaggle.com). It provides a comprehensive overview of various factors that affect student performance during exams. It has 6,607 rows and 20 variables. Of the variables, 10 are character variables, 7 are numerical variables, and 3 are Boolean variables. The variables represent a different factor that influences a student's success during exams. These include but are not limited to study habits, student demographics, and school quality. The students' exam scores are also provided as a metric to measure their performance. No changes were made to the data set. The data set is saved in the file 'StudentPerformanceFactors.csv'.

## Variables
**Name**                   | **Data Type** | **Description**
:--------------------------|:--------------|:----------------
Hours_Studied              | integer       | Number of hours the student spends studying per week.
Attendance                 | integer       | Percentage of classes the student attended.
Parental_Involvement       | string        | Level of parental involvement in the student’s education (Low/Medium/High).                               
Access_to_Resources        | string        | Availability of educational resources to the student(Low/Medium/High).
Extracurricular_Activities | Boolean       | Indicates whether the student participates in extracurriculars. 
Sleep_Hours                | integer       | Average number of hours of sleep the student gets per night. 
Previous_Scores            | integer       | Student’s scores from past exams.
Motivation_Level           | string        | Student’s motivation level (Low/Medium/High).
Internet_Access            | Boolean       | Indicates whether the student can access the internet.
Tutoring_Sessions          | integer       | Number of tutoring sessions the student attended per month.
Family_Income              | string        | Student’s family income level (Low/Medium/High).
Teacher_Quality            | string        | Quality of the teachers at the student’s school (Low/Medium/High).
School_Type                | string        | Type of school the student attends (Public/Private)
Peer_Influence             | string        | Influence of peers on the student’s academic performance(Positive/Negative/Neutral).
Physical_Activity          | integer       | Average number of hours of physical activity the student gets per week.
Learning_Disabilities      | Boolean       | Indicates whether the student has a learning disability.
Parental_Education_Level   | string        | Highest level of education of the student’s parents(High School/College/Postgraduate).
Distance_from_Home         | string        | Distance from the student’s home to their school(Near/Moderate/Far).
Gender                     | string        | Student’s gender (Male, Female).
Exam_Score                 | integer       | Student’s final exam score.

## Loading Required Packages

In [24]:
# Load the needed libraries
import pandas as pd
import numpy as np
# Used to display dataframe (Source Used: GeeksforGeeks.org)
from IPython.display import display

## Loading the Data

In [25]:
spf = pd.read_csv("StudentPerformanceFactors.csv")

#### Initial Inspection

In [26]:
# Display the dataframe dimensions
spf.shape

(6607, 20)

In [27]:
# Display the column names
spf.columns

Index(['Hours_Studied', 'Attendance', 'Parental_Involvement',
       'Access_to_Resources', 'Extracurricular_Activities', 'Sleep_Hours',
       'Previous_Scores', 'Motivation_Level', 'Internet_Access',
       'Tutoring_Sessions', 'Family_Income', 'Teacher_Quality', 'School_Type',
       'Peer_Influence', 'Physical_Activity', 'Learning_Disabilities',
       'Parental_Education_Level', 'Distance_from_Home', 'Gender',
       'Exam_Score'],
      dtype='object')

In [28]:
# Display the first 10 rows
spf.head(10)

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70
5,19,88,Medium,Medium,Yes,8,89,Medium,Yes,3,Medium,Medium,Public,Positive,3,No,Postgraduate,Near,Male,71
6,29,84,Medium,Low,Yes,7,68,Low,Yes,1,Low,Medium,Private,Neutral,2,No,High School,Moderate,Male,67
7,25,78,Low,High,Yes,6,50,Medium,Yes,1,High,High,Public,Negative,2,No,High School,Far,Male,66
8,17,94,Medium,High,No,6,80,High,Yes,0,Medium,Low,Private,Neutral,1,No,College,Near,Male,69
9,23,98,Medium,Medium,Yes,8,71,Medium,Yes,0,High,High,Public,Positive,5,No,High School,Moderate,Male,72


## Data Preparation

In [29]:
# Copy the orginal dataframe into df for modifications
df = spf

Currently, the 3 Boolean variables "Extracurricular_Activities", "Internet_Access", and "Learning_Disabilities" hold the 
values Yes/No instead of the standard Boolean values True/False. To make it easier to work with these variables later and 
to keep with standard Boolean conventions, change these variables to True/False.

In [30]:
# Use a loop to iterate through each value in the "Extracurricular_Activities", "Internet_Access", and "Learning_Disabilities"
# columns. If the value if "Yes" change it to True. If "No", change it to False.
loopLen = len(df["Extracurricular_Activities"])
for i in range(loopLen):
    if (df["Extracurricular_Activities"][i] == "Yes"):
        df.loc[i, "Extracurricular_Activities"] = True
    elif (df["Extracurricular_Activities"][i] == "No"):
        df.loc[i, "Extracurricular_Activities"] = False
        
    if (df["Internet_Access"][i] == "Yes"):
        df.loc[i, "Internet_Access"] = True
    elif (df["Internet_Access"][i] == "No"):
        df.loc[i, "Internet_Access"] = False
        
    if (df["Learning_Disabilities"][i] == "Yes"):
        df.loc[i, "Learning_Disabilities"] = True
    elif (df["Learning_Disabilities"][i] == "No"):
        df.loc[i, "Learning_Disabilities"] = False
        
# Check the changes
df[["Extracurricular_Activities", "Internet_Access", "Learning_Disabilities"]].head()

Unnamed: 0,Extracurricular_Activities,Internet_Access,Learning_Disabilities
0,False,True,False
1,False,True,False
2,True,True,False
3,True,True,False
4,True,True,False


The student's test scores are provided by two variables, "Previous_Scores" and "Exam_Score". To have an easy metric to
compare to other variables, create a new variable "Average_Score" that takes the average of the two scores.

In [31]:
# Create the Average_Score column in the df dataframe by taking the average of the 
# "Previous_Scores" and "Exam_Score" columns.
df["Average_Score"] = (df["Previous_Scores"] + df["Exam_Score"]) // 2

# Check the new column
df["Average_Score"].head()

0    70
1    60
2    82
3    84
4    67
Name: Average_Score, dtype: int64

## Data Moves

The following data moves will explore insights based on the following questions:
1. What is the relationship between the percentage of classes a student attends and a student’s exam scores, and does it differ based on the quality of the teachers at the school?
2. How do a student’s physical activity levels and hours of sleep per night relate to a student’s exam scores?
3. Is there an association between the presence of a learning disability in a student and the student’s level of motivation?

#### Question 1

In [32]:
# Determine the average attendance for a student
attendanceAvg = np.mean(df["Attendance"])
print("The average percentage of classes students attend is {0:.0f}%\n".format(attendanceAvg)) 

# Using the above value, create two data frames, one holding students with attendance at or above average and one holding
# those below the average.
satAtt = df[df["Attendance"] >= attendanceAvg]
unsatAtt = df[df["Attendance"] < attendanceAvg]

# Check the new dataframes hold the correct values
print("satAtt dataframe:")
print(satAtt["Attendance"].head())
print()      
print("unsatAtt dataframe:")
print(unsatAtt["Attendance"].head())

The average percentage of classes students attend is 80%

satAtt dataframe:
0    84
2    98
3    89
4    92
5    88
Name: Attendance, dtype: int64

unsatAtt dataframe:
1     64
7     78
14    78
15    68
16    60
Name: Attendance, dtype: int64


In [33]:
# Determine the average score of student who have satisfactory attendance versus those who don't
print("Students with attendance at or above {0:.0f}% have an average exam score of {1:.2f}".format(attendanceAvg,
                                                                                         satAtt["Average_Score"].mean()))
print("Students with attendance below {0:.0f}% have an average exam score of {1:.2f}".format(attendanceAvg,
                                                                                         unsatAtt["Average_Score"].mean()))

Students with attendance at or above 80% have an average exam score of 71.71
Students with attendance below 80% have an average exam score of 70.07


This shows that, on average, students who have satisfactory attendance have higher exam scores than those who do not. How does this relationship change when examining it in relation to the quality of teachers at the school?

In [34]:
# Group the satAtt dataframe by "Teacher_Quality" and use the agg() function to display the
# average score and average attendance for each group
# Source Used: LearnPython.com
satAttGrp = satAtt.groupby("Teacher_Quality").agg(
    Average_Score = ("Average_Score", "mean"),
    Avg_Attendance = ("Attendance", "mean")
    )

# Round the numerical values for readability
satAttGrp["Average_Score"] = satAttGrp["Average_Score"].round(2)
satAttGrp["Avg_Attendance"] = satAttGrp["Avg_Attendance"].round(2)

# Repeat the above code for the unsatAtt dataframe
unsatAttGrp = unsatAtt.groupby("Teacher_Quality").agg(
    Average_Score = ("Average_Score", "mean"),
    Avg_Attendance = ("Attendance", "mean")
    )

unsatAttGrp["Average_Score"] = unsatAttGrp["Average_Score"].round(2)
unsatAttGrp["Avg_Attendance"] = unsatAttGrp["Avg_Attendance"].round(2)

# Display both dataframes, sorting from highest to lowest average score
print("Satisfactory Attendance\n")
display(satAttGrp.sort_values(by = "Average_Score", ascending = False)) 
print("Unsatisfactory Attendance\n")
display(unsatAttGrp.sort_values(by = "Average_Score", ascending = False))

Satisfactory Attendance



Unnamed: 0_level_0,Average_Score,Avg_Attendance
Teacher_Quality,Unnamed: 1_level_1,Unnamed: 2_level_1
Low,72.27,90.09
High,71.77,89.56
Medium,71.58,89.88


Unsatisfactory Attendance



Unnamed: 0_level_0,Average_Score,Avg_Attendance
Teacher_Quality,Unnamed: 1_level_1,Unnamed: 2_level_1
High,70.68,69.8
Low,70.04,69.63
Medium,69.76,69.9


The data above reinforces that students with higher attendance have higher exam scores. By extending this relationship to teacher quality, this shows that of the students who attend more classes, those with lower-quality teachers have higher exam scores. Contrastingly, of students who attend fewer classes, those with higher quality teachers have higher exam scores. For both attendance groups, students with medium-quality teachers have the lowest average exam scores, even if they attend more classes than others. This is interesting, as the expected result was that for both attendance groups, the average score would increase as the teacher quality improved. These results could be interpreted that students who attend class learn more, resulting in better exam scores. However, students who attend class learn less, and their exam scores suffer as a result. For those who don't attend class, having a higher quality teacher benefits their exam score more than those who do not. It is reasonable to conclude that teacher quality will have a slight impact on a student's exam scores, that impact becoming greater if they attend class less.

#### Question 2

In [35]:
# Find the unique values for "Sleep_Hours" and "Physical_Activity" to determine if they can be easily grouped
print("Unique Sleep Hours:", df["Sleep_Hours"].unique())
print("Unique Physical Activity:", df["Physical_Activity"].unique())

Unique Sleep Hours: [ 7  8  6 10  9  5  4]
Unique Physical Activity: [3 4 2 1 5 0 6]


In [36]:
# Beacuse both variable fall into easily divided group, group the data frame by "Sleep_Hours" and "Physical_Activity"
# Use the agg function to calculate the avergae exam score for each group and store it in the wellness dataframe
wellnessGrp = df.groupby(["Sleep_Hours", "Physical_Activity"]).agg({"Average_Score": "mean"}).round(2)

# Check the dataframe. Should be grouped by sleep hours, then physical activity  
wellnessGrp.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Average_Score
Sleep_Hours,Physical_Activity,Unnamed: 2_level_1
4,0,82.0
4,1,66.52
4,2,70.1
4,3,71.94
4,4,72.08


With the data correctly grouped, it is ready to be examined to see how sleep and physical activity relate to exam scores. The recommended amount of sleep for students (adolescents) is 8 hours. Using this benchmark, it will be best to examine students who get less than, the same, and more than the recommended amount. For each group, their physical activity will be examined as well.

In [37]:
# Get each group based on Sleep_Hours
# 4 hours = less than, 8 hours = recommended, 10 hours = greater than
below = wellnessGrp.loc[4, ].round(2)
average = wellnessGrp.loc[8, ].round(2)
above = wellnessGrp.loc[10, ].round(2)

# Display each group, sorted by average score
# Display the average score for each group
print("4 hours of sleep\n")
display(below.sort_values(by = "Average_Score", ascending = False))
print(below.mean().round(2))

print("\n8 hours of sleep\n")
display(average.sort_values(by = "Average_Score", ascending = False))
print(average.mean().round(2))

print("\n10 hours of sleep\n")
display(above.sort_values(by = "Average_Score", ascending = False))
print(above.mean().round(2))

4 hours of sleep



Unnamed: 0_level_0,Average_Score
Physical_Activity,Unnamed: 1_level_1
0,82.0
5,75.38
4,72.08
3,71.94
2,70.1
1,66.52
6,62.0


Average_Score    71.43
dtype: float64

8 hours of sleep



Unnamed: 0_level_0,Average_Score
Physical_Activity,Unnamed: 1_level_1
0,74.07
4,71.29
3,71.11
1,70.9
2,70.15
6,69.71
5,69.66


Average_Score    70.98
dtype: float64

10 hours of sleep



Unnamed: 0_level_0,Average_Score
Physical_Activity,Unnamed: 1_level_1
6,74.5
1,72.75
2,71.51
4,70.51
3,70.19
5,68.41
0,67.5


Average_Score    70.77
dtype: float64


The data above shows that sleep is related to exam scores. Initially, the results are unexpected. The data shows that students who get less than the recommended amount of sleep have a higher average exam score than those who get the recommended amount. However, by looking closer at the values it is clear that the 4-hour group has a wider range of values than the other two. Its higher score of 82 is a major outlier and likely has skewed the exam score average. As demonstrated below, by excluding the outlier the average drops dramatically. It is also interesting that those who get more than 8 hours of sleep have slightly worse exam scores. This indicates that sleeping too much may be slightly detrimental. Looking at physical activity, for those who get the same or less than the recommended amount, those with the highest grades get no physical activity, followed by a mid-range amount. For those who get more than the recommended amount, the ones with the highest grades get the most, followed by a low amount. It is reasonable to conclude that sleep has a large impact on exam scores, with those getting the recommended amount scoring higher than those who get more or less. It cannot be concluded whether physical activity has a definitive relation to exam scores, as no consistent relationship has been observed.

In [38]:
# The avergae score of students who get 4 hours fo sleep excluding the outlier of 82
belowNoOutlier = (wellnessGrp.loc[4, 1] + wellnessGrp.loc[4, 2] + wellnessGrp.loc[4, 3] + wellnessGrp.loc[4, 4]
 + wellnessGrp.loc[4, 5] + wellnessGrp.loc[4, 6]) / 6
print(belowNoOutlier.round(2))

Average_Score    69.67
dtype: float64


#### Question 3

In [39]:
# Filter df to only include the columns Motivation_Level and Learning_Disabilities
# Store it in a temprory dataframe
tempDf = df[["Motivation_Level", "Learning_Disabilities"]]

# Use the temporary data frame and filter it based on whether Learning_Disabilities is True or False
# Store the results into the corresponding dataframe
hasDisability = tempDf[tempDf["Learning_Disabilities"]]
noDisability = tempDf[tempDf["Learning_Disabilities"] == False]

# Check the contents of eah dataframe
display(hasDisability.head())
display(noDisability.head())

Unnamed: 0,Motivation_Level,Learning_Disabilities
26,Medium,True
30,Medium,True
41,Low,True
43,Medium,True
54,Low,True


Unnamed: 0,Motivation_Level,Learning_Disabilities
0,Low,False
1,Low,False
2,Medium,False
3,Medium,False
4,Medium,False


With this data, finding the motivation level each student reports is the best way to investigate question 3. Because more 
students do not have a learning disability, finding the percentage of different motivations levels will ensure the results can be compared fairly.

In [40]:
# Calculate the percentage of students who reported either Low, Medium, or High motivation for each dataframe
# Round the percentages for readability
hasDisabilityMotivation = (hasDisability["Motivation_Level"].value_counts() / hasDisability["Motivation_Level"].count()) * 100
hasDisabilityMotivation = hasDisabilityMotivation.round(2)

noDisabilityMotivation = (noDisability["Motivation_Level"].value_counts() / noDisability["Motivation_Level"].count()) * 100
noDisabilityMotivation = noDisabilityMotivation.round(2)

# Display the results as a dataframe, sorted in descending order by count
print("Motivation count for students with disabilities\n")
display(hasDisabilityMotivation.to_frame().sort_values(by = "count", ascending = False))

print("Motivation count for students without disabilities\n")
display(noDisabilityMotivation.to_frame().sort_values(by = "count", ascending = False))

Motivation count for students with disabilities



Unnamed: 0_level_0,count
Motivation_Level,Unnamed: 1_level_1
Medium,47.91
Low,30.79
High,21.29


Motivation count for students without disabilities



Unnamed: 0_level_0,count
Motivation_Level,Unnamed: 1_level_1
Medium,51.05
Low,29.14
High,19.81


The data above shows that among students with and without learning disabilities, their motivation level is very similar.
Students with disabilities have slightly more low motivation but also slightly more high motivation. Students with disabilities tend to have a medium amount of motivation. This could indicate that the presence of a learning disability makes those students have more extreme feelings regarding doing well in school. They could be discouraged and feel less motivated or be encouraged and feel more motivated to do well despite their learning disability. Those without disabilities may not feel less or more motivated because of the lack of that extra factor impacting their learning. It is reasonable to conclude that the presence of a learning disability has a slight impact on a student's motivation level, with the data indicating it could cause them to feel more extreme about doing well than those who do not.