# **WA2 — Studying Studying: Is There A Recipe For High Marks in School?**

## **0.a Imports for Required Libraries**

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy

## **0.b Dataset Onboarding & Basic Renaming for Better Readability + Referenceability**

In [47]:
df1 = pd.read_csv("SAP-4000.csv")
df1 = df1.rename(columns={"Gender":"Sex", 
                          "HoursStudied/Week":"HoursPerWeekStudy", 
                          "Attendance(%)":"Attendance", 
                          "Exam_Score":"Score", 
                          "Parent Education":"ParentEdu"})
df1=df1.dropna(); df1

Unnamed: 0,Sex,HoursPerWeekStudy,Tutoring,Region,Attendance,ParentEdu,Score
0,Male,5.5,No,Urban,72.7,Tertiary,43.5
1,Female,6.8,No,Urban,62.0,Primary,51.7
2,Female,9.7,No,Rural,95.0,Secondary,70.1
5,Female,7.9,No,Urban,73.7,Tertiary,58.8
6,Female,7.6,No,Urban,79.5,Secondary,64.8
...,...,...,...,...,...,...,...
3995,Male,11.3,Yes,Urban,79.5,Secondary,93.5
3996,Male,3.7,Yes,Urban,50.7,Tertiary,53.8
3997,Female,0.0,No,Rural,72.7,Tertiary,25.4
3998,Male,4.0,No,Urban,62.2,Tertiary,40.3


## **1. Introduction, About the Dataset, Intended Audience**

Academic examinations are employed by almost all educational programs across the world, and come in a variety of different configurations. These can range from a quick daily quiz on the previous lecture material to a cumulative marathon scantron encompassing a semester's worth of content. It's almost guaranteed that a test that a student takes in his lifetime will each be unique from all others from the past. 

However, for almost all tests, there is a general strategy for success that seems to be replicatable for basically everyone. If one was to go around and ask other students how they are able to succeed on tests, the advice would be relatively summed up by these three points:

1) Learn new material. 

2) Review/relearn past material as you progress into newer material.

3) Practice answering questions that emulate those from the test. 

It would make sense too — if we were to take a panel of top scorers from the AP Calculus BC exam from last year, it is sensible these students knew the material well, relearned/reviewed older material, and practiced MCQ and FRQ questions from past years. 

An exam called the Evaluación de Bachillerato para el Acceso a la Universidad (EBAU), which is also better known as the Selectividad, is a college-entrance exam and verification of completion of secondary school that is taken by students after finishing the equivalent of high school. A mark on the Selectividad can make or break a student's future in higher education, so doing well is vital. 

However, the general strategy drawn above can easily be disrupted. Here are some examples: 

1) John is a high school senior in Calculus. He transferred from a country that did not require anything above the equivalent of Algebra I to be taken. As such, the highest level of math John knows is Alg I. John subsequently cannot even complete 1) because he is unable to learn due to the fact that he lacks a foundation in mathematics. 

2) Ben is a classmate of John, also a senior, but has been attending school in the area since 1st grade. He has all the Precalc prereqs to learn new material, but the course materials and resources are online, and Ben does not have a computer at home, meaning his ability to review on his own time is limited to when he is at school or at the library / coffee shop. 

3) Peter is also a classmate of John and Ben. Senior, attending school in the area since 1st grade. While Peter is the most prepared of the three, he is unable to practice test questions because this year, the Calculus exam has been majorly revamped. Suddenly, all the questions are different enough that the last test's questions are inadequate. 

As such, it becomes apparent that being able to predict test scores from demographics could allow intervention during the learning process to hopefully improve scores in the future, rather than finding out later that students were struggling and recieving disappointing scores. 

This dataset was obtained from Kaggle.com's extensive supply of datasets, and contains 4,000 anonymized datapoints containing fields about a student's demographic information (Gender, HoursStudied/Week, Tutoring?, Urban/Rural, Attendance Rate, Highest Level of Parent Education) as well as respective student's mark on the Selectividad. While this dataset is pretty great, there were about 422 datapoints that were dropped because they were missing data in some fields. This can be considered a considerable portion of the original data that was trimmed away, and this is my disclaimer, if this trimming has any statistically significant impact on the findings. 

This project and literate programming document, methods, and its findings are targetted towards parents, educators, and governments. It should serve to inform and guide future actions, policies, and regulations in the common interest for a smarter and more capable next generation of minds. 

## **2. Placeholder**

## **3. EDA**

This is a fun section because key insights about the data that can answer a lot of question can be revealed with speed and ease. 

First, I simply want to look at the distribution of Rural vs Urban students. From the code, it becomes clear that Urban students outnumber Rural ones. There are multiple explanations for this:

1) More adults live in cities than in rural areas, and assuming that the average number of children per couple is the same for both regions, Urban would have more students that go on to outnumber their rural counterparts in this dataset.

2) Urban students are more likely to reach the point of taking the Selectividad than Rural ones. This metric would be based on percentage of Urban students who make it, and percentage of Rural students who make it, rather than raw counts. 

3) Urban couples have a higher average children per couple than Rural couples. 

There are likely more reasons, and the disparity between the two is likely a combination of all of the relevant factors. 

In [19]:
df1.groupby("Region").agg("count").reset_index()[["Region", "Score"]]

Unnamed: 0,Region,Score
0,Rural,1577
1,Urban,2423


Next we'll run some summary stats between things like Region, ParentEdu, etc. 

In [69]:
# Urban, P/S/T, No Tutor

urbanPrimaryNotutScore = df1[(df1.Region=="Urban") & (df1.ParentEdu=="Primary") & (df1.Tutoring=="No")].Score.describe().to_dict()
urbanSecondaryNotutScore = df1[(df1.Region=="Urban") & (df1.ParentEdu=="Secondary") & (df1.Tutoring=="No")].Score.describe().to_dict()
urbanTertiaryNotutScore = df1[(df1.Region=="Urban") & (df1.ParentEdu=="Tertiary") & (df1.Tutoring=="No")].Score.describe().to_dict()

print(f"urbanPrimaryNotutScore {urbanPrimaryNotutScore} \nurbanSecondaryNotutScore {urbanSecondaryNotutScore} \nurbanTertiaryNotutScore {urbanTertiaryNotutScore}")

urbanPrimaryNotutScore {'count': 332.0, 'mean': 66.70481927710843, 'std': 15.27203200620862, 'min': 20.0, '25%': 55.7, '50%': 67.35, '75%': 78.45, 'max': 100.0} 
urbanSecondaryNotutScore {'count': 681.0, 'mean': 68.31424375917769, 'std': 16.00829802447762, 'min': 21.6, '25%': 57.2, '50%': 68.6, '75%': 80.4, 'max': 100.0} 
urbanTertiaryNotutScore {'count': 499.0, 'mean': 70.30981963927856, 'std': 15.893585751481439, 'min': 27.9, '25%': 59.4, '50%': 70.8, '75%': 82.05000000000001, 'max': 100.0}


In [72]:
# Urban, P/S/T, Yes Tutor

urbanPrimaryYestutScore = df1[(df1.Region=="Urban") & (df1.ParentEdu=="Primary") & (df1.Tutoring=="Yes")].Score.describe().to_dict()
urbanSecondaryYestutScore = df1[(df1.Region=="Urban") & (df1.ParentEdu=="Secondary") & (df1.Tutoring=="Yes")].Score.describe().to_dict()
urbanTertiaryYestutScore = df1[(df1.Region=="Urban") & (df1.ParentEdu=="Tertiary") & (df1.Tutoring=="Yes")].Score.describe().to_dict()

print(f"urbanPrimaryYestutScore {urbanPrimaryYestutScore} \nurbanSecondaryYestutScore {urbanSecondaryYestutScore} \nurbanTertiaryYestutScore {urbanTertiaryYestutScore}")

urbanPrimaryYestutScore {'count': 142.0, 'mean': 79.46478873239437, 'std': 15.162796001370028, 'min': 41.1, '25%': 70.175, '50%': 80.55000000000001, '75%': 90.75, 'max': 100.0} 
urbanSecondaryYestutScore {'count': 306.0, 'mean': 82.11307189542484, 'std': 14.229434710200659, 'min': 36.1, '25%': 72.42500000000001, '50%': 82.55000000000001, '75%': 94.875, 'max': 100.0} 
urbanTertiaryYestutScore {'count': 214.0, 'mean': 83.68691588785046, 'std': 14.227428935178978, 'min': 37.3, '25%': 74.7, '50%': 84.55, '75%': 97.725, 'max': 100.0}


## **4. Modeling**

Coming soon.

## **5. Findings & Conclusion**

## **Z. Bibliography**

**Mateo, J. (2025). Student Academic Performance 4000 [Dataset]. In *Kaggle*. Retrieved June 16, 2025, from https://www.kaggle.com/datasets/firedmosquito831/student-academic-performance-simulation-4000**

## **ZZ. Peer Feedback**

From Aditi Dixit

The chosen dataset of test scores in Spain and a variety of demographic information is interesting and I’m curious to see what you will find with further analysis. The code works so far, as long as the csv file is downloaded previously onto the computer of whoever is running the program. 
One area of improvement might be being more clear about who your audience is; like if it is parents trying to figure out how they can help their kids test scores, or teachers, or even students. 
Another area could be working on what the audience should do with the information. For example, if you find that students from a certain area have better scores, are you going to encourage people to move there or to look into the school and see what they may be doing differently? 
I think your planned outline structure is good, and covers everything you will need from the Intro, to EDA, to methodology ,etc. I would also recommend having the argument clearly stated before the data analysis as well, so it can be in the Introduction section, and then this is proven by the data through your analysis.