# 2. Data Understanding



## 2.1 Collect initial data

For this question we've only used the Student questionnaire data (from 2022), collected from the folowing link: https://webfs.oecd.org/pisa2022/index.html

The dataset was originally in a .sas7bdat format and was converted to a .csv file.

In [1]:
"""
import pandas as pd

data = pd.read_sas(
    "../../../databases/2022/CY08MSP_STU_QQQ.sas7bdat", format="sas7bdat"
)
data.to_csv("../../../databases/2022/student2022.csv", index=False)

"""

'\nimport pandas as pd\n\ndata = pd.read_sas(\n    "../../../databases/2022/CY08MSP_STU_QQQ.sas7bdat", format="sas7bdat"\n)\ndata.to_csv("../../../databases/2022/student2022.csv", index=False)\n\n'

**Note:** We don't include these files in the project folder, so it's necessary to manually download and put them in their respective folder.

In [None]:
import pandas as pd

student = pd.read_csv("../../../databases/2022/STU_QQQ_SAS/student2022.csv")

For testing pusposes, we included a small sample of the data in the project folder (1% of the original data):

In [None]:
# import pandas as pd

# student = pd.read_csv("../../../databases/student2022_sample.csv")

## 2.2 Describe data

The original dataset has 1278 features

In [None]:
student.head(5)

The dataset is composed by 1260 numeric columns and only 18 categorical columns. 

In [None]:
import pandas as pd
from tabulate import tabulate

categorical_columns = student.select_dtypes(include=["object", "category"]).columns
numeric_columns = student.select_dtypes(include=["int64", "float64"]).columns

column_types_df = pd.DataFrame(
    {
        "Column type": ["Numeric", "Categorical"],
        "Number of columns": [len(numeric_columns), len(categorical_columns) ],
        "Column names": [
            ", ".join(numeric_columns),
            ", ".join(categorical_columns),
        ],
    }
)

print(
    tabulate(
        column_types_df,
        headers="keys",
        tablefmt="pretty",
        showindex=False,
        colalign=("left", "left", "left"),
    )
)

In [None]:
student.describe()

We can observe that there are a total of 613744 students, from which ~10% are repeating.

In [None]:
print(f"Total number of students: {len(student)}\n" )

not_repeating_students = student[student["REPEAT"] == 0]
print(f"Total number of non repeating students: {len(not_repeating_students)}")
student_grades = not_repeating_students["ST001D01T"].value_counts().reset_index()
student_grades.columns = ["Grade", "Count"]
print(student_grades)
print("\n")
repeating_students = student[student["REPEAT"] == 1]
print(f"Total number of repeating students: {len(repeating_students)}")
repeating_students_grades = repeating_students["ST001D01T"].value_counts().reset_index()
repeating_students_grades.columns = ["Grade", "Repeating"]
print(repeating_students_grades)

From the previous tables we can conclude that the students can be in different grades, from 7th to 13th. We can separate them, by filtering using the corresponding code "ST001D01T"

In [None]:
def filter_by_grade(dataframe, grade):
    return dataframe[dataframe["ST001D01T"] == grade]

In [None]:
grade_7_repeating = filter_by_grade(repeating_students, 7)
grade_8_repeating = filter_by_grade(repeating_students, 8)
grade_9_repeating = filter_by_grade(repeating_students, 9)
grade_10_repeating = filter_by_grade(repeating_students, 10)
grade_11_repeating = filter_by_grade(repeating_students, 11)
grade_12_repeating = filter_by_grade(repeating_students, 12)
grade_13_repeating = filter_by_grade(repeating_students, 13)

grade_7_not_repeating = filter_by_grade(not_repeating_students, 7)
grade_8_not_repeating = filter_by_grade(not_repeating_students, 8)
grade_9_not_repeating = filter_by_grade(not_repeating_students, 9)
grade_10_not_repeating = filter_by_grade(not_repeating_students, 10)
grade_11_not_repeating = filter_by_grade(not_repeating_students, 11)
grade_12_not_repeating = filter_by_grade(not_repeating_students, 12)
grade_13_not_repeating = filter_by_grade(not_repeating_students, 13)

## 2.3 Explore data

The target variable in this analysis is the mathematics score achieved by each student. 

This score is calculated as the average of the values across all "Possible Math Value" columns, which are represented in the dataset as PV1MATH to PV10MATH. These features are plausible values, each representing multiple estimates of the student's performance. Averaging them provides a more reliable and comprehensive measure of the student's grade.

Note: The Math result has a scale from 0 to 1000. The typical range goes from 400 to 600. 700 is considered a high score and above 800 is considered very high score.

In [None]:
math_columns = [f"PV{i}MATH" for i in range(1, 11)]
student["Avg Math Result"] = student[math_columns].mean(axis=1)
student = student.drop(columns=math_columns)
student["Avg Math Result"].describe()

If we distribute the students per grade we can check that the majority of repeating students are in the 9th grade and most of the non-repeating students are in the 10th grade, which makes sense because they all are 15 years old.

However, there are still some students in more advanced grades (repeating and not repeating) that we should take a deeper look.

In [None]:
import matplotlib.pyplot as plt

grades = [7, 8, 9, 10, 11, 12, 13]

repeating_counts = [
    len(grade_7_repeating),
    len(grade_8_repeating),
    len(grade_9_repeating),
    len(grade_10_repeating),
    len(grade_11_repeating),
    len(grade_12_repeating),
    len(grade_13_repeating),
]

not_repeating_counts = [
    len(grade_7_not_repeating),
    len(grade_8_not_repeating),
    len(grade_9_not_repeating),
    len(grade_10_not_repeating),
    len(grade_11_not_repeating),
    len(grade_12_not_repeating),
    len(grade_13_not_repeating),
]

total_repeating = sum(repeating_counts)
total_not_repeating = sum(not_repeating_counts)

repeating_percent = [
    r / total_repeating * 100 if total_repeating > 0 else 0 for r in repeating_counts
]
not_repeating_percent = [
    nr / total_not_repeating * 100 if total_not_repeating > 0 else 0
    for nr in not_repeating_counts
]

fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

axes[0].bar(grades, repeating_percent)
axes[0].set_title("Distribution of Repeating Students by Grade")
axes[0].set_xlabel("Grade")
axes[0].set_ylabel("Percentage (%)")
axes[0].set_xticks(grades)
axes[0].set_ylim(0, max(max(repeating_percent), max(not_repeating_percent)) + 5)
axes[0].grid(axis="y", linestyle="--", alpha=0.6)

axes[1].bar(grades, not_repeating_percent, color="mediumseagreen")
axes[1].set_title("Distribution of Non-repeating Students by Grade")
axes[1].set_xlabel("Grade")
axes[1].set_xticks(grades)
axes[1].grid(axis="y", linestyle="--", alpha=0.6)

plt.tight_layout()
plt.show()

If we separate the more advanced grades (11th, 12th and 13th) by country, we can clearly observe that the majority of students are English.

This can be justified because of the way that the english teaching system works. See: https://b28mathstutor.co.uk/how-the-english-school-system-works/#:~:text=Unlike%20in%20some%20countries%2C%20students,1%2C%20also%20known%20as%20Infants

This difference may lead to a disproportionate representation of students and we should consider them as an exception in the next phase.

In [None]:
import matplotlib.pyplot as plt

grade_11 = filter_by_grade(student, 11)
grade_12 = filter_by_grade(student, 12)
grade_13 = filter_by_grade(student, 13)

grade_11_counts = grade_11["CNT"].value_counts(normalize=True).mul(100).head(10)
grade_12_counts = grade_12["CNT"].value_counts(normalize=True).mul(100).head(10)
grade_13_counts = grade_13["CNT"].value_counts(normalize=True).mul(100).head(10)

fig, axes = plt.subplots(1, 3, figsize=(18, 6), sharey=True)

axes[0].bar(grade_11_counts.index, grade_11_counts.values, color="cornflowerblue")
axes[0].set_title("Country Distribution - 11th Grade")
axes[0].set_xlabel("Country")
axes[0].set_ylabel("Students (%)")
axes[0].tick_params(axis="x", rotation=45)
axes[0].set_ylim(
    0, max(grade_11_counts.max(), grade_12_counts.max(), grade_13_counts.max()) + 5
)

axes[1].bar(grade_12_counts.index, grade_12_counts.values, color="mediumseagreen")
axes[1].set_title("Country Distribution - 12th Grade")
axes[1].set_xlabel("Country")
axes[1].tick_params(axis="x", rotation=45)

axes[2].bar(grade_13_counts.index, grade_13_counts.values, color="tomato")
axes[2].set_title("Country Distribution - 13th Grade")
axes[2].set_xlabel("Country")
axes[2].tick_params(axis="x", rotation=45)

plt.tight_layout()
plt.show()

It is also essential to examine the correlation between all dataset features and the target variable (Avg Math Result). This helps identify which features are strongly associated with students performance and can be considered in the next step for feature selection.

We computed this correlation, and the top 20 features were visualized in a table.

In [None]:
correl = (
    student.corr(numeric_only=True)["Avg Math Result"]
    .abs()
    .sort_values(ascending=False)
)

In [None]:
top_corr = correl.drop("Avg Math Result").head(20)

top_corr_df = top_corr.reset_index()
top_corr_df.columns = ["Feature", "Correlation with Math Result"]
display(top_corr_df)

The top features obtained are plausible values, which can be averaged, similar to the approach we used to calculate the Math Result.

Some of these features represent subscales of mathematics, and can be removed from the dataset in the next phase, as they are already captured by the aggregated Math Result score.

### Categorical Values analysis 


Regarding the categorical values, there are a few of them that are country specific. Since this study focuses on identifying global trends, we chose not to include these variables in our analysis.

Country specific codes:

- ST250D06JA
- ST250D07JA
- ST251D08JA
- ST251D09JA
- ST330D10WA 
- PROGN

Additionally, the codes "CNT", "NatCen", "STRATUM" and "SUBNATION" are all related to the student's country/region. To reduce the dimensionality of the dataset, we decided to retain only "CNT", as it effectively aggregates the information from the others.

"COBN_S", "COBN_M" and "COBN_F" represent the country of birth of the student, mother and father, respectively. This were excluded to avoid increasing the dimensionality with features that are strongly correlated with "CNT".

"OCOD1", "OCOD2" and "OCOD3" represent the occupation of the student, mother and father. While potentially insightful, occupational data can be highly country-dependent due to cultural and economic differences. For this reason, we chose not to include them in the current analysis.

Finally, "VER_DAT" was removed, as it only contains the questionnaire date, which is not relevant to our study.


"CNT" (Country) is the only remaining categorical variable, but it contains a large number of distinct values, and it needs to be grouped into fewer categories to avoid high dimensionality in future approaches such as One-Hot Encoding.

## 2.4 Verify data quality

In this step we started by checking missing values in the dataset.

We've decided that variables with more than 70% missing data can lead to biased results, so they should be removed from the dataset in the future.

In [None]:
print("\n--- Missing Values ---")
missing = student.isnull().mean().sort_values(ascending=False)
print(missing[missing > 0.7])

We've also done a sanity check to see if there were any duplicated rows or columns, but there were no cases in this dataset.

In [None]:
print("\n--- Duplicated Rows ---")
duplicated_rows = student.duplicated().sum()
print(f"Duplicated rows: {duplicated_rows}")

print("\n--- Duplicated Columns ---")
duplicated_columns = student.T.duplicated().sum()
print(f"Duplicated columns: {duplicated_columns}")
