### Import pandas library

This line is to get the pandas tool ready to use in our project.  

In [39]:
import pandas as pd



Here, we open the clean data files for Math and Portuguese.  
We use `pd.read_csv()` to read the files and save the data in `math_data` and `por_data`.  
Then, we print the size of each data set.  
This shows how many rows and columns each file has.  It helps us know how big the data is.

In [69]:

math_data = pd.read_csv('./data/student_math_clean_new.csv')
por_data = pd.read_csv('./data/student_portuguese_clean_new.csv')

print("Math data size:", math_data.shape)
print("Portuguese data size:", por_data.shape)


Math data size: (395, 34)
Portuguese data size: (649, 34)


I made a small function called preview_dataset() to help me look at my data. It shows the name of the dataset, the size (how many rows and columns), and the first two rows. This helps me understand what is inside the data without writing the same code many times.

In [62]:
def preview_dataset(df, name):
    print(f"Preview of {name} dataset:")
    print(df.shape)
    print(df.head(2))
    print("-" * 40)

# Then call it like this:
preview_dataset(math_df, "Math")
preview_dataset(por_df, "Portuguese")


Preview of Math dataset:
(395, 34)
   student_id school sex  age address_type     family_size    parent_status  \
0           1     GP   F   18        Urban  Greater than 3            Apart   
1           2     GP   F   17        Urban  Greater than 3  Living together   

                mother_education               father_education mother_job  \
0               higher education               higher education    at_home   
1  primary education (4th grade)  primary education (4th grade)    at_home   

   ... family_relationship free_time social weekday_alcohol weekend_alcohol  \
0  ...                   4         3      4               1               1   
1  ...                   5         3      3               1               1   

   health absences grade_1 grade_2 final_grade  
0       3        6       5       6           6  
1       3        4       5       5           6  

[2 rows x 34 columns]
----------------------------------------
Preview of Portuguese dataset:
(649, 34)
  

This function helps to quickly see how many rows and columns the data has, find any missing values in the columns, and know what type of data is in each column.
I use this to understand the data better before working with it.

In [63]:
check_data(math_df, "Math Dataset")
check_data(por_df, "Portuguese Dataset")


--- Data info for Math Dataset ---
Shape: (395, 34)
Missing values in each column:
student_id               0
school                   0
sex                      0
age                      0
address_type             0
family_size              0
parent_status            0
mother_education         0
father_education         0
mother_job               0
father_job               0
school_choice_reason     0
guardian                 0
travel_time              0
study_time               0
class_failures           0
school_support           0
family_support           0
extra_paid_classes       0
activities               0
nursery_school           0
higher_ed                0
internet_access          0
romantic_relationship    0
family_relationship      0
free_time                0
social                   0
weekday_alcohol          0
weekend_alcohol          0
health                   0
absences                 0
grade_1                  0
grade_2                  0
final_grade              0

In [46]:
print(math_df.columns)



Index(['student_id', 'school', 'sex', 'age', 'address_type', 'family_size',
       'parent_status', 'mother_education', 'father_education', 'mother_job',
       'father_job', 'school_choice_reason', 'guardian', 'travel_time',
       'study_time', 'class_failures', 'school_support', 'family_support',
       'extra_paid_classes', 'activities', 'nursery_school', 'higher_ed',
       'internet_access', 'romantic_relationship', 'family_relationship',
       'free_time', 'social', 'weekday_alcohol', 'weekend_alcohol', 'health',
       'absences', 'grade_1', 'grade_2', 'final_grade'],
      dtype='object')


In [58]:
# Select student info columns 
student_info_math = math_df[['student_id', 'school', 'sex', 'age', 'address_type', 'family_size', 'parent_status']]

# Select grades columns
grades_math = math_df[['student_id', 'grade_1', 'grade_2', 'final_grade']]


In [59]:
print(por_df.columns)


Index(['student_id', 'school', 'sex', 'age', 'address_type', 'family_size',
       'parent_status', 'mother_education', 'father_education', 'mother_job',
       'father_job', 'school_choice_reason', 'guardian', 'travel_time',
       'study_time', 'class_failures', 'school_support', 'family_support',
       'extra_paid_classes', 'activities', 'nursery_school', 'higher_ed',
       'internet_access', 'romantic_relationship', 'family_relationship',
       'free_time', 'social', 'weekday_alcohol', 'weekend_alcohol', 'health',
       'absences', 'grade_1', 'grade_2', 'final_grade'],
      dtype='object')


here i  removes the grade columns, so we only keep the student's personal info like school, age, and address(This is for the student table).
after that I  keeps only the grades and the student ID (This is for the grades table).

In [65]:
# Student info columns 
student_info_por = por_df.drop(columns=['grade_1', 'grade_2', 'final_grade'])

# Grades columns
grades_por = por_df[['student_id', 'grade_1', 'grade_2', 'final_grade']]



Here I separate the columns into two tables. One table will have the student information (like age, school, and address_type), and the other one will have the grades. I also add a 'subject' column to say if the data is from Math or Portuguese.


I picked a few columns that describe each student. These will go into my students table. I did this for both math and Portuguese datasets.

In [71]:
# Student info (no grades)
student_columns = ['student_id', 'school', 'sex', 'age', 'address_type']

# Grade info (keep only grades + student_id)
grade_columns = ['student_id', 'grade_1', 'grade_2', 'final_grade']


In this step, I made a new table for grades. I picked the grade columns and added a subject column to know if it’s math or Portuguese.



In [70]:

grades_math = math_df[grade_columns].copy()
grades_math['subject'] = 'math'

grades_por = por_df[grade_columns].copy()
grades_por['subject'] = 'portuguese'


In [67]:
students_math = math_df[['student_id', 'school', 'sex', 'age', 'address_type']]
students_por = por_df[['student_id', 'school', 'sex', 'age', 'address_type']]


I combined both student and grade tables. For the students table, I used drop_duplicates() to make sure each student only shows once.

In [68]:
# Trying to combine student info 

students = pd.concat(students_math + students_por).drop_duplicates  
grades = pd.concat(grades_math + grades_por) 


TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"