# Data Exploration of PISA report: Relationships Between Student Performance, Socioeconomic Status and Internet Access

## by Alfredo Yigal Núñez Varillas

___________________

### Brief introduction to the PISA report

Having a good education is one of the most important factors for a person to fully develop their skills, be able to provide a better quality of life for themselves and their families, and contribute to the economic development of their community. This is well known by different governments worldwide, which is why more and more States are investing in knowing the educational level of their population, the results of their educational efforts and how to improve it. One of the reports that help this goal is the PISA report.

The PISA report, from the acronym Program for International Student Assessment, is an OECD program that measures the performance of 15-year-old students in three competencies of importance for their development in the adult world: reading, mathematics and natural sciences. This program is done every three years and allows the different governments to have comparable data in order to have a better knowledge of the performance of their students and improve their educational policies. The program started in the year 2000, participating 32 countries. Since then, the PISA study has grown to involve 65 economies worldwide for the 2012 report. 

The main objective of the PISA report is not to test the student's knowledge, but to measure their aptitudes and abilities relevant for an optimal development in the adult world. The study focuses on 15-year-old students as this is the age where compulsory education is commonly completed.


### Introduction to the PISA dataset

In this project, we will explore the data obtained in the PISA report of 2012. The database, which can be obtained [here](https://www.oecd.org/pisa/pisaproducts/pisa2012database-downloadabledata.htm), has information on 485490 students of 15 years from 65 countries worldwide. One of the most outstanding points of the PISA report is the amount of information that was collected. Not only basic data such as country, age and score was obtained; but rich and varied data, from quantity of books in the student's house, educational level of their parents, access to Internet, etc.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [2]:
# Read pisa2012.csv
# pisa_df = pd.read_csv('pisa2012.csv', encoding='windows-1252')
# print(pisa_df.shape)
# pisa_df.sample(3)

The dataset `pisa2012.csv` is interestingly large: there are 485490 rows per 636 columns. Due to these dimensions, it was considered better to have two separate notebooks: One for a basic Wrangling called "PISA Data Wrangling" and another for the data exploration (the current notebook).

In [3]:
# Import col types
col_dtype_series = pd.read_pickle("clean_pisa_cols.pkl")
col_dtype_dict = col_dtype_series.to_dict()

# read PISA data.
pisa_df = pd.read_csv('clean_pisa_data.csv', dtype = col_dtype_dict) 

In [4]:
print(pisa_df.shape)
print(pisa_df.columns)
pisa_df.sample(3)

(485490, 14)
Index(['student_id', 'gender', 'country', 'ESCS', 'science_score',
       'reading_score', 'math_score', 'immig_status', 'late_to_school',
       'skip_whole_school_day', 'skip_class_within_school', 'internet_home',
       'global_region', 'comb_score'],
      dtype='object')


Unnamed: 0,student_id,gender,country,ESCS,science_score,reading_score,math_score,immig_status,late_to_school,skip_whole_school_day,skip_class_within_school,internet_home,global_region,comb_score
21656,5414,Male,Argentina,-0.71,268.3141,236.5555,312.5175,Native,Three or Four Times,One or Two Times,One or Two Times,,Latin America,817.3871
424418,1643,Male,Singapore,-0.42,408.2805,416.111,456.1536,Native,One or Two Times,Never,Never,"Yes, and I use it",Asia,1280.5451
352120,2612,Female,Montenegro,-0.35,474.2074,491.8408,461.2946,First-Generation,One or Two Times,Never,Never,,Europe,1427.3428


### What is the structure of your dataset?

This DF has a total of 14 columns per 485490 rows. The data we can obtain range from personal and family information of the student, to data on the scores obtained, their access to the internet etc. 


### What is/are the main feature(s) of interest in your dataset?

In the Data Wrangling stage, I analyzed column by column of the original dataset and wrote down the following points of interest:

**Main point of interest:**

- Is there any relationship between students who arrive late, skip a class or miss a whole school day and a lower performance on the study?
- Is there any difference in results between students who have access to the internet and those who do not?
- Is there any relationship between the economic-socio-cultural level of students and their performance in the test?

**Other points of interest:**

- What are the differences between the results of boys and girls?
- Are there statistically significant differences between foreign students and local students?

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Basically, all the columns of `clean_pisa_data.csv` will be used for our analysis, since we had to define points of interest from the beginning in order to filter the important columns from the hundreds of available columns in the `pisa2012.csv` file.