# Data Understanding

(This document relates to question number 2)

- What pedagogical strategies developed by teachers seem to promote better reading performance (PISA 2018 - teacher context data)?
- What pedagogical strategies, according to the students' perspective, seem to promote better reading performance (PISA 2018 - student context data)?
- Do teachers and students have the same perceptions?

For this question we're using 2 datasets (https://www.oecd.org/en/data/datasets/pisa-2022-database.html):
- Students database 2018 (PISA)
- Teachers database 2018 (PISA)

## **2.1 Collect Initial Data**

The source datasets are in .sas7bdat format, that we converted to .csv with the following command:

In [None]:
"""
data = pd.read_sas(
    "../../../databases/2018/cy07_msu_stu_qqq.sas7bdat", format="sas7bdat"
)

data.to_csv("../../../databases/2018/student2018.csv", index=False)

data = pd.read_sas(
    "../../../databases/2018/cy07_msu_tch_qqq.sas7bdat", format="sas7bdat"
)

data.to_csv("../../../databases/2018/teacher2018.csv", index=False)
"""

**Note: The path may change depending for you, we don't include these files in the project, so you'll have to download and execute these commands yourself**

## **2.2 Describe Data**

#### Student Data

In [22]:
import pandas as pd

student = pd.read_csv('../../../databases/2018/student2018.csv', nrows=1000)
#student = pd.read_csv('../../../databases/2018/student2018.csv')
#student_data_structure = pd.read_csv('../../../databases/2018/student_data_structure_2018.csv')

In [23]:
student.head()

Unnamed: 0,CNTRYID,CNT,CNTSCHID,CNTSTUID,CYC,NatCen,STRATUM,SUBNATIO,OECD,ADMINMODE,...,PV4RTML,PV5RTML,PV6RTML,PV7RTML,PV8RTML,PV9RTML,PV10RTML,SENWT,VER_DAT,i
0,8.0,b'ALB',800115.0,800001.0,b'07MS',b'000800',b'ALB0107',b'0080000',0.0,2.0,...,325.281,370.041,358.524,345.833,380.064,357.376,385.496,0.48064,b' 09MAY19:11:20:53',31.0
1,8.0,b'ALB',800300.0,800002.0,b'07MS',b'000800',b'ALB0105',b'0080000',0.0,2.0,...,337.259,294.53,325.444,367.058,333.356,367.616,334.448,1.30666,b' 09MAY19:11:20:54',31.0
2,8.0,b'ALB',800088.0,800003.0,b'07MS',b'000800',b'ALB0101',b'0080000',0.0,2.0,...,297.929,269.282,293.719,314.027,295.519,283.143,315.992,0.67391,b' 09MAY19:11:20:54',31.0
3,8.0,b'ALB',800014.0,800004.0,b'07MS',b'000800',b'ALB0109',b'0080000',0.0,2.0,...,349.369,333.416,320.41,388.597,324.419,372.543,355.213,0.6825,b' 09MAY19:11:20:53',31.0
4,8.0,b'ALB',800294.0,800005.0,b'07MS',b'000800',b'ALB0203',b'0080000',0.0,2.0,...,461.508,464.534,461.681,455.574,430.815,476.752,482.148,0.63579,b' 09MAY19:11:20:53',31.0


In [24]:
student.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1119 columns):
 #     Column        Non-Null Count  Dtype  
---    ------        --------------  -----  
 0     CNTRYID       1000 non-null   float64
 1     CNT           1000 non-null   object 
 2     CNTSCHID      1000 non-null   float64
 3     CNTSTUID      1000 non-null   float64
 4     CYC           1000 non-null   object 
 5     NatCen        1000 non-null   object 
 6     STRATUM       1000 non-null   object 
 7     SUBNATIO      1000 non-null   object 
 8     OECD          1000 non-null   float64
 9     ADMINMODE     1000 non-null   float64
 10    LANGTEST_QQQ  995 non-null    float64
 11    LANGTEST_COG  1000 non-null   float64
 12    LANGTEST_PAQ  0 non-null      float64
 13    BOOKID        1000 non-null   float64
 14    ST001D01T     1000 non-null   float64
 15    ST003D02T     1000 non-null   float64
 16    ST003D03T     1000 non-null   float64
 17    ST004D01T     1000 non-null   

#### Teacher Data

In [26]:
#teacher = pd.read_csv('../../../databases/2018/teacher2018.csv', nrows=1000)
teacher = pd.read_csv('../../../databases/2018/teacher2018.csv')
teacher_data_structure = pd.read_csv('../../../databases/2018/teacher_data_structure_2018.csv', delimiter=';')

In [27]:
teacher.head()

Unnamed: 0,CNTRYID,CNT,CNTSCHID,CNTTCHID,TEACHERID,CYC,NatCen,Region,STRATUM,SUBNATIO,...,FEEDBACK,ADAPTINSTR,FEEDBINSTR,TCATTIMM,GCTRAIN,TCMCEG,GCSELF,W_SCHGRNRABWT,W_FSTUWT_SCH_SUM,VER_DAT
0,8.0,b'ALB',800057.0,800001.0,5.0,b'07MS',b'000800',800.0,b'ALB0203',b'0080000',...,0.6122,,,1.2332,1.2735,-0.1151,1.9399,3.45567,93.30319,b' 09MAY19:11:21:10'
1,8.0,b'ALB',800121.0,800002.0,5.0,b'07MS',b'000800',800.0,b'ALB0107',b'0080000',...,-1.2284,,,-0.8666,1.8008,-0.1151,-0.2634,6.07334,12.46632,b' 09MAY19:11:21:10'
2,8.0,b'ALB',800140.0,800003.0,5.0,b'07MS',b'000800',800.0,b'ALB0101',b'0080000',...,0.2741,,,0.4987,1.8008,0.5387,-0.7038,3.76885,99.20152,b' 09MAY19:11:21:10'
3,8.0,b'ALB',800149.0,800004.0,5.0,b'07MS',b'000800',800.0,b'ALB0211',b'0080000',...,-0.2922,,,1.2332,0.2917,-0.6041,-1.0056,6.62948,19.88845,b' 09MAY19:11:21:10'
4,8.0,b'ALB',800095.0,800005.0,5.0,b'07MS',b'000800',800.0,b'ALB0204',b'0080000',...,0.1202,,,0.4987,1.0055,-1.9057,1.9399,3.35644,104.76897,b' 09MAY19:11:21:10'


In [28]:
teacher.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107367 entries, 0 to 107366
Data columns (total 350 columns):
 #    Column            Non-Null Count   Dtype  
---   ------            --------------   -----  
 0    CNTRYID           107367 non-null  float64
 1    CNT               107367 non-null  object 
 2    CNTSCHID          107367 non-null  float64
 3    CNTTCHID          107367 non-null  float64
 4    TEACHERID         91190 non-null   float64
 5    CYC               107367 non-null  object 
 6    NatCen            107367 non-null  object 
 7    Region            107367 non-null  float64
 8    STRATUM           107367 non-null  object 
 9    SUBNATIO          107367 non-null  object 
 10   OECD              107367 non-null  float64
 11   ADMINMODE         107367 non-null  float64
 12   LANGTEST          93006 non-null   float64
 13   TC001Q01NA        90090 non-null   float64
 14   TC002Q01NA        90326 non-null   float64
 15   TC005Q01NA        90408 non-null   float64
 16   

In [29]:
teacher_data_structure.head()

Unnamed: 0,CY07_MSU_TCH_QQQ,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9
0,Teacher questionnaire data file,,,,,,,,,
1,NAME,VARLABEL,TYPE,FORMAT,VARNUM,MINMAX,VAL,LABEL,COUNT,PERCENT
2,CNTRYID,Country Identifier,NUM,3.0,1,8-840,,,,
3,,,,,,,8,Albania,3375,314
4,,,,,,,31,Baku (Azerbaijan),4077,380


In [30]:
teacher_data_structure.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4854 entries, 0 to 4853
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   CY07_MSU_TCH_QQQ  352 non-null    object
 1   Unnamed: 1        351 non-null    object
 2   Unnamed: 2        351 non-null    object
 3   Unnamed: 3        351 non-null    object
 4   Unnamed: 4        351 non-null    object
 5   Unnamed: 5        344 non-null    object
 6   Unnamed: 6        4503 non-null   object
 7   Unnamed: 7        4503 non-null   object
 8   Unnamed: 8        4503 non-null   object
 9   Unnamed: 9        4503 non-null   object
dtypes: object(10)
memory usage: 379.3+ KB


### Verifying similarities between databases

In [17]:
# Get column names
student_keys = set(student.columns)
teacher_keys = set(teacher.columns)

# Compare keys
common_keys = student_keys.intersection(teacher_keys)
unique_keys_student = student_keys.difference(teacher_keys)
unique_keys_teacher = teacher_keys.difference(student_keys)

print("Common keys:", common_keys)
#print("Unique keys in student:", unique_keys_student)
#print("Unique keys in teacher:", unique_keys_teacher)


Common keys: {'STRATUM', 'ADMINMODE', 'CNTSCHID', 'SUBNATIO', 'CYC', 'CNT', 'VER_DAT', 'OECD', 'NatCen', 'CNTRYID'}
