# Data Understanding of Teacher dataset

(This document relates to question number 2)

- What pedagogical strategies developed by teachers seem to promote better reading performance (PISA 2018 - teacher context data)?
- What pedagogical strategies, according to the students' perspective, seem to promote better reading performance (PISA 2018 - student context data)?
- Do teachers and students have the same perceptions?







OECD (2019), PISA 2018 Results (Volume I): What Students Know and Can Do, PISA, OECD Publishing, Paris, https://doi.org/10.1787/5f07c754-en.

Executive Summary

Reading proficiency is essential for a wide variety of human activities – from following instructions in a manual; to finding out the who, what, when, where and why of an event; to communicating with others for a specific purpose or transaction. PISA recognises that evolving technologies have changed the ways people read and exchange information, whether at home, at school or in the workplace. Digitalisation has resulted in the emergence and availability of new forms of text, ranging from the concise (text messages; annotated search-engine results) to the lengthy (tabbed, multipage websites; newly accessible archival material scanned from microfiches). In response, education systems are increasingly incorporating digital (reading) literacy into their programmes of instruction.

Reading was the main subject assessed in PISA 2018. The PISA 2018 reading assessment, which was delivered on computer in most of the 79 countries and economies that participated, included new text and assessment formats made possible through digital delivery. The test aimed to assess reading literacy in the digital environment while retaining the ability to measure trends in reading literacy over the past two decades. PISA 2018 defined reading literacy as understanding, using, evaluating, reflecting on and engaging with texts in order to achieve one’s goals, to develop one’s knowledge and potential, and to participate in society.


https://www.oecd.org/en/publications/pisa-2018-results-volume-i_5f07c754-en.html

What students know and can do: main findings
In reading

    Beijing, Shanghai, Jiangsu and Zhejiang (China) and Singapore scored significantly higher in reading than all other countries/ economies that participated in PISA 2018. Estonia, Canada, Finland and Ireland were the highest-performing OECD countries in reading.

    Some 77 % of students, on average across OECD countries, attained at least Level 2 proficiency in reading. At a minimum, these students are able to identify the main idea in a text of moderate length, find information based on explicit, though sometimes complex, criteria, and reflect on the purpose and form of texts when explicitly directed to do so. Over 85 % of students in Beijing, Shanghai, Jiangsu and Zhejiang (China), Canada, Estonia, Finland, Hong Kong (China), Ireland, Macao (China), Poland and Singapore performed at this level or above.

    Around 8.7 % of students, on average across OECD countries, were top performers in reading, meaning that they attained Level 5 or 6 in the PISA reading test. At these levels, students are able to comprehend lengthy texts, deal with concepts that are abstract or counterintuitive, and establish distinctions between fact and opinion, based on implicit cues pertaining to the content or source of the information. In 20 education systems, including those of 15 OECD countries, over 10 % of 15-year-old students were top performers.



## **2.1 Collect Initial Data**

For this question we're using 2 datasets (https://www.oecd.org/en/data/datasets/pisa-2022-database.html):
- Students database 2018 (PISA)
- Teachers database 2018 (PISA) (In this file we use only the teacher database)

The source datasets are in .sas7bdat format, that we converted to .csv with the following command:

In [98]:
"""
data = pd.read_sas(
    "../../../databases/2018/cy07_msu_stu_qqq.sas7bdat", format="sas7bdat"
)

data.to_csv("../../../databases/2018/student2018.csv", index=False)

data = pd.read_sas(
    "../../../databases/2018/cy07_msu_tch_qqq.sas7bdat", format="sas7bdat"
)

data.to_csv("../../../databases/2018/teacher2018.csv", index=False)
"""

'\ndata = pd.read_sas(\n    "../../../databases/2018/cy07_msu_stu_qqq.sas7bdat", format="sas7bdat"\n)\n\ndata.to_csv("../../../databases/2018/student2018.csv", index=False)\n\ndata = pd.read_sas(\n    "../../../databases/2018/cy07_msu_tch_qqq.sas7bdat", format="sas7bdat"\n)\n\ndata.to_csv("../../../databases/2018/teacher2018.csv", index=False)\n'

In [100]:
#teacher = pd.read_csv('../../../databases/2018/teacher2018.csv', nrows=1000)
teacher = pd.read_csv('../../../databases/2018/teacher2018.csv')

**Note:** We don't include these files in the project folder, so it's necessary to manually download and put them in their respective folder.

## **2.2 Describe Data**

#### Teacher Data

In [83]:
teacher.head()

Unnamed: 0,CNTRYID,CNT,CNTSCHID,CNTTCHID,TEACHERID,CYC,NatCen,Region,STRATUM,SUBNATIO,...,FEEDBACK,ADAPTINSTR,FEEDBINSTR,TCATTIMM,GCTRAIN,TCMCEG,GCSELF,W_SCHGRNRABWT,W_FSTUWT_SCH_SUM,VER_DAT
0,8.0,b'ALB',800057.0,800001.0,5.0,b'07MS',b'000800',800.0,b'ALB0203',b'0080000',...,0.6122,,,1.2332,1.2735,-0.1151,1.9399,3.45567,93.30319,b' 09MAY19:11:21:10'
1,8.0,b'ALB',800121.0,800002.0,5.0,b'07MS',b'000800',800.0,b'ALB0107',b'0080000',...,-1.2284,,,-0.8666,1.8008,-0.1151,-0.2634,6.07334,12.46632,b' 09MAY19:11:21:10'
2,8.0,b'ALB',800140.0,800003.0,5.0,b'07MS',b'000800',800.0,b'ALB0101',b'0080000',...,0.2741,,,0.4987,1.8008,0.5387,-0.7038,3.76885,99.20152,b' 09MAY19:11:21:10'
3,8.0,b'ALB',800149.0,800004.0,5.0,b'07MS',b'000800',800.0,b'ALB0211',b'0080000',...,-0.2922,,,1.2332,0.2917,-0.6041,-1.0056,6.62948,19.88845,b' 09MAY19:11:21:10'
4,8.0,b'ALB',800095.0,800005.0,5.0,b'07MS',b'000800',800.0,b'ALB0204',b'0080000',...,0.1202,,,0.4987,1.0055,-1.9057,1.9399,3.35644,104.76897,b' 09MAY19:11:21:10'


In [84]:
teacher.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107367 entries, 0 to 107366
Data columns (total 350 columns):
 #    Column            Non-Null Count   Dtype  
---   ------            --------------   -----  
 0    CNTRYID           107367 non-null  float64
 1    CNT               107367 non-null  object 
 2    CNTSCHID          107367 non-null  float64
 3    CNTTCHID          107367 non-null  float64
 4    TEACHERID         91190 non-null   float64
 5    CYC               107367 non-null  object 
 6    NatCen            107367 non-null  object 
 7    Region            107367 non-null  float64
 8    STRATUM           107367 non-null  object 
 9    SUBNATIO          107367 non-null  object 
 10   OECD              107367 non-null  float64
 11   ADMINMODE         107367 non-null  float64
 12   LANGTEST          93006 non-null   float64
 13   TC001Q01NA        90090 non-null   float64
 14   TC002Q01NA        90326 non-null   float64
 15   TC005Q01NA        90408 non-null   float64
 16   

In [85]:
data_map = PISADataMap('../../../databases/2018/teacher_data_structure_2018.csv')

for column in data_map.map_enum:
    print(f"\033[1m{column.name}\033[0m: {column.value}")

[1mCNTRYID[0m: Country Identifier
[1mCNT[0m: Country code 3-character
[1mCNTSCHID[0m: Intl. School ID
[1mCNTTCHID[0m: Intl. Teacher ID
[1mTEACHERID[0m: Teacher identification code
[1mCYC[0m: PISA Assessment Cycle (2 digits + 2 character Assessment type - MS/FT)
[1mNatCen[0m: National Centre 6-digit Code
[1mRegion[0m: Region
[1mSTRATUM[0m: Stratum ID 7-character (cnt + region ID + original stratum ID)
[1mSUBNATIO[0m: Adjudicated sub-region code 7-digit code (3-digit country code + region ID + stratum ID)
[1mOECD[0m: OECD country
[1mADMINMODE[0m: Mode of Respondent
[1mLANGTEST[0m: Language of Questionnaire/Assessment
[1mTC001Q01NA[0m: Are you female or male?
[1mTC002Q01NA[0m: How old are you?
[1mTC005Q01NA[0m: What is your current employment status as a teacher? My employment status at this school
[1mTC007Q01NA[0m: How many years of work experience do you have? Year(s) working as a teacher at this school
[1mTC007Q02NA[0m: How many years of work experie

  self.map_enum = Enum('MapEnum', {row[0]: row[1] for index, row in map_df.iterrows()}) # Setup enum with key, values of the codebook


The dataset is composed by 343 numeric columns and only 7 categorical columns.

In [86]:
import pandas as pd
from tabulate import tabulate

categorical_columns = teacher.select_dtypes(include=["object", "category"]).columns
numeric_columns = teacher.select_dtypes(include=["int64", "float64"]).columns

column_types_df = pd.DataFrame(
    {
        "Column type": ["Numeric", "Categorical"],
        "Number of columns": [len(numeric_columns), len(categorical_columns) ],
        "Column names": [
            ", ".join(numeric_columns),
            ", ".join(categorical_columns),
        ],
    }
)

print(
    tabulate(
        column_types_df,
        headers="keys",
        tablefmt="pretty",
        showindex=False,
        colalign=("left", "left", "left"),
    )
)

+-------------+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [87]:
teacher.describe()

Unnamed: 0,CNTRYID,CNTSCHID,CNTTCHID,TEACHERID,Region,OECD,ADMINMODE,LANGTEST,TC001Q01NA,TC002Q01NA,...,TCDIRINS,FEEDBACK,ADAPTINSTR,FEEDBINSTR,TCATTIMM,GCTRAIN,TCMCEG,GCSELF,W_SCHGRNRABWT,W_FSTUWT_SCH_SUM
count,107367.0,107367.0,107367.0,91190.0,107367.0,107367.0,107367.0,93006.0,90090.0,90326.0,...,27293.0,62184.0,27312.0,27309.0,48908.0,58111.0,52577.0,52481.0,107367.0,107367.0
mean,482.540324,48255510.0,48258680.0,4.694188,48256.795766,0.44482,2.0,260.17015,1.368432,43.240374,...,0.148607,0.19494,0.073665,0.169752,-0.157354,0.221334,0.045425,0.150496,18.838765,1822.10621
std,271.54789,27156440.0,27156530.0,0.460753,27156.8113,0.496948,0.0,134.130385,0.482382,10.079831,...,1.045973,1.010828,1.032449,0.992757,1.094769,1.080615,0.944777,1.020194,43.119851,4225.350641
min,8.0,800002.0,800001.0,4.0,800.0,0.0,2.0,140.0,1.0,20.0,...,-3.9355,-3.2129,-4.3098,-3.6121,-4.2802,-1.3413,-2.716,-2.4972,0.86354,1.0
25%,214.0,21400190.0,21402100.0,4.0,21400.0,0.0,2.0,156.0,1.0,35.0,...,-0.5757,-0.4809,-0.6257,-0.4962,-0.8666,-0.4411,-0.5511,-0.2634,1.66143,143.52434
50%,591.0,59100050.0,59100720.0,5.0,59100.0,0.0,2.0,232.0,1.0,43.0,...,0.1679,0.1974,-0.0248,0.0151,-0.1598,0.1823,-0.1151,-0.2634,4.987,441.80496
75%,724.0,72400680.0,72413470.0,5.0,72413.0,1.0,2.0,313.0,2.0,51.0,...,1.3596,0.8993,0.7196,0.9909,0.6596,1.0759,1.125,1.038,18.91576,1990.8191
max,840.0,84000180.0,84003780.0,5.0,84000.0,1.0,2.0,803.0,2.0,70.0,...,1.3656,1.8542,2.4533,1.6101,1.3026,1.8008,1.125,1.9436,1294.02003,49343.58053


### Verifying similarities between databases

In [88]:
# Get column names
student_keys = set(student.columns)
teacher_keys = set(teacher.columns)

# Compare keys
common_keys = student_keys.intersection(teacher_keys)
unique_keys_student = student_keys.difference(teacher_keys)
unique_keys_teacher = teacher_keys.difference(student_keys)

print("Common keys:", common_keys)
#print("Unique keys in student:", unique_keys_student)
#print("Unique keys in teacher:", unique_keys_teacher)


Common keys: {'CNTRYID', 'SUBNATIO', 'OECD', 'VER_DAT', 'STRATUM', 'CNTSCHID', 'NatCen', 'CYC', 'CNT', 'ADMINMODE'}


## 2.4 Verify data quality


#### Student Data

In [89]:
# Visualização geral de valores ausentes
missing_summary = pd.DataFrame({
    'Missing Values': student.isnull().sum(),
    'Percentage Missing': (student.isnull().mean() * 100).round(2)
})

# Ordenar pelo número de valores ausentes
missing_summary = missing_summary[missing_summary['Missing Values'] > 0].sort_values(by='Missing Values', ascending=False)

print("Resumo de valores ausentes:\n")
print(missing_summary)

# Identificar linhas com valores ausentes
rows_with_missing = student[student.isnull().any(axis=1)]
print(f"\nNúmero de linhas com valores ausentes: {len(rows_with_missing)}")

# Exibir um sample de linhas com valores ausentes
if len(rows_with_missing) > 0:
    print("\nExemplo de linhas com valores ausentes:")
    print(rows_with_missing.head())

Resumo de valores ausentes:

            Missing Values  Percentage Missing
EC162Q08HA          577987               94.44
EC162Q06HA          576013               94.12
EC162Q07HA          575891               94.10
EC162Q04HA          575680               94.06
EC162Q05HA          575435               94.02
...                    ...                 ...
PV7READ               5377                0.88
PV6READ               5377                0.88
PV3READ               5377                0.88
GRADE                 3019                0.49
ST004D01T                2                0.00

[1007 rows x 2 columns]

Número de linhas com valores ausentes: 612004

Exemplo de linhas com valores ausentes:
   CNTRYID     CNT  CNTSCHID  CNTSTUID      CYC     NatCen     STRATUM  \
0      8.0  b'ALB'  800115.0  800001.0  b'07MS'  b'000800'  b'ALB0107'   
1      8.0  b'ALB'  800300.0  800002.0  b'07MS'  b'000800'  b'ALB0105'   
2      8.0  b'ALB'  800088.0  800003.0  b'07MS'  b'000800'  b'ALB0101'  

#### Teacher Data

In [90]:
# Visualização geral de valores ausentes
missing_summary = pd.DataFrame({
    'Missing Values': teacher.isnull().sum(),
    'Percentage Missing': (teacher.isnull().mean() * 100).round(2)
})

# Ordenar pelo número de valores ausentes
missing_summary = missing_summary[missing_summary['Missing Values'] > 0].sort_values(by='Missing Values', ascending=False)

print("Resumo de valores ausentes:\n")
print(missing_summary)

# Identificar linhas com valores ausentes
rows_with_missing = teacher[teacher.isnull().any(axis=1)]
print(f"\nNúmero de linhas com valores ausentes: {len(rows_with_missing)}")

# Exibir um sample de linhas com valores ausentes
if len(rows_with_missing) > 0:
    print("\nExemplo de linhas com valores ausentes:")
    print(rows_with_missing.head())

Resumo de valores ausentes:

            Missing Values  Percentage Missing
TC203Q01HA           81989               76.36
TC203Q02HA           81968               76.34
TC203Q03HA           81884               76.27
TC204Q01HA           81690               76.08
TC204Q02HA           81644               76.04
...                    ...                 ...
STTMG5               16754               15.60
STTMG6               16754               15.60
STTMG7               16754               15.60
TEACHERID            16177               15.07
LANGTEST             14361               13.38

[324 rows x 2 columns]

Número de linhas com valores ausentes: 107367

Exemplo de linhas com valores ausentes:
   CNTRYID     CNT  CNTSCHID  CNTTCHID  TEACHERID      CYC     NatCen  Region  \
0      8.0  b'ALB'  800057.0  800001.0        5.0  b'07MS'  b'000800'   800.0   
1      8.0  b'ALB'  800121.0  800002.0        5.0  b'07MS'  b'000800'   800.0   
2      8.0  b'ALB'  800140.0  800003.0        5.0  b