# Data Understanding of Teacher dataset

(This document relates to question number 2)

- What pedagogical strategies developed by teachers seem to promote better reading performance (PISA 2018 - teacher context data)?
- What pedagogical strategies, according to the students' perspective, seem to promote better reading performance (PISA 2018 - student context data)?
- Do teachers and students have the same perceptions?

## **2.1 Collect Initial Data**

For this question we're using 2 datasets (https://www.oecd.org/en/data/datasets/pisa-2022-database.html):
- Students database 2018 (PISA)
- Teachers database 2018 (PISA) (In this file we use only the teacher database)

The source datasets are in .sas7bdat format, that we converted to .csv with the following command:

In [2]:
import pandas as pd

"""
data = pd.read_sas(
    "../../../databases/2018/cy07_msu_stu_qqq.sas7bdat", format="sas7bdat"
)

data.to_csv("../../../databases/2018/student2018.csv", index=False)

data = pd.read_sas(
    "../../../databases/2018/cy07_msu_tch_qqq.sas7bdat", format="sas7bdat"
)

data.to_csv("../../../databases/2018/teacher2018.csv", index=False)
"""

'\ndata = pd.read_sas(\n    "../../../databases/2018/cy07_msu_stu_qqq.sas7bdat", format="sas7bdat"\n)\n\ndata.to_csv("../../../databases/2018/student2018.csv", index=False)\n\ndata = pd.read_sas(\n    "../../../databases/2018/cy07_msu_tch_qqq.sas7bdat", format="sas7bdat"\n)\n\ndata.to_csv("../../../databases/2018/teacher2018.csv", index=False)\n'

In [3]:
teacher = pd.read_csv('../../../databases/2018/teacher2018.csv', nrows=1000)
#teacher = pd.read_csv('../../../databases/2018/teacher2018.csv')

**Note:** We don't include these files in the project folder, so it's necessary to manually download and put them in their respective folder.

## **2.2 Describe Data**

#### Teacher Data

The original dataset has 350 features

In [4]:
teacher.head()

Unnamed: 0,CNTRYID,CNT,CNTSCHID,CNTTCHID,TEACHERID,CYC,NatCen,Region,STRATUM,SUBNATIO,...,FEEDBACK,ADAPTINSTR,FEEDBINSTR,TCATTIMM,GCTRAIN,TCMCEG,GCSELF,W_SCHGRNRABWT,W_FSTUWT_SCH_SUM,VER_DAT
0,8.0,b'ALB',800057.0,800001.0,5.0,b'07MS',b'000800',800.0,b'ALB0203',b'0080000',...,0.6122,,,1.2332,1.2735,-0.1151,1.9399,3.45567,93.30319,b' 09MAY19:11:21:10'
1,8.0,b'ALB',800121.0,800002.0,5.0,b'07MS',b'000800',800.0,b'ALB0107',b'0080000',...,-1.2284,,,-0.8666,1.8008,-0.1151,-0.2634,6.07334,12.46632,b' 09MAY19:11:21:10'
2,8.0,b'ALB',800140.0,800003.0,5.0,b'07MS',b'000800',800.0,b'ALB0101',b'0080000',...,0.2741,,,0.4987,1.8008,0.5387,-0.7038,3.76885,99.20152,b' 09MAY19:11:21:10'
3,8.0,b'ALB',800149.0,800004.0,5.0,b'07MS',b'000800',800.0,b'ALB0211',b'0080000',...,-0.2922,,,1.2332,0.2917,-0.6041,-1.0056,6.62948,19.88845,b' 09MAY19:11:21:10'
4,8.0,b'ALB',800095.0,800005.0,5.0,b'07MS',b'000800',800.0,b'ALB0204',b'0080000',...,0.1202,,,0.4987,1.0055,-1.9057,1.9399,3.35644,104.76897,b' 09MAY19:11:21:10'


In [84]:
teacher.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107367 entries, 0 to 107366
Data columns (total 350 columns):
 #    Column            Non-Null Count   Dtype  
---   ------            --------------   -----  
 0    CNTRYID           107367 non-null  float64
 1    CNT               107367 non-null  object 
 2    CNTSCHID          107367 non-null  float64
 3    CNTTCHID          107367 non-null  float64
 4    TEACHERID         91190 non-null   float64
 5    CYC               107367 non-null  object 
 6    NatCen            107367 non-null  object 
 7    Region            107367 non-null  float64
 8    STRATUM           107367 non-null  object 
 9    SUBNATIO          107367 non-null  object 
 10   OECD              107367 non-null  float64
 11   ADMINMODE         107367 non-null  float64
 12   LANGTEST          93006 non-null   float64
 13   TC001Q01NA        90090 non-null   float64
 14   TC002Q01NA        90326 non-null   float64
 15   TC005Q01NA        90408 non-null   float64
 16   

In [5]:
import sys
import os

# Add the src directory to the Python path
sys.path.append(os.path.abspath('../../../src')) # Add the src directory to the Python path

from pisadatamap.pisadatamap import PISADataMap

data_map = PISADataMap('../../../databases/2018/teacher_data_structure_2018.csv')

for column in data_map.map_enum:
    print(f"\033[1m{column.name}\033[0m: {column.value}")

[1mCNTRYID[0m: Country Identifier
[1mCNT[0m: Country code 3-character
[1mCNTSCHID[0m: Intl. School ID
[1mCNTTCHID[0m: Intl. Teacher ID
[1mTEACHERID[0m: Teacher identification code
[1mCYC[0m: PISA Assessment Cycle (2 digits + 2 character Assessment type - MS/FT)
[1mNatCen[0m: National Centre 6-digit Code
[1mRegion[0m: Region
[1mSTRATUM[0m: Stratum ID 7-character (cnt + region ID + original stratum ID)
[1mSUBNATIO[0m: Adjudicated sub-region code 7-digit code (3-digit country code + region ID + stratum ID)
[1mOECD[0m: OECD country
[1mADMINMODE[0m: Mode of Respondent
[1mLANGTEST[0m: Language of Questionnaire/Assessment
[1mTC001Q01NA[0m: Are you female or male?
[1mTC002Q01NA[0m: How old are you?
[1mTC005Q01NA[0m: What is your current employment status as a teacher? My employment status at this school
[1mTC007Q01NA[0m: How many years of work experience do you have? Year(s) working as a teacher at this school
[1mTC007Q02NA[0m: How many years of work experie

  self.map_enum = Enum('MapEnum', {row[0]: row[1] for index, row in map_df.iterrows()}) # Setup enum with key, values of the codebook


The dataset is composed by 343 numeric columns and only 7 categorical columns.

In [6]:
import pandas as pd
from tabulate import tabulate

categorical_columns = teacher.select_dtypes(include=["object", "category"]).columns
numeric_columns = teacher.select_dtypes(include=["int64", "float64"]).columns

column_types_df = pd.DataFrame(
    {
        "Column type": ["Numeric", "Categorical"],
        "Number of columns": [len(numeric_columns), len(categorical_columns) ],
        "Column names": [
            ", ".join(numeric_columns),
            ", ".join(categorical_columns),
        ],
    }
)

print(
    tabulate(
        column_types_df,
        headers="keys",
        tablefmt="pretty",
        showindex=False,
        colalign=("left", "left", "left"),
    )
)

+-------------+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [7]:
teacher.describe()

Unnamed: 0,CNTRYID,CNTSCHID,CNTTCHID,TEACHERID,Region,OECD,ADMINMODE,LANGTEST,TC001Q01NA,TC002Q01NA,...,TCDIRINS,FEEDBACK,ADAPTINSTR,FEEDBINSTR,TCATTIMM,GCTRAIN,TCMCEG,GCSELF,W_SCHGRNRABWT,W_FSTUWT_SCH_SUM
count,1000.0,1000.0,1000.0,972.0,1000.0,1000.0,1000.0,972.0,962.0,963.0,...,146.0,803.0,146.0,146.0,802.0,768.0,804.0,803.0,1000.0,1000.0
mean,8.0,800171.222,800513.832,4.829218,800.0,0.0,2.0,140.0,1.284823,42.091381,...,0.692434,0.463599,0.595611,0.47565,0.690982,0.746902,0.073679,0.874636,3.992718,104.02442
std,0.0,93.680269,295.948883,0.376512,0.0,0.0,0.0,0.0,0.451565,10.608315,...,0.902041,0.913956,0.922965,0.790719,0.765817,0.935982,1.091519,1.02482,2.895212,64.165826
min,8.0,800002.0,800001.0,4.0,800.0,0.0,2.0,140.0,1.0,20.0,...,-3.9262,-2.9984,-1.6706,-1.3671,-1.694,-1.3375,-2.716,-1.6674,1.0,1.11679
25%,8.0,800092.0,800257.75,5.0,800.0,0.0,2.0,140.0,1.0,34.0,...,0.3918,-0.2844,0.0995,-0.1045,0.2587,0.1156,-0.6266,-0.2634,1.422867,53.478533
50%,8.0,800172.0,800513.5,5.0,800.0,0.0,2.0,140.0,1.0,41.0,...,1.0468,0.5297,0.4804,0.50275,1.2332,0.7189,0.4203,1.1563,3.43985,96.20303
75%,8.0,800252.0,800770.25,5.0,800.0,0.0,2.0,140.0,2.0,50.0,...,1.3596,1.09365,1.109,0.9909,1.2332,1.8008,1.125,1.9399,6.07334,146.875957
max,8.0,800336.0,801026.0,5.0,800.0,0.0,2.0,140.0,2.0,69.0,...,1.3596,1.8277,2.4533,1.5975,1.2972,1.8008,1.125,1.9399,15.0754,361.8096


## 2.4 Verify data quality


In [89]:
print("\n--- Missing Values ---")
missing = teacher.isnull().mean().sort_values(ascending=False)
print(missing[missing > 0.7])

Resumo de valores ausentes:

            Missing Values  Percentage Missing
EC162Q08HA          577987               94.44
EC162Q06HA          576013               94.12
EC162Q07HA          575891               94.10
EC162Q04HA          575680               94.06
EC162Q05HA          575435               94.02
...                    ...                 ...
PV7READ               5377                0.88
PV6READ               5377                0.88
PV3READ               5377                0.88
GRADE                 3019                0.49
ST004D01T                2                0.00

[1007 rows x 2 columns]

Número de linhas com valores ausentes: 612004

Exemplo de linhas com valores ausentes:
   CNTRYID     CNT  CNTSCHID  CNTSTUID      CYC     NatCen     STRATUM  \
0      8.0  b'ALB'  800115.0  800001.0  b'07MS'  b'000800'  b'ALB0107'   
1      8.0  b'ALB'  800300.0  800002.0  b'07MS'  b'000800'  b'ALB0105'   
2      8.0  b'ALB'  800088.0  800003.0  b'07MS'  b'000800'  b'ALB0101'  