# GATHER

In [2]:
import numpy as np
import pandas as pd
import zipfile
import dill

In [None]:
#uncomment to save or load environment

dill.dump_session('notebook_session.db') #save
#dill.load_session('notebook_session.db') #load

In [307]:
zf = zipfile.ZipFile('./data/pisa2012.zip')
pisa2012 = pd.read_csv(zf.open('pisa2012.csv'), encoding = 'latin-1', sep = ',',
                   error_bad_lines = False, index_col = False, dtype = 'unicode')

> For our data wrangling process, I aim to focus on how both environmental factors and demographic factors affect a student's academic performance, future optimism and self-esteem. 

- Environmental factors will consist of ***teacher helpfulness***, ***sense of belonging***, and ***acculturation.***

- Demographic factors will consist of ***gender*** and ***parent background***.

- Academic performance will consist of math, reading, and science ***scores*** and ***work ethic***.
- ***Future optimism*** and ***self-esteem*** will both be standalone.

I'm essentially exploring the idea of, "If a student isn't getting any help and feels like an outsider, then that'll lower their commitment for school and hurt their self-esteem, in turn lowering their academic performance." On the flip side, a student who ***is*** getting help and feels comfortable in the school environment would have greater optimism in committing to school, thus raising their academic performance.

At the same time, I want to see how parent background, country, and gender can drive these variables against/in favor of the student.

After looking through the dataset, I have isolated these variables.

### Student Information
- **Gender + Parent Background**
    - **ST04Q01** - Gender
    - **PV1MATH** -	Plausible value 1 in mathematics
    - **PV1READ** - Plausible value 1 in reading
    - **PV1SCIE** -	Plausible value 1 in science
    - **OCOD1** ISCO-08 Occupation code - Mother
    - **OCOD2** ISCO-08 Occupation code - Father
    - **ST22Q01** -	Acculturation - Mother Immigrant (Filter)
    
- **STIDSTD** - student ID
- **CNT** - 3-character country code


### Teacher Helpfulness/Involvement
    
- **Teacher Support**
    - **ST83Q01** -	Teacher Support - Lets Us Know We Have to Work Hard
    - **ST83Q02** -	Teacher Support - Provides Extra Help When Needed
    - **ST83Q03** -	Teacher Support - Helps Students with Learning
    - **ST83Q04** -	Teacher Support - Gives Opportunity to Express Opinions
    
- **Student-Teacher Relation**
    - **ST86Q01** -	Student-Teacher Relation - Get Along with Teachers
    - **ST86Q02** -	Student-Teacher Relation - Teachers Are Interested
    - **ST86Q03** -	Student-Teacher Relation - Teachers Listen to Students
    - **ST86Q04** -	Student-Teacher Relation - Teachers Help Students
    - **ST86Q05** -	Student-Teacher Relation - Teachers Treat Students Fair
    
- **Teacher-Directed Instruction**
    - **ST79Q01** -	Teacher-Directed Instruction - Sets Clear Goals
    - **ST79Q02** -	Teacher-Directed Instruction - Encourages Thinking and Reasoning
    - **ST79Q06** -	Teacher-Directed Instruction - Checks Understanding
    - **ST79Q08** -	Teacher-Directed Instruction - Summarizes Previous Lessons
    - **ST79Q15** -	Teacher-Directed Instruction - Informs about Learning Goals
    
### Sense of Belonging

- **Sense of Belonging**
    - **ST87Q01** - Sense of Belonging - Feel Like Outsider
    - **ST87Q02** -	Sense of Belonging - Make Friends Easily
    - **ST87Q03** -	Sense of Belonging - Belong at School
    - **ST87Q04** -	Sense of Belonging - Feel Awkward at School
    - **ST87Q05** -	Sense of Belonging - Liked by Other Students
    - **ST87Q06** -	Sense of Belonging - Feel Lonely at School
    - **ST87Q07** -	Sense of Belonging - Feel Happy at School
    - **ST87Q08** -	Sense of Belonging - Things Are Ideal at School
    - **ST87Q09** -	Sense of Belonging - Satisfied at School
    - **BELONG**  - Sense of Belonging to School
    
### Acculturation

> The reason I picked only the host culture acculturations is because I want to see how well a student is ***fitting in*** to the culture of the school they're currently attending.

- **Acculturation**
    - **ST23Q01** -	Acculturation - Enjoy Host Culture Friends
    - **ST23Q03** -	Acculturation - Enjoy Host Culture Celebrations
    - **ST23Q05** -	Acculturation - Spend Time with Host Culture Friends
    - **ST23Q07** -	Acculturation - Participate in Host Culture Celebrations
    - **ST24Q01** -	Acculturation - Perceived Host-Heritage Cultural Differences - Values
    - **ST24Q02** -	Acculturation - Perceived Host-Heritage Cultural Differences - Mother Treatment
    - **ST24Q03** -	Acculturation - Perceived Host-Heritage Cultural Differences - Teacher Treatment


> I included math work ethic into overall work ethic. 
### Work Ethic

- **Math Work Ethic**
    - **ST46Q01** -	Math Work Ethic - Homework Completed in Time
    - **ST46Q02** -	Math Work Ethic - Work Hard on Homework
    - **ST46Q03** -	Math Work Ethic - Prepared for Exams
    - **ST46Q04** -	Math Work Ethic - Study Hard for Quizzes
    - **ST46Q05** -	Math Work Ethic - Study Until I Understand Everything
    - **ST46Q06** -	Math Work Ethic - Pay Attention in Classes
    - **ST46Q07** -	Math Work Ethic - Listen in Classes
    - **ST46Q08** -	Math Work Ethic - Avoid Distractions When Studying
    - **ST46Q09** -	Math Work Ethic - Keep Work Organized

- **Perseverance**
    - **ST93Q01** -	Perseverance - Give up easily
    - **ST93Q03** -	Perseverance - Put off difficult problems
    - **ST93Q04** -	Perseverance - Remain interested
    - **ST93Q06** -	Perseverance - Continue to perfection
    - **ST93Q07** -	Perseverance - Exceed expectations
    
- **Truancy**
    - **ST08Q01** - Truancy - Late for School
    - **ST09Q01** - Truancy - Skip whole school day
    - **ST115Q01** - Truancy - Skip classes within school day
    
- **Out-of-School Study Time**
    - **ST57Q01** -	Out-of-School Study Time - Homework
    - **ST57Q02** -	Out-of-School Study Time - Guided Homework
    - **ST57Q03** -	Out-of-School Study Time - Personal Tutor
    - **ST57Q04** -	Out-of-School Study Time - Commercial Company
    - **ST57Q05** -	Out-of-School Study Time - With Parent
    - **ST57Q06** -	Out-of-School Study Time - Computer


### Future Optimism

- **Attitudes toward School**
    - **ST88Q01** -	Attitude towards School - Does Little to Prepare Me for Life
    - **ST88Q02** -	Attitude towards School - Waste of Time
    - **ST88Q03** -	Attitude towards School - Gave Me Confidence
    - **ST88Q04** -	Attitude towards School- Useful for Job
    - **ST89Q02** -	Attitude toward School - Helps to Get a Job
    - **ST89Q03** -	Attitude toward School - Prepare for College
    - **ST89Q04** -	Attitude toward School - Enjoy Good Grades
    - **ST89Q05** -	Attitude toward School - Trying Hard is Important
    
- **Instrumental Motivation**
    - **ST29Q07** -	Instrumental Motivation - Important for Future Study
    - **ST29Q08** -	Instrumental Motivation - Helps to Get a Job

### Self-Esteem
- **Perceived Control**
    - **ST91Q01** -	Perceived Control - Can Succeed with Enough Effort
    - **ST91Q02** -	Perceived Control - My Choice Whether I Will Be Good
    - **ST91Q03** -	Perceived Control - Problems Prevent from Putting Effort into School
    - **ST91Q04** -	Perceived Control - Different Teachers Would Make Me Try Harder
    - **ST91Q05** -	Perceived Control - Could Perform Well if I Wanted
    - **ST91Q06** -	Perceived Control - Perform Poor Regardless    
    
- **Math Self-Concept** 
    - **ST42Q02** -	Math Self-Concept - Not Good at Maths
    - **ST42Q04** -	Math Self-Concept- Get Good Grades
    - **ST42Q06** -	Math Self-Concept - Learn Quickly
    - **ST42Q07** -	Math Self-Concept - One of Best Subjects
    - **ST42Q09** -	Math Self-Concept - Understand Difficult Work
    
After isolating the above variables, I intend to have these 14 columns in the final clean CSV:

- student ID
- country
- gender
- overall math score
- overall reading score
- overall science score
- teacher helpfulness/involvement
- sense of belonging
- acculturation
- work ethic
- future optimism
- self-esteem
- mother work background
- father work background

All variables from `teacher helpfulness` to `self-esteem` will be quantified with an average score, and normalized across all columns.

In [359]:
#create categories for each variable
student_info = ['STIDSTD', 'CNT', 'ST04Q01', 'PV1MATH', 'PV1READ', 'PV1SCIE', 'OCOD1', 'OCOD2', 'ST22Q01']

In [360]:
#teacher helpfulness/involvement
teacher_support = ['ST83Q01', 'ST83Q02', 'ST83Q03', 'ST83Q04', 'ST83Q01', 'ST83Q02', 'ST83Q03', 'ST83Q04'] 
teacher_relation = ['ST86Q01', 'ST86Q02', 'ST86Q03', 'ST86Q04', 'ST86Q05']
teacher_direct = ['ST79Q01', 'ST79Q02', 'ST79Q06', 'ST79Q08', 'ST79Q15']

teacher_helpfulness = np.concatenate([teacher_support, teacher_relation, teacher_direct]).ravel()

#num of columns
len(pisa2012[teacher_helpfulness].columns)

18

In [362]:
#sense of belonging
belonging = ['ST87Q01', 'ST87Q02', 'ST87Q03', 'ST87Q04', 'ST87Q05', 'ST87Q06', 'ST87Q07', 'ST87Q08', 'ST87Q09'] 

len(pisa2012[belonging].columns)

9

In [357]:
#acculturation
acculturation = ['ST23Q01', 'ST23Q03', 'ST23Q05', 'ST23Q07', 'ST24Q01', 'ST24Q02', 'ST24Q03']

len(pisa2012[acculturation].columns)

7

In [343]:
#work ethic
math_ethic = ['ST46Q01', 'ST46Q02', 'ST46Q03', 'ST46Q04', 'ST46Q05', 'ST46Q06', 'ST46Q07', 'ST46Q08', 'ST46Q09']
perseverance = ['ST93Q01', 'ST93Q03', 'ST93Q04', 'ST93Q06', 'ST93Q07']
truancy = ['ST08Q01', 'ST09Q01', 'ST115Q01']

work_ethic = np.concatenate([math_ethic, perseverance, truancy]).ravel()
pisa2012[work_ethic].columns.shape[0]

17

In [344]:
#future optimism
attitude_school =['ST88Q01', 'ST88Q02', 'ST88Q03', 'ST88Q04', 'ST89Q02', 'ST89Q03', 'ST89Q04', 'ST89Q05']
motivation = ['ST29Q07', 'ST29Q08']

future_optimism = np.concatenate([attitude_school, motivation]).ravel()
pisa2012[future_optimism].columns.shape[0]

10

In [361]:
#self esteem
perceived_control = ['ST91Q01', 'ST91Q02', 'ST91Q03', 'ST91Q04', 'ST91Q05', 'ST91Q06']
math_self_concept = ['ST42Q02', 'ST42Q04', 'ST42Q06', 'ST42Q07', 'ST42Q09']

self_esteem = np.concatenate([perceived_control, math_self_concept]).ravel()
pisa2012[self_esteem].columns.shape[0]

11

### Observations
1. `teacher_helpfulness` has a total of 18 columns.
2. `belonging` has a total of 9 columns.
3. `acculturation` has a total of 7 columns.
4. `work ethic` has a total of 17 columns.
5. `future_optimism` has a total of 10 columns.
6. `self_esteem` has a total of 11 columns.
7. `student info` has a total of 9 columns.

# ASSESS

### Overall

In [363]:
pisa2012.shape

(485490, 636)

In [347]:
#number of rows with duplicate data
pisa2012.duplicated().sum()

0

In [318]:
#count how many rows have missing data
pisa2012.isna().sum().max()

472209

In [319]:
print('This PISA dataset has {} rows with no missing data.'.format(pisa2012.shape[0] - pisa2012.isna().sum().max()))

This PISA dataset has 13281 rows with no missing data.


> Comb through each variable category one at a time.

### Student Info

In [364]:
df_student_info = pisa2012[student_info]

In [366]:
df_student_info.head()

Unnamed: 0,STIDSTD,CNT,ST04Q01,PV1MATH,PV1READ,PV1SCIE,OCOD1,OCOD2,ST22Q01
0,1,Albania,Female,406.8469,249.5762,341.7009,Building architects,Primary school teachers,
1,2,Albania,Female,486.1427,406.2936,548.9929,"Tailors, dressmakers, furriers and hatters",Building construction labourers,
2,3,Albania,Female,533.2684,401.21,499.6643,Housewife,Bricklayers and related workers,
3,4,Albania,Female,412.2215,547.363,438.6796,Housewife,"Cleaners and helpers in offices, hotels and ot...",
4,5,Albania,Female,381.9209,311.7707,361.5628,Housewife,Economists,


In [367]:
df_student_info.describe()

Unnamed: 0,STIDSTD,CNT,ST04Q01,PV1MATH,PV1READ,PV1SCIE,OCOD1,OCOD2,ST22Q01
count,485490,485490,485490,485490.0,485490.0,485490.0,483887,482936,40721
unique,33806,68,2,7063.0,11556.0,6153.0,588,588,2
top,205,Mexico,Female,444.158,510.1099,472.5289,Housewife,Missing,Yes
freq,68,33806,245064,359.0,355.0,753.0,74358,36559,31692


In [323]:
#first 3 results of most common father occupations
pisa2012[student_info].OCOD2.value_counts()[:3]

Missing                                                         36559
Vague(a good job, a quiet job, a well paid job, an office jo    14716
Heavy truck and lorry drivers                                   11816
Name: OCOD2, dtype: int64

- It appears that Mexico is the most common country, with a frequency of 33806 out of 485490 total students.
- It appears that there are more female than male students.
- It appears that the most common occupation for a mother is a housewife.
- Besides missing and vague information, it appears that the most common occupation for a father is a heavy truck/lorry driver.
- It appears that most student's mothers is an immigrant.


- It appears that the most common math score is 444.158, with a frequency of 359 out of 485490 total scores. 
- It appears that the most common reading score is 510.1099, with a frequency of 355 out of 485490 total scores.
- It appears that the most common science score is 472.5289, with a frequency of 588 out of 485490 total scores.

In [368]:
df_student_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 485490 entries, 0 to 485489
Data columns (total 9 columns):
STIDSTD    485490 non-null object
CNT        485490 non-null object
ST04Q01    485490 non-null object
PV1MATH    485490 non-null object
PV1READ    485490 non-null object
PV1SCIE    485490 non-null object
OCOD1      483887 non-null object
OCOD2      482936 non-null object
ST22Q01    40721 non-null object
dtypes: object(9)
memory usage: 33.3+ MB


**Note:** math, reading, and science scores are recorded as strings.

**Student ID**

In [369]:
#check for number of unique student IDs
df_student_info.STIDSTD.duplicated().sum()

451684

There are 451684 duplicated student IDs, but that does not mean every student is the same. If we take a look below at some of the repeated student IDs, we can see that they are mostly based in different countries, all with much different records for the other pieces of student information.

In [370]:
df_student_info[df_student_info['STIDSTD'].duplicated()].sample(10)

Unnamed: 0,STIDSTD,CNT,ST04Q01,PV1MATH,PV1READ,PV1SCIE,OCOD1,OCOD2,ST22Q01
181698,6024,Finland,Female,631.4146,654.5949,632.0776,Beauticians and related workers,Motor vehicle mechanics and repairers,Yes
227438,428,Ireland,Female,487.8564,519.8799,539.668,Food and beverage tasters and graders,Building construction labourers,
223419,2031,Indonesia,Male,488.0122,435.6785,417.5121,Housewife,Stall and market salespersons,
368740,145,New Zealand,Male,648.5512,577.302,617.1578,Receptionists (general),Machinery mechanics and repairers,
240719,130,Italy,Female,582.1857,620.2807,644.2932,Veterinarians,Veterinarians,Yes
55576,311,Brazil,Female,373.5083,435.4448,424.972,"Sewing, embroidery and related workers",Bricklayers and related workers,
97733,1720,Switzerland,Male,535.6052,428.1402,537.8963,Nursing associate professionals,Managing directors and chief executives,
144720,6619,Denmark,Female,483.9617,556.1799,528.1984,Shop sales assistants,"Car, taxi and van drivers",
144667,6566,Denmark,Female,416.272,478.1786,397.8366,"Butchers, fishmongers and related food preparers",Personal care workers in health services not e...,
429018,697,Serbia,Female,599.4781,441.9581,523.4427,Shop salespersons,Bus and tram drivers,Yes


Let's pick 1 duplicated student ID and examine it.

In [371]:
df_student_info[df_student_info['STIDSTD'] == '03425'].sample(10)

Unnamed: 0,STIDSTD,CNT,ST04Q01,PV1MATH,PV1READ,PV1SCIE,OCOD1,OCOD2,ST22Q01
131198,3425,Czech Republic,Female,374.8325,417.9699,407.2548,Waiters,"Car, taxi and van drivers",
451755,3425,Chinese Taipei,Male,619.4189,564.4709,549.6457,"Sewing, embroidery and related workers",Process control technicians not elsewhere clas...,
464407,3425,Tunisia,Male,523.7654,476.8183,539.295,Housewife,House builders,
319127,3425,Mexico,Female,345.4666,238.4558,353.8232,Housewife,"Vague(a good job, a quiet job, a well paid job...",
275087,3425,Jordan,Male,364.239,344.9785,363.0548,Housewife,"Social beneficiary (unemployed, retired, sickn...",
58690,3425,Brazil,Male,480.768,431.5886,483.7188,Domestic housekeepers,Bricklayers and related workers,
205200,3425,Greece,Male,376.8578,346.422,372.1932,Hairdressers,Building construction labourers,
299610,3425,Lithuania,Male,641.2292,610.1818,628.9071,"Sewing, embroidery and related workers",Heavy truck and lorry drivers,
192541,3425,United Kingdom,Female,489.57,516.7027,484.6512,Construction managers,Construction managers,
382346,3425,Poland,Male,534.5926,504.4052,561.768,Mixed crop and animal producers,Mixed crop and animal producers,


This piece of evidence, coupled with the fact there are ***no*** duplicated entries for the entire PISA dataset, proves that repeated student IDs does not mean repeated identical students.

These repeated student IDs are most likely caused by other factors such as ***different schools*** among different countries. Additionally, students with the same student ID all have different records--some are recorded as male and other females, based in different countries, and with varying math scores.

**Country**

In [372]:
#number of entries with missing countries
sum(df_student_info.CNT.isnull())

0

In [373]:
df_student_info.CNT.nunique()

68

> Our dataset includes data from 65 countries, but our DataFrame shows that there are 68 countries. What are the 3 additional rogue countries then?

In [374]:
#Massachusetts, Florida, and Connecticut recorded as separate countries.
df_student_info.CNT.value_counts()[-5:]

Florida (USA)               1896
Perm(Russian Federation)    1761
Massachusetts (USA)         1723
Connecticut (USA)           1697
Liechtenstein                293
Name: CNT, dtype: int64

**Gender**

In [375]:
#number of entries with no gender entry
sum(df_student_info.ST04Q01.isnull())

0

In [376]:
#check male to female ratio
gender_counts = df_student_info.ST04Q01.value_counts()
gender_counts

Female    245064
Male      240426
Name: ST04Q01, dtype: int64

In [377]:
total_students = gender_counts[0] + gender_counts[1]
print('Female students make up {:0.2f}% of the student population.'.format((gender_counts[0] / total_students) * 100))
print('Male students make up {:0.2f}% of the student population.'.format((gender_counts[1] / total_students) * 100))

Female students make up 50.48% of the student population.
Male students make up 49.52% of the student population.


> Male and female students are about **evenly split** with female students outnumbering male students by only .96%.

**Math, Reading, and Science Score**

> All scores are recorded as strings. To appropriately assess the numbers quantitatively, I will temporarily change them to floats.

In [378]:
#number of entries with no entry for math, science, and reading scores
print(sum(df_student_info.PV1MATH.isnull()))
print(sum(df_student_info.PV1READ.isnull()))
print(sum(df_student_info.PV1SCIE.isnull()))

0
0
0


In [379]:
#look at 5-Number Summary for all scores
df_student_info[['PV1MATH', 'PV1READ', 'PV1SCIE']].astype(float).describe()

Unnamed: 0,PV1MATH,PV1READ,PV1SCIE
count,485490.0,485490.0,485490.0
mean,469.621653,472.00464,475.769824
std,103.265391,102.505523,101.464426
min,19.7928,0.0834,2.6483
25%,395.3186,403.6007,404.4573
50%,466.2019,475.455,475.6994
75%,541.0578,544.5025,547.7807
max,962.2293,904.8026,903.3383


- It appears that people, on average, scored higher for science.
- All people scored similarly in the 25th, 50th, and 75th percentiles for math, reading, and science scores.
- The highest recorded score is for math at 962.229.
- The lowest recorded score is for reading at 0.083400.

> Initially, I was going to handle missing and duplicate data among the upcoming **environmental + student effect** variables. However, I concluded that it's unncecessary to drop rows with missing/duplicate data. This is because we have previously established that 1) there are no duplicate records of the same students, and 2) NaN data is to be expected especially under columns that are not required information. As long as our student IDs, scores, and gender (among other necessary column fields) equal the entire number of students, then we have nothing to worry about.

In that case, the rest of the data assessment focuses only on 1) isolating variables, 2) finding unique values, and 3) finding a way to normalize them quantitatively.

In [380]:
#function to isolate unique values among multiple columns in a df
def comb_cols(df, columns):
    return (pd.unique(df[columns].values.ravel()))

### Teacher Helpfulness/Involvement

-**Teacher Helpfulness** measures how well a teacher is dedicating their time to helping a student drive their academic performance.


In [381]:
#unique values across all teacher support categories
pd.unique(pisa2012[teacher_helpfulness].values.ravel())

array(['Strongly disagree', 'Agree', 'Disagree', 'Never or Hardly Ever',
       'Most Lessons', 'Every Lesson', nan, 'Strongly agree',
       'Some Lessons'], dtype=object)

> It appears some categories under `teacher_helpfulness` consist of different values. Let's check them more in-depth:

In [382]:
#list all unique values per column under teacher_helpfulness
print('Teacher support:', comb_cols(pisa2012, teacher_support))
print('Teacher relation:', comb_cols(pisa2012, teacher_relation))
print('Teacher direction:', comb_cols(pisa2012, teacher_direct))

Teacher support: ['Strongly disagree' 'Agree' nan 'Strongly agree' 'Disagree']
Teacher relation: ['Agree' 'Strongly disagree' 'Disagree' nan 'Strongly agree']
Teacher direction: ['Never or Hardly Ever' 'Most Lessons' 'Every Lesson' nan 'Some Lessons']


- Variables among `teacher_support` and `teacher_relation` are the exact same at **"Strongly disagree," "Disagree," "Agree," "Strongly agree," and "NaN."** Therefore, we don't have to worry about future quantification and normalization between these 2 variables.

- However, variables under `teacher_direct` are ***"Never or Hardly Ever," "Most Lessons," "Every Lesson," "Some Lessons," and "NaN."*** 

### Sense of Belonging

- **Sense of Belonging** measures how comfortable a student feels in their immediate social environment at school.

In [383]:
#list all unique values per column
print('Sense of Belonging:', comb_cols(pisa2012, belonging))

Sense of Belonging: ['Agree' 'Strongly disagree' 'Strongly agree' nan 'Disagree']


- `belonging` has the same column values as `teacher_support` and `teacher_relation`.

### Acculturation
- **Acculturation** measures how well a student is assimilating into their host culture.

In [384]:
print('Acculturation:', comb_cols(pisa2012, acculturation))

Acculturation: [nan 'Agree' 'Disagree' 'Strongly disagree' 'Strongly agree']


- `acculturation` has the same column values as `belonging`, `teacher_support`, and `teacher_relation`.

### Work Ethic

- **Math Work Ethic** measures a student's focus and dedication to learning math.
- **Perseverance** measures a student's ability to not give up and do better on their work.
- **Truancy** measures a student's attendance.

In [348]:
#comb unique values through all columns
print(comb_cols(pisa2012, work_ethic))

['Agree' 'Very much like me' 'Somewhat like me' 'None  ' '1'
 'Strongly agree' 'Disagree' 'Not at all like me' 'Mostly like me'
 'One or two times  ' 'Not much like me' nan '2' 'Three or four times  '
 '3' 'Strongly disagree' 'Five or more times  ' '4']


In [352]:
#comb unique values individually
print('Math Ethic:', comb_cols(pisa2012, math_ethic))
print('Perseverance:', comb_cols(pisa2012, perseverance))
print('Truancy:', comb_cols(pisa2012, truancy))

Math Ethic: ['Agree' 'Strongly agree' 'Disagree' nan 'Strongly disagree']
Perseverance: ['Very much like me' 'Somewhat like me' 'Not at all like me'
 'Mostly like me' 'Not much like me' nan]
Truancy: ['None  ' '1' 'One or two times  ' '2' 'Three or four times  ' nan '3'
 'Five or more times  ' '4']


- `math_ethic` has the same column values as `acculturation`, `belonging`, `teacher_support`, and `teacher_relation`.
- `perseverance` and `truancy` have no common values with each other or any other previous variable.

### Future Optimism
- **Attitude towards School** measures a student's current perception of how useful school is in preparing them for college or a job.
- **Instrumental Motivation** measures a student's current perception of how helpful school is in future career/study prospects.

In [388]:
#comb unique values through all columns
comb_cols(pisa2012, future_optimism)

array(['Agree', 'Strongly agree', 'Strongly disagree', nan, 'Disagree'],
      dtype=object)

> Both columns under `future_optimism` share the same values. But just to double-check, let's check them individually:

In [389]:
print('Attitudes toward School: ', comb_cols(pisa2012, attitude_school))
print('Motivation: ', comb_cols(pisa2012, motivation))

Attitudes toward School:  ['Agree' 'Strongly agree' 'Strongly disagree' nan 'Disagree']
Motivation:  ['Agree' 'Strongly agree' 'Disagree' nan 'Strongly disagree']


- both `attitude_school` and `motivation` have the same values in common with `math_ethic`, `acculturation`, `belonging`, `teacher_support`, and `teacher_relation`.

### Self-Esteem
- **Perceived Control** measures if a student feels in control of their success.
- **Math Self-Concept** measures how a student feels about themselves in relation to math.

In [390]:
#comb unique values through all columns
comb_cols(pisa2012, self_esteem)

array(['Strongly agree', 'Strongly disagree', 'Disagree', 'Agree', nan],
      dtype=object)

In [391]:
#comb unique values individually
print('Perceived Control: ', comb_cols(pisa2012, perceived_control))
print('Math Self-Concept: ', comb_cols(pisa2012, math_self_concept))

Perceived Control:  ['Strongly agree' 'Strongly disagree' 'Disagree' nan 'Agree']
Math Self-Concept:  ['Disagree' 'Agree' nan 'Strongly agree' 'Strongly disagree']




- Values among `teacher_support`, `teacher_relation`, `belonging`, `acculturation`, `math_ethic`, `attitude_school`, `motivation`, `perceived_control`, and `math_self_concept` are **"Strongly disagree," "Disagree," "Agree," "Strongly agree," and "NaN."**


- Values under `teacher_direct` are **"Every Lesson", "Most Lessons", "Some Lessons", "Never or Hardly Ever", and "NaN."**



- Values under `perseverance` are **"Very much like me," "Somewhat like me," "Not at all like me," "Mostly like me," "Not much like me," and "NaN."** 


- Values under `truancy` are **"Five or more times," "4", "Three or four times", "3", "2", "One or Two Times", "1", "None", and "NaN."**

## Quality
- Erroneous data types for math, reading, and science scores.
- Florida (USA), Massachusetts (USA), and Connecticut (USA) considered as separate countries from United States of America.
- Ordinal nature of categorical variables from **teacher support** to **perceived control** means they can be quantified.
- Inconsistent and unnormalized values among these variable groups:
    - `teacher_support`, `teacher_relation`, `belonging`, `acculturation`, `math_ethic`, `attitude_school`, `motivation`, `perceived_control`, `math_self_concept`
    - `teacher_direct`
    - `perseverance`
    - `truancy`
- Non-descriptive column names for all variable categories.


## Tidiness
- Abundance of unnecessary variables besides the ones isolated in pre-assessment stage.

# CLEAN