# Data Wrangling

In this notebook, we perform data cleaning, fix missing values, and add new columns with meaning values.

## Loading modules

In [1]:
import pandas as pd
import numpy as np

In [2]:
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

## Loading the data

We acquired the test score data for the [California Assessment of Student Performance and Progress (CAASPP)](https://caaspp.cde.ca.gov/). The data is available between 2015 and 2018.

* [CAASPP test scores](https://caaspp.cde.ca.gov/sb2018/ResearchFileList) 

Additional datasets are obtained in the following sites:
* [Civil Rights Data Collection](https://ocrdata.ed.gov/): Teacher demographics
* [Zillow research data](https://www.zillow.com/research/data/): House prices based on zipcodes
* [GreatSchools API](https://www.greatschools.org/api/docs/technical-overview/): School profile, school reviews, school censuc data, nearby schools

We first load the 2018 test data.

In [3]:
pd.set_option('display.max_columns', None)

In [4]:
df = pd.read_csv("../Data/sb_ca2018_all_csv_v3/sb_ca2018_all.csv")

In [5]:
df.shape

(3269730, 32)

In [6]:
df.head()

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Subgroup ID,Test Type,Total Tested At Entity Level,Total Tested with Scores,Grade,Test Id,CAASPP Reported Enrollment,Students Tested,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores,Area 1 Percentage Above Standard,Area 1 Percentage Near Standard,Area 1 Percentage Below Standard,Area 2 Percentage Above Standard,Area 2 Percentage Near Standard,Area 2 Percentage Below Standard,Area 3 Percentage Above Standard,Area 3 Percentage Near Standard,Area 3 Percentage Below Standard,Area 4 Percentage Above Standard,Area 4 Percentage Near Standard,Area 4 Percentage Below Standard
0,0,0,0,,2018,1,B,3180554,3177403,3,1,445017,434454,2424.0,26.13,22.09,48.22,23.49,28.29,434193,25.32,44.02,30.66,23.84,43.3,32.85,20.89,61.25,17.86,27.6,47.71,24.68
1,0,0,0,,2018,1,B,3187375,3184687,3,2,445018,436464,2430.9,21.07,27.82,48.89,23.56,27.55,436215,33.59,33.2,33.21,26.72,42.3,30.98,28.8,46.31,24.89,0.0,0.0,0.0
2,0,0,0,,2018,1,B,3187375,3184687,4,2,463838,455589,2467.7,18.46,24.45,42.92,30.81,26.27,455315,29.03,31.02,39.95,21.65,44.73,33.62,24.03,43.78,32.2,0.0,0.0,0.0
3,0,0,0,,2018,1,B,3180554,3177403,4,1,463838,453771,2463.7,26.31,22.36,48.67,19.25,32.08,453491,24.77,46.77,28.46,24.18,44.21,31.61,19.3,63.22,17.48,25.86,48.83,25.31
4,0,0,0,,2018,1,B,3180554,3177403,5,1,469247,459433,2496.3,21.8,27.63,49.43,19.99,30.58,459208,24.22,45.02,30.76,29.2,41.42,29.38,16.52,59.74,23.73,28.29,44.32,27.39


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3269730 entries, 0 to 3269729
Data columns (total 32 columns):
County Code                          int64
District Code                        int64
School Code                          int64
Filler                               float64
Test Year                            int64
Subgroup ID                          int64
Test Type                            object
Total Tested At Entity Level         object
Total Tested with Scores             object
Grade                                int64
Test Id                              int64
CAASPP Reported Enrollment           object
Students Tested                      object
Mean Scale Score                     object
Percentage Standard Exceeded         object
Percentage Standard Met              object
Percentage Standard Met and Above    object
Percentage Standard Nearly Met       object
Percentage Standard Not Met          object
Students with Scores                 object
Area 1 Percen

The following entity files list the County, District, and School entity names and codes for all entities as the existed in the administration year selected. This file must be merged with the test data file to join these entity names with the appropriate score data.

In [8]:
import chardet

#find the file encoding type
with open("../Data/sb_ca2018_all_csv_v3/sb_ca2018entities.csv", 'rb') as f:
    result = chardet.detect(f.read())  # or readline if the file is large
    
entities = pd.read_csv("../Data/sb_ca2018_all_csv_v3/sb_ca2018entities.csv", encoding=result['encoding'])

In [9]:
entities.shape

(11333, 10)

In [20]:
entities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11333 entries, 0 to 11332
Data columns (total 10 columns):
County Code      11333 non-null int64
District Code    11333 non-null int64
School Code      11333 non-null int64
Filler           0 non-null float64
Test Year        11333 non-null int64
Type Id          11333 non-null int64
County Name      11333 non-null object
District Name    11274 non-null object
School Name      10251 non-null object
Zip Code         11333 non-null object
dtypes: float64(1), int64(5), object(4)
memory usage: 885.5+ KB


In [10]:
entities.head()

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Type Id,County Name,District Name,School Name,Zip Code
0,35,67520,6035109,,2018,7,San Benito,Panoche Elementary,Panoche Elementary,95043.0
1,35,67538,0,,2018,6,San Benito,San Benito High,,
2,35,67538,3530029,,2018,7,San Benito,San Benito High,San Andreas Continuation High,95023.0
3,35,67538,3537008,,2018,7,San Benito,San Benito High,San Benito High,95023.0
4,35,67553,0,,2018,6,San Benito,Southside Elementary,,


In [11]:
entities.loc[(entities['District Name'] == 'Irvine Unified'), :]

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Type Id,County Name,District Name,School Name,Zip Code
7499,30,73650,0,,2018,6,Orange,Irvine Unified,,
7500,30,73650,127472,,2018,7,Orange,Irvine Unified,Jeffrey Trail Middle,92620.0
7501,30,73650,129155,,2018,7,Orange,Irvine Unified,Cypress Village Elementary,92620.0
7502,30,73650,129296,,2018,7,Orange,Irvine Unified,Portola Springs Elementary,92618.0
7503,30,73650,133389,,2018,7,Orange,Irvine Unified,Beacon Park,92618.0
7504,30,73650,135137,,2018,7,Orange,Irvine Unified,Eastwood Elementary,92620.0
7505,30,73650,3030129,,2018,7,Orange,Irvine Unified,Creekside High,92606.0
7506,30,73650,3030152,,2018,7,Orange,Irvine Unified,Irvine High,92604.0
7507,30,73650,3030285,,2018,7,Orange,Irvine Unified,Woodbridge High,92604.0
7508,30,73650,3030467,,2018,7,Orange,Irvine Unified,Alternative Education-San Joaquin High,92606.0


Each `Subgroup ID` has the following meanings. We can investigate the characteristics of individual students. 

In [12]:
subgroup = pd.read_csv("../Data/Subgroups.txt", header=None)
subgroup.shape

(47, 4)

In [13]:
subgroup.head()

Unnamed: 0,0,1,2,3
0,1,1,"""All Students""","""All Students"""
1,3,3,"""Male""","""Gender"""
2,4,4,"""Female""","""Gender"""
3,6,6,"""Fluent English proficient and English only""","""English-Language Fluency"""
4,7,7,"""Initial fluent English proficient (IFEP)""","""English-Language Fluency"""


In [14]:
#delete first column (redundant with the second column); axis = 0 (index) and axis =1 (column), inplace=True means adjusting
subgroup.drop(0, axis=1, inplace=True)
subgroup.columns = ['Subgroup ID', 'Student Groups', 'Category']
#same expression: subgroup[['Subgroup ID', 'Student Groups', 'Category']]
subgroup.sort_values("Category")
subgroup.head(47)

Unnamed: 0,Subgroup ID,Student Groups,Category
0,1,"""All Students""","""All Students"""
1,3,"""Male""","""Gender"""
2,4,"""Female""","""Gender"""
3,6,"""Fluent English proficient and English only""","""English-Language Fluency"""
4,7,"""Initial fluent English proficient (IFEP)""","""English-Language Fluency"""
5,8,"""Reclassified fluent English proficient (RFEP)""","""English-Language Fluency"""
6,28,"""Migrant education""","""Migrant"""
7,31,"""Economically disadvantaged""","""Economic Status"""
8,74,"""Black or African American""","""Ethnicity"""
9,75,"""American Indian or Alaska Native""","""Ethnicity"""


The `Test ID` has the following meanings.

In [30]:
tests_id = pd.read_csv("../Data/Tests.txt", header=None)
tests_id.head()

Unnamed: 0,0,1,2
0,Test ID,Test ID Num,Test Name
1,1,1,SB - English Language Arts/Literacy
2,2,2,SB - Mathematics
3,3,3,CAA - English Language Arts/Literacy
4,4,4,CAA - Mathematics


For example, if we want the DataFrame where the district is **Irvine Unified**, the ethnicity is **Asian** for the **3rd** grade, we can obtain it as follows.

In [15]:
df.loc[(df['District Code'] == 73650) & (df['Subgroup ID'] == 76) & (df['Grade'] == 3), ['Percentage Standard Exceeded', 'Percentage Standard Met', 'Percentage Standard Nearly Met', 'Percentage Standard Not Met', 'School Code']]

Unnamed: 0,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Nearly Met,Percentage Standard Not Met,School Code
1602961,58.03,29.65,8.98,3.34,0
1602962,60.20,21.14,11.81,6.85,0
1603803,74.12,18.82,2.35,4.71,129155
1603804,75.00,16.67,3.57,4.76,129155
1604137,61.97,29.58,7.04,1.41,129296
1604138,60.56,21.13,11.27,7.04,129296
1604465,50.00,32.81,12.50,4.69,133389
1604466,52.38,22.22,12.70,12.70,133389
1604845,44.74,36.84,13.16,5.26,135137
1604846,50.00,33.33,8.33,8.33,135137


In [16]:
#Retreive District Code with District Name from entities DataFrame

school_code_dict = {}

def make_code_dict(code, name):   
    if code not in school_code_dict.keys():
        school_code_dict[code] = name

code = entities.loc[(entities['District Name'] == 'Irvine Unified'), ['District Code']].iloc[0][0]
name = 'Irvine Unified'

#make the input parameter dataframe (Series of school names and get the codes)
make_code_dict(code, name)

print(school_code_dict)

{73650: 'Irvine Unified'}


In [17]:
entities.loc[(entities['School Name'] == 'Eastwood Elementary'), :]

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Type Id,County Name,District Name,School Name,Zip Code
5605,19,64840,6020903,,2018,7,Los Angeles,Norwalk-La Mirada Unified,Eastwood Elementary,90638
7421,30,66746,6030761,,2018,7,Orange,Westminster,Eastwood Elementary,92683
7504,30,73650,135137,,2018,7,Orange,Irvine Unified,Eastwood Elementary,92620


In [24]:
entities.loc[(entities['School Name'] == 'Eastwood Elementary') & 
             (entities['District Name'] == 'Irvine Unified') & 
             (entities['County Name'] == 'Orange') &
             (entities['Zip Code'] == '92620'), :]

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Type Id,County Name,District Name,School Name,Zip Code
7504,30,73650,135137,,2018,7,Orange,Irvine Unified,Eastwood Elementary,92620


In [25]:
code = entities.loc[(entities['School Name'] == 'Eastwood Elementary'), ['School Code']]

#.iloc[0][0]
#name = 'Irvine Unified'

#make_code_dict(code, name)
code


Unnamed: 0,School Code
5605,6020903
7421,6030761
7504,135137


In [26]:
#Eastwood
df.iloc[1604845]

County Code                              30
District Code                         73650
School Code                          135137
Filler                                  NaN
Test Year                              2018
Subgroup ID                              76
Test Type                                 B
Total Tested At Entity Level            141
Total Tested with Scores                141
Grade                                     3
Test Id                                   2
CAASPP Reported Enrollment               39
Students Tested                          38
Mean Scale Score                     2495.1
Percentage Standard Exceeded          44.74
Percentage Standard Met               36.84
Percentage Standard Met and Above     81.58
Percentage Standard Nearly Met        13.16
Percentage Standard Not Met            5.26
Students with Scores                     38
Area 1 Percentage Above Standard      57.89
Area 1 Percentage Near Standard       34.21
Area 1 Percentage Below Standard

In [27]:
#Eastwood
#Test Id = 1
# Smarter Balanced (SB) Summative Assessments
# California Alternate Assessments
#Todo: load test.txt
df.iloc[1604846]

County Code                              30
District Code                         73650
School Code                          135137
Filler                                  NaN
Test Year                              2018
Subgroup ID                              76
Test Type                                 B
Total Tested At Entity Level            118
Total Tested with Scores                118
Grade                                     3
Test Id                                   1
CAASPP Reported Enrollment               39
Students Tested                          36
Mean Scale Score                     2486.6
Percentage Standard Exceeded          50.00
Percentage Standard Met               33.33
Percentage Standard Met and Above     83.33
Percentage Standard Nearly Met         8.33
Percentage Standard Not Met            8.33
Students with Scores                     36
Area 1 Percentage Above Standard      47.22
Area 1 Percentage Near Standard       41.67
Area 1 Percentage Below Standard

* Selecting single item df.loc[index, col_name] - need to have **string index**??

* Join two dataframes for obtaining the specific school name