# Data Wrangling

In this notebook, we perform data cleaning, fix missing values, and add new columns with meaning values.

## Loading modules

In [1]:
import pandas as pd
import numpy as np

In [2]:
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

## Loading the data

We acquired the test score data for the [California Assessment of Student Performance and Progress (CAASPP)](https://caaspp.cde.ca.gov/). The data is available between 2015 and 2018.

* [CAASPP test scores](https://caaspp.cde.ca.gov/sb2018/ResearchFileList) 

Additional datasets are obtained in the following sites:
* [Civil Rights Data Collection](https://ocrdata.ed.gov/): Teacher demographics
* [Zillow research data](https://www.zillow.com/research/data/): House prices based on zipcodes
* [GreatSchools API](https://www.greatschools.org/api/docs/technical-overview/): School profile, school reviews, school censuc data, nearby schools

We first load the 2018 test data.

In [3]:
pd.set_option('display.max_columns', None)

In [4]:
df = pd.read_csv("../Data/sb_ca2018_all_csv_v3/sb_ca2018_all.csv")

In [5]:
df.shape

(3269730, 32)

In [6]:
df.head()

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Subgroup ID,Test Type,Total Tested At Entity Level,Total Tested with Scores,Grade,Test Id,CAASPP Reported Enrollment,Students Tested,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores,Area 1 Percentage Above Standard,Area 1 Percentage Near Standard,Area 1 Percentage Below Standard,Area 2 Percentage Above Standard,Area 2 Percentage Near Standard,Area 2 Percentage Below Standard,Area 3 Percentage Above Standard,Area 3 Percentage Near Standard,Area 3 Percentage Below Standard,Area 4 Percentage Above Standard,Area 4 Percentage Near Standard,Area 4 Percentage Below Standard
0,0,0,0,,2018,1,B,3180554,3177403,3,1,445017,434454,2424.0,26.13,22.09,48.22,23.49,28.29,434193,25.32,44.02,30.66,23.84,43.3,32.85,20.89,61.25,17.86,27.6,47.71,24.68
1,0,0,0,,2018,1,B,3187375,3184687,3,2,445018,436464,2430.9,21.07,27.82,48.89,23.56,27.55,436215,33.59,33.2,33.21,26.72,42.3,30.98,28.8,46.31,24.89,0.0,0.0,0.0
2,0,0,0,,2018,1,B,3187375,3184687,4,2,463838,455589,2467.7,18.46,24.45,42.92,30.81,26.27,455315,29.03,31.02,39.95,21.65,44.73,33.62,24.03,43.78,32.2,0.0,0.0,0.0
3,0,0,0,,2018,1,B,3180554,3177403,4,1,463838,453771,2463.7,26.31,22.36,48.67,19.25,32.08,453491,24.77,46.77,28.46,24.18,44.21,31.61,19.3,63.22,17.48,25.86,48.83,25.31
4,0,0,0,,2018,1,B,3180554,3177403,5,1,469247,459433,2496.3,21.8,27.63,49.43,19.99,30.58,459208,24.22,45.02,30.76,29.2,41.42,29.38,16.52,59.74,23.73,28.29,44.32,27.39


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3269730 entries, 0 to 3269729
Data columns (total 32 columns):
County Code                          int64
District Code                        int64
School Code                          int64
Filler                               float64
Test Year                            int64
Subgroup ID                          int64
Test Type                            object
Total Tested At Entity Level         object
Total Tested with Scores             object
Grade                                int64
Test Id                              int64
CAASPP Reported Enrollment           object
Students Tested                      object
Mean Scale Score                     object
Percentage Standard Exceeded         object
Percentage Standard Met              object
Percentage Standard Met and Above    object
Percentage Standard Nearly Met       object
Percentage Standard Not Met          object
Students with Scores                 object
Area 1 Percen

The following entity files list the County, District, and School entity names and codes for all entities as the existed in the administration year selected. This file must be merged with the test data file to join these entity names with the appropriate score data.

In [8]:
import chardet

#find the file encoding type
with open("../Data/sb_ca2018_all_csv_v3/sb_ca2018entities.csv", 'rb') as f:
    result = chardet.detect(f.read())  # or readline if the file is large
    
entities = pd.read_csv("../Data/sb_ca2018_all_csv_v3/sb_ca2018entities.csv", encoding=result['encoding'])

In [9]:
entities.shape

(11333, 10)

In [10]:
entities.head()

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Type Id,County Name,District Name,School Name,Zip Code
0,35,67520,6035109,,2018,7,San Benito,Panoche Elementary,Panoche Elementary,95043.0
1,35,67538,0,,2018,6,San Benito,San Benito High,,
2,35,67538,3530029,,2018,7,San Benito,San Benito High,San Andreas Continuation High,95023.0
3,35,67538,3537008,,2018,7,San Benito,San Benito High,San Benito High,95023.0
4,35,67553,0,,2018,6,San Benito,Southside Elementary,,


In [11]:
entities.loc[(entities['School Name'] == 'Panoche Elementary'), :]

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Type Id,County Name,District Name,School Name,Zip Code
0,35,67520,6035109,,2018,7,San Benito,Panoche Elementary,Panoche Elementary,95043


In [12]:
df_irvine = entities.loc[(entities['District Name'] == 'Irvine Unified'), :]

In [13]:
df_irvine

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Type Id,County Name,District Name,School Name,Zip Code
7499,30,73650,0,,2018,6,Orange,Irvine Unified,,
7500,30,73650,127472,,2018,7,Orange,Irvine Unified,Jeffrey Trail Middle,92620.0
7501,30,73650,129155,,2018,7,Orange,Irvine Unified,Cypress Village Elementary,92620.0
7502,30,73650,129296,,2018,7,Orange,Irvine Unified,Portola Springs Elementary,92618.0
7503,30,73650,133389,,2018,7,Orange,Irvine Unified,Beacon Park,92618.0
7504,30,73650,135137,,2018,7,Orange,Irvine Unified,Eastwood Elementary,92620.0
7505,30,73650,3030129,,2018,7,Orange,Irvine Unified,Creekside High,92606.0
7506,30,73650,3030152,,2018,7,Orange,Irvine Unified,Irvine High,92604.0
7507,30,73650,3030285,,2018,7,Orange,Irvine Unified,Woodbridge High,92604.0
7508,30,73650,3030467,,2018,7,Orange,Irvine Unified,Alternative Education-San Joaquin High,92606.0


In [14]:
df_irvine[['District Code']].iloc(0)

<pandas.core.indexing._iLocIndexer at 0x11d205048>

In [15]:
entities.loc[(entities['District Name'] == 'Irvine Unified'), ['District Code']].iloc[0][0]

73650

Each `Subgroup ID` has the following meanings. We can investigate the characteristics of individual students. 

In [16]:
subgroup = pd.read_csv("../Data/Subgroups.txt", header=None)
subgroup.shape

(47, 4)

In [17]:
subgroup.head()

Unnamed: 0,0,1,2,3
0,1,1,"""All Students""","""All Students"""
1,3,3,"""Male""","""Gender"""
2,4,4,"""Female""","""Gender"""
3,6,6,"""Fluent English proficient and English only""","""English-Language Fluency"""
4,7,7,"""Initial fluent English proficient (IFEP)""","""English-Language Fluency"""


In [18]:
#delete first column (redundant with the second column); axis = 0 (index) and axis =1 (column), inplace=True means adjusting
subgroup.drop(0, axis=1, inplace=True)
subgroup.columns = ['Subgroup ID', 'Student Groups', 'Category']
#same expression: subgroup[['Subgroup ID', 'Student Groups', 'Category']]
subgroup.sort_values("Category")
subgroup.head(47)

Unnamed: 0,Subgroup ID,Student Groups,Category
0,1,"""All Students""","""All Students"""
1,3,"""Male""","""Gender"""
2,4,"""Female""","""Gender"""
3,6,"""Fluent English proficient and English only""","""English-Language Fluency"""
4,7,"""Initial fluent English proficient (IFEP)""","""English-Language Fluency"""
5,8,"""Reclassified fluent English proficient (RFEP)""","""English-Language Fluency"""
6,28,"""Migrant education""","""Migrant"""
7,31,"""Economically disadvantaged""","""Economic Status"""
8,74,"""Black or African American""","""Ethnicity"""
9,75,"""American Indian or Alaska Native""","""Ethnicity"""


For example, if we want the DataFrame where the district is **Irvine Unified**, the ethnicity is **Asian** for the **3rd** grade, we can obtain it as follows.

In [19]:
df.loc[(df['District Code'] == 73650) & (df['Subgroup ID'] == 76) & (df['Grade'] == 3), :]

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Subgroup ID,Test Type,Total Tested At Entity Level,Total Tested with Scores,Grade,Test Id,CAASPP Reported Enrollment,Students Tested,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores,Area 1 Percentage Above Standard,Area 1 Percentage Near Standard,Area 1 Percentage Below Standard,Area 2 Percentage Above Standard,Area 2 Percentage Near Standard,Area 2 Percentage Below Standard,Area 3 Percentage Above Standard,Area 3 Percentage Near Standard,Area 3 Percentage Below Standard,Area 4 Percentage Above Standard,Area 4 Percentage Near Standard,Area 4 Percentage Below Standard
1602961,30,73650,0,,2018,76,B,8918,8918,3,2,1272,1258,2512.4,58.03,29.65,87.68,8.98,3.34,1258,73.85,21.54,4.61,62.32,32.03,5.64,66.38,28.30,5.33,0.00,0.00,0.00
1602962,30,73650,0,,2018,76,B,8666,8666,3,1,1272,1211,2499.2,60.20,21.14,81.34,11.81,6.85,1211,55.08,36.09,8.84,54.38,35.62,10.00,41.29,53.76,4.95,56.81,35.84,7.35
1603803,30,73650,129155,,2018,76,B,312,312,3,2,86,85,2531.3,74.12,18.82,92.94,2.35,4.71,85,85.88,9.41,4.71,71.76,23.53,4.71,82.35,14.12,3.53,0.00,0.00,0.00
1603804,30,73650,129155,,2018,76,B,301,301,3,1,86,84,2521.6,75.00,16.67,91.67,3.57,4.76,84,54.76,39.29,5.95,76.19,16.67,7.14,53.57,42.86,3.57,75.00,20.24,4.76
1604137,30,73650,129296,,2018,76,B,282,282,3,2,71,71,2517.5,61.97,29.58,91.55,7.04,1.41,71,77.46,19.72,2.82,60.56,38.03,1.41,70.42,28.17,1.41,0.00,0.00,0.00
1604138,30,73650,129296,,2018,76,B,281,281,3,1,71,71,2504.8,60.56,21.13,81.69,11.27,7.04,71,54.93,40.85,4.23,49.30,38.03,12.68,43.66,50.70,5.63,61.97,30.99,7.04
1604465,30,73650,133389,,2018,76,B,267,267,3,2,64,64,2496.2,50.00,32.81,82.81,12.50,4.69,64,68.75,25.00,6.25,53.13,39.06,7.81,54.69,39.06,6.25,0.00,0.00,0.00
1604466,30,73650,133389,,2018,76,B,259,259,3,1,64,63,2478.6,52.38,22.22,74.60,12.70,12.70,63,47.62,38.10,14.29,42.86,41.27,15.87,31.75,58.73,9.52,42.86,44.44,12.70
1604845,30,73650,135137,,2018,76,B,141,141,3,2,39,38,2495.1,44.74,36.84,81.58,13.16,5.26,38,57.89,34.21,7.89,65.79,23.68,10.53,52.63,44.74,2.63,0.00,0.00,0.00
1604846,30,73650,135137,,2018,76,B,118,118,3,1,39,36,2486.6,50.00,33.33,83.33,8.33,8.33,36,47.22,41.67,11.11,52.78,36.11,11.11,22.22,75.00,2.78,52.78,33.33,13.89


In [20]:
df.iloc[1614596]

County Code                               30
District Code                          73650
School Code                          6120141
Filler                                   NaN
Test Year                               2018
Subgroup ID                               76
Test Type                                  B
Total Tested At Entity Level             182
Total Tested with Scores                 182
Grade                                      3
Test Id                                    2
CAASPP Reported Enrollment                53
Students Tested                           52
Mean Scale Score                      2497.8
Percentage Standard Exceeded           51.92
Percentage Standard Met                23.08
Percentage Standard Met and Above      75.00
Percentage Standard Nearly Met         21.15
Percentage Standard Not Met             3.85
Students with Scores                      52
Area 1 Percentage Above Standard       63.46
Area 1 Percentage Near Standard        32.69
Area 1 Per

* Selecting single item df.loc[index, col_name] - need to have **string index**??

* Join two dataframes for obtaining the specific school name