# Data Import and Cleaning 

## Part 2 of 2: SAT 2019 Dataset for California

In [1]:
import pandas as pd

## Data Import & Cleaning

Import the datasets that you selected for this project and go through the following steps at a minimum. You are welcome to do further cleaning as you feel necessary:
1. Display the data: print the first 5 rows of each dataframe to your Jupyter notebook.
2. Check for missing values.
3. Check for any obvious issues with the observations (keep in mind the minimum & maximum possible values for each test/subtest).
4. Fix any errors you identified in steps 2-3.
5. Display the data types of each feature.
6. Fix any incorrect data types found in step 5.
    - Fix any individual values preventing other columns from being the appropriate type.
    - If your dataset has a column of percents (ex. '50%', '30.5%', etc.), use the function you wrote in Part 1 (coding challenges, number 3) to convert this to floats! *Hint*: use `.map()` or `.apply()`.
7. Rename Columns.
    - Column names should be all lowercase.
    - Column names should not contain spaces (underscores will suffice--this allows for using the `df.column_name` method to access columns in addition to `df['column_name']`).
    - Column names should be unique and informative.
8. Drop unnecessary rows (if needed).
9. Merge dataframes that can be merged.
10. Perform any additional cleaning that you feel is necessary.
11. Save your cleaned and merged dataframes as csv files.

## Loading the data 

### Reading in the .csv files

In [2]:
sat = pd.read_csv('../data/sat_2019_ca.csv')

In [3]:
sat.head()

Unnamed: 0,CDS,CCode,CDCode,SCode,RType,SName,DName,CName,Enroll12,NumTSTTakr12,...,NumERWBenchmark11,PctERWBenchmark11,NumMathBenchmark11,PctMathBenchmark11,TotNumBothBenchmark12,PctBothBenchmark12,TotNumBothBenchmark11,PctBothBenchmark11,Year,Unnamed: 25
0,6615981000000.0,6.0,661598.0,630046.0,S,Colusa Alternative Home,Colusa Unified,Colusa,18.0,0.0,...,,,,,,,,,2018-19,
1,6616061000000.0,6.0,661606.0,634758.0,S,Maxwell Sr High,Maxwell Unified,Colusa,29.0,10.0,...,*,*,*,*,*,*,*,*,2018-19,
2,19647330000000.0,19.0,1964733.0,1930924.0,S,Belmont Senior High,Los Angeles Unified,Los Angeles,206.0,102.0,...,42,24.14,12,6.90,14,13.73,11,6.32,2018-19,
3,19647330000000.0,19.0,1964733.0,1931476.0,S,Canoga Park Senior High,Los Angeles Unified,Los Angeles,227.0,113.0,...,97,35.27,37,13.45,18,15.93,35,12.73,2018-19,
4,19647330000000.0,19.0,1964733.0,1931856.0,S,Whitman Continuation,Los Angeles Unified,Los Angeles,18.0,14.0,...,*,*,*,*,*,*,*,*,2018-19,


### Shape of the data 

The dataset has the following number of (rows, columns) respectively. 

In [4]:
sat.shape

(2580, 26)

###  Removing the extra column at the end

In [5]:
sat.drop('Unnamed: 25', axis=1, inplace=True)

In [6]:
sat.head(3)

Unnamed: 0,CDS,CCode,CDCode,SCode,RType,SName,DName,CName,Enroll12,NumTSTTakr12,...,NumTSTTakr11,NumERWBenchmark11,PctERWBenchmark11,NumMathBenchmark11,PctMathBenchmark11,TotNumBothBenchmark12,PctBothBenchmark12,TotNumBothBenchmark11,PctBothBenchmark11,Year
0,6615981000000.0,6.0,661598.0,630046.0,S,Colusa Alternative Home,Colusa Unified,Colusa,18.0,0.0,...,0.0,,,,,,,,,2018-19
1,6616061000000.0,6.0,661606.0,634758.0,S,Maxwell Sr High,Maxwell Unified,Colusa,29.0,10.0,...,6.0,*,*,*,*,*,*,*,*,2018-19
2,19647330000000.0,19.0,1964733.0,1930924.0,S,Belmont Senior High,Los Angeles Unified,Los Angeles,206.0,102.0,...,174.0,42,24.14,12,6.90,14,13.73,11,6.32,2018-19


The column has been removed.

### Standardising the format of the column names

The column names will be changed to lowercase, and with underscores as word separators.

In [7]:
# Changing column names to lowercase
sat.rename(str.lower, axis="columns", inplace=True)

In [8]:
sat.columns

Index(['cds', 'ccode', 'cdcode', 'scode', 'rtype', 'sname', 'dname', 'cname',
       'enroll12', 'numtsttakr12', 'numerwbenchmark12', 'pcterwbenchmark12',
       'nummathbenchmark12', 'pctmathbenchmark12', 'enroll11', 'numtsttakr11',
       'numerwbenchmark11', 'pcterwbenchmark11', 'nummathbenchmark11',
       'pctmathbenchmark11', 'totnumbothbenchmark12', 'pctbothbenchmark12',
       'totnumbothbenchmark11', 'pctbothbenchmark11', 'year'],
      dtype='object')

In [9]:
# Renaming the column names with underscores as word separators
sat.rename(columns = {'cds':'cds_code', 
                      'ccode':'c_code',
                      'cdcode':'cd_code',
                      'scode':'s_code',
                      'rtype':'r_type',
                      'sname':'s_name',
                      'dname':'d_name',
                      'cname':'c_name',
                      'enroll12':'enroll_12',
                      'numtsttakr12':'num_tst_takr_12',
                      'numerwbenchmark12':'num_erw_benchmark_12',
                      'pcterwbenchmark12':'pct_erw_benchmark_12',
                      'nummathbenchmark12':'num_math_benchmark_12',
                      'pctmathbenchmark12':'pct_math_benchmark_12',
                      'enroll11':'enroll_11',
                      'numtsttakr11':'num_tst_takr_11',
                      'numerwbenchmark11':'num_erw_benchmark_11',
                      'pcterwbenchmark11':'pct_erw_benchmark_11',
                      'nummathbenchmark11':'num_math_benchmark_11',
                      'pctmathbenchmark11':'pct_math_benchmark_11',
                      'totnumbothbenchmark12':'tot_num_both_benchmark_12',
                      'pctbothbenchmark12':'pct_both_benchmark_12',
                      'totnumbothbenchmark11':'tot_num_both_benchmark_11',
                      'pctbothbenchmark11':'pct_both_benchmark_11'
                     }, inplace=True)

In [10]:
sat.columns

Index(['cds_code', 'c_code', 'cd_code', 's_code', 'r_type', 's_name', 'd_name',
       'c_name', 'enroll_12', 'num_tst_takr_12', 'num_erw_benchmark_12',
       'pct_erw_benchmark_12', 'num_math_benchmark_12',
       'pct_math_benchmark_12', 'enroll_11', 'num_tst_takr_11',
       'num_erw_benchmark_11', 'pct_erw_benchmark_11', 'num_math_benchmark_11',
       'pct_math_benchmark_11', 'tot_num_both_benchmark_12',
       'pct_both_benchmark_12', 'tot_num_both_benchmark_11',
       'pct_both_benchmark_11', 'year'],
      dtype='object')

The columns have been renamed.

### Checking for null values

In [11]:
sat.isna().sum()

cds_code                       1
c_code                         1
cd_code                        1
s_code                         1
r_type                         1
s_name                       598
d_name                        59
c_name                         1
enroll_12                      1
num_tst_takr_12                1
num_erw_benchmark_12         276
pct_erw_benchmark_12         276
num_math_benchmark_12        276
pct_math_benchmark_12        276
enroll_11                      1
num_tst_takr_11                1
num_erw_benchmark_11         311
pct_erw_benchmark_11         311
num_math_benchmark_11        311
pct_math_benchmark_11        311
tot_num_both_benchmark_12    276
pct_both_benchmark_12        276
tot_num_both_benchmark_11    311
pct_both_benchmark_11        311
year                           1
dtype: int64

Let's look closer into where these null values appear.

In [12]:
# Fetch the rows of the 'sat' DataFrame where there is a null value under the 'cds_code' column.
sat[sat['cds_code'].isna()]

Unnamed: 0,cds_code,c_code,cd_code,s_code,r_type,s_name,d_name,c_name,enroll_12,num_tst_takr_12,...,num_tst_takr_11,num_erw_benchmark_11,pct_erw_benchmark_11,num_math_benchmark_11,pct_math_benchmark_11,tot_num_both_benchmark_12,pct_both_benchmark_12,tot_num_both_benchmark_11,pct_both_benchmark_11,year
2579,,,,,,,,,,,...,,,,,,,,,,


This is the last row again, as seen with the ACT dataset.

### Dropping the last row

It seems like the last row is entirely made up of null values. We will drop the row. 

In [13]:
sat.drop(2579, inplace=True)

In [14]:
sat.isna().sum()

cds_code                       0
c_code                         0
cd_code                        0
s_code                         0
r_type                         0
s_name                       597
d_name                        58
c_name                         0
enroll_12                      0
num_tst_takr_12                0
num_erw_benchmark_12         275
pct_erw_benchmark_12         275
num_math_benchmark_12        275
pct_math_benchmark_12        275
enroll_11                      0
num_tst_takr_11                0
num_erw_benchmark_11         310
pct_erw_benchmark_11         310
num_math_benchmark_11        310
pct_math_benchmark_11        310
tot_num_both_benchmark_12    275
pct_both_benchmark_12        275
tot_num_both_benchmark_11    310
pct_both_benchmark_11        310
year                           0
dtype: int64

We have removed null values for many of the columns.

### Checking the `'s_code'` column

In the `'act'` DataFrame, the `'s_code'` (School Code) column had null values for the rows with `'r_type'` (Record Type) of 'D' (District). Here, there are no null values in this column. Let's take a quick peek. 

In [15]:
# Fetch the first 5 rows of the 'sat' DataFrame where the value of 'r_type' is 'D'.
sat[sat['r_type'] == 'D'].head()

Unnamed: 0,cds_code,c_code,cd_code,s_code,r_type,s_name,d_name,c_name,enroll_12,num_tst_takr_12,...,num_tst_takr_11,num_erw_benchmark_11,pct_erw_benchmark_11,num_math_benchmark_11,pct_math_benchmark_11,tot_num_both_benchmark_12,pct_both_benchmark_12,tot_num_both_benchmark_11,pct_both_benchmark_11,year
2037,1611760000000.0,1.0,161176.0,0.0,D,,Fremont Unified,Alameda,2537.0,845.0,...,1396.0,1365,97.78,1321,94.63,678,80.24,1312,93.98,2018-19
2038,1612750000000.0,1.0,161275.0,0.0,D,,Piedmont City Unified,Alameda,231.0,78.0,...,97.0,97,100.0,96,98.97,61,78.21,96,98.97,2018-19
2039,1612910000000.0,1.0,161291.0,0.0,D,,San Leandro Unified,Alameda,754.0,193.0,...,458.0,239,52.18,140,30.57,77,39.9,122,26.64,2018-19
2040,10621660000000.0,10.0,1062166.0,0.0,D,,Fresno Unified,Fresno,4593.0,1048.0,...,3017.0,1508,49.98,723,23.96,323,30.82,681,22.57,2018-19
2041,10751270000000.0,10.0,1075127.0,0.0,D,,Mendota Unified,Fresno,234.0,69.0,...,76.0,43,56.58,24,31.58,9,13.04,21,27.63,2018-19


The entries have already been formatted as the float '0.0'.

### Replacing the null values in the `'s_name'` and `'d_name'` columns 

The observations from the `'act'` DataFrame are observed to apply here as well (e.g. entries which have null values for School Name are entries for District and County level). Hence, we will simply fill the null values with the string 'None'.

In [16]:
sat['s_name'].fillna(value='None', inplace=True) 
sat['d_name'].fillna(value='None', inplace=True)

### Replacing the null values in the rest of the columns

In [17]:
sat.isna().sum()

cds_code                       0
c_code                         0
cd_code                        0
s_code                         0
r_type                         0
s_name                         0
d_name                         0
c_name                         0
enroll_12                      0
num_tst_takr_12                0
num_erw_benchmark_12         275
pct_erw_benchmark_12         275
num_math_benchmark_12        275
pct_math_benchmark_12        275
enroll_11                      0
num_tst_takr_11                0
num_erw_benchmark_11         310
pct_erw_benchmark_11         310
num_math_benchmark_11        310
pct_math_benchmark_11        310
tot_num_both_benchmark_12    275
pct_both_benchmark_12        275
tot_num_both_benchmark_11    310
pct_both_benchmark_11        310
year                           0
dtype: int64

As with the `'act'` data, these remaining columns are for aspects of the test itself. We will check these columns in the same manner as previously, where we make sure all the entries with null values are only for schools with zero test takers. The columns ending with '12' (Grade 12) will be matched with `'num_tst_takr_12'`, and likewise with `'num_tst_takr_11'`.

We will use the checker function used during the ACT data cleaning to ascertain this.

In [18]:
def na_has_no_test_takers(col_names, test_takers_col_name, df):
    for col_name in col_names:
        print(col_name + ':', df[df[col_name].isna()].equals(df[df[test_takers_col_name] == 0.0]))

In [19]:
sat_12_list = ['num_erw_benchmark_12',
               'pct_erw_benchmark_12',
               'num_math_benchmark_12',
               'pct_math_benchmark_12',
               'tot_num_both_benchmark_12',
               'pct_both_benchmark_12']

na_has_no_test_takers(sat_12_list, 'num_tst_takr_12', sat)

num_erw_benchmark_12: True
pct_erw_benchmark_12: True
num_math_benchmark_12: True
pct_math_benchmark_12: True
tot_num_both_benchmark_12: True
pct_both_benchmark_12: True


In [20]:
sat_11_list = ['num_erw_benchmark_11',
               'pct_erw_benchmark_11',
               'num_math_benchmark_11',
               'pct_math_benchmark_11',
               'tot_num_both_benchmark_11',
               'pct_both_benchmark_11']

na_has_no_test_takers(sat_11_list, 'num_tst_takr_11', sat)

num_erw_benchmark_11: True
pct_erw_benchmark_11: True
num_math_benchmark_11: True
pct_math_benchmark_11: True
tot_num_both_benchmark_11: True
pct_both_benchmark_11: True


The data passed the tests. We will fill the null values with the value 'None'. 

In [21]:
sat_11_12_list = ['num_erw_benchmark_12',
                  'pct_erw_benchmark_12', 
                  'num_math_benchmark_12',
                  'pct_math_benchmark_12',
                  'tot_num_both_benchmark_12',
                  'pct_both_benchmark_12',
                  'num_erw_benchmark_11',
                  'pct_erw_benchmark_11', 
                  'num_math_benchmark_11',
                  'pct_math_benchmark_11',
                  'tot_num_both_benchmark_11',
                  'pct_both_benchmark_11']

for column in sat_11_12_list: 
    sat[column].fillna(value='None', inplace=True)

In [22]:
# Gets the total number of null values in the entire DataFrame
sat.isna().sum().sum()

0

We have removed all the null values from the `'sat'` DataFrame.

### Preparing the columns for typecasting 

In [23]:
sat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2579 entries, 0 to 2578
Data columns (total 25 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   cds_code                   2579 non-null   float64
 1   c_code                     2579 non-null   float64
 2   cd_code                    2579 non-null   float64
 3   s_code                     2579 non-null   float64
 4   r_type                     2579 non-null   object 
 5   s_name                     2579 non-null   object 
 6   d_name                     2579 non-null   object 
 7   c_name                     2579 non-null   object 
 8   enroll_12                  2579 non-null   float64
 9   num_tst_takr_12            2579 non-null   float64
 10  num_erw_benchmark_12       2579 non-null   object 
 11  pct_erw_benchmark_12       2579 non-null   object 
 12  num_math_benchmark_12      2579 non-null   object 
 13  pct_math_benchmark_12      2579 non-null   objec

There are numeric columns similar to the ACT dataset that should be numeric, but are of object datatype.

### Dropping the rows with no test-takers

There will be a loss of some information, such as that for enrollment numbers of schools without test-takers. Thus, we can save the DataFrame at this stage.

In [24]:
sat.to_csv('../data/sat_2019_ca_before_drop.csv', index=False)

In [25]:
# Reassign the 'sat' DataFrame to one including only rows where 'num_tst_takr_12' is not the float '0.0'
sat = sat[sat['num_tst_takr_12'] != 0.0]

# Reassign the 'sat' DataFrame to one including only rows where 'num_tst_takr_11' is not the float '0.0'
sat = sat[sat['num_tst_takr_11'] != 0.0]

In [26]:
len(sat)

2143

We have removed the rows with no test takers. Originally, it was 2580 rows. The 'None' values have been cleared.

### Dropping the rows with asterisks for both Grade 11 and 12

A number of rows still have asterisk values. The data source did not talk about the asterisks in the SAT dataset, but we can assume it is the same as or similar to the ones in the ACT dataset.

The rows with no values for both the Grade 11 and Grade 12 students (only asterisks) will be dropped. 

In [27]:
# Get the number of rows of the 'sat' DataFrame where both 'num_erw_benchmark_11' 
# and 'num_erw_benchmark_12' are the '*' symbol.
len(sat[(sat['num_erw_benchmark_11'] == '*') & (sat['num_erw_benchmark_12'] == '*')])

350

There are 350 such rows. 

In [28]:
# Reassign the sat DataFrame to one excluding all rows where both 'num_erw_benchmark_11' 
# and 'num_erw_benchmark_11''avg_scr_read' are the '*' symbol. 
sat = sat[~((sat['num_erw_benchmark_11'] == '*') & (sat['num_erw_benchmark_12'] == '*'))]

In [29]:
# Get the number of rows of the 'sat' DataFrame where both 'num_erw_benchmark_11' 
# and 'num_erw_benchmark_12' are the '*' symbol.
len(sat[(sat['num_erw_benchmark_11'] == '*') & (sat['num_erw_benchmark_12'] == '*')])

0

The removal was successful. 

### Removing the Grade 11 columns

Here, we look at the DataFrame again:

In [30]:
sat[sat['num_erw_benchmark_12'] == '*'].head()

Unnamed: 0,cds_code,c_code,cd_code,s_code,r_type,s_name,d_name,c_name,enroll_12,num_tst_takr_12,...,num_tst_takr_11,num_erw_benchmark_11,pct_erw_benchmark_11,num_math_benchmark_11,pct_math_benchmark_11,tot_num_both_benchmark_12,pct_both_benchmark_12,tot_num_both_benchmark_11,pct_both_benchmark_11,year
14,19647330000000.0,19.0,1964733.0,1930387.0,S,Central High,Los Angeles Unified,Los Angeles,104.0,4.0,...,18.0,2,11.11,1,5.56,*,*,1,5.56,2018-19
57,30768930000000.0,30.0,3076893.0,130765.0,S,Magnolia Science Academy Santa Ana,SBE - Magnolia Science Academy Santa Ana,Orange,34.0,14.0,...,27.0,17,62.96,12,44.44,*,*,11,40.74,2018-19
100,19647330000000.0,19.0,1964733.0,129536.0,S,Boyle Heights STEM High,Los Angeles Unified,Los Angeles,42.0,14.0,...,50.0,20,40.0,7,14.0,*,*,7,14.0,2018-19
137,19647330000000.0,19.0,1964733.0,1930429.0,S,View Park Continuation,Los Angeles Unified,Los Angeles,39.0,3.0,...,19.0,4,21.05,0,0.0,*,*,0,0.0,2018-19
161,33103300000000.0,33.0,3310330.0,134320.0,S,Riverside County Education Academy - Indio,Riverside County Office of Education,Riverside,20.0,1.0,...,19.0,1,5.26,0,0.0,*,*,0,0.0,2018-19


If we take having values as not being an asterisk symbol:

* Some rows may have values for the Grade 11 students, but not the Grade 12 students.
* Some rows may have values for the Grade 12 students, but not the Grade 11 students.
* Some rows may have values for both grades.

The ACT dataset is for Grade 12 students. Hence, for a fair comparison and in accordance with our problem statement, the SAT dataset should be reduced to only Grade 12 students as well. 

However, we should check if there are a large number of schools that only administer the SAT at Grade 11. This is because if this is so, a large number of entries will be deleted when we reduce the dataset. Currently we have the following number of rows:

In [31]:
# Get the number of rows in the SAT DataFrame. 
len(sat)

1793

How many indicate schools or entities that only administer the SAT at Grade 11?

In [32]:
# Get the number of rows in the SAT DataFrame where either 'num_erw_benchmark_11' 
# or 'num_erw_benchmark_12' is the '*' symbol.
len(sat[(sat['num_erw_benchmark_11'] != '*') & (sat['num_erw_benchmark_12'] == '*')])

77

We can see that the bulk of the data has values for Grade 12. Thus, we will simplify the data as planned. Before dropping the rows, we will save the current DataFrame to a CSV file. 

In [33]:
sat.to_csv('../data/sat_2019_ca_before_2nd_drop.csv', index=False)

The operation to remove the columns:

In [34]:
sat_11_full_list = ['enroll_11',
                    'num_tst_takr_11',
                    'num_erw_benchmark_11',
                    'pct_erw_benchmark_11',
                    'num_math_benchmark_11',
                    'pct_math_benchmark_11',
                    'tot_num_both_benchmark_11',
                    'pct_both_benchmark_11']

for column in sat_11_full_list:
    sat.drop(column, axis=1, inplace=True)

In [35]:
sat.columns

Index(['cds_code', 'c_code', 'cd_code', 's_code', 'r_type', 's_name', 'd_name',
       'c_name', 'enroll_12', 'num_tst_takr_12', 'num_erw_benchmark_12',
       'pct_erw_benchmark_12', 'num_math_benchmark_12',
       'pct_math_benchmark_12', 'tot_num_both_benchmark_12',
       'pct_both_benchmark_12', 'year'],
      dtype='object')

The columns are removed.

### Dropping the rows with asterisks for Grade 12

In [36]:
# Get the number of rows in the SAT DataFrame where 'num_erw_benchmark_12' is the '*' symbol.
len(sat[sat['num_erw_benchmark_12'] == '*'])

77

There are 77 rows, and these correspond to the rows we identified earlier that only had values in Grade 11 columns.

In [37]:
# Reassign the 'sat' DataFrame to one including only rows where 'num_erw_benchmark_12' is not the '*' symbol.
sat = sat[sat['num_erw_benchmark_12'] != '*']

In [38]:
# Gets the total number of asterisks in the 'sat' DataFrame
sat.eq('*').sum().sum()

0

There are no more asterisks in the DataFrame.

In [39]:
# Get the number of rows in the SAT DataFrame. 
len(sat)

1716

There are now 1716 rows. Previously, it was 1793. 

### Typecasting columns to numerical form

The columns of interest are of an object datatype. They will be converted to numeric form. 

We will use the converter function used for the ACT dataset to do this.

In [40]:
def numeric_converter(col_names, df):
    for col_name in col_names:
        df[col_name] = df[col_name].apply(pd.to_numeric)

In [41]:
numeric_converter(sat_12_list, sat)

In [42]:
sat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1716 entries, 2 to 2578
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   cds_code                   1716 non-null   float64
 1   c_code                     1716 non-null   float64
 2   cd_code                    1716 non-null   float64
 3   s_code                     1716 non-null   float64
 4   r_type                     1716 non-null   object 
 5   s_name                     1716 non-null   object 
 6   d_name                     1716 non-null   object 
 7   c_name                     1716 non-null   object 
 8   enroll_12                  1716 non-null   float64
 9   num_tst_takr_12            1716 non-null   float64
 10  num_erw_benchmark_12       1716 non-null   int64  
 11  pct_erw_benchmark_12       1716 non-null   float64
 12  num_math_benchmark_12      1716 non-null   int64  
 13  pct_math_benchmark_12      1716 non-null   float

The typecasting was successful. 

### Checking the numerical columns for anomalies

We will now check each column to see if it is reasonable. 

The first 4 columns are just codes, but rendered in floats. We would likely prefer them in integer format.

In [43]:
sat.head(3)

Unnamed: 0,cds_code,c_code,cd_code,s_code,r_type,s_name,d_name,c_name,enroll_12,num_tst_takr_12,num_erw_benchmark_12,pct_erw_benchmark_12,num_math_benchmark_12,pct_math_benchmark_12,tot_num_both_benchmark_12,pct_both_benchmark_12,year
2,19647330000000.0,19.0,1964733.0,1930924.0,S,Belmont Senior High,Los Angeles Unified,Los Angeles,206.0,102.0,31,30.39,14,13.73,14,13.73,2018-19
3,19647330000000.0,19.0,1964733.0,1931476.0,S,Canoga Park Senior High,Los Angeles Unified,Los Angeles,227.0,113.0,54,47.79,18,15.93,18,15.93,2018-19
5,19647340000000.0,19.0,1964733.0,6061451.0,S,Foshay Learning Center,Los Angeles Unified,Los Angeles,166.0,106.0,68,64.15,36,33.96,36,33.96,2018-19


In [44]:
# Typecasting the columns to integer format 
sat[['cds_code', 'c_code', 'cd_code', 's_code']] = sat[['cds_code', 'c_code', 'cd_code', 's_code']].astype('int64')

In [45]:
sat.head(3)

Unnamed: 0,cds_code,c_code,cd_code,s_code,r_type,s_name,d_name,c_name,enroll_12,num_tst_takr_12,num_erw_benchmark_12,pct_erw_benchmark_12,num_math_benchmark_12,pct_math_benchmark_12,tot_num_both_benchmark_12,pct_both_benchmark_12,year
2,19647331930924,19,1964733,1930924,S,Belmont Senior High,Los Angeles Unified,Los Angeles,206.0,102.0,31,30.39,14,13.73,14,13.73,2018-19
3,19647331931476,19,1964733,1931476,S,Canoga Park Senior High,Los Angeles Unified,Los Angeles,227.0,113.0,54,47.79,18,15.93,18,15.93,2018-19
5,19647336061451,19,1964733,6061451,S,Foshay Learning Center,Los Angeles Unified,Los Angeles,166.0,106.0,68,64.15,36,33.96,36,33.96,2018-19


The format is now more readable. 

### Retaining only school-level data

The data also has district, county and state-level data. Such aggregated data would skew the summary statistics and further statistical analysis unless they are excluded. Hence, they will be removed. First, the DataFrame will be saved to back up the information and the cleaning at this stage. 

In [46]:
sat.to_csv('../data/act_2019_ca_before_3rd_drop.csv', index=False)

Dropping the values:

In [47]:
# Reassign the 'sat' DataFrame to one including only rows where 'r_type' is 'S'
sat = sat[sat['r_type'] == 'S']

In [48]:
# Get the number of rows in the SAT DataFrame. 
len(sat)

1255

There are now 1255 rows. Previously, it was 1716. 

We will print some summary statistics for the next few columns. The summary statistics will skip the text-based columns.

In [49]:
sat.describe()

Unnamed: 0,cds_code,c_code,cd_code,s_code,enroll_12,num_tst_takr_12,num_erw_benchmark_12,pct_erw_benchmark_12,num_math_benchmark_12,pct_math_benchmark_12,tot_num_both_benchmark_12,pct_both_benchmark_12
count,1255.0,1255.0,1255.0,1255.0,1255.0,1255.0,1255.0,1255.0,1255.0,1255.0,1255.0,1255.0
mean,28481840000000.0,27.818327,2848184.0,2171559.0,330.360956,133.785657,91.677291,66.853482,64.377689,44.550629,60.92749,42.11506
std,13628070000000.0,13.599201,1362807.0,1783738.0,211.794142,106.209369,76.83303,21.319224,63.656393,23.665942,61.444756,23.659765
min,1100170000000.0,1.0,110017.0,100065.0,25.0,15.0,1.0,1.28,0.0,0.0,0.0,0.0
25%,19647330000000.0,19.0,1964733.0,127476.5,128.5,54.0,31.0,51.57,16.0,25.0,15.0,22.625
50%,30665140000000.0,30.0,3066514.0,1936947.0,322.0,108.0,71.0,70.16,43.0,42.53,40.0,39.82
75%,37683380000000.0,37.0,3768338.0,3634038.0,490.0,183.0,133.5,85.0,92.5,63.345,87.0,61.0
max,58727700000000.0,58.0,5872769.0,6120893.0,1135.0,932.0,475.0,100.0,390.0,100.0,385.0,100.0


Given the information we have found in the Background and Outside Research sections, the summary statistics of the data look reasonably correct. The maximums, minimums and averages are within their logical ranges/bounds. 

### Saving the data

In [50]:
sat.to_csv('../data/sat_2019_ca_cleaned.csv', index=False)

This is the end of the data cleaning portion for the SAT dataset. The rest of the data analysis will be continued in the main notebook.