# Practice Project 2.1 - Preparing School Data

## Business Understanding
A school district wants to predict the per pupil costs of a school based on some high level summary data about the school. This way they’ll have a good estimation of how well a school is managing its costs relative to what the model would predict. You’ve been asked to to prepare the data for modelling.

## Data Understanding
You’ve been given four CSV files that contain data for two different school districts. You can find these files at the bottom of the page.

- DistrictA_Attendance - This file contains average daily attendance, percent attendance, and pupil-teacher ratio data for the 25 schools in district A.

- DistrictA_Finance - This file contains average monthly teacher salary and per pupil cost data for the 25 schools in district A.

- DistrictB_Attendance - This file contains average daily attendance, percent attendance, and pupil-teacher ratio data for the 21 schools in district B.

- DistrictB_Finance - This file contains average monthly teacher salary and per pupil cost data for the 21 schools in district B.

In [None]:
# load module
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [11, 7]

In [None]:
# load attendance and finance data
a_attendance = pd.read_csv('districta-attendance.csv')
b_attendance = pd.read_csv('districtb-attendance.csv')
a_finance = pd.read_csv('districta-finance.csv')
b_finance = pd.read_csv('districtb-finance.csv')

# Remove empty columns
attendance_columns = ['School', 'Average daily Attendance', 'Percent Attendance', 'Pupil/Teacher ratio']
a_attendance = a_attendance[attendance_columns]


## Data Preparation


### Step 1: Combine the data
First you’ll need to combine the data from the various files into one sheet, with one row per school. To do this, you’ll use the skills you learned in the Formatting Data and Blending Data lessons.

In [None]:
# Pivot Finace Data
a_finance_pivot = a_finance.pivot(index='School', columns='Metric', values='Value').reset_index()
b_finance_pivot = b_finance.pivot(index='School', columns='Metric', values='Value').reset_index()

# Merge district data
district_a = pd.merge(a_attendance, a_finance_pivot, on='School')
district_b = pd.merge(b_attendance, b_finance_pivot, on='School')

# join districts data and remove duplicates
districts_data = pd.concat([district_a, district_b], ignore_index=True).drop_duplicates(ignore_index=True)

### Step 2: Clean the Data


Next you’ll clean the data, which includes addressing duplicate data, missing data, and any other data issues. To do this, you’ll use the skills you learned in the Data Issues lesson.

In [None]:
# check non null data
print('Non Null Data in Raw School Data\n')
print(districts_data.info())

# drop NA row
districts_data_dropna = districts_data.dropna()
print('\nClean Data')
print(districts_data_dropna.info())


### Step 3: Identify and Deal with Outliers


Lastly, you’ll look for outliers and determine the best way to address them. To do this, you’ll use the skills you learned in the Data Issues lesson.

In [None]:
# Target Variable: 'Per-Pupil Cost'
# Predictor Variables: 
# ['Average daily Attendance', 'Percent Attendance', 'Pupil/Teacher ratio', 'Average Monthly Teacher Salary']

print('Boxplots All Variables\n')

column_names = list(districts_data_dropna.columns)[1:]

districts_data_predict_variables = districts_data_dropna[column_names]

for column_name in column_names:
    districts_data_predict_variables.boxplot(column=column_name)
    plt.figure()


### Remove outliers by zcore

threshold = 3

np.where(z > threshold)

`from scipy import stats`

`import numpy as np`

In [None]:
z_score_abs = np.abs(stats.zscore(districts_data_predict_variables))
idx = (z_score_abs < 3).all(axis=1)

# outliers data
outlier_school_data = districts_data_predict_variables[~idx]
print('Outliers School Dara')
print(outlier_school_data)

# filter outliers data
no_outlier_school_data = districts_data_predict_variables[idx]
print('\nFiltered Outliers School Dara')
print(no_outlier_school_data.info())

In [None]:
districts_data_predict_variables.info()