# The Tampa Bay Times and school performance

**Story:** [The story](http://www.tampabay.com/projects/2015/investigations/pinellas-failure-factories/), and [a critique](https://rogueedu.blogspot.com/2015/08/fcat-reading-scores-only-two-of-five.html)

**Author:** A million people, but I believe Nathaniel Lash did the data analysis

**Topics:** Linear Regression, Residuals

**Datasets**

* **0066897-gr04_rsch_2014.xls:** 4th grader pass rates for standardized tests, from Florida Dept of Education
* **FRL-1314-School-Web-Survey-3-Final.xls:** Free and reduced price lunch data, from Florida Dept of Education
* **MembershipSchoolRaceGender1415.xls:** School population by gender, from Florida Dept of Education

# What's the story?

We're trying to see what kind of effect race and poverty have on school test score data. Their actual story doesn't include a regression, but they used one behind the scenes for research.

## Imports

You'll want pandas and seaborn. You'll want want to set pandas to display a lot of columns and rows at a time.

# Reading in our test scores data

While we have a lot of options for what tests we can use, let's stick with reading scores.

* **Tip:** There's a lot of junk up at the file, so you'll want to skip a few of those rows.
* **Tip:** Ouch, even if we skip rows there are still a couple bad ones - the "Number of possible points" and "STATE TOTALS" rows. Get rid of them, too. You can drop them after you've read the data in, if you'd like.
* **Tip:** Sometimes the school number starts with `0`, but pandas will drop it if it thinks the column is an integer. Tell `.read_csv` to read the column in as a string, instead.

### Getting an average

Try to get the median of the `Percentage Passing (Achievement Levels 3 and Above)` column. Oof, it doesn't work! Take a look at your data and see if there's something that needs to be done with `na_values`.

### Confirm that you have 2207 rows and 17 columns, and that the first school is CHARLES W. DUVAL ELEM SCHOOL

# Read in lunch data

We'll be using free lunch as a proxy for poverty.

* **Tip:** You'll need to specify the sheet you're interested in
* **Tip:** Again, the top of the file is kind of messy
* **Tip:** It might be easiest to just specify the names for the columns yourself

## Calculating a column

Let's add in a new column that is the percent of students who are eligible for free or reduced-price lunch.

* Free, reduced price, provision 2, and CEP direct cert are all kinds of reduced lunch.
* Total members it the total number of students at the school.
* **Tip:** If you get an error, read your error message. Check the datatype of your columns, and take a look at your dataset. Maybe you need to add an `na_values` to your `read_excel` to deal with something in there?

## Fixing district and school numbers

Even if you specify `dtype` when you're reading in this data, it still drops the leading `0`s that you see in Excel for the district and school numbers. Use `.str.pad` to add them back in.

* **Tip:** School numbers should be 4 characters long, district number should be 2 characters.

### Confirm you have 3987 rows and 10 columns

# Read in race data

* **Tip:** Beware! The file uses a space `' '` instead of an empty string `''` when having missing data, so you might want to let `pd.read_excel` know about that special kind of missing data.

## These columns are stupid

If you look at the column names with `df.columns`, you'll see the they have extra spacs after them. This is terrible! 

You can use something like `race.columns = race.columns.str.strip()` to fix that, then columns will behave properly.

## Cleaning up race counts

When a school has no students of a certain race, it just doesn't put anything in the column. This means a lot of `NaN` values that should be zeros! Fill in those `NaN` values with 0.

## Finding the totals

One row for each school is the `TOTAL` row, that adds up all the other rows and provides an aggregate. Instead of adding ourselves, let's try to use this row.

First, try to filter to only look at the total row for each school.

It doesn't list the school's name!

There are a lot of ways to fix this, but my favorite is to replace all of the instances of `"SCHOOL TOTAL"` with `NaN`, then have pandas copy down the value from above it. You can use this code:

```python
race.School = race.School.replace("SCHOOL TOTALS", np.nan).fillna(method='ffill')
```

Now let's try again to see the school totals.

### Create a new dataframe that is only the 'TOTAL' rows, and confirm it is 3992 rows and 15 columns

### Adding in percentages

Create a new column called `pct_black` that is the percentage of black students.

* **Tip:** If this isn't working, think about how you fixed a similar problem with lunch data up above

Typically you'd take a larger view of race issues, but in this case we're just trying to reproduce what was done by others.

### Confirm that your dataframe has 3992 rows and 16 columns

# Merging our datasets

Let's take a look at the first couple rows of our three datasets:

* Our reading score data
* Our free lunch data
* Our race data

## Doing our merging

We need to merge them, but **school numbers repeat in difference districts.** You'll need to join on district AND school number to successfully perform each merge.

 ### Confirm that you have around 2189 schools and 43 columns

If you have a lot more, it's probably because you merged on your original race dataframe instead of just the totals.

# Cleaning up our columns

We're interested in:

* District number
* School number
* Percent passing
* Percent free or reduced lunch
* Percent Black

Let's just select only those columns.

While you're at it, you should probably rename `Percentage Passing (Achievement Levels 3 and Above)` to `pct_passing` because it's so so long.

### Converting to percentages

It's really easy to get mixed up later if we don't have our percentage columns as actual percents. Multiply any percentages that go 0-1 by 100 to turn them into 0-100 instead.

* **Tip:** Make sure your numbers are 1-100 after you multiply!

# Graphing our data

Use seaborn's `regplot` to plot the relationship between free/reduced lunch and percent passing, and the same with percent black.

* **Tip:** You can use `scatter_kws={'alpha':0.3}` to see things a bit more nicely

# Linear regression

Now let's be a little more exact: run a linear regression that takes into account both percent black and percent free or reduced.

* **Tip:** Use `.dropna()` to remove missing data
* **Tip:** Remember to use `sm.add_constant`!

## Describe the relationship coefficient using "real" words

For example, "For every X change, we get Y change"

# Overperformers and underperformers

The point of the regression is to predict the percent passing, right? We can use `result.predict()` to get the predicted passing rate for each school. Try to run it below: 

Now, let's **save that value into a new column**, we can call it `predicted_passing`.

### Confirm that Charles W. Duval had a predicted passing rate of 32.

## Now let's find the difference between the predicted passing rate and the actual passing rate

If we're being stats-y, this is called **the residual**. Save it into a new column called.... `residual`.

* **Tip:** Think real hard about which direction you should be subtracting in.

### Find the 10 schools that did much worse than predicted

* PRINCETON HOUSE CHARTER should be the worst, with PEPIN ACADEMIES below that

### Find the top 10 schools that did better than predicted

* PARKWAY MIDDLE SCHOOL should be the best, and PATHWAYS should be second

# What problems might our analysis have?

We brought in two things we thought would do a good job covering socioeconomics and demographic patterns. What else might we be missing?

* **Tip:** Pay attention to the names of the schools