# Data Representativeness

### Introduction

In this section, we'll see various techniques for making inferences from data.  But first, we'll need to make sure that our data somehow *represents* the world that we are trying to describe.  

In this lesson, we'll work with SAT data from NYC high schools. And just like almost all data we'll work with, it is incomplete.  So we'll need to look at how these limitations in the data may change what we can or cannot conclude.

### Loading our data

In [3]:
import pandas as pd
url = 'https://raw.githubusercontent.com/analytics-engineering-jigsaw/data-visualization/main/2-storytelling/1-what-to-focus-on/sat_scores.csv'
df = pd.read_csv(url, index_col = 0)

In [4]:
df[:2]

Unnamed: 0,dbn,name,num_test_takers,reading_avg,math_avg,writing_score,boro,total_students,graduation_rate,attendance_rate,college_career_rate
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,M,171,0.66,0.87,0.36
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,M,465,0.9,0.93,0.7


And if we look at the data, we can see various columns in our data -- indicating the boro and different test averages -- like the reading, math, and writing score averages.

In [5]:
df['boro'].unique()

array(['M', 'X', 'K', 'Q', 'R'], dtype=object)

We can replace the borough information with the [original names](https://en.wikipedia.org/wiki/Boroughs_of_New_York_City).

In [6]:
keys = ['M', 'X', 'K', 'Q', 'R']
values = ['Manhattan', 'Bronx', 'Brooklyn', 'Queens', 'Staten Island']
dictionary = dict(zip(keys, values))
boro_updated = df['boro'].replace(dictionary)
df_updated = df.assign(boro = boro_updated)

### Drawing Conclusions

There are different components that we may want to say about the data.  For example, let's see how our SAT scores rank by borough.

In [7]:
avg_by_boro = df_updated.groupby('boro').mean(numeric_only = True)
avg_by_boro[['math_avg', 'reading_avg']].sort_values('math_avg', ascending = False)
# avg_by_boro

Unnamed: 0_level_0,math_avg,reading_avg
boro,Unnamed: 1_level_1,Unnamed: 2_level_1
Staten Island,472.5,457.5
Queens,450.65,423.916667
Manhattan,442.886076,426.696203
Brooklyn,404.030612,391.255102
Bronx,394.0,384.2375


So is this data accurate?  And what conclusions can we draw from this data.  

Can we say that students from Staten Island tend to be better at math and reading?  Or better at math and reading skills assessed by the SAT?

To get a better idea of what we can and cannot say, let's check the data.  The first step of checking the data is to say, do the results seem right.  Above, they may be surprising.  

Another way is to check our underlying data.  What are different ways you can check our above data to see if it really does allow us to make conclusions about SAT performance.  

### The problem with missing data

Before moving on, let's be explicit about the problem with missing data.  Missing data is ok if our data is still *representative* of the underlying population we are studying (here, NYC school across boroughs).  For example, one way we may achieve this is by taking a random sample of schools from different boroughs, and assessing the SAT performance of their students.  In that case, we might expect that the data we capture looks like our data in general.

**But normally**, when we have different missing data showing up in our data, it's because that data is hard to capture, or wasn't reported, or was reported incorrectly.  And often these records that are hard to capture look very different from the data that is easier to capture.  So this can *bias* our dataset.     

So one thing to be worried about with our above data is *reporting bias*.  The better the results, the more likely a school (or anyone else) is likely to report them.  And this means that the missing schools may be performing worse than the reporting ones.

### Your turn

Ok, so is there missing data in our dataset?  Spend the next fifteen minutes to explore the dataset to see if and how we may be missing data.  And also explore the dataset to make a general assessment about how representative our dataset is about the underlying population.  Does our dataset allow us to make certain conclusions about SAT performance or schools?

> Don't be so skeptical, this dataset does offer value.

Ok, we'll let you explore it for representativeness.

In [10]:
df_updated[:3]

Unnamed: 0,dbn,name,num_test_takers,reading_avg,math_avg,writing_score,boro,total_students,graduation_rate,attendance_rate,college_career_rate
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,Manhattan,171,0.66,0.87,0.36
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,Manhattan,465,0.9,0.93,0.7
2,01M450,EAST SIDE COMMUNITY SCHOOL,70.0,377.0,402.0,370.0,Manhattan,683,0.92,0.94,0.77


### Summary

In this lesson, we talked about considering the representativeness of our data.  That is, if we're going to make inferences from our data -- does our data reflect the real world?

We discussed the problem of missing data -- that it can bias our dataset.  This occurs when the data that is missing is different from the data that is present.  And that here, it can occur with reporting bias.  