Statistics Libraries
====================

This notebook runs some examples of different statitics
functions on our data set. The data we are looking at
is the combined school demographics data and NYS
math test scores which are given each year in grades
3-8. The basic hypothesis we are attempting to test is
that the percentage of Black students in a school
has no effect on the average test score for the school.
This is a complex topic and this notebook does not
claim to adequately explore the topic. It is meant to
give us a first look at how we might explore this question.

In addition to `pandas` there are two "main" libraries
for doing mathematical and statistical operations:
`scipy` and `numpy`. In addition to these libraries, this
notebook uses a package called [Pingouin](https://pingouin-stats.org/)
which "wraps" these other packages and works nicely with
pandas `DataFrames`.

To run these examples you need all 3 libraries installed (along with their dependencies). You can do this with `pip` or `conda`. Here are the `pip` commands:

```
pip install numpy
pip install scipy
pip install pingouin
```


In [2]:
# load pandas and the dataset
import pandas as pd


# this dataframe has the school demographic data
# and the NYS math test scores for grades 3-8 combined
df = pd.read_csv("demo-math.csv")
# the first column is needed -- it's the combined dbn and school name
df.drop(df.columns[0], axis=1, inplace=True)
df.columns

Index(['dbn', 'school_name', 'year', 'total_enrollment',
       'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3',
       'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9',
       'grade_10', 'grade_11', 'grade_12', 'female', 'female_1', 'male',
       'male_1', 'asian', 'asian_1', 'black', 'black_1', 'hispanic',
       'hispanic_1', 'multi_racial', 'multi_racial_1', 'native_american',
       'native_american_1', 'white', 'white_1', 'missing_race_ethnicity_data',
       'missing_race_ethnicity_data_1', 'students_with_disabilities',
       'students_with_disabilities_1', 'english_language_learners',
       'english_language_learners_1', 'poverty', 'poverty_1',
       'economic_need_index', 'district', 'boro', 'grade', 'category',
       'number_tested_math', 'mean_scale_score_math', 'level1_n_math',
       'level1_math', 'level2_n_math', 'level2_math', 'level3_n_math',
       'level3_math', 'level4_n_math', 'level4_math', 'level3_4_n_math',
       'l

In [2]:
import numpy as np
from scipy.stats import pearsonr

numpy_R = np.corrcoef(df["black_1"], df["mean_scale_score"])

scipy_R = pearsonr(df["black_1"], df["mean_scale_score"])
print("numpy R:", numpy_R)
print("scipy_R:", scipy_R)

numpy R: [[ 1.         -0.47137621]
 [-0.47137621  1.        ]]
scipy_R: (-0.47137621471716734, 0.0)


Numpy and Scipy Pearson Coefficients
--------------------------------------------------------

The code above uses two different libraries to calculate the
correlation between `black_1` (the percentage of Black students in a school)
and `mean_scale_score` (the average mean test score for the school in a given year --
for all grades and all demographic categories).

The `numpy` fuction `[corrcoef](https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html)`
returns an array where it runs a "row-wise" and "column-wise" correlation. We see that `numpy`
returns a value of `-0.47137621` for R.

The [`pearsonr`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html?highlight=pearson) function from the `scipy.stats` 
library runs the same statistical operation as `numpy`, but
it returns the R coeefficient and the p-value for the test. We see that for our data
R=-0.47137621471716734 and p < .01 (the result returned is zero).


Pingouin
-------------
The next example runs use the `pingouin` library
to find the correlation on the same two `Series`.
In this example, we use the [`corr`](https://pingouin-stats.org/generated/pingouin.corr.html#pingouin.corr)
function wich **returns** a pandas `DataFrame`. Pinguin runs the Pearson correlation coefficient
as well as several related statistical functions and returns all of the results


In [4]:
import pingouin as pg

penguin_R = pg.corr(df["black_1"], df["mean_scale_score"])
print("With pg.corr we get a dataframe which can display this table of results:")

# show the dataframe
display(penguin_R)

# we can also get single values from the df...
# since there is only one row, we always want the zero-eth row
print(f"""
The columns in the dataframe results: {", ".join(penguin_R.columns)}

• n={penguin_R["n"][0]}: the size of our data
• r={penguin_R["r"][0]}: the R coefficient (matches all of our other results, differing precision)
• p-val={penguin_R["p-val"][0]}: the p-value of the test
• CI95%={penguin_R["CI95%"][0]}: confidence intervals
• BF10={penguin_R["BF10"][0]}: Bayes Factor of the alternative hypothesis (only for Pearson correlation)
• power={penguin_R["power"][0]}: the power of the test
""")

With pg.corr we get a dataframe which can display this table of results:


Unnamed: 0,n,r,CI95%,p-val,BF10,power
pearson,9379,-0.471376,"[-0.49, -0.46]",0.0,inf,1.0



The columns in the dataframe results: n, r, CI95%, p-val, BF10, power

• n=9379: the size of our data
• r=-0.47137621471716734: the R coeeficient (matches all of our other results, differing precision)
• p-val=0.0: the p-value of the test
• CI95%=[-0.49 -0.46]: confidence intervals
• BF10=inf: Bayes Factor of the alternative hypothesis (only for Pearson correlation)
• power=1.0: the power of the test



In [5]:
# now lets run several correlations and put them in a single df
pct_black = pg.corr(df["black_1"], df["mean_scale_score"])
pct_ell = pg.corr(df["english_language_learners_1"], df["mean_scale_score"])
pct_swd = pg.corr(df["students_with_disabilities_1"], df["mean_scale_score"])
pct_pov = pg.corr(df["poverty_1"], df["mean_scale_score"])
combo = pd.concat([pct_black, pct_ell, pct_swd, pct_pov])
combo["category"] = ["% Black", "% ENL", "% SWD", "% poverty"]


combo[["category", "n", "r", "p-val"]].sort_values(by="r")


Unnamed: 0,category,n,r,p-val
pearson,% poverty,9379,-0.656381,0.0
pearson,% SWD,9379,-0.539883,0.0
pearson,% Black,9379,-0.471376,0.0
pearson,% ENL,9379,-0.19013,4.624005e-77


In [6]:
# run analysis of covariance (ancova) analyzing percent black and percent poverty

pg.ancova(data=df, dv='mean_scale_score', covar='poverty_1', between='black_1')

Unnamed: 0,Source,SS,DF,F,p-unc,np2
0,black_1,1019744.0,718,7.491231,0.0,0.383161
1,poverty_1,659140.2,1,3476.675117,0.0,0.286484
2,Residual,1641653.0,8659,,,
