# Looking at Descriptive Statistics and Probabilities

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('https://corgis-edu.github.io/corgis/datasets/csv/classics/classics.csv')

In [None]:
df.describe()

In [None]:
df['metrics.statistics.words'].describe()

In [None]:
df['metrics.statistics.words'].plot(kind='hist',bins=100)

In [None]:
df.sort_values(by='metrics.statistics.words',
               ignore_index=True).loc[0:950,'metrics.statistics.words'].plot(kind='hist',bins=100)

In [None]:
df.sort_values(by='metrics.statistics.words',
               ignore_index=True).loc[0:950,'metrics.statistics.words'].plot(kind='box')

In [None]:
df['metrics.statistics.words'].count()

In [None]:
df['metrics.statistics.words'].mean()

In [None]:
df2 = df.sort_values(by='metrics.statistics.words').copy()

In [None]:
df2['metrics.statistics.words']

In [None]:
df2 = df.sort_values(by='metrics.statistics.words',ignore_index=True).copy()

In [None]:
df2['metrics.statistics.words']

In [None]:
df2.loc[int(0.1*1005):int(0.9*1005),'metrics.statistics.words'].mean()

In [None]:
df['metrics.statistics.words'].median()

In [None]:
df['metrics.statistics.words'].std()

In [None]:
df['metrics.statistics.words'].quantile()

default above is the 50% quantile --> the median

In [None]:
df['metrics.statistics.words'].quantile(0.25)

In [None]:
df['metrics.statistics.words'].quantile(0.75)

In [None]:
df['metrics.statistics.words'].plot(kind='box')

Here the box goes from Q1 to Q3, the whiskers extend to show the range but no more than 1.5 * IQR (Q3-Q1).  Data outside that range is shown as separate dots.

Let's look at sentiment too!

In [None]:
df['metrics.sentiments.polarity'].plot(kind='box')

In [None]:
df[df['metrics.sentiments.polarity']<0]

In [None]:
df['metrics.sentiments.polarity'].plot(kind='hist')

In [None]:
df['metrics.sentiments.polarity'].plot(kind='hist', bins=500)

In [None]:
df['metrics.sentiments.polarity'].plot(kind='hist', bins=5000)

In [None]:
print('Mean: ',df['metrics.sentiments.polarity'].mean())
print('Median: ',df['metrics.sentiments.polarity'].median())
print('Mode: ',df['metrics.sentiments.polarity'].mode())
df['metrics.sentiments.polarity'].plot(kind='hist',bins=25);

In [None]:
print('Mean: ',df['bibliography.author.birth'].mean())
print('Median: ',df['bibliography.author.birth'].median())
print('Mode: ',df['bibliography.author.birth'].mode())
df['bibliography.author.birth'].plot(kind='hist',bins=25);

In [None]:
df2 = df.loc[df['bibliography.author.birth']>1800]
print('Mean: ',df2['bibliography.author.birth'].mean())
print('Median: ',df2['bibliography.author.birth'].median())
print('Mode: ',df2['bibliography.author.birth'].mode())
df2['bibliography.author.birth'].plot(kind='hist',bins=25);

## Looking at Distributions with Random Numbers

In [None]:
import numpy as np
import matplotlib.pyplot as plt

Let's simulate flipping a coin:

In [None]:
coin = ['heads','tails']

In [None]:
np.random.choice(coin,
                 p=[0.5, 0.5])

In [None]:
def flipcoin():
    return np.random.choice(coin)

In [None]:
flipcoin()

Can we tell if an actual coin we're flipping is biased?

We could do a lot of hypothetical coin flips with our code to see what the distribution of outcomes is like.

If we want to flip a coin n times:

In [None]:
def flips(n):
    flips = []
    for i in range(n):
        flips.append(flipcoin())
    return flips

In [None]:
print(flips(20))

Computers are convenient because we can get them to do our boring repetitive work thousands and thousands (and millions!) of times.

We can also make a function to count the number of times 'heads' came up in our trial:

In [None]:
def countheads(a):
    headcount = 0
    for i in a:
        if i == 'heads':
            headcount += 1
    return headcount

In [None]:
countheads(flips(500))

This enables us to generalize to the distribution of heads in a given number of samples, with each sample having a given number of coin flips.

In [None]:
def coinsamples(numsamples, numflips):
    samples = []
    for i in range(numsamples):
        samples.append(countheads(flips(numflips)))
    return samples

In [None]:
coinsamples(10, 20)

In [None]:
numflips = 300
numsamples = 100
fliparray = coinsamples(numsamples, numflips)
plt.hist(fliparray)
plt.xlim(0,numflips)

Looks like a Gaussian distribution function!

Final step: put this functionality into a Python function:

In [None]:
def makenorm(n,flips):
    xnums = []
    for i in range(n):
        xnums.append(0)
        for j in range(flips):
            if np.random.choice([1,0]):
                xnums[i] += 1
    plt.hist(xnums, bins=np.arange(-0.5,40.5,1), width=0.8)
    plt.xlim(-0.5,0.5+flips)
    plt.show()

makenorm(100,40)

What if the coin is biased, so that heads only comes up 30% of the time?  Could we tell from the distribution?

In [None]:
def makenorm(n,flips,probheads):
    xnums = []
    for i in range(n):
        xnums.append(0)
        for j in range(flips):
            if np.random.choice([1,0],p=[probheads, 1-probheads]):
                xnums[i] += 1
    plt.hist(xnums, bins=np.arange(-0.5,40.5,1), width=0.8)
    plt.xlim(-0.5,0.5+flips)
    plt.show()


makenorm(100,40,0.3)

**Racial sampling**

[The following is based on an example from Berkeley's Data8 JupyterBook, [linked here](https://inferentialthinking.com/chapters/11/1/Assessing_a_Model.html)]

Calculating random numbers from a distribution can be more than just a mathematical exercise. Statistics and coding can be used to explore inequality.

**Amendment VI of the United States Constitution**

"In all criminal prosecutions, the accused shall enjoy the right to a speedy and public trial, by an impartial jury of the State and district wherein the crime shall have been committed."


**The Supreme Court case of Robert Swain**

Robert Swain was a Black man convicted in Talladega County, Alabama, in 1962. His case was appealed all the way up to the U.S. Supreme Court based on the claim that Black people were systematically excluded from juries in Talladega County. (This case also involved issues related to peremptory challenges -- I recommend reading the above link and associated references if you are interested).

**A few details**
* In Talladega County, 26% of men were Black.
* Only 8 men among the 100-member jury panel in Robert Swain's case were Black.
* Robert Swain also pointed out that this county's jury panels over the past 10 years had only contained a small percent of Black panelists.
* The U.S. Supreme Court wrote that “the overall percentage disparity has been small.” and Robert Swain was later sentenced to life in prison.

**Our question**
* Is it reasonable to expect that a jury panel in this County could have 8% Black membership?

**The model**
* A hypothesis about the world
* The panel was selected at random
* The 8-member panel was just due to chance

We can assess this model with code

* Simulate data based on the model
* Show what the data would be like if panel members were selected at random
* Compare the simulated data with the real data
* If they're not consistent, reject the model

In [None]:
elems = ['b','w']
elems_perc = [0.26, 0.74]

In [None]:
np.random.choice(elems, p=elems_perc)

In [None]:
panelsize = 100
panel = []
for i in range(panelsize):
  race = np.random.choice(elems, p=elems_perc)
  panel.append(race)
print('Number of Black members on the panel: ', panel.count('b'))

In [None]:
numpanels = 10000
numblackmembers = []

for i in range(numpanels):
  panelsize = 100
  panel = []
  for i in range(panelsize):
    race = np.random.choice(elems, p=elems_perc)
    panel.append(race)
  numblackmembers.append(panel.count('b'))
  
plt.hist(numblackmembers, bins=np.arange(5.5, 46.5, 1))
plt.xlim(5.5, 46.5)

In [None]:
plt.hist(numblackmembers, bins=np.arange(5.5, 46.5, 1))
plt.xlim(5.5, 46.5)
plt.ylim(0,5)