# Week4 Statistical Inference with Python

In week 4, we've covered:
* **Probability Distributions**:
    * Binomial Distribution
    * Beta Distribution
* **Inference**:
    * Estimating a batting average from data
* **Variance**:
    * measuring uncertainty
    * Emprical Bayes estimation

The best way to consolidate the knowledge in your mind is by practicing.<br>Please complete the part marked with <span style="color:green">**# TODO**</span>.

[Google](www.google.com) and [Python Documentation](https://docs.python.org/3/contents.html) are your good friends if you have any python questions.

Upload **Week4_Statistical_Inference_With_Python_Homework.ipynb** notebook to your Google Drive and open it with Google Colab

## Probability Distributions  

A probability distribution is a function that gives a probability to any event that might occur in an experiment. The simplest probability distribution is the binomial distribution, which can take on *n* outcomes. A 6-sided dice follows a binomial distribution with 6 possible outcomes, each side having some probability of landing up, and all the probabilities of all the sides sum to 1. An RV example of an event that follows the binomial distribution is the probability of someone converting. We can use the conversion rate, a value between 0 and 1, to describe the probability that any individual will convert. Note that the events will always be 0, didn't convert, or 1, but the conversion rate will take on a rational value between 0 and 1.

For more details of the math of the binomial distribution, watch this Khan Academy video: https://www.khanacademy.org/math/statistics-probability/random-variables-stats-library/binomial-random-variables/v/binomial-distribution  

Let's start by simulating some conversion events using python.

In [27]:
import numpy as np

number_of_events = 1
conversion_rate = 0.3
number_of_trials = 10

Above, we have set up some parameters to simulate 10 customers, each with a probability = 3/10 of converting. Before running the below code, make a prediction for how many `1`s you expect to see. After you've written your prediction, run the cell several times and record the output in comments in the cell below.

In [28]:
# TODO: write your prediction here: 

sample = np.random.binomial(n = number_of_events, p = conversion_rate, size = number_of_trials)
print(sample)
print("total conversions: ", sample.sum())

[0 1 1 1 0 0 1 0 0 1]
total conversions:  5


In [29]:
# TODO: Write down what you observed. How did your observations compare to your prediction?

# I would expect to see 3 conversions

# what I actually saw:
# 1,4,2,3,3,4,2


As the number of trials increases, we would expect the actual proportion of 1s to approach the "true" conversion rate,  which in this case is 0.3 by our design. This phenomenon is called the "Law of Large Numbers." In the code block below, create a short experiment that demonstrates this phenomenon.

Note that if you have a numpy array named `sample` you can get its proportion by using `sample.mean()`

In [30]:
# TODO: Write a short experiment that demonstrates the Law of Large Numbers.

def print_experiment(trials):
    print(trials, np.random.binomial(n = number_of_events, p = conversion_rate, size = trials).mean())

# lets do our experiment a few times
for i in range(0,5):
    #10 trials
    print_experiment(10)

    #100 trials
    print_experiment(100)

    #1000 trials
    print_experiment(1000)

    #10000 trials
    print_experiment(10000)



10 0.3
100 0.3
1000 0.282
10000 0.2955
10 0.1
100 0.29
1000 0.318
10000 0.3092
10 0.4
100 0.3
1000 0.305
10000 0.3031
10 0.4
100 0.24
1000 0.309
10000 0.3028
10 0.4
100 0.31
1000 0.308
10000 0.2986


## Inference  

In the above, we were told the true conversion rate by some all-knowing oracle, and then we explored how conversions might play out over a set of *n* trials. Typically, we are tasked with the reverse problem. That is, we are given access to some observed data, and asked to infer what the true conversion rate is. This process is called "statistical inference" and it is a core concept to all of data science and machine learning. We will spend some time over the next weeks developing your python skills to explore statistical inference.  

We will be leveraging a well known data science communicator, David Robinson, and will read the first four articles of this series: http://varianceexplained.org/r/simulation-bayes-baseball/  

The coding language of the series is R, a statistical programming language, but you will be asked to recreate and explore some of the ideas here in python. Let's start by understanding a second probability distribution, the beta distribution. This probability distribution is closely related to the binomial distribution we explored above. In short, it can be used to describe what the probability of the true conversion rate is, given some data. Read more about it in the first article of the series: http://varianceexplained.org/statistics/beta_distribution_and_baseball/

### Exercises  

1. In the code block below, write a short simulation that samples 25 batting averages from a Beta Distribution with parameters $\alpha = 81$ and $\beta = 219$ You will likely find this numpy method to be helpful: https://numpy.org/doc/stable/reference/random/generated/numpy.random.beta.html

In [31]:
# TODO
samples = np.random.beta(a=81, b=219, size=25)
print(samples)

[0.32344403 0.3038042  0.24905091 0.27493149 0.30088414 0.30275412
 0.30057067 0.28688761 0.28023192 0.26423462 0.25655339 0.27394064
 0.29488724 0.25160745 0.28198156 0.31278011 0.24682137 0.28737536
 0.29369698 0.26134498 0.25655318 0.29034012 0.23031611 0.28210651
 0.24048829]


2. What are the lowest and highest batting averages from your experiment above? What is the median batting average from your experiment? How does the median batting average from your experiment relate to the parameters alpha=81 and beta=219? (Leave a comment)

In [32]:
# TODO
print("min:", samples.min())
print("max:",samples.max())
print("median: ", np.quantile(samples, 0.5))

print(81/(219 + 81)) # student should have some text comparing this value to the median

min: 0.23031610782840148
max: 0.3234440293767052
median:  0.2819815596226825
0.27


3. Sample a single batting average from your experiment at random, and then use that batting average to simulate 300 at bats. Give a statistical summary of the 300 at bats.

    Hint: When you have a single probablity of success (p) and a sequence of 300 Bernoulli(p) trials, what distribution should you sample from?

    Hint: What is a statistical summary? What 2 values can I usually use to describe a probability distribution?

In [33]:
# TODO
sampled_avg = np.random.choice(samples)
at_bats = np.random.binomial(1, sampled_avg, 300)
#print(at_bats)
print(sampled_avg)
print(at_bats.mean())
print(at_bats.var())

0.3127801066601673
0.31666666666666665
0.21638888888888885


4. As in the article, lets start with the $Beta(81, 219)$ distribution as the starting point of our estimate for a player's batting average at the start of a season. Then we will simulate a 300 at-bat season where a player gets 100 hits. 

    We will use the 300 new data points to *update* our initial best guess based on the prior information $Beta(81, 219)$

    Finally, we will analyze our updated estimate for the batting average of the simulated player at the end of the season. Fill in the code below to complete the above.

In [34]:
# constants, the initial parameters of the beta distribution
ALPHA0 = 81
BETA0 = 219  

def update_beta(alpha0, beta0, hits, at_bats):  
    '''
    Parameters:
        alpha0, int: the initial number of success
        beta0, int: the initial number of failures
        hits, int: the number of hits
        at_bats, int: the number of at bats
        
    Return:
        tuple with two members alpha and beta, each representing the updated successes and failures respectively.
    '''
    # TODO FILL IN THE CODE
    alpha = alpha0 + hits
    beta = beta0 + at_bats - hits
    return (alpha, beta)

# unit test. If you've written your function correctly, the following code should give the expected outputs
hits = 100
at_bats = 300
alpha, beta = update_beta(alpha0 = ALPHA0, beta0 = BETA0, hits = hits, at_bats = at_bats)
print(alpha) # should equal 181
print(beta) # should equal 419

181
419


Using code and text, give a brief statistical analysis of the updated probability distribution. If the manager asked you to give an estimate for the player's batting average at the end of the season, what answer would you give?

In [35]:
# TODO: Code
updated_distribution = np.random.beta(a = alpha, b = beta, size = 1000)
print("min: ", updated_distribution.min())
print("max: ", updated_distribution.max())
print("quartiles: ", np.quantile(updated_distribution, [0.25, 0.5, 0.75]))

min:  0.2422689157565732
max:  0.36226251335235093
quartiles:  [0.28727887 0.30004447 0.31266528]


TODO: Text - Great answer describes a likely range of values around the mean

## Variance  

So far, we've done some work to understand probability distributions, but we haven't taken advantage of the most important part - variance. Variance is a measure of uncertainty, it's a way to quantify everything you might *not* know about your estimate. Read the second article to begin to understand how variance can be used to leverage uncertainty: http://varianceexplained.org/r/empirical_bayes_baseball/

In [36]:
# TODO: load in "Batting.csv", "Pitching.csv", and "People.csv" into Pandas dataframes called batting, pitching, and master respectively

import pandas as pd  

batting = pd.read_csv('Batting.csv')
pitching = pd.read_csv('Pitching.csv')
master = pd.read_csv('People.csv')

### Exercises  

5. Use pandas methods to recreate the `career` data set from the article. You may find the pandas docs useful: https://pandas.pydata.org/pandas-docs/version/0.25.3/#  

    Focus on the logical steps taken in the article text and R code and create the appropriate pandas syntax.

In [37]:
# TODO: recreate career dataframe
pitchers = pitching['playerID'].tolist()
batting = batting[batting['AB'] > 0]
batting = batting[~batting['playerID'].isin(pitchers)] #filtered out pitchers
batting_sum =batting.groupby(['playerID']).agg({'H':'sum','AB':'sum'}) # get total hits and at bats for each player
batting_sum['Avg'] = batting_sum.loc[:,'H'] / batting_sum.loc[:,'AB'] #calculate the avg batting rate
career = pd.merge(batting_sum, master, how='inner', on='playerID')[["nameGiven", "nameLast", "H", "AB", "Avg"]]
career.head()

Unnamed: 0,nameGiven,nameLast,H,AB,Avg
0,Henry Louis,Aaron,3771,12364,0.304998
1,Tommie Lee,Aaron,216,944,0.228814
2,Fausto Andres,Abad,2,21,0.095238
3,John W.,Abadie,11,49,0.22449
4,Edward James,Abbaticchio,772,3044,0.253614


6. display the 5 highest and 5 lowest batting averages from the career dataset.

In [38]:
# TODO
career.sort_values('Avg', ascending=False).head(5)

Unnamed: 0,nameGiven,nameLast,H,AB,Avg
5388,Louis William,Lutz,1,1,1.0
5219,Charles William,Lindstrom,1,1,1.0
723,Stefan Alexander,Biras,2,2,1.0
3289,Roy William,Gleason,1,1,1.0
7110,Clarence William,Pickup,1,1,1.0


In [39]:
career.sort_values('Avg', ascending=True).head(5)

Unnamed: 0,nameGiven,nameLast,H,AB,Avg
1532,Travis Adrian,Chapman,0,1,0.0
6115,Michael Ryan,Miller,0,1,0.0
3269,Tinsley Rucker,Ginn,0,1,0.0
6124,Rodney Carter,Miller,0,1,0.0
6791,Brian Leonard,Ostrosser,0,5,0.0


### Note: Later in the article - he recommends filtering down to batters with >500 at bats. Do this now to roughly match the alpha and beta values in the article.

In [40]:
#TODO
career= career[career["AB"]>=500]

7. Calculate the mean and variance of the empirical batting averages in the dataset

In [41]:
# TODO  
empirical_mean = career['Avg'].to_numpy().mean()
empirical_variance = career['Avg'].to_numpy().var()

print(empirical_mean, empirical_variance)

0.2580094444017952 0.000627165172306701


8. Write functions to calculate alpha and beta using the method of moments and your estimated mean and variance above. https://stats.stackexchange.com/questions/12232/calculating-the-parameters-of-a-beta-distribution-using-the-mean-and-variance  

In [42]:
# TODO  
def moments(mu, sigma2):
    alpha = mu**2 * ((1 - mu) / sigma2 - (1 / mu))
    beta = alpha * (1 / mu - 1)
    return (alpha, beta)

9. Use your function to calculate an $\alpha_0$ and $\beta_0$ for the career data set, and then calculate an empirical bayes estimate of each players batting average. Display the top 5 and bottom 5 estimated batting averages.

In [43]:
# TODO
alpha0, beta0 = moments(empirical_mean, empirical_variance)
print(alpha0,beta0)
career['eb'] = ((career.loc[:,'H'] + alpha0) / (career.loc[:,'AB'] + alpha0 + beta0))
career.sort_values('eb', ascending=False).head(5)

78.49871612173963 225.74873615173013


Unnamed: 0,nameGiven,nameLast,H,AB,Avg,eb
4154,Rogers,Hornsby,2930,8173,0.358497,0.354891
4345,Joseph Walker,Jackson,1772,4981,0.355752,0.350125
2227,Edward James,Delahanty,2597,7510,0.345806,0.342387
3659,William Robert,Hamilton,2164,6283,0.344421,0.34043
3860,Harry Edwin,Heilmann,2660,7787,0.341595,0.338452


In [44]:
career.sort_values('eb', ascending=True).head(5)

Unnamed: 0,nameGiven,nameLast,H,AB,Avg,eb
641,William Aloysius,Bergen,516,3028,0.17041,0.178408
6822,Raymond Francis,Oyler,221,1265,0.174704,0.190855
9355,John Christopher,Vukovich,90,559,0.161002,0.195192
2561,Henry Propert,Easterday,203,1129,0.179805,0.196406
3853,Austin Charles,Hedges,343,1812,0.189294,0.199173


## Submission

Download completed **Week4_Statistical_Inference_With_Python_Homework.ipynb** from Google Colab and commit to your personal Github repo you shared with the faculty.