# python tutorial for statistics 

This workshop will look at the fundamental python commands for those in the liberal arts & sciences who are familiar with R or have no coding experience. It will focus more on helpful tools for analytics than beginner tools needed for program initialization & higher level coding. For those who have used R, this workshop will focus on the transition in syntax & similarities with basic functions. For those who have not, this workshop starts from "hello world" and will guide you through all the steps, making it easy to follow even if you installed anaconda 5 minutes ago. 

Thanks to Tyler Richards, who made most of this workshop in Fall 2018. His work is followed by (TR).  

## hello world

In [2]:
# printing out strings 

In [3]:
# creating a list 

In [4]:
import numpy as np

# creating an array -- numpy intro 

# adding arrays

# multiplying arrays (broadcasting)

# multiplying arrays (traditional)

## theoretical distributions  (tr)
Theoretical statistical distributions all stem from the idea that there are a very small number of ways that data exist in the real world, and once we can roughly segment real world data into these distributions, we can use the properties of these distributions to predict and analyze data.

In [None]:
from scipy.stats import norm, poisson, bernoulli, binom
from 

#### normal distribution
The normal distribution is famous for being the most widely used and seen distribution in natural phenomena. It is a foundational distribution that relies on the fact that the data centers around a mean and deviates equally above and below this mean. It is also commonly referred to as the bell curve. 

In [None]:
# rvs: random variates 
# mean, standard deviation, size

In [182]:
import seaborn as sns

#### poisson distribution
A poisson variable measures the number of successful events in a given time.  
Example: average of horse kicks in the prussian army.  
http://blog.minitab.com/blog/quality-data-analysis-and-statistics/no-horsing-around-with-the-poisson-distribution-troops


In [None]:
# using np.random now 

In [None]:
# using .rvs (mu, size)

#### bernoulli and binomial distributions
The Bernoulli distribution is a special case of the binomial distribution, so we'll start there and move to binomial afterwards.  
Bernoulli distributions represent data that have a probability (p) of having a success (a value of 1) in one try. If p is .5, it is just like flipping a coin!

In [None]:
# bernoulli rvs (p, size)

In [None]:
# binomial np random (p, n)

## central limit theorem

If you randomly sample enough values from a distribution of any shape, the average (and sum) of the samples will form a normal distribution.

In [432]:
import numpy as np
import pandas as pd

In [None]:

# number_tries = n

# averages_list = []

# for i in range(0, number_tries):

#     averages_list.append(random dist mean)

We now know that the CLT is right. When can we not use CLT?
1. If the sampling is not random.
2. If the values are not independent. In this case, if one die roll was dependent on another die roll, we wouldn't get a normal distribution.

### Who cares if we can find a normal distribution?

In [None]:
# making a normal distribution object

# norm(mean of averages, sd of averages)

#random = norm object (size)

What is probability that we get an average that is under .55?

In [None]:
# cdf method on object

What about over?

## A Quick Intro to A/B testing

If you're interested in a more run through of A/B testing, see the practical data science workshop on our github ( https://github.com/dsiufl/Python-Workshops ). 

One major use of the work above is on A/B testing, a method from causal inference. A/B testing is used to attempt to figure out, given two options, which option should be taken?  
For example, given the two options for a website layout below, how would you test which website was superior?

![image.png](attachment:image.png)

A/B testing has a few steps  
1. Randomly separate the samples into two groups
2. Create null hypothesis
3. Measure some outcome from the two groups (in this case, how long the user stayed on the page)
4. Perform t-test on populations to see if the populations differ  

In our case, we'll generate our data to see how to interpret an A/B test, post randomization.  
  
Our null hypothesis is a way of saying 'let's wait to believe anything until we have evidence to.' In this case, we assume that the layouts are exactly the same, and people spend the same amount of time on each.  
Now let's generate random data for the two layouts

In [None]:
# make another list from norm object

After we generate these data, we can run a t test, which gives us a p value. A p value is a measure of strength against the null hypothesis, and measures the chance that you would find a result as extreme or more extreme than the one we found if the null hypothesis was true. A small p value means that we have more evidence to reject the null hypothesis, and a large p value means we have more evidence to accept the null hypothesis. 

In [1]:
from scipy.stats import ttest_ind