## Statistics != Data Science

    It's a tool of data scientists
    
    We are working with experiemental probability, not theoretical
    
    You HAVE to be able to imagine the dataset you need to solve the problem
    before you figure out how to code it in Python.


In [501]:
# interactive matplotlib plots
%matplotlib notebook
import numpy as np
import pandas as pd

#import viz # curriculum example visualizations

np.random.seed(123)

## Generating Random Numbers with Numpy


The numpy.random module provides a number of functions for generating random numbers.

    np.random.choice: selects random options from a list

    np.random.uniform: generates numbers between a given lower and upper bound

    np.random.random: generates numbers between 0 and 1

    np.random.randn: generates numbers from the standard normal distribution

    np.random.normal: generates numbers from a normal distribution with a specified mean and standard deviation

In [53]:
# choice(array, size=None, replace=True, p=None)
np.random.choice([1, 2, 3, 4], (3, 10))

array([[3, 2, 3, 3, 1, 3, 3, 2, 4, 3],
       [4, 2, 3, 2, 1, 2, 3, 4, 2, 1],
       [3, 1, 4, 2, 4, 3, 2, 1, 1, 1]])

In [54]:
#
np.random.uniform(10, 20, (4, 4))

array([[17.37995406, 11.8249173 , 11.75451756, 15.31551374],
       [15.31827587, 16.34400959, 18.49431794, 17.24455325],
       [16.11023511, 17.22443383, 13.22958914, 13.61788656],
       [12.28263231, 12.93714046, 16.30976124, 10.9210494 ]])

In [55]:
#
np.random.random((4,4))

array([[0.43370117, 0.43086276, 0.4936851 , 0.42583029],
       [0.31226122, 0.42635131, 0.89338916, 0.94416002],
       [0.50183668, 0.62395295, 0.1156184 , 0.31728548],
       [0.41482621, 0.86630916, 0.25045537, 0.48303426]])

In [56]:
# numbers from standard normal dist
np.random.randn(10,3)

array([[ 0.01353855,  0.33737412, -0.92662298],
       [ 0.27574741,  0.37085233,  1.17430715],
       [-2.03229248,  0.58250829, -0.7211758 ],
       [-1.7024236 ,  0.76764147,  0.37067905],
       [ 2.07545935,  0.12775769,  1.79988945],
       [ 1.23836635,  0.91375695,  0.46626803],
       [ 0.12503966, -0.19765092, -0.72803492],
       [-0.61031342, -0.87304897,  0.15646903],
       [ 0.28655207,  0.8677178 ,  1.26751791],
       [ 2.10783998, -0.23608945, -0.23208183]])

In [57]:
# takes in mean and std and size
np.random.normal(10, 3, (2, 2))

array([[ 6.76558177, 13.48955361],
       [11.56747022, 15.53309564]])

### Example Problems

#### Carnival Dice Rolls
    You are at a carnival and come across a person in a booth offering you a game of "chance" (as people in booths at carnivals tend to do).

    You pay 5 dollars and roll 3 dice. If the sum of the dice rolls is greater than 12, you get 15 dollars. If it's less than or equal to 12, you get nothing.

    Assuming the dice are fair, should you play this game? How would this change if the winning condition was a sum greater than or equal to 12?

#### The point is to write code that simulates this experiement

In [58]:
ntrials = nrows = 4
n_dice_rolled = ncols = 3
# rolls = np.random.choice([1, 2, 3, 4, 5, 6], n_trials * n_dice).reshape(nrows, ncols)
rolls = np.random.choice([1, 2, 3, 4, 5,6], (ntrials, n_dice_rolled))
rolls


array([[4, 5, 6],
       [5, 2, 5],
       [5, 4, 5],
       [4, 1, 6]])

In [59]:
# default is to sum the whole thing
# axis=0 or axis=1
rolls.sum(axis=1)

array([15, 12, 14, 11])

In [60]:
# creates bool for win or loss
rolls.sum(axis=1) > 12

array([ True, False,  True, False])

In [61]:
# You can do this without .astype(int) and numpy will convert it for you
# probability of winning
p_win = (rolls.sum(axis=1) > 12).mean()

In [62]:
amount_we_win = 15
cost = 5

expected_winnings = p_win * amount_we_win

profit = expected_winnings - cost
profit

2.5

### No Rest or Relaxation


In [63]:
p_nap = .3
naps = np.random.random((10, 2)) <= p_nap
naps

array([[False,  True],
       [ True, False],
       [ True, False],
       [False, False],
       [ True, False],
       [ True, False],
       [False,  True],
       [ True, False],
       [False, False],
       [False, False]])

In [64]:
naps.sum(axis=1)

array([1, 1, 1, 0, 1, 1, 1, 1, 0, 0])

In [65]:
(naps.sum(axis=1) >= 1)

array([ True,  True,  True, False,  True,  True,  True,  True, False,
       False])

In [66]:
# probability of one nap in a weekend
(naps.sum(axis=1) >= 1).mean()

0.7

In [67]:
# probability of zero naps taken in a weekend
(naps.sum(axis=1) == 0).mean()

0.3

### One With Dataframes

    What is the probability of getting at least one 3 in 3 dice rolls?

    To simulate this, we'll use a similar strategy to how we modeled the dice rolls in the previous example, but this time, we'll store the results in a pandas dataframe so that we can apply a lambda function that will check to see if one of the rolls was a 3.

In [68]:
n_trials = 5
n_dice_rolled = 3

rolls = np.random.choice([1, 2, 3, 4, 5, 6], (n_trials, n_dice_rolled))
rolls

array([[1, 2, 4],
       [2, 1, 6],
       [2, 3, 6],
       [3, 3, 6],
       [2, 4, 3]])

In [69]:
df = pd.DataFrame(rolls)
df

Unnamed: 0,0,1,2
0,1,2,4
1,2,1,6
2,2,3,6
3,3,3,6
4,2,4,3


In [70]:
# We need to apply a function row by row which returns
# True or False for 3 in the row
# .values is a Pandas Series
df.apply(lambda row: 3 in row.values, axis=1)

0    False
1    False
2     True
3     True
4     True
dtype: bool

In [71]:
# converts above to probability
df.apply(lambda row: 3 in row.values, axis=1).mean()

0.6

#### Explanation of lambda parts above

In [72]:
s = pd.Series([3, 2, 2])
s

0    3
1    2
2    2
dtype: int64

In [73]:
# converts this to a Numpy array
s.values

array([3, 2, 2])

In [74]:
3 in s.values

True

## Exercises

1. How likely is it that you roll doubles when rolling two dice?



In [76]:
n_trials = nrows = 10
n_dice = ncols = 2

rolls = np.random.choice([1, 2, 3, 4, 5, 6], n_trials * n_dice).reshape(nrows, ncols)
rolls

array([[6, 6],
       [1, 1],
       [4, 2],
       [1, 3],
       [2, 6],
       [4, 1],
       [5, 1],
       [5, 2],
       [4, 6],
       [2, 1]])

In [91]:
# function to create a bool list returning True for doubles rolled
def doubles_trial(rolls):
    return np.array([roll[0] == roll[1] for roll in rolls])
doubles_bool = doubles_trial(rolls)
doubles_bool
#doubles_bool = pd.Series(doubles_bool)

array([ True,  True, False, False, False, False, False, False, False,
       False])

In [92]:
doubles_rate = doubles_bool.mean()
doubles_rate

0.2

2. If you flip 8 coins, what is the probability of getting exactly 3 heads? What is the probability of getting more than 3 heads?

 

In [505]:
n_trials = nrows = 10000
n_coins = ncols = 8

flips = np.random.choice([0, 1], (n_trials, n_coins))
flips

array([[0, 1, 0, ..., 0, 0, 1],
       [1, 0, 1, ..., 1, 0, 1],
       [0, 1, 1, ..., 0, 1, 1],
       ...,
       [1, 1, 0, ..., 0, 1, 0],
       [1, 0, 0, ..., 1, 1, 0],
       [1, 1, 1, ..., 1, 1, 1]])

In [506]:
# create a bool df of True for a heads flip
heads_bool = flips == 1
heads_bool
df_heads = pd.DataFrame(heads_bool)
df_heads

Unnamed: 0,0,1,2,3,4,5,6,7
0,False,True,False,False,False,False,False,True
1,True,False,True,True,False,True,False,True
2,False,True,True,False,False,False,True,True
3,True,False,True,False,False,False,False,True
4,True,True,False,False,True,False,False,True
5,False,True,False,True,True,True,False,False
6,False,False,True,True,False,False,True,False
7,True,False,False,True,False,True,False,True
8,False,False,False,False,True,False,False,True
9,True,True,True,False,False,False,False,True


In [507]:
# probability of getting exactly 3 heads tossing 8 coins
exactly_three = ((df_heads.sum(axis=1)) == 3).mean()
exactly_three

0.2242

In [115]:
# probability of getting more than 3 heads tossing 8 coins


In [508]:
more_than_three = ((df_heads.sum(axis=1)) > 3).mean()
more_than_three

0.6314

3. There are approximitely 3 web development cohorts for every 1 data science cohort at Codeup. Assuming that Codeup randomly selects an alumni to put on a billboard, what are the odds that the two billboards I drive past both have data science students on them?

    plan:
    
        -generate a random pandas dataframe from a list of 1 (True data sci alum on billboard) 
            and 0 (False data sci alum on billboard)

        -convert integers to bools

        -sum rows and compare to 2

        -take the mean to find probability

In [509]:
n_sims = 100000
n_billboards = 2

drive_by = np.random.choice([1, 0], (n_sims, n_billboards))
drive_by

array([[0, 0],
       [0, 1],
       [1, 0],
       ...,
       [1, 1],
       [1, 1],
       [1, 0]])

In [510]:
# convert array to DataFrame
ds_df = pd.DataFrame(drive_by)
ds_df

Unnamed: 0,0,1
0,0,0
1,0,1
2,1,0
3,1,0
4,0,1
5,1,1
6,1,1
7,0,0
8,0,0
9,0,1


In [513]:
# find rows that sum to 2
ds_df.sum(axis=1) == 2

0        False
1        False
2        False
3        False
4        False
5         True
6         True
7        False
8        False
9        False
10       False
11        True
12       False
13       False
14        True
15        True
16       False
17       False
18       False
19       False
20       False
21       False
22        True
23       False
24       False
25        True
26       False
27        True
28       False
29       False
         ...  
99970     True
99971    False
99972    False
99973    False
99974    False
99975    False
99976    False
99977     True
99978    False
99979    False
99980    False
99981    False
99982    False
99983     True
99984    False
99985    False
99986     True
99987    False
99988    False
99989    False
99990    False
99991     True
99992    False
99993    False
99994    False
99995    False
99996     True
99997     True
99998     True
99999    False
Length: 100000, dtype: bool

In [514]:
# probability that two billboards out of n_sims will both have data science alums on them
prob_ds_df = (ds_df.sum(axis=1) == 2).mean()
prob_ds_df

0.24926

4. Codeup students buy, on average, 3 poptart packages (+- 1.5) a day from the snack vending machine. If on monday the machine is restocked with 17 poptart packages, how likely is it that I will be able to buy some poptarts on Friday afternoon?

Plan:
    
    sims = rows = n
    days = cols = 5

   np.random.normal(3, 1.5, sims)

   

    P(sold_tarts >= 17)

In [517]:
sims = rows = 10000
days = cols = 5
# possible tarts sold each day for five days
pop_sales = np.round(np.random.normal(3, 1.5, (sims, days)))

# put it all together
pop_prob = (
 (pd.DataFrame(pop_sales)
  .sum(axis=1) >= 17)
  .mean()
)
pop_prob

0.3271

In [None]:
# for Pandas Series
# s.where(condition False, replace)

In [None]:
# Look into this for later
# np.where(if this condtion True, replacement value, if condition False this value)

In [256]:
# Breakdown

In [518]:
# convert array in Pandas DataFrame
pop_sales = pd.DataFrame(pop_sales)
pop_sales

Unnamed: 0,0,1,2,3,4
0,3.0,2.0,1.0,3.0,1.0
1,5.0,2.0,4.0,3.0,6.0
2,3.0,2.0,3.0,1.0,4.0
3,4.0,4.0,2.0,1.0,2.0
4,1.0,5.0,2.0,5.0,5.0
5,-0.0,3.0,1.0,4.0,3.0
6,1.0,3.0,3.0,-0.0,5.0
7,1.0,2.0,2.0,4.0,4.0
8,3.0,5.0,4.0,3.0,1.0
9,2.0,4.0,3.0,1.0,4.0


In [169]:
# week has leftover poptarts for Zach
pop_sales.sum(axis=1) >= 17

array([False,  True, False, False, False, False,  True, False, False,
       False, False, False,  True,  True,  True,  True, False,  True,
       False, False, False, False, False, False, False, False, False,
        True, False, False,  True, False, False,  True,  True, False,
        True,  True, False,  True, False,  True,  True, False,  True,
        True, False, False, False,  True,  True, False,  True, False,
       False, False, False,  True, False,  True, False,  True, False,
        True, False,  True, False, False,  True, False, False, False,
       False,  True, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False])

In [519]:
# probability Zach can get a poptart on Friday
(pop_sales.sum(axis=1) >= 17).mean()

0.3271

5. Compare Heights

    Men have an average height of 178 cm and standard deviation of 8cm.
    
    Women have a mean of 170, sd = 6cm.
    
    If a man and woman are chosen at random, P(woman taller than man)?
    
    Plan:
    
    people = cols = 2
    sims = rows = n
    
    use np.random.normal(mean, std, size) in a function
    
    generate random woman height from mean 170 and std 6
    
    generate random man height from mean 178 and std 8
    
    

In [207]:
# function to find random man and woman height
def rand_height(mean, std, sims):
    return np.random.normal(mean, std, sims)

def f_beats_m(sims):
    return (rand_height(170, 6, sims) > rand_height(178, 8, sims)).mean()

f_beats_m(1000000)

0.212304

In [205]:
# steps to create the second function
# call random woman and man heights 
# p_fem_beats_male = (rand_height(170, 6, 1000) > rand_height(178, 8, 1000)).mean()
# p_fem_beats_male

6. When installing anaconda on a student's computer, there's a 1 in 250 chance that the download is corrupted and the installation fails. What are the odds that after having 50 students download anaconda, no one has an installation issue? 100 students?

Plan:

    generate a random number between 1 and 250 using random.randint(1, 250, (sims))

    create a bool where 1 == True which indicates an installation issue

    check for a 1 in 50

    check for a 1 in 100

In [524]:

# This is Zach's answer... yea, I wasn't even close to setting this up right
p_corrupt = 1/250
n = 10000
1- ((np.random.random((n, 50)) < p_corrupt).sum(axis=1) > 0).mean()

0.8177

In [None]:
# Below is mine, and it was totally wrong.

In [247]:
# generate an array from np.random.randint(), the size will me number of sims
# the mean() will give the probability

sims = 50
prob_of_issue = (np.random.randint(1, 251, (sims)) == 1).mean()
prob_of_issue

0.04

What is the probability that we observe an installation issue within the first 150 students that download anaconda?

In [251]:
sims = 150
prob_of_issue = (np.random.randint(1, 251, (sims)) == 1).mean()
prob_of_issue

0.006666666666666667

How likely is it that 450 students all download anaconda without an issue?

In [252]:
sims = 450
prob_of_issue = (np.random.randint(1, 251, (sims)) == 1).mean()
prob_of_issue

0.011111111111111112

7. There's a 70% chance on any given day that there will be at least one food truck at Travis Park. However, you haven't seen a food truck there in 3 days. How unlikely is this?

Plan:

sims = rows = 10
days = cols = 3
P(no food truck for 1 day) = .3
P(no food truck for 3 days in a row) = .027

In [456]:
# create data set

ndays = ncols = 3
sims = nrows = 10000

data = np.random.random((sims, ndays))
data

array([[0.06057774, 0.37024424, 0.87511035],
       [0.246096  , 0.08719785, 0.62287342],
       [0.80524174, 0.19654933, 0.98069304],
       ...,
       [0.59919814, 0.70570969, 0.85072467],
       [0.70494551, 0.71846437, 0.57927791],
       [0.99389087, 0.53038173, 0.56241645]])

In [464]:
p_no_truck = .3
truck = data > p_no_truck
truck

array([[False,  True,  True],
       [False, False,  True],
       [ True, False,  True],
       ...,
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

In [467]:
# The probability of no food truck for 3 days in a row
prob_no_truck = (truck.sum(axis=1) == 0).mean()

0.0294

How likely is it that a food truck will show up sometime this week?



In [468]:
sims = rows = 10000
ndays = cols = 7

data = np.random.random((sims, ndays))
data

array([[0.06656096, 0.06403803, 0.48380594, ..., 0.90727134, 0.6734589 ,
        0.66520871],
       [0.06736501, 0.14697547, 0.71935343, ..., 0.39344575, 0.38236498,
        0.91593727],
       [0.01042886, 0.14139416, 0.32475743, ..., 0.1569147 , 0.20207263,
        0.8466322 ],
       ...,
       [0.67441744, 0.06571309, 0.57403344, ..., 0.94121842, 0.08481748,
        0.13670107],
       [0.79025064, 0.21403208, 0.18184221, ..., 0.47364641, 0.27057171,
        0.71752207],
       [0.5484722 , 0.55364459, 0.61210809, ..., 0.80910569, 0.50840023,
        0.22385916]])

In [470]:
# probability of a truck in 7 days
prob_truck = (truck.sum(axis=1) >= 1).mean()
prob_truck

0.9706

8. If 23 people are in the same room, what are the odds that two of them share a birthday? What if it's 20 people? 40?

Plan:

sims = rows = people

check for a matching number in the generated array

365 days in a year   np.random.randint(1, 366)



In [544]:
# generate a random number between 1 and 365 to represent birthday
# rows will be people in group
# check for a matching number in the generated array
sims = rows = 1000
people = 23
birthdays = np.random.randint(1, 366, (sims, people))
birthdays = pd.DataFrame(birthdays)
birthdays

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,45,3,315,156,242,361,128,209,106,142,...,60,303,45,187,21,81,218,272,141,237
1,256,244,335,341,3,202,344,190,142,194,...,362,187,262,340,196,102,4,40,274,224
2,79,46,265,165,198,276,220,358,253,134,...,338,219,340,85,274,348,280,8,239,304
3,231,212,164,97,23,141,232,152,195,116,...,90,178,49,323,102,145,341,165,218,288
4,301,175,237,190,74,254,326,38,35,276,...,164,322,22,80,209,32,27,188,69,252
5,218,6,232,143,281,169,45,293,308,84,...,29,161,95,345,196,90,214,40,247,76
6,138,313,19,361,309,123,63,156,45,335,...,179,163,141,84,214,208,309,360,248,134
7,33,56,138,52,15,64,121,344,351,122,...,165,332,273,154,125,117,217,153,94,334
8,220,241,326,88,283,308,336,147,363,148,...,103,153,16,309,78,167,321,13,73,141
9,194,108,101,337,17,13,192,267,161,25,...,178,13,199,51,116,336,319,64,11,141


In [545]:
# Wow! Yes, this is beautiful
(birthdays.nunique(axis=1) < 23).mean()

0.536

In [None]:
# Code I was playing around with

In [534]:
n_sims = 10_000
n_people = 23
row = np.random.choice(range(365), 23, (n_sims, n_people))

In [535]:
n_unique_bdays = np.array([np.unique(row).size for row in birthdays])

In [538]:
# returns number of values in array
# if True, at least two people have a shared birthday
np.unique(row).size < 23

False