<a href="https://colab.research.google.com/github/dannyNiming/Danny-Wang/blob/main/Sims_1_stats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running simualtions with Python

When we run simulations of events, business decisions etc, we will always run a follow-up analysis of stats:

## Warmup Exercise

- set the seed to be 765 with numpy
- create a 5*3 numpy array from random numbers between 0 and 1.
Hint: `np.random.rand()`
- multiple the array (elementwise) by 4
- make the array 1 dimensional. Hint: `flatten()` or `reshape()`
- what is the max value?
- identify the index for that largest value? Hint: `argmax()`




In [7]:
import numpy as np
np.random.seed(765)
two = np.random.rand(5,3)
print(two)
three = two * 4
print(three)
four = two.reshape(15,1)
four.flatten()
print(four)
print(two.max())
print(two.argmax())


[[0.47622943 0.46378993 0.29035724]
 [0.94356321 0.25553297 0.73366691]
 [0.70950592 0.89212214 0.20222305]
 [0.79012439 0.82681928 0.68601438]
 [0.1526213  0.70779554 0.05825119]]
[[1.90491773 1.85515973 1.16142896]
 [3.77425284 1.02213187 2.93466764]
 [2.83802369 3.56848857 0.80889221]
 [3.16049755 3.30727714 2.74405751]
 [0.6104852  2.83118215 0.23300477]]
[[0.47622943]
 [0.46378993]
 [0.29035724]
 [0.94356321]
 [0.25553297]
 [0.73366691]
 [0.70950592]
 [0.89212214]
 [0.20222305]
 [0.79012439]
 [0.82681928]
 [0.68601438]
 [0.1526213 ]
 [0.70779554]
 [0.05825119]]
0.9435632112221578
3


# Basic Stats in Python

![](https://www.statology.org/wp-content/uploads/2018/10/normal_dist.png)

In [None]:
# Very useful package for many math/science/engineering tasks
# import scipy to start
import scipy

In [None]:
# if you are coding locally and don't have scipy yet:
#! pip install scipy



In [None]:
# the module we will use today:
from scipy import stats

#https://docs.scipy.org/doc/scipy/reference/stats.html

## One Sample T-test



In [None]:
# create a 1d array from a normal dist 0/1
# size = 15
x = np.random.normal(size=15)

In [None]:
x

array([ 0.36497273,  0.41097184,  0.03739921,  0.2502273 ,  0.06795932,
       -1.24925051, -0.24171349, -1.20130371, -0.92771451, -1.10461306,
       -2.23939628, -0.20392185, -0.05706672,  1.2340581 ,  0.91260418])

In [None]:
x.mean()

-0.2631191635307657

In [None]:
# ttest from scipy
stats.ttest_1samp(x, 0)


Ttest_1sampResult(statistic=-1.1057515211658153, pvalue=0.2874742212123508)

In [None]:
# try this again - larger sample
# size = 100
x = np.random.normal(size=100)

In [None]:
stats.ttest_1samp(x, 0)

Ttest_1sampResult(statistic=0.04749486840050701, pvalue=0.9622145007396632)

In [None]:
x.mean()

0.004923130546908293

> Makes sense right, as sample size increases from the distribution, pvalue gets larger.  Much less unlikely to be from a different dist

In [None]:
# another but we shift the data to 2, std=1, size=50
z = np.random.normal(loc=2, scale=1, size=50)
z.mean()

1.8447668796007555

In [None]:
# save out the result to a variable
# result
result = stats.ttest_1samp(z, 0)

In [None]:
result

Ttest_1sampResult(statistic=13.533460579802602, pvalue=3.565480921237748e-18)

In [None]:
# type
type(result)

scipy.stats.stats.Ttest_1sampResult

In [None]:
# parse to list
list(result)

[13.533460579802602, 3.565480921237748e-18]

## Quick Exercise:

In [None]:
# create an array with mean 85 and standard deviation of 3
# test against a population mean of 91
# draw 50 samples

In [None]:
#grades = np.random.normal(loc=?, scale=?, size=?)

## Two Sample t-test

In [None]:
# lets create two random normal, 100/15, 115/15, size=100
x = np.random.normal(100, 15, 100)
y = np.random.normal(115, 15, 100)

In [None]:
x.mean()

99.74053504594147

In [None]:
y.mean()

113.7493900561022

In [None]:
stats.ttest_ind(x, y)

Ttest_indResult(statistic=-6.393018292291385, pvalue=1.143437649108597e-09)

## Chi-square

In [None]:
## test for independence
# 4 sets of rolls of dice, summarized

a1 = [6, 4, 5, 10]
a2 = [8, 5, 3, 3]
a3 = [5, 4, 8, 4]
a4 = [4, 11, 7, 13]
a5 = [5, 8, 7, 6]
a6 = [7, 3, 5, 9]
dice = np.array([a1, a2, a3, a4, a5, a6])


array([35, 35, 35, 45])

In [None]:
dice

array([[ 6,  4,  5, 10],
       [ 8,  5,  3,  3],
       [ 5,  4,  8,  4],
       [ 4, 11,  7, 13],
       [ 5,  8,  7,  6],
       [ 7,  3,  5,  9]])

In [None]:
stats.chi2_contingency(dice)

(16.490612061288754,
 0.35021521809742745,
 15,
 array([[ 5.83333333,  5.83333333,  5.83333333,  7.5       ],
        [ 4.43333333,  4.43333333,  4.43333333,  5.7       ],
        [ 4.9       ,  4.9       ,  4.9       ,  6.3       ],
        [ 8.16666667,  8.16666667,  8.16666667, 10.5       ],
        [ 6.06666667,  6.06666667,  6.06666667,  7.8       ],
        [ 5.6       ,  5.6       ,  5.6       ,  7.2       ]]))

In [None]:
# another way to unpack the results
stat, p, dof, exp = stats.chi2_contingency(dice)
p

0.35021521809742745

## Quick Exercise:

The operations manager of a company that manufactures tires wants to determine whether there are any differences in the quality of work among the three daily shifts. She randomly selects 496 tires and carefully inspects them. Each tire is either classified as perfect, satisfactory, or defective, and the shift that produced it is also recorded. The two categorical variables of interest are the shift and condition of the tire produced. The data can be summarized by the accompanying two-way table. Does the data provide sufficient evidence at the 5% significance level to infer that there are differences in quality among the three shifts?



|      shift           | Perfect | Satisfactory | Defective |
|-----------------|---------|--------------|-----------|
| Morning Shift   | 106     | 124          | 1         |
| Afternoon Shift | 67      | 85           | 1         |
| Night Shift     | 37      | 72           | 3         |

Source: https://online.stat.psu.edu/stat500/lesson/8/8.1