# Statistics in Python
**SciPy - (pronounced “Sigh Pie”)** is an open-source software for mathematics, science, and engineering.

- [SciPy - Documentation](https://docs.scipy.org/doc/scipy/reference/index.html)
- [SciPy - Wiki](https://en.wikipedia.org/wiki/SciPy)

**We shall be using [stats](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.a) package and considering normal distribution of variable for the demonstration purposes.** However, the same can be replicated for other distributions with minimal change.

**Following are frequently used parameters:**

- x = numeric argument
- p = probability argument
- loc = mean
- scale = standard deviation

In [1]:
from scipy.stats import norm    # Importing normal distribution class from stats module
import numpy as np              # Importing numpy for basic math operations

### 1. Probability Density Function
**Returns probability of finding the number X in your distribution**

- **R** : dnorm()
- **Python** : norm.pdf(x, loc, scale)

In [2]:
norm.pdf(90,loc=100,scale=20)

0.017603266338214976

### 2. Cumulative Density Function (left tail probability)
**Returns probability of finding a number less than X in your distribution**

- **R** : pnorm()
- **Python** : norm.cdf(x, loc, scale)

In [3]:
norm.cdf(90,loc=100,scale=20)

0.3085375387259869

### 3. Survival Function (right tail probability)
**Returns probability of finding a number less than X in your distribution**

- **R** : 1 - pnorm()
- **Python** : norm.sf(x, loc, scale)

In [4]:
norm.cdf(90,loc=100,scale=20)

0.3085375387259869

### 4. Percent Point Function
**Returns value associated to the probability P (aka P-th quantile) in your distribution**

- **R** : qnorm()
- **Python** : norm.ppf(x, loc, scale)



In [5]:
norm.ppf(0.30853754,loc=100,scale=20)

90.00000007237368

### Function Mapping [R => Python]

- **dnorm   => norm.pdf()**

- **pnorm   => norm.cdf()**

- **1-pnorm => norm.sf()**

- **qnorm   => norm.ppf()**

# Statistical questions

**Q1.** A guy at a counter serves customers standing in the queue one by one. Suppose that the service time X_i for customer i has mean E(X_i)=2 (minutes) and Var(X_i)=1. We assume that service times for different bank customers are independent. Let Y be the total time the guy spends serving 50 customers. Find P(90<Y<110) (E.g. - 0.99)

**Ans in R**

- pnorm(110,100,sqrt(50)) - pnorm(90,100,sqrt(50))

**Ans in Python**


In [6]:
sd = np.sqrt(50)
print(norm.cdf(x=110,loc=100,scale=sd) - norm.cdf(x=90,loc=100,scale=sd))

0.8427007929497148


**Q2**. A random sample of 100 items is taken, producing a sample mean of 49. The population SD is 4.49. Construct a 90% confidence interval to estimate the population mean. (E.g. - 45.33,53.45)

**Ans in R**

- qnorm(.05,49,4.49) and qnorm(.95,49,4.49)


**Ans in Python**


In [7]:
sd = 4.49/np.sqrt(100)
print(norm.ppf(q=0.05,loc=49,scale=sd))
print(norm.ppf(q=0.95,loc=49,scale=sd))

48.261460721498786
49.738539278501214


**Q3**. A random sample of 35 items is taken, producing a sample mean of 2.364 with a sample variance of 0.81. Assume x is normally distributed and construct a 90% confidence interval for the population mean.

**Ans in R**

- qnorm(.05,2.364,0.9/sqrt(35))
- qnorm(.95,2.364,0.9/sqrt(35))

**Ans in Python**


In [8]:
sd = 0.9/np.sqrt(35)
print(norm.ppf(q=0.05,loc=2.364,scale=sd))
print(norm.ppf(q=0.95,loc=2.364,scale=sd))

2.1137720925797394
2.6142279074202603


**Q4**. Suppose a car manufacturer claims a model gets 25 mpg. A consumer group asks 40 owners of this model to calculate their mpg and the mean value was 22 with a standard deviation of 1.5.

Give the z-score for this observation. Is the claim true? (Give your answer as "z-score,Yes/No". e.g. -1.99,Yes)


**Ans in R**

- (22-25)/(1.5/sqrt(40))
- qnorm(pnorm(22,25,1.5/sqrt(40)))
- No

**Ans in Python**


In [9]:
sd = 1.5/np.sqrt(40)
p = norm.cdf(x=22,loc=25,scale=sd)  ## Finding p-value
norm.ppf(p)                         ## Finding z-score (p-th quantile value in a standard normal distribution) 

-12.649110640673518

**Q5**. Suppose the mean weight of King Penguins found in an Antarctic colony last year was 15.4 kg. In a sample of 35 penguins same time this year in the same colony, the mean penguin weight is 14.6 kg. Assume the population standard deviation is 2.5 kg.

What is the p-value for the given observation? At 0.05 significance level, can we reject the null hypothesis that the mean penguin weight does not differ from last year?

**Ans in R**

- pnorm(14.6,15.4,2.5/sqrt(35))
- No (0.025 limits either side since its a two tail test)

**Ans in Python**


In [10]:
sd = 2.5/np.sqrt(35)
norm.cdf(14.6,loc=15.4,scale=sd)  ## Finding p-value

0.029169259343448172

**Q6**. A student, to test his luck, went to an examination unprepared.

It was a MCQ type examination with two choices for each questions. There are 50 questions of which at least 20 are to be answered correctly to pass the test. What is the probability that he clears the exam?
If each question has 4 choices instead of two, What is the probability that he clears the exam?

**Note: Its binomial distribution**

**Ans in R**

- Case1: 1-pbinom(19,50,0.5)
- Case2: 1-pbinom(19,50,0.25)


**Ans in Python**


In [11]:
from scipy.stats import binom

print("Case 1")
## Case 1 - 2 options (aka 50% probability of getting correct answer)

print(1 - binom.cdf(k=19,n=50,p=0.5))  ## 1-probability of less than 19 correct answers
print(binom.sf(k=19,n=50,p=0.5))       ## Cumulative probability for more that 19 correct answers

## Case 2 - 4 options (aka 20% probability of getting correct answer)

print("\nCase 2")
print(1 - binom.cdf(k=19,n=50,p=0.25))  ## 1-probability of less than 19 correct answers
print(binom.sf(k=19,n=50,p=0.25))       ## Cumulative probability for more that 19 correct answers


Case 1
0.9405397737202819
0.9405397737202819

Case 2
0.013917608678660653
0.01391760867866067
