# Inferential statistics

- Inferential statistics is used to perform finding of facts.
- Inferential statistics: It is a process of estimating population parameters from samples.
- Parameters: A characteristics of population (mean, SD, etc)
- Sampling Error: The amount of error in the estimation of a population parameters that is based on sample statistics.

- Instructor says that:
- when we import a .csv file into pandas dataframe
- we will try to plot a probability distribution (using probability density function)
- Then we try to find the parameters of distribution (mean, median, mode, variance, SD)


- Usually, the population size is too large, so we cannot do statistics on entire population, so we have to fetch a sample from population.
- This sample should be a good representative of the population.
- The purpose of sample is estimation.



- If you are fetching only one variable from the population then its called univariate sampling.
- If you are fetching 2 variables, then its called bivariate sampling.

- Parameters: Characteristics of population (mean, SD, etc).
- Sampling Error: The amount of error in the estimation of a population parameters that is based on sample statistics.

# Central limit theorem
- See slides

# Hypothesis Testing
- It is an example of inferential statistics.
- Hypothesis is a claim.
- Hypothesis testing is to either reject or retain a null hypothesis using data.
- Hypothesis testing consists of two complementary statements:
    1. Null Hypothesis (H<sub>0</sub>) - an existing belief.
    2. Alternate Hypothesis (H<sub>A</sub>) - what we intend to establish with new evidences.

- Types of Hypothesis testing: Parameteric and non-parametric
    1. Parameteric tests: They make use population parameters such as mean, standard deviations, etc. Example: Z-test, T-test, ANOVA, etc.
    2. Non-Parametric tests: They make use data distribution to comment on the claim. Example: Chi-Square, etc.
- Few examples of the null hypothesis are as follows: 
    1. Children who drink the health drink Complan are likely to grow taller.
    2. Women use camera phone more than men (Freier, 2016)
    3. Vegetarians miss few flights (Siegel, 2016)
    4. Smokers are better sales people.

## The steps for hypothesis tests are as follows:

1. Define null and alternative hypothesis. Hypothesis is described using a population parameter.
2. Identify the test statistic to be used for testing the validity of the null hypothesis.
3. Decide the criteria for rejection and retention of null hypothesis. This is called significance value.
4. Calculate the p-value (probability value), which is the conditional probability of observing the test statistic value when the null hypothesis is true.
5. Take the decision to reject or not.

### Types of probability:
- Marginal probability: P(A)
- Joint Probability: P(A ∩ B) or P(A ∪ B)
- Conditional Probability: P(A|B)

## For the following examples, we will use the following notatins:
1. μ - Population mean
2. σ - Population standard deviation
3. X - sample mean
4. S - Sample Standard deviation
5. n - sample size

- 1, 2, and 5 are given in claim.
- 3 and 4 have to be determined by you.

## Z-test (Parameteric HT)
- Z-test is used when:
    1. We need to test the value of **population mean**, given that population variance is known. And **mean, alpha and Standard deviations are given**.
    2. The population is a **normal distribution** and population variance is known.
    3. If the sample size is large and the population variance is known then normal distribution can be relaxed. That is, the assumption of normal distribution can be relaxed for large samples (n>30).
- Z-statistic is calculated as Z = (X-μ)/(σ/$\sqrt{n}$)

## Important Note on p-value
- p-value is the conditional probability Z-statistics.
- We can call it as cumulitive probability of Z-statistics
- cdf of Z-statistics is p.
- If the calculated p-value is greater than given alpha value, then the hypothesis can be accepted.
- else it is rejected.

## Problem statement on Z-test


- Read the first 5 entries and check the values present in the dataframe.


- conducting Z-test for the above hypothesis test:


```python
    import math
    def z_test(pop_mean, pop_std, sample):
        z_score = (sample.mean() - pop_mean) / (pop_std/math.sqrt(len(sample)))
        return z_score, stats.norm.cdf(z_score)

    z_test(30, 12.5, passport_df.processing_time)
```

- You will receive two values as output: for eg: (-1.4925, 0.0677)
- The first value of the result is the Z-statistic value or Z-score and second value is the corresponding p-value.
- Since the p-value of 0.0677 (aka 6.77%) is greater than alpha of 5%, there is not enough evidence to reject null hypothesis.
- Hence the null hypothesis is retained.

## T-test
- It is used if one population mean is given, but SD is missing.
- Here before estimating, the t-value, SD(s) of population is calculated as:
- Formula for T-test:
- t-statistics = (X-μ)/(S/$\sqrt{n}$)
- Z-test is population test while T-test is a sample test.
- There are the following types of T-tests:
    1. One Sample (Parametric HT)
        - One sample is given, estimation is done on one sample.
    2. Two Sample
        - Two samples are given, both are independent variables.
    3. Paired Sample 
        - Two samples are given, but unlike Two sample, first variable and second variable are dependent variables.


## Problem statement for One sample T-test:


- Read data and print first 5 entries.  

- Conducting the test:


```python
stats.ttest_1samp(bollywood_movies_df.production_cost, 500)
Ttest_1sampResult(statistic=-2.2845, pvalue=0.02786)
```
- t-statistic value = -2.2845, and p-value = 0.02786
- p-value is less than 0.05
- Reject the null hypothesis.

## Two-Sample T-test
- It is used to test the difference between two population means, if SD is not given.
- The parameters are estimated from the samples.

- Reading the data with tab healthdrink_yes as parameter and then display first five records.
```python
    # first parameter is filename, second parameter is sheetname
    healthdrink_yes_df = pd.read_excel('healthdrink.xlsx', 'healthdrink_yes')
    healthdrink_yes_df.head(5)
```

- Reading from sheet number 2:
```python
    # first parameter is filename, second parameter is sheetname
    healthdrink_yes_df = pd.read_excel('healthdrink.xlsx', 'healthdrink_no')
    healthdrink_yes_df.head(5)
```

- Conducting the test:
```python
    stats.ttest_ind(healthdrink_yes_df['height_increase'], healthdrink_no_df['height_increase'])
    Ttest_indResult(statistic=8.1316, pvalue=0.00)
```

## Paired-Sample T-test:
- When for before and after statistical estimation.
- Used for comparing two different intervensions applied on the same sample.
- The parameters are calculated from the samples.
- Eg:
1. Employee performance after training program.
2. Diagnosis after treatment.

## Example Problem statement for Paired-Sample T-test:
- Given:
- Level of Confidence = 95%
- Therefore, Alpha = 100 - LoC = 5%
- Notice that alcohol consumption before and after breakup is given, so its a bivariate sampling.

- Distribution plots of alcohol consumption separately for before and after breakups:
```python
sn.distplot(breakups_df['Before_Breakup'], label = 'Before_Breakup')
sn.distplot(breakups_df['After_Breakup'], label = 'After_Breakup')
plt.legent()
```

- Conducting the test:
```python
stats.ttest_rel(breakups_df['Before_Breakup'], breaksups_df['After_Breakup'])
Ttest_relResult(statistic=-0.5375, pvalue=0.5971)
```
- Probability of the sample belonging to the same distribution is 0.597 which is more than 0.05 value.
- We conclude that they are part of same distribution.
- There is no change in alcohol consumption pattern before and after breakup.