# Statistical Tests in Python

## Introduction: 

There exist a very large number of statistical tests that we could perform. 
However, in this jupyter notebook, we will restrict ourselves only to the following Statistical Tests.


    - Paired T-test
    - 2 Tailed P Test (through Pearson correlation)
    - F-Test (through ANOVA)
    - Wilcoxon sign-rank test


A look into https://cyfar.org/types-statistical-tests can tell you exactly why I selected these specific tests. (Tailoring for data)

For each Test there exist different types of Hypothesis

These are
    - Alternate Hypothesis
    - Null Hypothesis


-------------------------------------------------------------------------------------------------------------------------------

## Business Objective
    The statistical analysis of our data allows us to make mathematical inferences and conclusions that we can gain insight from.
    
-------------------------------------------------------------------------------------------------------------------------------

In [1]:
#importing libraries
import os
import pandas as pd
import numpy as np
import seaborn as sns
import xlrd
import scipy
import matplotlib.pyplot as plt
import time
import statistics
from scipy.stats import wilcoxon
from scipy import stats
import statsmodels.api as sn
from statsmodels.formula.api import ols

In [2]:
#Changing the directory using os package, to the place where the excel sheet exists
os.chdir('C:/Users/Viswa/Desktop/R Project')

#Name of the excel sheet
excel_file='Project_Part_2.xlsx'

#Reading excel sheet into sales.
comp=pd.read_excel(excel_file)

## T-Statistic ( Got out of Paired T-Test )
    -The Paired T-Test can be used to find the difference between two related variables.

### About T:
    - In statistics ,the 'T'-statistic is the ratio of the departure of the estimated value of a parameter from its hypothesized value to its standard error.
    
### Hypotheses:
   
   #### Default Alternate Hypothesis:
       The alternative hypothesis assumes that the difference between the true mean (μ) and the comparison value is not zero.
   #### Default Null Hypothesis:
       The null hypothesis assumes that the difference between the true mean (μ) and the comparison value is equal to zero.
    
### Values of 'T' and their meanings:
The greater the T, the more evidence you have that your team’s scores are significantly different from average. A smaller T value is evidence that your team’s score is not significantly different from average. 

In [3]:
stats.ttest_rel(comp['profits_billion'],comp['sales_billion'])

Ttest_relResult(statistic=-5.265275529919137, pvalue=5.426975561487554e-07)

In [4]:
stats.ttest_rel(comp['assets_billion'],comp['sales_billion'])

Ttest_relResult(statistic=3.137694597667933, pvalue=0.0020942019593703965)

In [5]:
stats.ttest_rel(comp['market_value_billion'],comp['sales_billion'])

Ttest_relResult(statistic=-0.6781622755623272, pvalue=0.4988382091859547)

In [6]:
stats.ttest_rel(comp['profits_billion'],comp['assets_billion'])

Ttest_relResult(statistic=-3.632425424497226, pvalue=0.0003987318757533985)

In [7]:
stats.ttest_rel(comp['market_value_billion'],comp['assets_billion'])

Ttest_relResult(statistic=-3.3111473046824846, pvalue=0.0011942338606216614)

In [8]:
stats.ttest_rel(comp['market_value_billion'],comp['profits_billion'])

Ttest_relResult(statistic=6.213676558129145, pvalue=6.094543457412977e-09)

According to the data that we have 

Degrees of Freedom - 133

Significance Level - 5%

Putting this information into the T-Statistic Table, we find that 1.97 is the critical value for Two-tailed-T test.

Thus values of T are -1.97 and +1.97

### Conclusion for T-Statistic
    
    For the comparision between market value and sales, and market value and profits, we see that the default null hypothesis is accepted. In other words, the difference between the means is zero.
    
    For the other comparisions, since the default alternate hypothesis is accepted, thus the difference between means is significant

## 2-tailed P-Statistic ( Got out of Pearson's Correlation )
    -Tests for the strength of the association between two continuous variables

### About p:
    -'p' is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event
    
### Hypotheses:
   
   #### Alternate Hypothesis: 
       Compared Variables have significant relationship.
   #### Null Hypothesis:
       Compared Varaibles have no significant relationship.
    
### Values of 'p' and their meanings:
The most commonly used p-value is 0.05. If the investor concludes that the p-value is less than 0.05, there is strong evidence against the null hypothesis. As a result, we would reject the null hypothesis and accept the alternative hypothesis.

In [9]:
r,p=scipy.stats.pearsonr(comp['profits_billion'],comp['sales_billion'])
print('p = '+str(p) +' For profits and sales.')

p = 5.0039335971871726e-14 For profits and sales.


In [10]:
r,p=scipy.stats.pearsonr(comp['assets_billion'],comp['sales_billion'])
print('p = '+str(p) +' For assets and sales.')

p = 7.969918017223284e-07 For assets and sales.


In [11]:
r,p=scipy.stats.pearsonr(comp['market_value_billion'],comp['sales_billion'])
print('p = '+str(p) +' For market_value and sales.')

p = 2.2940713321851338e-23 For market_value and sales.


In [12]:
r,p=scipy.stats.pearsonr(comp['profits_billion'],comp['assets_billion'])
print('p = '+str(p) +' For profits and assets.')

p = 7.022370745232362e-67 For profits and assets.


In [13]:
r,p=scipy.stats.pearsonr(comp['market_value_billion'],comp['assets_billion'])
print('p = '+str(p) +' For market value and assets.')

p = 1.795677382452173e-31 For market value and assets.


In [14]:
r,p=scipy.stats.pearsonr(comp['market_value_billion'],comp['profits_billion'])
print('p = '+str(p) +' For market value and profits.')

p = 4.1385597517240784e-57 For market value and profits.


### Conclusion for P-Statistic

Since all p values are below 0.05, the Null Hypothesis is rejected.

Thus all variables have signifcant relationship with one another.

## F-Statistic ( Got out of ANOVA )
    -ANOVA gives us the F-statistic, and several other variables.

### About F
     F-statistic which is simply a ratio of two variances. Variances are a measure of dispersion, or how far the data are scattered from the mean. Larger values represent greater dispersion.

In [15]:
mod=ols('sales_billion ~ profits_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                        sum_sq     df          F        PR(>F)
profits_billion  106782.547336    1.0  71.062504  5.003934e-14
Residual         199853.340812  133.0        NaN           NaN


In [16]:
mod=ols('sales_billion ~ assets_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                       sum_sq     df          F        PR(>F)
assets_billion   51502.797609    1.0  26.848231  7.969918e-07
Residual        255133.090539  133.0        NaN           NaN


In [17]:
mod=ols('sales_billion ~ market_value_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                             sum_sq     df           F        PR(>F)
market_value_billion  161562.732476    1.0  148.117295  2.294071e-23
Residual              145073.155672  133.0         NaN           NaN


In [18]:
mod=ols('profits_billion ~ market_value_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                           sum_sq     df           F        PR(>F)
market_value_billion  3152.533056    1.0  768.106273  4.138560e-57
Residual               545.870944  133.0         NaN           NaN


In [19]:
mod=ols('profits_billion ~ assets_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                     sum_sq     df            F        PR(>F)
assets_billion  3309.069615    1.0  1130.406858  7.022371e-67
Residual         389.334385  133.0          NaN           NaN


In [20]:
mod=ols('assets_billion ~ market_value_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                            sum_sq     df           F        PR(>F)
market_value_billion  1.418042e+07    1.0  238.670441  1.795677e-31
Residual              7.902092e+06  133.0         NaN           NaN


### Conclusion for F statistic
    
   #### Dispersion (Based on F)
   
    The comparision between profits and assets is extremely dispersed in nature whereas a comparision between sales and assets is not that dispersed.

## Wilcoxon sign-rank test
    -Tests for the difference between two related variables—takes into account the magnitude and direction of difference

### About Wilcoxon statistic:
    -If alternative is “two-sided”, the sum of the ranks of the differences above or below zero, whichever is smaller. Otherwise the sum of the ranks of the differences above zero.
    
### Hypotheses:
   
   #### Alternate Hypothesis: 
       Default is Two-Sided. If statistic is not zero, then Alternative Hypothesis is Two-Sided
       Default is that two related paired samples do not come from the same distribution
   #### Null Hypothesis:
       Dependant on Alternative Hypothesis.
       Defualt is that two related paired samples come from the same distribution
    
### Values of 'p' and their meanings:
The most commonly used p-value is 0.05. If the investor concludes that the p-value is less than 0.05, there is strong evidence against the null hypothesis. As a result, we would reject the null hypothesis and accept the alternative hypothesis.

In [21]:
w, p = wilcoxon(comp['market_value_billion'])
w,p

(0.0, 6.703835947022963e-24)

Alternate Hypothesis Accepted, Distribution is not Two-Sided

In [22]:
w, p = wilcoxon(comp['profits_billion'])
w,p

(438.0, 4.2413058183328205e-19)

Alternate Hypothesis Accepted, Distribution is Two-Sided

In [23]:
w, p = wilcoxon(comp['assets_billion'])
w,p

(0.0, 6.709229901634943e-24)

Alternate Hypothesis Accepted, Distribution is not Two-Sided

In [24]:
w, p = wilcoxon(comp['sales_billion'])
w,p

(0.0, 6.705702613281484e-24)

Alternate Hypothesis Accepted, Distribution is not Two-Sided

### Conclusion for Wilcoxon sign-rank test
 

For profits, we that the distribution is two-sided, i.e, loss is also a factor, whereas for the other variables, there don't exist such two sided values.

# Overall Conclusion

    We have thus seen our data run through a many number of tests, through which we were able to gain knowledge and insight about our data.