# Application of ANOVA
    -ANOVA stands for Analysis of variance, a statistical method in which the variation in a set of observations is divided into distinct components.

## Business Objective

-------------------------------------------------------------------------------------------------------------------------------

The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.
 
-------------------------------------------------------------------------------------------------------------------------------


## Initialization

In [1]:
#importing libraries
import os
import pandas as pd
import numpy as np
import seaborn as sns
import xlrd
import matplotlib.pyplot as plt
import time
import statistics
import scipy
from scipy import stats
import statsmodels.api as sn
from statsmodels.formula.api import ols

In [2]:
#Changing the directory using os package, to the place where the excel sheet exists
os.chdir('C:/Users/Viswa/Desktop/R Project')

#Name of the excel sheet
excel_file='Project_Part_2.xlsx'

#Reading excel sheet into sales.
comp=pd.read_excel(excel_file)

## ANOVA Tables

 In each ANOVA table
 
     sum_sq ~ Sum of Squares
 
     df     ~ Degrees of Freedom
         (For precise coefficient estimates and powerful hypothesis tests in regression)
    
     F      ~ F-Statistic
         (F-Statistic is directly proportional to variablitiy)
     
     PR     ~ significance probability value associated with the F Value
      

The F-statistic is simply a ratio of two variances. 
Variances are a measure of dispersion, or how far the data are scattered from the mean. Larger values represent greater dispersion.

In [3]:
mod=ols('sales_billion ~ profits_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                        sum_sq     df          F        PR(>F)
profits_billion  106782.547336    1.0  71.062504  5.003934e-14
Residual         199853.340812  133.0        NaN           NaN


Residual value is very high, thus the variations between profit and sales can't be explained properly by the model

The lower the residual value the better.

In [4]:
mod=ols('sales_billion ~ assets_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                       sum_sq     df          F        PR(>F)
assets_billion   51502.797609    1.0  26.848231  7.969918e-07
Residual        255133.090539  133.0        NaN           NaN


Residual value is very much higher than the Original Value, thus the variations between assets and sales can't be explained properly by the model

The lower the residual value the better.

In [5]:
mod=ols('sales_billion ~ market_value_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                             sum_sq     df           F        PR(>F)
market_value_billion  161562.732476    1.0  148.117295  2.294071e-23
Residual              145073.155672  133.0         NaN           NaN


Residual value is almost equal to original value, thus the variations between profit and sales can be explained. 
The lower the residual value the better.

In [6]:
mod=ols('profits_billion ~ market_value_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                           sum_sq     df           F        PR(>F)
market_value_billion  3152.533056    1.0  768.106273  4.138560e-57
Residual               545.870944  133.0         NaN           NaN


Residual value is much lower than Original Value, thus the variations between Market Value and profits can be explained properly by the model.

The lower the residual value the better.

In [7]:
mod=ols('profits_billion ~ assets_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                     sum_sq     df            F        PR(>F)
assets_billion  3309.069615    1.0  1130.406858  7.022371e-67
Residual         389.334385  133.0          NaN           NaN


Residual value is much lower than Original Value, thus the variations between assets and profits can be explained properly by the model.

The lower the residual value the better.

In [8]:
mod=ols('assets_billion ~ market_value_billion',data=comp).fit()
anova_table=sn.stats.anova_lm(mod,typ=2)
print(anova_table)

                            sum_sq     df           F        PR(>F)
market_value_billion  1.418042e+07    1.0  238.670441  1.795677e-31
Residual              7.902092e+06  133.0         NaN           NaN


Residual value is almost equal to Original Value, thus the variations between Market Value and Residue can be explained properly by the model.

The lower the residual value the better.

## Conclusion
    
   #### Dispersion (Based on F)
   
    The comparision between profits and assets is extremely dispersed in nature whereas a comparision between sales and assets is not that dispersed.
    
   #### Predictability of Variation (Based on Residual)
   
    The comparision between profits and assets is thoroughly explainable, whereas a comparision between sales and assets is very difficult to explain.