## Overview
This notebook is part of a data analysis task (mini-project) to asnwer the following question:
Is a city population correlated with its team performance in the 4 American national Leagues(for year 2018):
* NHL: Hockey league
* nbl: Basketball league
* MLB: Baseball League
* NFL: Football (Americal football) League  

This notebook puts everything together to answer our intial question using statistical tests

## Load the data and compute the statistical tests

In [1]:
import pandas as pd
import numpy as np
from scipy import stats 
from scipy.stats import ttest_ind
from scipy.stats import ttest_rel

from cleaning_data import final_nfl_path, final_mlb_path, final_nbl_path, final_nhl_path

In [2]:
# first let's load the data frames in question: 

nbl = pd.read_csv(final_nbl_path)
nfl = pd.read_csv(final_nfl_path)
mlb = pd.read_csv(final_mlb_path)
nhl = pd.read_csv(final_nhl_path)
merge = pd.merge(nbl, nhl, how='inner', on='area')
print(merge)

ttest_rel(merge['win_loss_ratio_x'], merge['win_loss_ratio_y'])[1]


                      area  win_loss_ratio_x       pop_x  win_loss_ratio_y  \
0                   Boston          0.670732   4794447.0          0.714286   
1                  Chicago          0.329268   9512999.0          0.458333   
2        Dallas–Fort Worth          0.292683   7233323.0          0.567568   
3                   Denver          0.560976   2853077.0          0.589041   
4                  Detroit          0.475610   4297617.0          0.434783   
5              Los Angeles          0.469512  13310447.0          0.622895   
6    Miami–Fort Lauderdale          0.536585   6066387.0          0.594595   
7   Minneapolis–Saint Paul          0.573171   3551036.0          0.633803   
8            New York City          0.347561  20153634.0          0.518201   
9             Philadelphia          0.634146   6070500.0          0.617647   
10                 Phoenix          0.256098   4661537.0          0.414286   
11  San Francisco Bay Area          0.707317   6657982.0        

0.022297049643438753

## Reaching the final step 
Among the main goals of this analysis was to test the following assumption: (hypothesis):

Given that an area has two sports teams in different sports, those teams will perform the same within their respective sports. I explored this assumption with a series of paired t-tests between all pairs of sports. Are there any sports where we can reject the null hypothesis?(average values where a sport has multiple teams in one region) 

In [3]:
dic = {"NFL": nfl, "nbl": nbl, "NHL":nhl, "MLB": mlb } 
p_values = []
for key1, value1 in dic.items():
    lst = []
    for key2, value2 in dic.items():
        if key1 != key2:
            merge = pd.merge(value1, value2, how='inner', on='area')
            lst.append(ttest_rel(merge['win_loss_ratio_x'], merge['win_loss_ratio_y'])[1])
        else:
            lst.append(np.nan)
        
    p_values.append(lst)

p_values = pd.DataFrame(p_values, columns=dic.keys(), index=dic.keys()) 
print(p_values )


          NFL       nbl       NHL       MLB
NFL       NaN  0.941792  0.030883  0.802069
nbl  0.941792       NaN  0.022297  0.950540
NHL  0.030883  0.022297       NaN  0.000708
MLB  0.802069  0.950540  0.000708       NaN


## Conclusion
The p_values array can finally help us answer the question, assuming we are looking for 95\% confidence interval, it is safe to assume that:
* For the same area, there is no guarantee that NFL and NHL team will perform the same (win_loss ratio) 
* Given the same area, it is likely that NBA and MLB teams will perform similarly (p_value larger than 0.95 which statistically significant)