# Team sports analysis assignment

## Description
In this assignment you must read in a file of metropolitan regions and associated sports teams from [assets/wikipedia_data.html](assets/wikipedia_data.html) and answer some questions about each metropolitan region. Each of these regions may have one or more teams from the "Big 4": NFL (football, in [assets/nfl.csv](assets/nfl.csv)), MLB (baseball, in [assets/mlb.csv](assets/mlb.csv)), NBA (basketball, in [assets/nba.csv](assets/nba.csv) or NHL (hockey, in [assets/nhl.csv](assets/nhl.csv)). Please keep in mind that all questions are from the perspective of the metropolitan region, and that this file is the "source of authority" for the location of a given sports team. Thus teams which are commonly known by a different area (e.g. "Oakland Raiders") need to be mapped into the metropolitan region given (e.g. San Francisco Bay Area). This will require some human data understanding outside of the data you've been given (e.g. you will have to hand-code some names, and might need to google to find out where teams are)!

For each sport I would like you to answer the question: **what is the win/loss ratio's correlation with the population of the city it is in?** Win/Loss ratio refers to the number of wins over the number of wins plus the number of losses. Remember that to calculate the correlation with [`pearsonr`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html), so you are going to send in two ordered lists of values, the populations from the wikipedia_data.html file and the win/loss ratio for a given sport in the same order. Average the win/loss ratios for those cities which have multiple teams of a single sport. Each sport is worth an equal amount in this assignment (20%\*4=80%) of the grade for this assignment. You should only use data **from year 2018** for your analysis -- this is important!

**Note: This assignment is comprised of five questions. The first four questions create dataframes for each of the four sports, containing all the win/loss ratio of teams. Question five tests a hypothesis related to the performance of different sports teams in the same areas.**

## Question 1
For this question, calculate the win/loss ratio's correlation with the population of the city it is in for the **NHL** using **2018** data.

### My answer:

Below is code which answers Question 1. Data was imported from both wikipedia and existing csv files. Cleaning of data was performed, such as removing unneccessary columns, and formatting correctly.  Data was compiled into data frames, so that calculations (such as win/loss ratio) could be carried out. Correlation was calculated using Pearson's r.

In [40]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import re
import warnings

# Dictionary mapping Teams to Cities
teams_cities_nhl= {
    'New York Rangers': 'New York City',
    'New York Islanders': 'New York City',
    'New Jersey Devils': 'New York City',
    'Los Angeles Kings': 'Los Angeles',
    'Anaheim Ducks': 'Los Angeles',
    'San Jose Sharks': 'San Francisco Bay Area',
    'Chicago Blackhawks': 'Chicago',
    'Dallas Stars': 'Dallas–Fort Worth',
    'Washington Capitals': 'Washington, D.C.',
    'Philadelphia Flyers': 'Philadelphia',
    'Boston Bruins': 'Boston',
    'Minnesota Wild': 'Minneapolis–Saint Paul',
    'Colorado Avalanche': 'Denver',
    'Florida Panthers': 'Miami–Fort Lauderdale',
    'Arizona Coyotes': 'Phoenix',
    'Detroit Red Wings': 'Detroit',
    'Toronto Maple Leafs': 'Toronto',
    'Tampa Bay Lightning': 'Tampa Bay Area',
    'Pittsburgh Penguins': 'Pittsburgh',
    'Seattle Kraken': 'Seattle',
    'St. Louis Blues': 'St. Louis',
    'Buffalo Sabres': 'Buffalo',
    'Montreal Canadiens': 'Montreal',
    'Vancouver Canucks': 'Vancouver',
    'Columbus Blue Jackets': 'Columbus',
    'Calgary Flames': 'Calgary',
    'Ottawa Senators': 'Ottawa',
    'Edmonton Oilers': 'Edmonton',
    'Winnipeg Jets': 'Winnipeg',
    'Vegas Golden Knights': 'Las Vegas',
    'Carolina Hurricanes': 'Raleigh',
    'Nashville Predators': 'Nashville'
}

# Function to load cities data from HTML file
def cities():
    cities_var = pd.read_html("assets/wikipedia_data.html")[1]
    cities_var = cities_var.iloc[:-1, [0, 3, 5, 6, 7, 8]]
    return cities_var

# Function to extract specific columns from cities data
def cities_col(col):
    citiesmlb = cities()[['Metropolitan area', 'Population (2016 est.)[8]', col]]
    return citiesmlb

# Function to load data from a CSV file for the year 2018
def data_2018(path):
    warnings.filterwarnings("ignore", category=UserWarning)
    df = pd.read_csv(path)
    
    # Check for variations in the column name for Win/Loss ratio
    ratio1 = any(df.columns.isin(['W/L%']))
    ratio2 = any(df.columns.isin(['W-L%']))
    
    # Define columns to be extracted based on the available Win/Loss ratio column
    data_columns = ['year', 'team', 'W', 'L']
    if ratio1:
        data_columns.extend(['W/L%'])
    elif ratio2:
        data_columns.extend(['W-L%'])
    
    # Filter data for the year 2018
    df_2018 = df[df['year'] == 2018].loc[:, data_columns].dropna()
    
    # Data cleaning and formatting
    df_2018['year'] = df_2018['year'].astype(int)
    df_2018['team'] = (df_2018['team'].apply(lambda x: re.sub(r'\*', '', x))
                                        .apply(lambda x: re.sub(r"\([0-9]+\)$", '', x))
                                        .apply(lambda x: re.sub(r"\\xa0$", '', x))
                                        .apply(lambda x: re.sub(r'\+$', '', x)))
    
    # Remove rows containing division information
    df_2018 = df_2018[~df_2018['team'].str.contains(r'[A-Za-z]+\sDivision', regex=True)].reset_index(drop=True)
    
    # Clean team names
    df_2018['team'] = [name.replace('\xa0', '') for name in df_2018['team'].values]
    
    warnings.resetwarnings()
    return df_2018

# Function to load NHL data for the year 2018
def nhl_2018():
    warnings.filterwarnings("ignore", category=UserWarning)
    nhl_2018 = data_2018("assets/nhl.csv")
    warnings.resetwarnings()
    return nhl_2018

# Function to extract specific columns from cities data related to NHL
def cities_nhl():
    warnings.filterwarnings("ignore", category=UserWarning)
    citiesNHL = cities()[['Metropolitan area', 'Population (2016 est.)[8]', 'NHL']]
    # Filter rows with valid NHL data
    citiesNHL = (citiesNHL.where((citiesNHL['NHL'] != '—') & 
                                 (~citiesNHL['NHL'].str.contains(r'(^\[.*\]$)', regex=True)))
                      .dropna()
                      .reset_index(drop=True))
    warnings.resetwarnings()
    return citiesNHL

# Function to calculate mean Win/Loss for a group
def calc_mean_WL(group):
    group['W'] = group['W'].astype(int)
    group['L'] = group['L'].astype(int)
    w_avg = np.nanmean(group["W"])
    l_avg = np.nanmean(group["L"])
    # Update 'W' and 'L' columns with mean values
    group['W'] = np.abs(w_avg)
    group['L'] = np.abs(l_avg)
    return group

# Function to merge data from NHL and cities datasets
def merged(path, col, dictionary):
    sport = data_2018(path)
    cities_var = cities_col(col)
    
    # Map team names to Metropolitan areas
    sport['Metropolitan area'] = sport['team'].map(dictionary)
    
    # Check for variations in the column name for Win/Loss ratio
    ratio1 = any(sport.columns.isin(['W/L%']))
    ratio2 = any(sport.columns.isin(['W-L%']))
    
    # Define columns to be merged based on the available Win/Loss ratio column
    merge_columns = ['team', 'W', 'L', 'Metropolitan area']
    if ratio1:
        merge_columns.extend(['W/L%'])
    elif ratio2:
        merge_columns.extend(['W-L%'])

    # Merge datasets
    merged_df = pd.merge(cities_var[["Metropolitan area", "Population (2016 est.)[8]"]],
                         sport[merge_columns],
                         on='Metropolitan area', how='left')

    # Drop rows with NaN values in the 'team' column
    merged_df = merged_df.dropna(subset=['team']).reset_index(drop=True)

    return merged_df

# Function to calculate Win/Loss ratio for a group
def calc_ratio(group):
    # Win/Loss ratio calculation
    group['W'] = group['W'].astype(float)
    group['L'] = group['L'].astype(float)
    ratio = group['W'] / (group['W'] + group['L'])
    # Add 'Ratio' column to the group
    group['Ratio'] = np.abs(ratio)
    return group

# Function to calculate mean Win/Loss and Win/Loss ratio for each Metropolitan area
def grouped_nhl():
    df = merged('assets/nhl.csv', 'NHL', teams_cities_nhl)
    
    # Calculate mean Win/Loss for each group
    df = df.groupby('Metropolitan area', group_keys=False).apply(calc_mean_WL)
    
    # Calculate Win/Loss ratio for each group
    df = df.groupby('Metropolitan area', group_keys=False).apply(calc_ratio)

    # Group by Metropolitan area and aggregate data
    grouped_df = df.groupby('Metropolitan area', as_index=False).agg({
        'Population (2016 est.)[8]': 'first',  # Keep the first value of 'Population'
        'W': 'first',
        'L': 'first',
        'Ratio': 'first',
        'team': ' & '.join  # Concatenate the team names using ' & ' as separator
    })
    return grouped_df

# Function to calculate Pearson correlation coefficient and p-value
def nhl_correlation(): 
    df = grouped_nhl()
    
    population_by_region = df['Population (2016 est.)[8]'].values.astype(float)
    win_loss_by_region = df['Ratio'].values
    
    # Calculate Pearson correlation coefficient and p-value
    coEff, pval = stats.pearsonr(population_by_region, win_loss_by_region)
    
    print (f"Correlation coefficient: {coEff:.3f}\nP-value: {pval:.3f}")

In [41]:
nhl_correlation()

Correlation coefficient: 0.012
P-value: 0.950


A correlation coefficient meassures the strength and direction of a linear relationship between two variables. In this case, the two variables are population in a region, and win loss ratio for the NHL in a region.

A coefficient of 0.012 is very close to 0. This indicates that there is likely no correlation between win/loss ratio and the population of a city for the NHL. 

A p-value indicates the probability of erroneously rejecting the null hypothesis (that there is no correlation), also known as a type I error. Usually, the p-value needs to be under 0.05 to accept the alternative hypothesis (that there is a correlation between the variables).

In this case, the p-value of 0.950 indicates a very high probability of making such an error (approx. 95%). Therefore we can conclude that there is no correlation between win/loss ratio and the population of a city for the NHL. 

## Question 2

For this question, calculate the win/loss ratio's correlation with the population of the city it is in for the **NBA** using **2018** data.

### My Answer

Below is code which answers Question 2. For this answer, some of the functions declared in the previous answer were reused, such as ones that cleaned and compiled the data. Once again, correlation was calculated using Pearson's r.

In [27]:
# Dictionary mapping Teams to Cities
teams_cities_nba = {
    'Toronto Raptors': 'Toronto',
    'Boston Celtics': 'Boston',
    'Philadelphia 76ers': 'Philadelphia',
    'Cleveland Cavaliers': 'Cleveland',
    'Indiana Pacers': 'Indianapolis',
    'Miami Heat': 'Miami–Fort Lauderdale',
    'Milwaukee Bucks': 'Milwaukee',
    'Washington Wizards': 'Washington, D.C.',
    'Detroit Pistons': 'Detroit',
    'Charlotte Hornets': 'Charlotte',
    'New York Knicks': 'New York City',
    'Brooklyn Nets': 'New York City',
    'Chicago Bulls': 'Chicago',
    'Orlando Magic': 'Orlando',
    'Atlanta Hawks': 'Atlanta',
    'Houston Rockets': 'Houston',
    'Golden State Warriors': 'San Francisco Bay Area',
    'Portland Trail Blazers': 'Portland',
    'Oklahoma City Thunder': 'Oklahoma City',
    'Utah Jazz': 'Salt Lake City',
    'New Orleans Pelicans': 'New Orleans',
    'San Antonio Spurs': 'San Antonio',
    'Minnesota Timberwolves': 'Minneapolis–Saint Paul',
    'Denver Nuggets': 'Denver',
    'Los Angeles Clippers': 'Los Angeles',
    'Los Angeles Lakers': 'Los Angeles',
    'Sacramento Kings': 'Sacramento',
    'Dallas Mavericks': 'Dallas–Fort Worth',
    'Memphis Grizzlies': 'Memphis',
    'Phoenix Suns': 'Phoenix'
}

# Function to calculate mean Win/Loss ratio for a group
def calc_mean_ratio(group):
    group['W/L%'] = group['W/L%'].astype(float)
    w_avg = np.nanmean(group['W/L%'])
    # Update 'W/L%' column with mean value
    group['W/L%'] = np.abs(w_avg)
    return group

# Function to group NBA data by Metropolitan area and calculate mean Win/Loss and Win/Loss ratio
def grouped_nba():
    df = merged('assets/nba.csv', 'NBA', teams_cities_nba)
    
    # Calculate mean Win/Loss for each group
    df = df.groupby('Metropolitan area', group_keys=False).apply(calc_mean_WL)
    
    # Calculate mean Win/Loss ratio for each group
    df = df.groupby('Metropolitan area', group_keys=False).apply(calc_mean_ratio)
    
    # Group by Metropolitan area and aggregate data
    grouped_df = df.groupby('Metropolitan area', as_index=False).agg({
        'Population (2016 est.)[8]': 'first',  # Keep the first value of 'Population'
        'W': 'first',
        'L': 'first',
        'W/L%': 'first',
        'team': ' & '.join  # Concatenate the team names using ' & ' as separator
    })
    return grouped_df

# Function to calculate Pearson correlation coefficient for NBA data
def nba_correlation():
    df = grouped_nba() 
    
    # Extract population and Win/Loss ratio data for correlation calculation
    population_by_region = df['Population (2016 est.)[8]'].values.astype(float)
    win_loss_by_region = df['W/L%'].values
    
    # Calculate Pearson correlation coefficient
    coEff, pval = stats.pearsonr(population_by_region, win_loss_by_region)
    
    print (f"Correlation coefficient: {coEff:.3f}\nP-value: {pval:.3f}")

In [28]:
nba_correlation()

Correlation coefficient: -0.176
P-value: 0.369


A correlation coefficient of -0.176 indicates that there is likely a very weak negative correlation between population by region and win loss ratio by region for the nba.
However, once again the p-value is above 0.05, so we accept the null hypothesis and conclude that there is no correlation between the variables.

## Question 3
For this question, calculate the win/loss ratio's correlation with the population of the city it is in for the **MLB** using **2018** data.

### My Answer:

In [33]:
# Dictionary mapping Teams to Cities
teams_cities_mlb = {
    'Boston Red Sox': 'Boston',
    'New York Yankees': 'New York City',
    'Tampa Bay Rays': 'Tampa Bay Area',
    'Toronto Blue Jays': 'Toronto',
    'Baltimore Orioles': 'Baltimore',
    'Cleveland Indians': 'Cleveland',
    'Minnesota Twins': 'Minneapolis–Saint Paul',
    'Detroit Tigers': 'Detroit',
    'Chicago White Sox': 'Chicago',
    'Kansas City Royals': 'Kansas City',
    'Houston Astros': 'Houston',
    'Oakland Athletics': 'San Francisco Bay Area',
    'Seattle Mariners': 'Seattle',
    'Los Angeles Angels': 'Los Angeles',
    'Texas Rangers': 'Dallas–Fort Worth',
    'Atlanta Braves': 'Atlanta',
    'Washington Nationals': 'Washington, D.C.',
    'Philadelphia Phillies': 'Philadelphia',
    'New York Mets': 'New York City',
    'Miami Marlins': 'Miami–Fort Lauderdale',
    'Milwaukee Brewers': 'Milwaukee',
    'Chicago Cubs': 'Chicago',
    'St. Louis Cardinals': 'St. Louis',
    'Pittsburgh Pirates': 'Pittsburgh',
    'Cincinnati Reds': 'Cincinnati',
    'Los Angeles Dodgers': 'Los Angeles',
    'Colorado Rockies': 'Denver',
    'Arizona Diamondbacks': 'Phoenix',
    'San Francisco Giants': 'San Francisco Bay Area',
    'San Diego Padres': 'San Diego'
}

# Function to calculate mean Win/Loss ratio for a group in MLB data
def calc_mlb_ratio(group):
    group['W-L%'] = group['W-L%'].astype(float)
    w_avg = np.nanmean(group['W-L%'])
    # Update 'W-L%' column with mean value
    group['W-L%'] = np.abs(w_avg)
    return group

# Function to group MLB data by Metropolitan area and calculate mean Win/Loss and Win/Loss ratio
def grouped_mlb():
    df = merged('assets/mlb.csv', 'MLB', teams_cities_mlb)
    
    # Calculate mean Win/Loss ratio for each group
    df = df.groupby('Metropolitan area', group_keys=False).apply(calc_mlb_ratio)
    
    # Calculate mean Win/Loss for each group
    df = df.groupby('Metropolitan area', group_keys=False).apply(calc_mean_WL)
    
    # Group by Metropolitan area and aggregate data
    grouped_df = df.groupby('Metropolitan area', as_index=False).agg({
        'Population (2016 est.)[8]': 'first',  # Keep the first value of 'Population'
        'W': 'first',
        'L': 'first',
        'W-L%': 'first',
        'team': ' & '.join  # Concatenate the team names using ' & ' as separator
    })
    return grouped_df

# Function to calculate Pearson correlation coefficient for MLB data
def mlb_correlation(): 
    df = grouped_mlb() 
    
    # Extract population and Win/Loss ratio data for correlation calculation
    population_by_region = df['Population (2016 est.)[8]'].values.astype(float)
    win_loss_by_region = df['W-L%'].values
    
    # Calculate Pearson correlation coefficient
    coEff, pval = stats.pearsonr(population_by_region, win_loss_by_region)
    
    print (f"Correlation coefficient: {coEff:.3f}\nP-value: {pval:.3f}")

In [34]:
mlb_correlation()

Correlation coefficient: 0.150
P-value: 0.464


Again, a correlation coefficient indicates the likelihood of a very weak positive correlation between population by region and win loss ratio by region for MLB. But, the p-value is above 0.05 so we accept the null hypothesis and conclude that there is no correlation.

## Question 4
For this question, calculate the win/loss ratio's correlation with the population of the city it is in for the **NFL** using **2018** data.

### My answer:

In [37]:
teams_cities_nfl = {
    'New England Patriots': 'Boston',
    'Miami Dolphins': 'Miami–Fort Lauderdale',
    'Buffalo Bills': 'Buffalo',
    'New York Jets': 'New York City',
    'Baltimore Ravens': 'Baltimore',
    'Pittsburgh Steelers': 'Pittsburgh',
    'Cleveland Browns': 'Cleveland',
    'Cincinnati Bengals': 'Cincinnati',
    'Houston Texans': 'Houston',
    'Indianapolis Colts': 'Indianapolis',
    'Tennessee Titans': 'Nashville',
    'Jacksonville Jaguars': 'Jacksonville',
    'Kansas City Chiefs': 'Kansas City',
    'Los Angeles Chargers': 'Los Angeles',
    'Denver Broncos': 'Denver',
    'Oakland Raiders': 'San Francisco Bay Area',
    'Dallas Cowboys': 'Dallas–Fort Worth',
    'Philadelphia Eagles': 'Philadelphia',
    'Washington Redskins': 'Washington, D.C.',
    'New York Giants': 'New York City',
    'Chicago Bears': 'Chicago',
    'Minnesota Vikings': 'Minneapolis–Saint Paul',
    'Green Bay Packers': 'Green Bay',
    'Detroit Lions': 'Detroit',
    'New Orleans Saints': 'New Orleans',
    'Carolina Panthers': 'Charlotte',
    'Atlanta Falcons': 'Atlanta',
    'Tampa Bay Buccaneers': 'Tampa Bay Area',
    'Los Angeles Rams': 'Los Angeles',
    'Seattle Seahawks': 'Seattle',
    'San Francisco 49ers': 'San Francisco Bay Area',
    'Arizona Cardinals': 'Phoenix'}


# Function to group NFL data by Metropolitan area and calculate mean Win/Loss and Win/Loss ratio
def nfl_grouped():
    df = merged('assets/nfl.csv', 'NFL', teams_cities_nfl)
    
    # Calculate mean Win/Loss ratio for each group
    df = df.groupby('Metropolitan area', group_keys=False).apply(calc_mlb_ratio)
    
    # Calculate mean Win/Loss for each group
    df = df.groupby('Metropolitan area', group_keys=False).apply(calc_mean_WL)
    
    # Group by Metropolitan area and aggregate data
    grouped_df = df.groupby('Metropolitan area', as_index=False).agg({
        'Population (2016 est.)[8]': 'first',  # Keep the first value of 'Population'
        'W': 'first',
        'L': 'first',
        'W-L%': 'first',
        'team': ' & '.join  # Concatenate the team names using ' & ' as separator
    })
    return grouped_df

# Function to calculate Pearson correlation coefficient for NFL data
def nfl_correlation(): 
    df = nfl_grouped()
    
    # Extract population and Win/Loss ratio data for correlation calculation
    population_by_region = df['Population (2016 est.)[8]'].values.astype(float)
    win_loss_by_region = df['W-L%'].values
    
    # Calculate Pearson correlation coefficient
    coEff, pval = stats.pearsonr(population_by_region, win_loss_by_region)
    
    print (f"Correlation coefficient: {coEff:.3f}\nP-value: {pval:.3f}")


In [39]:
nfl_correlation()

Correlation coefficient: 0.004
P-value: 0.982


A correlation coefficient of 0.004 indicates the likelihood of no correlation between population by region and win loss ratio by region for NFL. The p-value is above 0.05 so we accept the null hypothesis and conclude that there is no correlation.

## Question 5
In this question I would like you to explore the hypothesis that:

 **Given that an area has two sports teams in different sports, those teams will perform the same within their respective sports**. 

How I would like to see this explored is with a series of paired t-tests (so use [`ttest_rel`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html)) between all pairs of sports. Are there any sports where we can reject the null hypothesis? Again, average values where a sport has multiple teams in one region. Remember, you will only be including, for each sport, cities which have teams engaged in that sport, drop others as appropriate. This question is worth 20% of the grade for this assignment.

### My Answer:

The code below creates dataframes for the win/loss ratios for all four sports. It then performs t-test for each pair of sports.

In [62]:
from scipy.stats import ttest_rel

# Create an empty dictionary to store paired t-test results
paired_t_test_results = {}

# Extract relevant columns from grouped datasets for different sports
nhl = grouped_nhl()[['Metropolitan area', 'team', 'Ratio']]
nhl.rename(columns={'Ratio': 'W-L%'}, inplace=True)

nba = grouped_nba()[['Metropolitan area', 'team', 'W/L%']]
nba.rename(columns={'W/L%': 'W-L%'}, inplace=True)

mlb = grouped_mlb()[['Metropolitan area', 'team', 'W-L%']]
nfl = nfl_grouped()[['Metropolitan area', 'team', 'W-L%']]


# Function to perform a paired t-test between two dataframes
def paired_t_test(df1, df2):
    t_statistic, p_value = ttest_rel(df1, df2)
    return p_value

# Function to get the name of a dataframe from local variables
def get_df_name(dataframe):
    return [name for name, obj in locals().items() if obj is dataframe][0]

# Function to compare sports team performance using paired t-tests
def sports_team_performance():
    sports_dfs = [nfl, nba, nhl, mlb]
    sports = ['NFL', 'NBA', 'NHL', 'MLB']
    p_values = pd.DataFrame({k: np.nan for k in sports}, index=sports)
    
    # Loop through pairs of sports dataframes for comparison
    for i in range(len(sports_dfs)):
        for j in range(i + 1, len(sports_dfs)):
            sport1_name = sports[i]
            sport2_name = sports[j]
            sport1_df = sports_dfs[i]
            sport2_df = sports_dfs[j]
            
            # Merge dataframes on Metropolitan area
            merged_df = pd.merge(sport1_df[['Metropolitan area', 'team', 'W-L%']],
                                 sport2_df[['Metropolitan area', 'team', 'W-L%']],
                                 on=['Metropolitan area'],
                                 how='inner')
            
            # Rename columns for clarity
            merged_df.rename(columns={'W-L%_x': sport1_name, 'W-L%_y': sport2_name}, inplace=True)
            
            # Conduct paired t-test and store p-value in the results dataframe
            p_value = paired_t_test(merged_df[sport1_name], merged_df[sport2_name])
            p_values.loc[sport1_name, sport2_name] = p_value
            p_values.loc[sport2_name, sport1_name] = p_value

    # Perform assertions for expected p-values
    assert abs(p_values.loc["NBA", "NHL"] - 0.02) <= 1e-2, "The NBA-NHL p-value should be around 0.02"
    assert abs(p_values.loc["MLB", "NFL"] - 0.80) <= 1e-2, "The MLB-NFL p-value should be around 0.80"
    
    return p_values
    raise NotImplementedError()

In [63]:
sports_team_performance()

Unnamed: 0,NFL,NBA,NHL,MLB
NFL,,0.937509,0.030392,0.803459
NBA,0.937509,,0.022405,0.949566
NHL,0.030392,0.022405,,0.000703
MLB,0.803459,0.949566,0.000703,


Here was the hypothesis: **given that an area has two sports teams in different sports, those teams will perform the same within their respective sports**

The dataframe above shows p-values for pair of sports. A p-value less than 0.05 allows us to accept the hypothesis. So from the table above, we can see that the hypothesis can be accepted for the NHL-NFL pair, the MLB-NHL pair, and the NHL-NBA pair. Therefore:

1. If a metropolitan area has a Hockey team and a Football team, those teams will perform similarly.
2. If a metropolitan area has a Baseball team and a Hockey team, those teams will perform similarly.
3. If a metropolitan area has a Hockey team and a Baseball team, those teams will perform similarly.