# Description
Taking a file of metropolitan regions and associated sports teams from [assets/wikipedia_data.html](assets/wikipedia_data.html), I  answer some questions about each metropolitan region. Each of these regions may have one or more teams from the "Big 4": NFL (football, in [assets/nfl.csv](assets/nfl.csv)), MLB (baseball, in [assets/mlb.csv](assets/mlb.csv)), NBA (basketball, in [assets/nba.csv](assets/nba.csv) or NHL (hockey, in [assets/nhl.csv](assets/nhl.csv)). 

### For each sport I answer the question: 
<ul>
<li>What is the win percentage's correlation with the population of the city it is in?</li>
<li>Win/Loss ratio refers to the number of wins over the number of wins plus the number of losses.</li>
<li>Average the win/loss ratios for those cities which have multiple teams of a single sport.</li>
</ul>

In [9]:
# First we import the necessary libraries
import pandas as pd # Pandas for dataframes
import numpy as np # For core usage 
import scipy.stats as stats # For Pearson R Coefficient & T Test
import re # For Regular Expressions

## First: Create Functions
Modularize my code for repetitive downstream activities –– code efficiency.

In [10]:
def metro_func():
    """
    Generates: cities (pd.DataFrame), nomen (dict)
    """
    # Read datasets
    cities=pd.read_html("assets/wikipedia_data.html")[1]

    # Defining nomenclature
    nomen={"New York City":"New York","New Jersey":"New York","Tampa Bay Area":"Tampa Bay","Miami–Fort Lauderdale":"Miami","Dallas-Fort Worth":"Dallas","Texas":"Dallas","Minneapolis–Saint Paul":"Minneapolis","San Francisco Bay Area":"San Francisco","Washington, D.C.":"DC","Washington Capitals":"DC Capitals","St.  Louis":"St. Louis","Utah":"Salt Lake City"}

    # Cleaning cities dataframe
    cities.rename({'Population (2016 est.)[8]':'Population','Metropolitan area':'Metro'}, axis=1, inplace=True)
    cities = cities.replace(nomen)
    cities = cities.replace("(\[.+)",'',regex=True) # new regex to get rid of notes
    cities = cities.iloc[:-1,[0,3,5,6,7,8]]
    cities['Population'] = cities['Population'].astype('int')

    return nomen, cities

In [22]:
def league_func(): 
    """
    Descriptions: imports, regex cleans, unicode strips, filters for 2018 seasons; merges metros with leagues into individual dataframes 
    grouping multi teams into one win/lss ratio stat

    Returns: cities, nhl_metro_mean, nba_metro_mean, mlb_metro_mean, nfl_metro_mean (pd.DataFrame)
    """
    # Import data
    mlb_df=pd.read_csv("assets/mlb.csv")
    nhl_df=pd.read_csv("assets/nhl.csv")
    nba_df=pd.read_csv("assets/nba.csv")
    nfl_df=pd.read_csv("assets/nfl.csv")

    # Generate cities dataframe : cities
    nomen, cities = metro_func()

    # Cleaning nhl dataframe & isolating 2018 year
    nhl_df = nhl_df.replace("(\*$)",'',regex=True).replace(nomen)
    nhl_df = nhl_df[nhl_df['year'] == 2018]
    fltr = df['team'].str.contains('(.*)(Division$)')
    nhl_filtered = nhl_df[~fltr]
    nhl=nhl_filtered.iloc[:,0:4].astype({'GP':'int','W':'int','L':'int'})

    # Cleaning nba dataframe & isolating 2018 year
    nba_df = nba_df.replace("(\(.+)",'',regex=True).replace("(\*)",'',regex=True).replace(nomen)
    nba_df = nba_df[nba_df['year'] == 2018]
    fltr = nba_df['team'].str.contains('(.*)(Division$)')
    nba_filtered = nba_df[~fltr]
    nba=nba_filtered.astype({'W':'int','L':'int','W/L%':'float'})

    # Cleaning mlb dataframe & isolating 2018 year
    mlb_df = mlb_df.replace("(\(.+)",'',regex=True).replace("(\*)",'',regex=True).replace(nomen)
    mlb_df.rename({'W-L%':'W/L%'}, axis=1, inplace=True)
    mlb_df = mlb_df[mlb_df['year'] == 2018]
    fltr = mlb_df['team'].str.contains('(.*)(Division$)')
    mlb_filtered = mlb_df[~fltr]
    mlb=mlb_filtered.astype({'W':'int','L':'int','W/L%':'float'})

    # Cleaning nfl dataframe & isolating 2018 year
    nfl_df = nfl_df.replace("(\+)",'',regex=True).replace("(\*)",'',regex=True).replace(nomen)
    nfl_df.rename({'W-L%':'W/L%'}, axis=1, inplace=True)
    nfl_df = nfl_df[nfl_df['year'] == 2018]
    fltr = nfl_df['team'].str.contains('AFC|NFC', regex=True)
    nfl_filtered = nfl_df[~fltr]
    nfl=nfl_filtered.astype({'W/L%':'float','W':'float','L':'float'})

    # Strip Metro area from team name & set team as index

    metro_list = list(cities['Metro']) + ['Florida','New Jersey','Colorado','Dallas','Vegas','Minnesota','Anaheim','San Jose','Arizona','Carolina','Golden State','Indiana','Brooklyn','Washington','Utah','Oakland','Texas','New England','Tennessee']

    for c in metro_list:
        nhl['team'] = nhl['team'].str.replace(f'{c} ', '')
        nba['team'] = nba['team'].str.replace(f'{c} ', '')
        mlb['team'] = mlb['team'].str.replace(f'{c} ', '')
        nfl['team'] = nfl['team'].str.replace(f'{c} ', '') 

    # stripping the hidden unicode
    nhl['team'] = nhl['team'].str.split().str.join(' ')
    nba['team'] = nba['team'].str.split().str.join(' ')
    mlb['team'] = mlb['team'].str.split().str.join(' ')
    nfl['team'] = nfl['team'].str.split().str.join(' ')

    # Explode the metro teams and create metro league dataframes
    def split_to_list(str): 
        """take a string with comma separated values and return a list"""
        return re.findall(r'[\w\s]+?(?:[a-z])(?=[A-Z]|$)', str)
    for league in ['NFL','MLB','NBA','NHL']:
        cities[league]=cities[league].astype(str)
        cities[league]=cities[league].apply(split_to_list)

    nhl_metro = cities.explode('NHL').dropna()
    nhl_metro = nhl_metro[['Metro','Population','NHL']]

    nba_metro = cities.explode('NBA').dropna()
    nba_metro = nba_metro[['Metro','Population','NBA']]
    nba_metro['Population'] = nba_metro['Population'].astype(float)

    mlb_metro = cities.explode('MLB').dropna()
    mlb_metro = mlb_metro[['Metro','Population','MLB']]
    mlb_metro['Population'] = mlb_metro['Population'].astype(float)

    nfl_metro = cities.explode('NFL').dropna()
    nfl_metro = nfl_metro[['Metro','Population','NFL']]
    nfl_metro['Population'] = nfl_metro['Population'].astype(float)

    # Merging the data into league data frames
    nhl_df = pd.merge(nhl_metro,nhl,left_on='NHL',right_on='team', how='left')
    nhl_df['NHL']=nhl_df['W']/(nhl_df['W']+nhl_df['L'])
    nhl_metro_mean = nhl_df[['Metro','Population','NHL']].groupby('Metro').mean()

    nba_df = pd.merge(nba_metro,nba,left_on='NBA',right_on='team', how='left')
    nba_df['NBA']=nba_df['W']/(nba_df['W']+nba_df['L'])
    nba_metro_mean = nba_df[['Metro','Population','NBA']].groupby('Metro').mean()

    mlb_df = pd.merge(mlb_metro,mlb,left_on='MLB',right_on='team', how='left')
    mlb_df['MLB']=mlb_df['W']/(mlb_df['W']+mlb_df['L'])
    mlb_metro_mean = mlb_df[['Metro','Population','MLB']].groupby('Metro').mean() 

    nfl_df = pd.merge(nfl_metro,nfl,left_on='NFL',right_on='team', how='left')
    nfl_df['NFL']=nfl_df['W']/(nfl_df['W']+nfl_df['L'])
    nfl_metro_mean = nfl_df[['Metro','Population','NFL']].groupby('Metro').mean()

    return cities, nhl_metro_mean, nba_metro_mean, mlb_metro_mean, nfl_metro_mean

In [35]:
def metro_statistics():
    """
    Descriptions: Consolidates leagues metro means into a single dataframe

    Returns: metro_stats
    """
    # Group and merge league win/loss ratios by metro area
    cities, nhl_metro_mean, nba_metro_mean, mlb_metro_mean, nfl_metro_mean = league_func()

    # Drop Population columns
    nhl_metro_mean = nhl_metro_mean.drop(['Population'], axis=1)
    nba_metro_mean = nba_metro_mean.drop(['Population'], axis=1)
    mlb_metro_mean = mlb_metro_mean.drop(['Population'], axis=1)
    nfl_metro_mean = nfl_metro_mean.drop(['Population'], axis=1)
    
    # Merging into a single data frame
    metro_sports =pd.DataFrame()
    metro_sports = cities[['Metro','Population']].join(nfl_metro_mean, on='Metro').join(nba_metro_mean, on='Metro').join(nhl_metro_mean, on='Metro').join(mlb_metro_mean, on='Metro').sort_values(by='Population', ascending=False)
    metro_stats = metro_sports.set_index('Metro').drop(['Population'],axis=1)

    return metro_stats

## NHL Correlation
For this question, I calculate the win percentage's correlation with the population of the city it is in for the **NHL** using **2018** data. The win percentage ratio is calculated using the following formula: win/(win+loss).

In [23]:
def nhl_correlation(): 

    # Group and merge league win/loss ratios by metro area
    cities, nhl_metro_mean, nba_metro_mean, mlb_metro_mean, nfl_metro_mean = league_func()

    #raise NotImplementedError()
    
    population_by_region = [] # pass in metropolitan area population from cities    
    win_loss_by_region = [] # pass in win/loss ratio from nhl_df in the same order as cities["Metropolitan area"]
    for m in cities['Metro']:
        if m in nhl_metro_mean.index:
            population_by_region.append(nhl_metro_mean.loc[m,'Population'])
            win_loss_by_region.append(nhl_metro_mean.loc[m,'NHL'])
        else:
            continue

    
    return stats.pearsonr(population_by_region, win_loss_by_region)[0]

In [24]:
nhl_correlation()

0.012486162921209909

## NBA Correlation
For this question, I calculate the win percentage's correlation with the population of the city it is in for the **NBA** using **2018** data.

In [26]:
def nba_correlation():

    # Group and merge league win/loss ratios by metro area
    cities, nhl_metro_mean, nba_metro_mean, mlb_metro_mean, nfl_metro_mean = league_func()

    #raise NotImplementedError()
    
    population_by_region = [] # pass in metropolitan area population from cities
    win_loss_by_region = [] # pass in win/loss ratio from nba_df in the same order as cities["Metropolitan area"]
    for m in cities['Metro']:
        if m in nba_metro_mean.index:
            population_by_region.append(nba_metro_mean.loc[m,'Population'])
            win_loss_by_region.append(nba_metro_mean.loc[m,'NBA'])
        else:
            continue

    return stats.pearsonr(population_by_region, win_loss_by_region)[0]

In [27]:
nba_correlation()

-0.17657160252844614

## MLB Correlation
For this question, I calculate the win percentage's correlation with the population of the city it is in for the **MLB** using **2018** data.

In [28]:
def mlb_correlation(): 
    
    # Group and merge league win/loss ratios by metro area
    cities, nhl_metro_mean, nba_metro_mean, mlb_metro_mean, nfl_metro_mean = league_func()

    #raise NotImplementedError()
    
    population_by_region = [] # pass in metropolitan area population from cities
    win_loss_by_region = [] # pass in win/loss ratio from mlb_df in the same order as cities["Metropolitan area"]
    for m in cities['Metro']:
        if m in mlb_metro_mean.index:
            population_by_region.append(mlb_metro_mean.loc[m,'Population'])
            win_loss_by_region.append(mlb_metro_mean.loc[m,'MLB'])
        else:
            continue

    return stats.pearsonr(population_by_region, win_loss_by_region)[0]

In [29]:
mlb_correlation()

0.15027698302669307

## NFL Correlation
For this question, calculate the win percentage's correlation with the population of the city it is in for the **NFL** using **2018** data.

In [30]:
def nfl_correlation(): 
    
    # Group and merge league win/loss ratios by metro area
    cities, nhl_metro_mean, nba_metro_mean, mlb_metro_mean, nfl_metro_mean = league_func()

    #raise NotImplementedError()
    
    population_by_region = [] # pass in metropolitan area population from cities
    win_loss_by_region = [] # pass in win/loss ratio from nfl_df in the same order as cities["Metropolitan area"]
    for m in cities['Metro']:
        if m in nfl_metro_mean.index:
            population_by_region.append(nfl_metro_mean.loc[m,'Population'])
            win_loss_by_region.append(nfl_metro_mean.loc[m,'NFL'])
        else:
            continue


    return stats.pearsonr(population_by_region, win_loss_by_region)[0]

In [31]:
nfl_correlation()

0.004922112149349428

## Hypothesis Test
Here I am exploring the hypothesis that **given that an area has two sports teams in different sports, those teams will perform the same within their respective sports**. I explored this with a series of paired t-tests (so use [`ttest_rel`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html)) between all pairs of sports. <p>Are there any sports where we can reject the null hypothesis?</p> Again, average values where a sport has multiple teams in one region. I will only be including, for each sport, cities which have teams engaged in that sport: dropping others as appropriate. 

In [36]:
def sports_team_performance():

    # Consolidates leagues metro means into a single dataframe
    metro_stats = metro_statistics()

    #raise NotImplementedError()

    # Note: p_values is a full dataframe, so df.loc["NFL","NBA"] should be the same as df.loc["NBA","NFL"] and
    # df.loc["NFL","NFL"] should return np.nan
    sports = ['NFL', 'NBA', 'NHL', 'MLB']
    p_values = pd.DataFrame({k:np.nan for k in sports}, index=sports)
    for key in p_values.keys():
        for k in p_values.keys():
            p_values.at[key,k] = stats.ttest_rel(metro_stats[key],metro_stats[k], nan_policy='omit')[1]

    return p_values

In [37]:
sports_team_performance()

Unnamed: 0,NFL,NBA,NHL,MLB
NFL,,0.941792,0.030883,0.802069
NBA,0.941792,,0.022297,0.95054
NHL,0.030883,0.022297,,0.000708
MLB,0.802069,0.95054,0.000708,
