## Overview
This notebook is part of a data analysis task (mini-project) to asnwer the following question:
what is the win/loss ratio's correlation with the population of the city it is in? are these correlations different between the four major leagues ?
Is a city population correlated with its team performance in the 4 American national Leagues(for year 2018):
* NHL: Hockey league
* NBA: Basketball league
* MLB: Baseball League
* NFL: Football (Americal football) League
This notebook considerers the National Hockey League.

## Preliminary work: Imports and Loading the data

In [1]:
import pandas as pd
import numpy as np
import re
import os
import cleaning_data as c
year = 2018
from cleaning_data import cities_original_path, cities_clean_path, nhl_path, final_nhl_path 

In [2]:
# cities_original_path = "utility_files/wiki_data.html"
# cities_clean_path = "utility_files/cities.csv" 
# nhl_path = "utility_files/nhl.csv"
# final_nhl_path = "nhl_final.csv"

In [3]:
# load the clean cities dataset
if not os.path.exists(cities_clean_path):
    c.clean_city_data(cities_original_path, cities_clean_path)


In [4]:
# the original data frame
nhl_df_org = pd.read_csv(nhl_path)
cities = pd.read_csv(cities_clean_path)

In [5]:
print(cities['nhl'])

# as we can see the teams' names require additional preprocessing to remove unnecessary characters
# cities_nhl is a dataframe that contain the information related solely to the nhl sport and cities
nhl_cols = ['area', 'pop', 'nhl']
cities_nhl = cities.copy().loc[:, nhl_cols]

0     RangersIslandersDevils[note 3]
1                         KingsDucks
2                     Sharks[note 7]
3                         Blackhawks
4                              Stars
5                           Capitals
6                    Flyers[note 13]
7                             Bruins
8                      Wild[note 16]
9                 Avalanche[note 18]
10                          Panthers
11                           Coyotes
12                         Red Wings
13                       Maple Leafs
14                                 —
15                         [note 25]
16                         Lightning
17                 Penguins[note 28]
18                         [note 32]
19                         [note 34]
20                                 —
21                         [note 39]
22                    Blues[note 43]
23                                 —
24                                 —
25                                 —
26                         Predators
2

## Cleaning the team names 
The cities dataframes contain information about the NHL teams associated with each area. The names tend to include unnecessary filler characters that should be removed for further manipulation

In [6]:
def clean_team_name(name):
    # remove anything written between brackets [] 
    name_1 = re.sub('\[.*\]', "", name)
    # convert to lower case and remove indenting spaces
    return name_1.lower().strip()

cities_nhl['nhl'] = cities_nhl['nhl'].apply(clean_team_name)
# removing non-ascii characters
cities_nhl['nhl'] = cities_nhl['nhl'].apply(lambda x: re.sub("[^\x00-\xFF]", "", x)) 
# final cleaning step
cities_nhl['nhl'] = cities_nhl['nhl'].apply(lambda x: re.sub("[^(A-Z)(a-z)\d\s]", "", x))

# at this point cities with no nhl team are assigned the empty string in the "nhl" column
# keep the cities with nhl teams
cities_nhl = cities_nhl[cities_nhl['nhl'] != ''] 
print(cities_nhl)
# set the index to a numerical series from 0 to the size of the dataframe
custom_index = pd.Index(range(len(cities_nhl)))
cities_nhl = cities_nhl.set_index(custom_index)

## after indexing

print("after indexing")
print("##" * 40)

print(cities_nhl)

                      area       pop                     nhl
0            New York City  20153634  rangersislandersdevils
1              Los Angeles  13310447              kingsducks
2   San Francisco Bay Area   6657982                  sharks
3                  Chicago   9512999              blackhawks
4        Dallas–Fort Worth   7233323                   stars
5         Washington, D.C.   6131977                capitals
6             Philadelphia   6070500                  flyers
7                   Boston   4794447                  bruins
8   Minneapolis–Saint Paul   3551036                    wild
9                   Denver   2853077               avalanche
10   Miami–Fort Lauderdale   6066387                panthers
11                 Phoenix   4661537                 coyotes
12                 Detroit   4297617               red wings
13                 Toronto   5928040             maple leafs
16          Tampa Bay Area   3032171               lightning
17              Pittsbur

In [7]:
# in order to map each team with its area, a new column should be added 
# that groups both the area/city name as well as the team's name

def area_team(row):
    area_no_space = re.sub("\s", "", row['area']).strip().lower()
    team_no_space = re.sub("\s", "", row['nhl']).strip().lower()
    return area_no_space + team_no_space

cities_nhl['area_team'] = cities_nhl.apply(area_team, axis=1)
print(cities_nhl.loc[:, ["area",  "nhl", "area_team"]])

                      area                     nhl  \
0            New York City  rangersislandersdevils   
1              Los Angeles              kingsducks   
2   San Francisco Bay Area                  sharks   
3                  Chicago              blackhawks   
4        Dallas–Fort Worth                   stars   
5         Washington, D.C.                capitals   
6             Philadelphia                  flyers   
7                   Boston                  bruins   
8   Minneapolis–Saint Paul                    wild   
9                   Denver               avalanche   
10   Miami–Fort Lauderdale                panthers   
11                 Phoenix                 coyotes   
12                 Detroit               red wings   
13                 Toronto             maple leafs   
14          Tampa Bay Area               lightning   
15              Pittsburgh                penguins   
16               St. Louis                   blues   
17               Nashville  

## Clean the NHL data
We will know consider the NHL teams' data (the data is provided independently from areas in this dataframe)

In [8]:
# it is time to consider the nhl DataFrame
nhl_org = pd.read_csv("utility_files/nhl.csv")
nhl = nhl_org[nhl_org['year'] == year]
print(nhl.columns)

Index(['team', 'GP', 'W', 'L', 'OL', 'PTS', 'PTS%', 'GF', 'GA', 'SRS', 'SOS',
       'RPt%', 'ROW', 'year', 'League'],
      dtype='object')


In [9]:
# among the columns we are only interested in the team, win (W) and loss (L) columns
cols = ["team", "W", "L"]
nhl = nhl.loc[:, cols]
print(nhl['team'])


0          Atlantic Division
1       Tampa Bay Lightning*
2             Boston Bruins*
3       Toronto Maple Leafs*
4           Florida Panthers
5          Detroit Red Wings
6         Montreal Canadiens
7            Ottawa Senators
8             Buffalo Sabres
9      Metropolitan Division
10      Washington Capitals*
11      Pittsburgh Penguins*
12      Philadelphia Flyers*
13    Columbus Blue Jackets*
14        New Jersey Devils*
15       Carolina Hurricanes
16        New York Islanders
17          New York Rangers
18          Central Division
19      Nashville Predators*
20            Winnipeg Jets*
21           Minnesota Wild*
22       Colorado Avalanche*
23           St. Louis Blues
24              Dallas Stars
25        Chicago Blackhawks
26          Pacific Division
27     Vegas Golden Knights*
28            Anaheim Ducks*
29          San Jose Sharks*
30        Los Angeles Kings*
31            Calgary Flames
32           Edmonton Oilers
33         Vancouver Canucks
34           A

In [10]:
# at first glance we can detect at least 2 main issues with the team column:
# 1. the need for reformatting the names
# 2. removing the rows declaring the teams' divisions

def clean_team_name_nhl(name):
    return re.sub("[^(A-z)(a-z)\d\s]", "", name).strip().lower()

# addressing problem 1
nhl['team'] = nhl['team'].apply(clean_team_name_nhl)

# addressing problem 2
nhl = nhl[~nhl['team'].str.contains("division")]

# setting a custom index
nhl_custom_index = pd.Index(range(len(nhl)))

nhl = nhl.set_index(nhl_custom_index)

print(nhl)

                     team   W   L
0     tampa bay lightning  54  23
1           boston bruins  50  20
2     toronto maple leafs  49  26
3        florida panthers  44  30
4       detroit red wings  30  39
5      montreal canadiens  29  40
6         ottawa senators  28  43
7          buffalo sabres  25  45
8     washington capitals  49  26
9     pittsburgh penguins  47  29
10    philadelphia flyers  42  26
11  columbus blue jackets  45  30
12      new jersey devils  44  29
13    carolina hurricanes  36  35
14     new york islanders  35  37
15       new york rangers  34  39
16    nashville predators  53  18
17          winnipeg jets  52  20
18         minnesota wild  45  26
19     colorado avalanche  43  30
20         st louis blues  44  32
21           dallas stars  42  32
22     chicago blackhawks  33  39
23   vegas golden knights  51  24
24          anaheim ducks  44  25
25        san jose sharks  45  27
26      los angeles kings  45  29
27         calgary flames  37  35
28        edmo

In [11]:
# time to add the area_team name column to the nhl DataFrame
nhl['area_team'] = nhl['team'].apply(lambda x: re.sub("\s","", x).strip().lower())
print(nhl)

                     team   W   L            area_team
0     tampa bay lightning  54  23    tampabaylightning
1           boston bruins  50  20         bostonbruins
2     toronto maple leafs  49  26    torontomapleleafs
3        florida panthers  44  30      floridapanthers
4       detroit red wings  30  39      detroitredwings
5      montreal canadiens  29  40    montrealcanadiens
6         ottawa senators  28  43       ottawasenators
7          buffalo sabres  25  45        buffalosabres
8     washington capitals  49  26   washingtoncapitals
9     pittsburgh penguins  47  29   pittsburghpenguins
10    philadelphia flyers  42  26   philadelphiaflyers
11  columbus blue jackets  45  30  columbusbluejackets
12      new jersey devils  44  29      newjerseydevils
13    carolina hurricanes  36  35   carolinahurricanes
14     new york islanders  35  37     newyorkislanders
15       new york rangers  34  39       newyorkrangers
16    nashville predators  53  18   nashvillepredators
17        

## Map Areas and Teams
The objective is to map every team in the nhl dataframe to an area/city in the cities_nhl dataframe.  

The first step is to merge the two dataframes on the area_team column: As the area_team column is designed to represent the combination of a team and its area, this step should associate most of NHL teams with their respective areas.  

The rest should be processed separately.

In [12]:
# having the area_team column  in common between the two DataFrames we can merge them

merge_areas = pd.merge(cities_nhl, nhl, how ='left',on=['area_team'])
print(merge_areas.loc[:, ["area", "team"]])

                      area                   team
0            New York City                    NaN
1              Los Angeles                    NaN
2   San Francisco Bay Area                    NaN
3                  Chicago     chicago blackhawks
4        Dallas–Fort Worth                    NaN
5         Washington, D.C.                    NaN
6             Philadelphia    philadelphia flyers
7                   Boston          boston bruins
8   Minneapolis–Saint Paul                    NaN
9                   Denver                    NaN
10   Miami–Fort Lauderdale                    NaN
11                 Phoenix                    NaN
12                 Detroit      detroit red wings
13                 Toronto    toronto maple leafs
14          Tampa Bay Area                    NaN
15              Pittsburgh    pittsburgh penguins
16               St. Louis                    NaN
17               Nashville    nashville predators
18                 Buffalo         buffalo sabres


In [13]:
merge_teams = pd.merge(nhl, cities_nhl, how='left', on=['area_team'])
print(merge_teams.loc[:, ["team", "area","area_team"]])

                     team          area            area_team
0     tampa bay lightning           NaN    tampabaylightning
1           boston bruins        Boston         bostonbruins
2     toronto maple leafs       Toronto    torontomapleleafs
3        florida panthers           NaN      floridapanthers
4       detroit red wings       Detroit      detroitredwings
5      montreal canadiens      Montreal    montrealcanadiens
6         ottawa senators        Ottawa       ottawasenators
7          buffalo sabres       Buffalo        buffalosabres
8     washington capitals           NaN   washingtoncapitals
9     pittsburgh penguins    Pittsburgh   pittsburghpenguins
10    philadelphia flyers  Philadelphia   philadelphiaflyers
11  columbus blue jackets      Columbus  columbusbluejackets
12      new jersey devils           NaN      newjerseydevils
13    carolina hurricanes           NaN   carolinahurricanes
14     new york islanders           NaN     newyorkislanders
15       new york ranger

### Teams without a match in area_team column
Due to the different versions of certain team names, certain teams do not match their cities in the cities dataset. However, this association can be easily inferred by investigating manually the area_team column.

The two versions of this column (Generally) share an inner word, syllabe...

In [14]:
teams_no_clear_area = merge_teams[merge_teams['area'].isna()]
print(teams_no_clear_area)

                    team   W   L           area_team area  pop  nhl
0    tampa bay lightning  54  23   tampabaylightning  NaN  NaN  NaN
3       florida panthers  44  30     floridapanthers  NaN  NaN  NaN
8    washington capitals  49  26  washingtoncapitals  NaN  NaN  NaN
12     new jersey devils  44  29     newjerseydevils  NaN  NaN  NaN
13   carolina hurricanes  36  35  carolinahurricanes  NaN  NaN  NaN
14    new york islanders  35  37    newyorkislanders  NaN  NaN  NaN
15      new york rangers  34  39      newyorkrangers  NaN  NaN  NaN
18        minnesota wild  45  26       minnesotawild  NaN  NaN  NaN
19    colorado avalanche  43  30   coloradoavalanche  NaN  NaN  NaN
20        st louis blues  44  32        stlouisblues  NaN  NaN  NaN
21          dallas stars  42  32         dallasstars  NaN  NaN  NaN
23  vegas golden knights  51  24  vegasgoldenknights  NaN  NaN  NaN
24         anaheim ducks  44  25        anaheimducks  NaN  NaN  NaN
25       san jose sharks  45  27       sanjosesh

In [15]:
areas_no_clear_team = merge_areas[merge_areas["team"].isna()]

In [16]:
# the teams left out with no clear area name are to be processed manually
# first let's consider the possibility of a mapping between the column [area_team] in the nhl DF
# and the column [area_team] in the nhl_cities DF

area_team_no_match_nhl_DF = teams_no_clear_area.set_index(pd.Index(range(len(teams_no_clear_area))))['area_team']
area_team_no_match_cities_nhl_DF = areas_no_clear_team.set_index(pd.Index(range(len(areas_no_clear_team))))['area_team']

no_match = pd.DataFrame({"team_no_area": area_team_no_match_nhl_DF, "area_no_name": area_team_no_match_cities_nhl_DF})
print(no_match)

          team_no_area                       area_no_name
0    tampabaylightning  newyorkcityrangersislandersdevils
1      floridapanthers               losangeleskingsducks
2   washingtoncapitals          sanfranciscobayareasharks
3      newjerseydevils              dallas–fortworthstars
4   carolinahurricanes            washington,d.c.capitals
5     newyorkislanders          minneapolis–saintpaulwild
6       newyorkrangers                    denveravalanche
7        minnesotawild       miami–fortlauderdalepanthers
8    coloradoavalanche                     phoenixcoyotes
9         stlouisblues              tampabayarealightning
10         dallasstars                      st.louisblues
11  vegasgoldenknights              lasvegasgoldenknights
12        anaheimducks                  raleighhurricanes
13       sanjosesharks                                NaN
14     losangeleskings                                NaN
15      arizonacoyotes                                NaN


In [17]:
# the last dataframe made it easy to see the mapping between the two columns
# the following dictionary reflects the mapping

mapping = {"newyorkcityrangersislandersdevils": "newyorkislanders", "losangeleskingsducks": "losangeleskings", "dallas–fortworthstars":"dallasstars"
, 'washington,d.c.capitals': "washingtoncapitals", "minneapolis–saintpaulwild": "minnesotawild", "denveravalanche": "coloradoavalanche"
, "miami–fortlauderdalepanthers": "floridapanthers", "tampabayarealightning": "tampabaylightning", "st.louisblues": "stlouisblues"
, "lasvegasgoldenknights": "vegasgoldenknights", "phoenixcoyotes": "arizonacoyotes", "raleighhurricanes": "carolinahurricanes", "sanfranciscobayareasharks": "sanjosesharks"}

# the next step is to map the old area_team names in the cities_nhl DF to their respective mapped value  

cities_nhl['area_team'] = cities_nhl['area_team'].apply(lambda x: mapping[x].strip() if x in mapping else x)

merge_teams = pd.merge(nhl, cities_nhl, how='left', on=['area_team'])
print(merge_teams.loc[: , ['area', 'team', 'area_team']])

                      area                   team            area_team
0           Tampa Bay Area    tampa bay lightning    tampabaylightning
1                   Boston          boston bruins         bostonbruins
2                  Toronto    toronto maple leafs    torontomapleleafs
3    Miami–Fort Lauderdale       florida panthers      floridapanthers
4                  Detroit      detroit red wings      detroitredwings
5                 Montreal     montreal canadiens    montrealcanadiens
6                   Ottawa        ottawa senators       ottawasenators
7                  Buffalo         buffalo sabres        buffalosabres
8         Washington, D.C.    washington capitals   washingtoncapitals
9               Pittsburgh    pittsburgh penguins   pittsburghpenguins
10            Philadelphia    philadelphia flyers   philadelphiaflyers
11                Columbus  columbus blue jackets  columbusbluejackets
12                     NaN      new jersey devils      newjerseydevils
13    

### Teams with no area indication in their names
This type of team is the trickiest and associating it with its region requires a human understanding of the data as well research the problem outside the scope of Jupyter Notebooks

In [18]:
# as expected there are 3 teams with no associated area, the next stop now is to associate these teams with one
# of areas provided. This task requires more human understanding of the data and a little bit of research

print("LEFT OUT TEAMS")
print(merge_teams[merge_teams['area'].isna()]['team'])
print()
print("AREAS:")
print(cities_nhl['area'])

LEFT OUT TEAMS
12    new jersey devils
15     new york rangers
24        anaheim ducks
Name: team, dtype: object

AREAS:
0              New York City
1                Los Angeles
2     San Francisco Bay Area
3                    Chicago
4          Dallas–Fort Worth
5           Washington, D.C.
6               Philadelphia
7                     Boston
8     Minneapolis–Saint Paul
9                     Denver
10     Miami–Fort Lauderdale
11                   Phoenix
12                   Detroit
13                   Toronto
14            Tampa Bay Area
15                Pittsburgh
16                 St. Louis
17                 Nashville
18                   Buffalo
19                  Montreal
20                 Vancouver
21                  Columbus
22                   Calgary
23                    Ottawa
24                  Edmonton
25                  Winnipeg
26                 Las Vegas
27                   Raleigh
Name: area, dtype: object


#### Research Results
According to the following [link](https://en.wikipedia.org/wiki/New_Jersey_Devils), the ***new jersey devils*** can be assigned to the ***New York City***. The ***new york rangers*** are also assigned to the same area. As for ***anaheim ducks***, it belongs to the ***Los Angelos*** area according to the following links:
* [link1](https://en.wikipedia.org/wiki/Anaheim_Ducks#:~:text=Anaheim%20Ducks%20The%20Anaheim%20Ducks%20are%20a%20professional,and%20play%20their%20home%20games%20at%20Honda%20Center.)
* [link2](https://en.wikipedia.org/wiki/Anaheim,_California)

In [19]:
team_area = {"new jersey devils": "New York City", "new york rangers": "New York City", "anaheim ducks": "Los Angeles"}

def set_areas(row):
    if row['team'] in team_area:
        row['area'] = team_area[row['team']]
    return row
merge_teams = merge_teams.apply(set_areas, axis=1)

print(merge_teams.loc[: , ['area', 'team', 'area_team']])

                      area                   team            area_team
0           Tampa Bay Area    tampa bay lightning    tampabaylightning
1                   Boston          boston bruins         bostonbruins
2                  Toronto    toronto maple leafs    torontomapleleafs
3    Miami–Fort Lauderdale       florida panthers      floridapanthers
4                  Detroit      detroit red wings      detroitredwings
5                 Montreal     montreal canadiens    montrealcanadiens
6                   Ottawa        ottawa senators       ottawasenators
7                  Buffalo         buffalo sabres        buffalosabres
8         Washington, D.C.    washington capitals   washingtoncapitals
9               Pittsburgh    pittsburgh penguins   pittsburghpenguins
10            Philadelphia    philadelphia flyers   philadelphiaflyers
11                Columbus  columbus blue jackets  columbusbluejackets
12           New York City      new jersey devils      newjerseydevils
13    

## Final Step
As the dataframe is cleaned, merged and filled with the necessary values, we can proceed to the final step which is
computing the win-loss ration for each team, then group the team by the area and consider the correlation between the win-loss values and the area's population

In [20]:
final_df = merge_teams.loc[:, ['area', 'team', 'W', 'L', 'pop']]
final_df['win_loss_ratio'] = final_df['W'].astype(float) / (final_df['W'].astype(float) + final_df['L'].astype(float))
final_df = final_df.loc[:, ['area', 'win_loss_ratio', 'pop']]
# print(final_df)

final_df = final_df.set_index('area').astype(float).groupby('area').agg({"win_loss_ratio":'mean', 'pop':'mean'})
print(final_df)

                        win_loss_ratio         pop
area                                              
Boston                        0.714286   4794447.0
Buffalo                       0.357143   1132804.0
Calgary                       0.513889   1392609.0
Chicago                       0.458333   9512999.0
Columbus                      0.600000   2041520.0
Dallas–Fort Worth             0.567568   7233323.0
Denver                        0.589041   2853077.0
Detroit                       0.434783   4297617.0
Edmonton                      0.473684   1321426.0
Las Vegas                     0.680000   2155664.0
Los Angeles                   0.622895  13310447.0
Miami–Fort Lauderdale         0.594595   6066387.0
Minneapolis–Saint Paul        0.633803   3551036.0
Montreal                      0.420290   4098927.0
Nashville                     0.746479   1865298.0
New York City                 0.518201  20153634.0
Ottawa                        0.394366   1323783.0
Philadelphia                  0

In [21]:
print(final_df.corr())

                win_loss_ratio       pop
win_loss_ratio        1.000000  0.012486
pop                   0.012486  1.000000


In [22]:
final_df.to_csv(final_nhl_path)