# NWSL Exploratory Data Analysis

## Overview
The National Women's Soccer League (NWSL) is the premier professional women's soccer league in the United States. In this repository, I will be scraping player and team data, from the NWSL website (www.nwslsoccer.com) and performing exploratory data analysis on the collected data.

This project can be found on my GitHub at https://github.com/ck-duong/nwsl.
Please view this notebook at https://nbviewer.jupyter.org/github/ck-duong/nwsl/blob/master/nwsl_eda.ipynb for best visualization.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
#necessary imports to run the code
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import os

In [3]:
#imports for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly
import cufflinks as cf

#for offline plotting
plotly.offline.init_notebook_mode(connected=True)
cf.set_config_file(offline=True, world_readable=True, theme='ggplot')

In [4]:
#imports from module
from module.functions import combination, calculate_stats

## Scraping
In the subdirectory "scripts", there are three Python files written to scrape data from the official NWSL website: statscrape.py, teamscrape.py, and standingscrape.py. 

The statscrape.py file scrapes player data from the Stats page of the website for each player in the league from 2016 through 2019 (each of the years the league has made player/team stats publically available) and compiles them into csv files by year, entitled "nwsl{}.csv" for each year. 

The teamscrape.py file scrapes player data from the Team pages of the website for each team for each year the team has existed and compiles them into csv files by year, entitled "position{}.csv" for each year.

The standingscrape.py file scrapes team data from the Standings pages of the website for each team for each year the league has provided public stats (2016 - 2019) and compiles them into csv files by year, entitled "standings{}.csv" for each year. Additionally, this file also formats the dataframes into more user-readable and more usable data by seperating the scraped Home and Away game data into seperate columns based on location (Home or Away) and game result (Win, Loss, Tie). These standings come from the regular season of NWSL play and do not account for playoff/post-season games.

In the following cell of code, I run these three files to create the csvs I will be working with in the rest of this notebook. Currently they are commented out since they only need to be run once to collect our data. However, I will note that the 2019 NWSL season is currently taking place, meaning that rerunning these files will get us the most up to date data. 

For this analysis, I will only be looking at the statistics for the 2019 season up to June 11th, although my code will be able to work with future data as well since it will all be formatted in the same way. It is also worth nothing that as of late May/early June 2019, many NWSL teams are missing players who also serve on their national team (ie. Sam Kerr of the Chicago Red Stars who also plays for the Australian Women's National Team, etc.) due to the Women's World Cup occuring this summer, 2019.

In the future, I will be scraping further player data from the Player page on the NWSL site, which also includes players' personal information such as Country, Height, etc. rather than just NWSL stats.

In [5]:
#py files to run to scrape the data from the NWSL page. Only need to run once.
#uncomment the following lines the first time you're running through this notebook
#may take 1-4 minutes to run. Please only run once as not to flood the nwsl.com servers

#!python ./scripts/statscrape.py
#!python ./scripts/teamscrape.py
#!python ./scripts/standingscrape.py

## Cleaning and Pre-Analysis
This section of the notebook includes reading in the raw data and applying some basic cleaning to make the data easier to work with.

Some cleaning/pre-analysis strategies I used were:

- Combining the nwsl{}.csv and position{}.csv files into full{}.csv files so I could access both player stats and position.
- Creating "Goals per Game", "Assists per Game", "Shots per Game", "Proportion of Shots on Goal per Goal", and "Proportion of Shots on Goal", for each player in the dataset
- Combining all full{}.csv dataframes into a larger dataframe with data from all years named `nwsl`
- Combining all standings{}.csv dataframes into a larger dataframe with data from all years named `standings`
- Getting the ranks for each team throughout the years and combining them into one dataframe

In [6]:
#run to join all of the nwsl/position csvs
combination(2016, 2019)

In [7]:
#getting all of the full.csv files in the subdirectory
file_path = os.path.join('data', 'full')
csvs = os.listdir(path = file_path)
files = []
#for loop to get all the full.csv paths
for file in csvs:
    fp = os.path.join(file_path, file)
    files.append(fp)
#for organization purposes later
files.sort()

In [8]:
#reading all the files from the subdirectory
nwsl_2016 = pd.read_csv(files[0])

nwsl_2017 = pd.read_csv(files[1])

nwsl_2018 = pd.read_csv(files[2])

nwsl_2019 = pd.read_csv(files[3])

#compiling all the dataframes into a list for later
all_nwsl = [nwsl_2016, nwsl_2017, nwsl_2018, nwsl_2019]

In [9]:
#apply above function to all dataframes in the list
for each in all_nwsl:
    calculate_stats(each)

In [10]:
nwsl = pd.DataFrame(columns = nwsl_2019.columns)
for each in all_nwsl:
    nwsl = pd.concat([nwsl, each])
#nwsl is the combined data for all years

#save as csv for easy access later
nwsl.to_csv(os.path.join('data', 'final', 'nwsl_final.csv'), index = False)

In [11]:
#getting all of the full.csv files in the subdirectory
standings_path = os.path.join('data', 'standings')
standings_csvs = os.listdir(path = standings_path)
standings_files = []
#for loop to get all the full.csv paths
for file in standings_csvs:
    fp = os.path.join(standings_path, file)
    standings_files.append(fp)
#for organization purposes later
standings_files.sort()

In [12]:
#reading all the files from the subdirectory
#adding a "season" column so we can combine the full dataframes
#and still be able to differentiate between seasons
standings_2016 = pd.read_csv(standings_files[0])

standings_2017 = pd.read_csv(standings_files[1])

standings_2018 = pd.read_csv(standings_files[2])

standings_2019 = pd.read_csv(standings_files[3])

all_standings = [standings_2016, standings_2017, standings_2018, standings_2019]

In [13]:
standings = pd.DataFrame()
for each in all_standings:
    standings = pd.concat([standings, each])
#standings is the combined standings data for all years

#save as csv for easy access later
standings.to_csv(os.path.join('data', 'final', 'standings_final.csv'), index = False)

In [14]:
def get_ranks():
    """
    Creates a dataframe of team rankings by season/year
    
    :parameters:
    standings - dataframe like standings2016.csv
    all_standings - a list of standings dataframes
    
    :returns:
    rankings - dataframe with team names as columns, years as index,
    and ranking of the team that year as the value
    """
    teams = standings['Team'].sort_values().unique()
    ranks = pd.DataFrame()
    for team in teams:
        team_rank = []
        for year in all_standings:
            rank = year.loc[year['Team'] == team]['Rank'].values
            if len(rank):
                team_rank.append(rank[0])
            else:
                team_rank.append(np.nan)
        ranks[team] = team_rank
        ranks = ranks.set_index(np.arange(2016, 2020))
    return ranks

## Missingness

In [15]:
nwsl.loc[nwsl['Player Name'] == 'Sam Kerr']

Unnamed: 0,Player Name,Team,Games Played,Games Started,Minutes Played,Goals,Assists,Shots,Shots on Goal,Fouls Committed,...,Penalty Kick Goals,Yellow Cards,Red Cards,Season,Position,Goals per Game,Assists per Game,Shots per Game,Prop SoG,Shots per Goal
18,Sam Kerr,NJ,9.0,6.0,616.0,5.0,0.0,20.0,12.0,5.0,...,0.0,0.0,0.0,2016.0,,0.555556,0.0,2.222222,0.6,0.416667
0,Sam Kerr,NJ,22.0,21.0,1918.0,17.0,4.0,91.0,54.0,10.0,...,0.0,2.0,0.0,2017.0,,0.772727,0.181818,4.136364,0.593407,0.314815
0,Sam Kerr,CHI,19.0,19.0,1704.0,16.0,4.0,88.0,47.0,9.0,...,0.0,2.0,0.0,2018.0,,0.842105,0.210526,4.631579,0.534091,0.340426
0,Sam Kerr,CHI,6.0,6.0,540.0,6.0,1.0,21.0,14.0,2.0,...,0.0,1.0,0.0,2019.0,Attacker,1.0,0.166667,3.5,0.666667,0.428571


In [16]:
nwsl.loc[nwsl['Player Name'] == 'Christen Press']

Unnamed: 0,Player Name,Team,Games Played,Games Started,Minutes Played,Goals,Assists,Shots,Shots on Goal,Fouls Committed,...,Penalty Kick Goals,Yellow Cards,Red Cards,Season,Position,Goals per Game,Assists per Game,Shots per Game,Prop SoG,Shots per Goal
5,Christen Press,CHI,14.0,14.0,1260.0,8.0,0.0,56.0,32.0,16.0,...,1.0,0.0,0.0,2016.0,Attacker,0.571429,0.0,4.0,0.571429,0.25
3,Christen Press,CHI,23.0,22.0,1997.0,11.0,4.0,84.0,49.0,24.0,...,4.0,2.0,0.0,2017.0,Attacker,0.478261,0.173913,3.652174,0.583333,0.22449
46,Christen Press,UTA,11.0,11.0,975.0,2.0,2.0,34.0,17.0,5.0,...,0.0,0.0,0.0,2018.0,Attacker,0.181818,0.181818,3.090909,0.5,0.117647
42,Christen Press,UTA,2.0,2.0,180.0,1.0,1.0,7.0,3.0,3.0,...,0.0,0.0,0.0,2019.0,Attacker,0.5,0.5,3.5,0.428571,0.333333


In [17]:
nwsl.loc[nwsl['Player Name'] == 'Ali Krieger']

Unnamed: 0,Player Name,Team,Games Played,Games Started,Minutes Played,Goals,Assists,Shots,Shots on Goal,Fouls Committed,...,Penalty Kick Goals,Yellow Cards,Red Cards,Season,Position,Goals per Game,Assists per Game,Shots per Game,Prop SoG,Shots per Goal
76,Ali Krieger,WAS,15.0,14.0,1267.0,1.0,0.0,8.0,4.0,9.0,...,0.0,2.0,0.0,2016.0,,0.066667,0.0,0.533333,0.5,0.25
116,Ali Krieger,ORL,24.0,24.0,2160.0,0.0,2.0,3.0,0.0,13.0,...,0.0,2.0,0.0,2017.0,Defender,0.0,0.083333,0.125,0.0,0.0
89,Ali Krieger,ORL,19.0,19.0,1675.0,0.0,2.0,7.0,1.0,12.0,...,0.0,1.0,0.0,2018.0,Defender,0.0,0.105263,0.368421,0.142857,0.0
86,Ali Krieger,ORL,4.0,4.0,360.0,0.0,0.0,1.0,0.0,3.0,...,0.0,0.0,0.0,2019.0,Defender,0.0,0.0,0.25,0.0,0.0


In [18]:
nwsl.loc[nwsl['Player Name'] == 'Hayley Raso']

Unnamed: 0,Player Name,Team,Games Played,Games Started,Minutes Played,Goals,Assists,Shots,Shots on Goal,Fouls Committed,...,Penalty Kick Goals,Yellow Cards,Red Cards,Season,Position,Goals per Game,Assists per Game,Shots per Game,Prop SoG,Shots per Goal
131,Hayley Raso,POR,20.0,6.0,751.0,0.0,2.0,9.0,3.0,18.0,...,0.0,3.0,0.0,2016.0,,0.0,0.1,0.45,0.333333,0.0
12,Hayley Raso,POR,22.0,20.0,1784.0,6.0,3.0,16.0,9.0,38.0,...,0.0,5.0,0.0,2017.0,,0.272727,0.136364,0.727273,0.5625,0.666667
43,Hayley Raso,POR,12.0,9.0,774.0,2.0,2.0,16.0,5.0,17.0,...,0.0,0.0,0.0,2018.0,,0.166667,0.166667,1.333333,0.3125,0.4
120,Hayley Raso,POR,1.0,0.0,45.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,2019.0,Attacker,0.0,0.0,1.0,1.0,0.0


As I was exploring the NWSL data, I noticed some missingness, specifically in the 'Position' column of `nwsl` data. When I looked over the position{}.csvs, I noticed that they were much smaller than the nwsl{}.csvs (general player stats), and the Position missingness became more evident after joining the dataframes. The position2019.csv was the only one that was not missing any data.

On further inspection of the NWSL site, I noticed that teams' older rosters did not have all of their players for that year. For example, in 2016, the Chicago Red Stars (CHI) only listed 5 players on their roster, resulting in only 5 player positions being scraped in position2016.csv for CHI, even though they had a full team of 20 players as shown in the nwsl_2016 dataframe.

To explore this missingness, I first focused on a specific player, Sam Kerr. For the 2016 and 2017 seasons, Kerr played for Sky Blue FC (NJ). She was traded to CHI in 2018 prior to the season start and played for CHI for the 2018 and 2019 seasons. However, only the 2019 dataset had her position as an Attacker listed, as shown above. It seemed that her previous team, NJ, had removed her from their roster and her current team, CHI, had not added her until 2019, so I wanted to check if this was true for all players.

Next I looked at players Christen Press, Ali Krieger, and Hayley Raso to further explore the missingness in this dataset. Press and Krieger are well known USWNT players who have been traded in the 2016-2019 timeframe while Raso is a prominent AUSWNT player who has consistently been playing for the Portland Thorns (POR) since her acquisition from the Washington Spirit (WAS) in 2016. 

For Press, though she had been traded in 2018 from CHI to the Utah Royals (UTA), she had had a position listed for all the years she played in the NWSL. When looking at the NWSL site, I noticed that if a team did not exist during a certain year (ie. UTA in 2016) changing the roster URL as I did in my teamscrape.py would redirect us to the 2019 roster, which had all of the current players listed, explaining why Press did not have any missingness in Position.

With Krieger, there was only missingness of Position in 2016 with her previous team, the Washington Spirit (WAS). Though she was traded in 2017 to the Orlando Pride (ORL), there is no missingness during her time with ORL, which was different than the trend we saw with Kerr.

For Raso, there was missing Position data for all years prior to 2019, even though Raso played for POR from 2016 onwards.

In the future, to confirm missingness type, I will perform a permutation test to see if the missingness of Position is Missing At Random (MAR) dependent on Team/Year, Not Missing At Random (NMAR) or Missing Completely At Random (MCAR).

In [19]:
null_cols = nwsl.columns[nwsl.isnull().any()]
data_nulls = nwsl[nwsl.isnull().any(axis=1)] #get the rows that have nulls
data_nulls.head()

Unnamed: 0,Player Name,Team,Games Played,Games Started,Minutes Played,Goals,Assists,Shots,Shots on Goal,Fouls Committed,...,Penalty Kick Goals,Yellow Cards,Red Cards,Season,Position,Goals per Game,Assists per Game,Shots per Game,Prop SoG,Shots per Goal
3,Nadia Nadim,POR,20.0,20.0,1649.0,9.0,3.0,32.0,15.0,36.0,...,4.0,4.0,0.0,2016.0,,0.45,0.15,1.6,0.46875,0.6
6,Manon Melis,SEA,16.0,15.0,1051.0,7.0,1.0,29.0,21.0,7.0,...,0.0,0.0,0.0,2016.0,,0.4375,0.0625,1.8125,0.724138,0.333333
7,Sofia Huerta,CHI,20.0,20.0,1742.0,7.0,2.0,39.0,16.0,21.0,...,0.0,2.0,0.0,2016.0,,0.35,0.1,1.95,0.410256,0.4375
8,Kim Little,SEA,20.0,20.0,1795.0,6.0,2.0,27.0,16.0,17.0,...,3.0,0.0,0.0,2016.0,,0.3,0.1,1.35,0.592593,0.375
12,Carli Lloyd,HOU,7.0,7.0,553.0,5.0,3.0,26.0,12.0,4.0,...,1.0,0.0,0.0,2016.0,,0.714286,0.428571,3.714286,0.461538,0.416667


## Visualizations

Note: After the 2016 season, the francise rights to the Western New York Flash (WNY) were sold to the owners of North Carolina FC and the team was rebranded as the North Carolina Courage (NC). Many players from the 2016 WNYF team eventually played on the 2017 and onwards NC team. For the purposes of this EDA, I did not combine the 2016 WNYF data with the NC data.

In [20]:
#this cell makes a bar graph of goals by team by season using the nwsl dataframe
#this cell can be modified to account for any stat in the nwsl dataframe (such as
#assists, minutes played, etc) to any statistic (sum, mean, etc)
goals = nwsl.pivot_table(index = 'Team', columns = 'Season',
                                                   values = 'Goals', aggfunc= sum)
goals.iplot(kind = 'bar', title = 'Total Goals by Team by Season (Year)')

In [21]:
#this cell creates a line graph of the rankings of each team
rankings = get_ranks()
layout = go.Layout(
    yaxis=dict(
        autorange = 'reversed'
    ),
    title = 'Overall Team Ranking by Year'
)

fig = rankings.iplot(layout = layout, asFigure=True)
plot_url = iplot(fig)
plot_url

In [22]:
#this cell plots a bar graph of goal difference by team by season using
#the standings dataframe. This has less flexibility than the nwsl dataframe
#but is simplier in structure
seasoned = standings.pivot_table(index = 'Team',
                                 columns = 'Season',
                                 values = 'Goal Difference')
seasoned.iplot(kind = 'bar', title = 'Goal Difference by Team by Season (Year)')

In my visualizations of NWSL team data standings and statistics, I began to notice ranking trends, the most intuitive being teams with a higher goal difference (more goals for than against) seemed to usually end the season with a higher rank. This seems obvious, that the team that scores the most goals and has the least goals against them is the "best team," but at the same time, goal difference was not the only indicator of performance.

In 2016, CHI ended the season in 3rd place out of 10 with a 4 goal difference. Although both the Seattle Reign (SEA) and WNY both had a higher goal difference than CHI, at 8 and 14 respectively, both teams were ranked lower than CHI (in 4th and 5th place respectively). This occurs again in 2017 where CHI once again ranks higher than SEA despite having a lower goal difference.

Similarly, ORL's highest rank (3rd) occured in 2017 when their goal difference was the highest while their lower ranks (9th in 2016 and 2019 and 7th in 2018) occured when their goal difference was low (below zero, meaning they had more goals against them than goals for)

Furthermore, when looking at standings data, I noticed that some teams tended to be more stable than others in their seasonal ranking. For example, in the 2016-2019 seasons, POR has consistently been ranked highly, specifically in 2nd for 2017-2019 and 1st in 2016. Meanwhile, WAS ended the 2016 season in 2nd place but dropped to 10th in 2017, finished 8th in 2018, and then shot up to 1st in the first half of the 2019 season. This is visible on the line graph of standings and also through the goal difference comparison throughout the seasons.

This trend may be due to factors not shown such as player trades, changing in coaching, player injuries, etc. In the future, I plan on scraping more extensive data from the NWSL site to visualize such as the ones listed and performing more detailed analysis and visualizations.