# NWSL Exploratory Data Analysis

## Introduction
The National Women's Soccer League (NWSL) is the premier professional women's soccer league in the United States. In this repository, I will be scraping player and team data, from the NWSL website (www.nwslsoccer.com) and performing exploratory data analysis on the collected data.

In [1]:
#necessary imports to run the code
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import os

In [3]:
#imports for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.plotly as py
import plotly.graph_objs as go
import plotly
import cufflinks as cf

#for offline plotting
plotly.offline.init_notebook_mode(connected=True)
cf.set_config_file(offline=True, world_readable=True, theme='ggplot')

In [4]:
%load_ext autoreload
%autoreload 2

## Scraping
In the subdirectory "scraping", there are three Python files written to scrape data from the official NWSL website: statscrape.py, teamscrape.py, and standingscrape.py. 

The statscrape.py file scrapes player data from the Stats page of the website for each player in the league from 2016 through 2019 (each of the years the league has made player/team stats publically available) and compiles them into csv files by year, entitled "nwsl{}.csv" for each year. 

The teamscrape.py file scrapes player data from the Team pages of the website for each team for each year the team has existed and compiles them into csv files by year, entitled "position{}.csv" for each year.

The standingscrape.py file scrapes team data from the Standings pages of the website for each team for each year the league has provided public stats (2016 - 2019) and compiles them into csv files by year, entitled "standings{}.csv" for each year. Additionally, this file also formats the dataframes into more user-readable and more usable data by seperating the scraped Home and Away game data into seperate columns based on location (Home or Away) and game result (Win, Loss, Tie).

In the following cell of code, I run these two files to create the csvs I will be working with in the rest of this notebook. Currently they are commented out since they only need to be run once to collect our data. However, I will note that the 2019 NWSL season is currently taking place, meaning that rerunning these files will get us the most up to date data. 

For this analysis, I will only be looking at the April and May statistics for the 2019 season, although my code will be able to work with future data as well since it will all be formatted in the same way. It is also worth nothing that as of June 2019, many NWSL teams are missing players who also serve on their national team (such as the USWNT, CANWNT, etc) due to the Women's World Cup occuring this summer.

In [127]:
#py files to run to scrape the data from the NWSL page. Only need to run once.
#!python ./scraping/statscrape.py
#!python ./scraping/teamscrape.py
#!python ./scraping/standingscrape.py

## Cleaning and Pre-Analysis
This section of the notebook includes reading in the raw data and applying some basic cleaning to make the data easier to work with.

Some cleaning/pre-analysis strategies I used were:

- Combining the nwsl{}.csv and position{}.csv files into full{}.csv files so I could access both player stats and position.
- Adding "Season" columns to each full{}.csv so that I could concat them all into a large dataframe to work with while still retaining a distinction between yearly data
- Creating "Goals per Game", "Assists per Game", "Shots per Game", "Proportion of Shots on Goal per Goal", and "Proportion of Shots on Goal", for each player in the dataset
- Combining all full{}.csv dataframes into a larger dataframe with data from all years

In [16]:
def combination(start_year, end_year):
    """
    Combines the nwsl.csv and position.csv csvs for each
    year in the given range ad
    
    :parameters:
    start_year - integer indicating start year of data
    end_year - integer indicating end year of data
    """
    for i in range(start_year, end_year + 1):
        nwsl_file = 'nwsl{}.csv'.format(i)
        position_file = 'position{}.csv'.format(i)

        nwsl = pd.read_csv(os.path.join('data', 'nwsl', nwsl_file))
        position = pd.read_csv(os.path.join('data', 'position', position_file))
        df = nwsl.merge(position, left_on='Player Name',
                            right_on='Player', how = 'left').drop('Player', axis = 1)

        name = 'full{}.csv'.format(i)
        path = os.path.join('data', 'full', name)

        df.to_csv(path, index=False)

In [17]:
#run to join all of the nwsl/position csvs
combination(2016, 2019)

In [18]:
#getting all of the full.csv files in the subdirectory
file_path = os.path.join('data', 'full')
csvs = os.listdir(path = file_path)
files = []
#for loop to get all the full.csv paths
for file in csvs:
    fp = os.path.join(file_path, file)
    files.append(fp)
#for organization purposes later
files.sort()
files
#use nwsl.files, full for with prediction later

['data/full/full2016.csv',
 'data/full/full2017.csv',
 'data/full/full2018.csv',
 'data/full/full2019.csv']

In [44]:
#reading all the files from the subdirectory
#adding a "season" column so we can combine the full dataframes
#and still be able to differentiate between seasons
nwsl_2016 = pd.read_csv(files[0])
nwsl_2016['Season'] = 2016

nwsl_2017 = pd.read_csv(files[1])
nwsl_2017['Season'] = 2017

nwsl_2018 = pd.read_csv(files[2])
nwsl_2018['Season'] = 2018

nwsl_2019 = pd.read_csv(files[3])
nwsl_2019['Season'] = 2019

all_nwsl = [nwsl_2016, nwsl_2017, nwsl_2018, nwsl_2019]

In [47]:
def calculate_stats(df):
    """
    Calculates Goals per Game, Assists per Game, Shots per Game, 
    Proportion of Shots on Goal per Goal, and Proportion of Shots on
    Goal, for each player in the dataset. Creates columns for these 
    values in each dataframe.
    
    :parameters:
    df - dataframe like nwsl.csv/full.csv with neceesary columns
    """
    #calculating stats, self explanatory column names
    df['Goals per Game'] = df['Goals']/df['Games Played']
    df['Assists per Game'] = df['Assists']/df['Games Played']
    df['Shots per Game'] = df['Shots']/df['Games Played']
    df['Prop SoG'] = df['Shots on Goal']/df['Shots']
    df['Shots per Goal'] = df['Goals']/df['Shots on Goal']
    
    int_cols = df.columns[2:].tolist()
    int_cols.remove('Position')
    for each in int_cols:
        df[each] = df[each].astype(float)
    
    #May create a classifer for Position later, leaving nulls in this column
    nonPos = df.loc[:, ~df.columns.isin(['Position'])].columns.tolist()
    df[nonPos] = df[nonPos].fillna(0)

In [48]:
#apply above function to all dataframes in the list
for each in all_nwsl:
    calculate_stats(each)

In [56]:
nwsl = pd.DataFrame(columns = nwsl_2019.columns)
for each in all_nwsl:
    nwsl = pd.concat([nwsl, each])
#nwsl is the combined data for all years

## Missingness

#NOTE: MISSINGNESS FOR POSITION DATA: SOME TEAMS DON'T POST THEIR FULL OLDER ROSTER IE CRS 2016 ONLY HAD 5 PLAYERS
#PLAYER POSITION MISSINGNESS: SOME PLAYERS ARE TEMPS/HIRES AND NOT FULLY CONTRACTED or TRADES/IF A PLAYER TRADED TEAMS THEN THEIR OLD TEAM DISCARDS THEIR DATA

## Visualizations

In [130]:
drop_cols = ['Games Played', 'Games Started', 'Minutes Played']
totals = nwsl.drop(drop_cols, axis = 1).groupby(['Team', 'Season']).sum()
tester = totals[['Goals']].unstack()
tester.columns = tester.columns.get_level_values(1)
tester.iplot(kind = 'bar', title = 'Total Goals by Team by Year')

## Analysis

In [None]:
#get the positions of each player, predict what position they play based on goals/assists/etc.