# NWSL Exploratory Data Analysis

## Introduction
The National Women's Soccer League (NWSL) is the premier professional women's soccer league in the United States. In this repository, I will be scraping player and team data, from the NWSL website (www.nwslsoccer.com) and performing exploratory data analysis on the collected data.

In [1]:
#necessary imports to run the code
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

In [2]:
%load_ext autoreload
%autoreload 2

## Scraping
In the subdirectory "scraping", there are two Python files written to scrape data from the official NWSL website: statscrape.py and teamscrape.py. 

The statscrape.py file scrapes player data from the Stats page of the website for each player in the league from 2016 through 2019, each of the years the league has existed and compiles them into csv files by year, entitled "nwsl{}.csv" for each year. 

The teamscrape.py file scrapes player data from the Team pages of the website for each team for each year the team has existed and compiles them into csv files by year, entitled "position{}.csv" for each year.

In the following cell of code, I run these two files to create the csvs I will be working with in the rest of this notebook. Currently they are commented out since they only need to be run once to collect our data. However, I will note that the 2019 NWSL season is currently taking place, meaning that rerunning these files will get us the most up to date data. 

For this analysis, I will only be looking at the April and May statistics for the 2019 season, although my code will be able to work with future data as well since it will all be formatted in the same way. It is also worth nothing that as of June 2019, many NWSL teams are missing players who also serve on their national team (such as the USWNT, CANWNT, etc) due to the Women's World Cup occuring this summer.

In [3]:
#py files to run to scrape the data from the NWSL page. Only need to run once.
#!python ./scraping/statscrape.py
#!python ./scraping/teamscrape.py

## Cleaning and Pre-Analysis
#TODO: GOAL/ASSISTS PER GAME PERCENTAGES/RATES

In [4]:
def combination(start_year, end_year):
    """
    Combines the nwsl.csv and position.csv csvs for each
    year in the given range ad
    
    :parameters:
    start_year - integer indicating start year of data
    end_year - integer indicating end year of data
    """
    for i in range(start_year, end_year + 1):
        nwsl_file = 'nwsl{}.csv'.format(i)
        position_file = 'position{}.csv'.format(i)

        nwsl = pd.read_csv(os.path.join('data', 'nwsl', nwsl_file))
        position = pd.read_csv(os.path.join('data', 'position', position_file))
        df = nwsl.merge(position, left_on='Player Name',
                            right_on='Player', how = 'left').drop('Player', axis = 1)

        name = 'full{}.csv'.format(i)
        path = os.path.join('data', 'full', name)

        df.to_csv(path, index=False)

In [5]:
combination(2016, 2019) #run to join all of the nwsl/position csvs

In [6]:
#getting all of the full.csv files in the subdirectory
full_path = os.path.join('data', 'full')
csvs = os.listdir(path = full_path)
files = []
#for loop to get all the full.csv paths
for file in csvs:
    fp = os.path.join(full_path, file)
    files.append(fp)
#for organization purposes later
files.sort()
files
#use nwsl.files, full for with prediction later

['data/full/full2016.csv',
 'data/full/full2017.csv',
 'data/full/full2018.csv',
 'data/full/full2019.csv']

In [7]:
nwsl_2016 = pd.read_csv(files[0])
nwsl_2017 = pd.read_csv(files[1])
nwsl_2018 = pd.read_csv(files[2])
nwsl_2019 = pd.read_csv(files[3]) #Training
all_nwsl = [nwsl_2016, nwsl_2017, nwsl_2018, nwsl_2019]

In [8]:
def calculate_stats(df):
    """
    Calculates Goals per Game, Assists per Game, Shots per Game, 
    Proportion of Shots on Goal per Goal, Proportion of Shots on Goal, 
    and Proportion of Successful Penalty Kicks, for each player 
    in the dataset. Creates columns for these values in each dataframe.
    
    :parameters:
    df - dataframe like nwsl.csv/full.csv with neceesary columns
    """
    #calculating stats, self explanatory column names
    df['Goals per Game'] = df['Goals']/df['Games Played']
    df['Assists per Game'] = df['Assists']/df['Games Played']
    df['Shots per Game'] = df['Shots']/df['Games Played']
    df['Prop SoG'] = df['Shots on Goal']/df['Shots']
    df['Shots per Goal'] = df['Goals']/df['Shots on Goal']
    df['Prop Penalty'] = df['Penalty Kick Goals']/df['Penalty Kicks Attempted']
    
    int_cols = df.columns[2:].tolist()
    int_cols.remove('Position')
    for each in int_cols:
        df[each] = df[each].astype(float)
    
    #May create a classifer for Position later, leaving nulls in this column
    nonPos = df.loc[:, ~df.columns.isin(['Position'])].columns.tolist()
    df[nonPos] = df[nonPos].fillna(0)

In [9]:
#apply above function to all dataframes in the list
for each in all_nwsl:
    calculate_stats(each)

## Missingness

#NOTE: MISSINGNESS FOR POSITION DATA: SOME TEAMS DON'T POST THEIR FULL OLDER ROSTER IE CRS 2016 ONLY HAD 5 PLAYERS
#PLAYER POSITION MISSINGNESS: SOME PLAYERS ARE TEMPS/HIRES AND NOT FULLY CONTRACTED or TRADES/IF A PLAYER TRADED TEAMS THEN THEIR OLD TEAM DISCARDS THEIR DATA

## Visualizations

## Analysis

In [10]:
#get the positions of each player, predict what position they play based on goals/assists/etc.

## Prediction Model

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [None]:
#TODO: CREATE FEATURES FOR PREDICTION

In [12]:
#NOTE: nwsl_2019 has positions for all players due to its role as
#most recent data on the league

y = nwsl_2019['Position']
features = nwsl_2019.drop(['Position', 'Team', 'Player Name'], axis=1)
#put in a column transformer and onehotencode by team

encoder = StandardScaler()
X = encoder.fit_transform(features, y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

neighbors = KNeighborsClassifier()
neighbors.fit(X_train, y_train)
neigh_pred = neighbors.predict(X_test)
neigh_f1 = accuracy_score(y_true=y_test, y_pred=neigh_pred)

bayesian = GaussianNB()
bayesian.fit(X_train, y_train)
bay_pred = bayesian.predict(X_test)
bay_f1 = accuracy_score(y_true=y_test, y_pred=bay_pred)

forest = RandomForestClassifier(n_estimators=100, max_depth=3,
                                min_samples_split=20, min_samples_leaf=10)
forest.fit(X_train, y_train)
forest_pred = forest.predict(X_test)
forest_f1 = accuracy_score(y_true=y_test, y_pred=forest_pred)
values = ([('kNN', neigh_f1), ('bayes', bay_f1), ('forest', forest_f1)])
values

[('kNN', 0.4084507042253521),
 ('bayes', 0.5492957746478874),
 ('forest', 0.4788732394366197)]