<a id='top'></a>

# Capology Player Web Scraping
##### Notebook to scrape raw data  from [Capology](https://www.capology.com/) using [Selenium](https://www.selenium.dev/)  and [Beautifulsoup](https://pypi.org/project/beautifulsoup4/)

### By [Edd Webster](https://www.twitter.com/eddwebster)
Notebook first written: 01/08/2021<br>
Notebook last updated: 06/08/2021

![title](../../img/logos/capology-logo.jpeg)

Click [here](#section5) to jump straight to the Exploratory Data Analysis section and skip the [Task Brief](#section2), [Data Scraping](#section3), and [Data Unification](#section4) sections. Or click [here](#section5) to jump straight to the Conclusion.

___

<a id='sectionintro'></a>

## <a id='import_libraries'>Introduction</a>
This notebook scrapes player statstics data from [Capology](https://www.capology.com/), using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames, and [Selenium](https://www.selenium.dev/) and [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) for webscraping.

For more information about this notebook and the author, I'm available through all the following channels:
*    [eddwebster.com](https://www.eddwebster.com/);
*    edd.j.webster@gmail.com;
*    [@eddwebster](https://www.twitter.com/eddwebster);
*    [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/);
*    [github/eddwebster](https://github.com/eddwebster/);
*    [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster);
*    [kaggle.com/eddwebster](https://www.kaggle.com/eddwebster); and
*    [hackerrank.com/eddwebster](https://www.hackerrank.com/eddwebster).

![title](../../img/fifa21eddwebsterbanner.png)

The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/football_analytics) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/A%29%20Web%20Scraping/FBref%20Web%20Scraping%20and%20Parsing.ipynb).

___

<a id='sectioncontents'></a>

## <a id='notebook_contents'>Notebook Contents</a>
1.    [Notebook Dependencies](#section1)<br>
2.    [Project Brief](#section2)<br>
3.    [Data Scraping](#section3)<br>
4.    [Data Unification](#section4)<br>
5.    [Summary](#section5)<br>
6.    [Next Steps](#section6)<br>
7.    [Bibliography](#section7)<br>

___

<a id='section1'></a>

## <a id='#section1'>1. Notebook Dependencies</a>

This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:
*    [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;
*    [`NumPy`](http://www.numpy.org/) for multidimensional array computing;
*    [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation;
*    `tqdm` for a clean progress bar;
*    `requests` for executing HTTP requests;
*    [`Beautifulsoup`](https://pypi.org/project/beautifulsoup4/) for web scraping; and
*    [`matplotlib`](https://matplotlib.org/contents.html?v=20200411155018) for data visualisations;

All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/).

### Import Libraries and Modules

In [None]:
# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv

# Import Dependencies
%matplotlib inline

# Math Operations
import numpy as np
from math import pi

# Datetime
import datetime
from datetime import date
import time

# Data Preprocessing
import pandas as pd
import os
import re
import random
import glob
from io import BytesIO
from pathlib import Path

# Reading directories
import glob
import os

# Working with JSON
import json
from pandas.io.json import json_normalize

# Web Scraping
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from bs4 import BeautifulSoup
import re


# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno

# Progress Bar
from tqdm import tqdm

# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML

# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

print('Setup Complete')

In [None]:
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))

### Defined Variables and Lists

##### Date 

In [6]:
# Define today's date
todays_date = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')

In [7]:
# Define variables and lists

## Define season
season = '2020'    # '2020' for the 20/21 season

# Create 'Full Season' and 'Short Season' strings

## Full season
full_season_string = str(int(season)) + '/' + str(int(season) + 1)

## Short season
short_season_string = str((str(int(season))[-2:]) + (str(int(season) + 1)[-2:]))

In [8]:
todays_date

'04082021'

##### Scraping Variables

In [9]:
options = webdriver.ChromeOptions()

In [10]:
##
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

##### Teams and Leagues

In [11]:
# Premier League

## 2013-2014 PL


## 2014-2015 PL


## 2015-2016 PL


## 2016-2017 PL
lst_teams_pl_1617 = ['arsenal', 'bournemouth', 'burnley', 'chelsea', 'crystal-palace', 'everton',
             'hull-city', 'leicester', 'liverpool', 'manchester-city', 'manchester-united',
             'middlesbrough', 'southampton', 'stoke-city', 'sunderland', 'swansea', 'tottenham',
             'watford', 'west-bromwich', 'west-ham']

## 2017-2018 PL
lst_teams_pl_1718 = ['arsenal', 'bournemouth', 'brighton', 'burnley', 'chelsea', 'crystal-palace', 'everton',
             'huddersfield', 'leicester', 'liverpool', 'manchester-city', 'manchester-united',
             'newcastle', 'southampton', 'stoke-city', 'swansea', 'tottenham',
             'watford', 'west-bromwich', 'west-ham']

## 2018-2019 PL
lst_teams_pl_1819 = ['arsenal', 'bournemouth', 'brighton', 'burnley', 'cardiff', 'chelsea',
             'crystal-palace', 'everton', 'fulham', 'huddersfield', 'leicester',
             'liverpool', 'manchester-city', 'manchester-united', 'newcastle',
             'southampton', 'tottenham', 'watford', 'west-ham', 'wolverhampton']

## 2019-2020 PL
lst_teams_pl_1920 = ['arsenal', 'aston-villa', 'bournemouth', 'brighton', 'burnley', 'chelsea',
             'crystal-palace', 'everton', 'leicester',
             'liverpool', 'manchester-city', 'manchester-united', 'newcastle',
             'norwich', 'sheffield-united', 'southampton', 'tottenham', 'watford',
             'west-ham', 'wolverhampton']

## 2020-2021 PL
lst_teams_pl_2021 = ['arsenal', 'aston-villa', 'brighton', 'burnley', 'chelsea',
             'crystal-palace', 'everton', 'fulham', 'leeds', 'leicester',
             'liverpool', 'manchester-city', 'manchester-united', 'newcastle',
             'sheffield-united', 'southampton', 'tottenham', 'west-bromwich',
             'west-ham', 'wolverhampton']

In [12]:
# Serie A

## 2013-2014 Serie A
#lst_teams_sa_1314 = ['']

## 2014-2015 Serie A

## 2015-2016 Serie A
lst_teams_sa_1516 = ['ac-milan', 'atalanta', 'bologna', 'carpi', 'chievo-verona', 'empoli', 'fiorentina', 'frosinone',
                     'genoa', 'hellas-verona', 'inter-milan', 'juventus', 'lazio', 'napoli', 'palermo', 'roma',
                     'sampdoria', 'sassuolo', 'torino', 'udinese']

## 2016-2017 Serie A
lst_teams_sa_1617 = ['ac-milan', 'atalanta', 'bologna', 'cagliari', 'chievo-verona', 'crotone', 'empoli', 'fiorentina',
                     'genoa', 'inter-milan', 'juventus', 'lazio', 'napoli', 'palermo', 'pescara', 'roma',
                     'sampdoria', 'sassuolo', 'torino', 'udinese']

## 2017-2018 Serie A
lst_teams_sa_1718 = ['ac-milan', 'atalanta', 'benevento', 'bologna', 'cagliari', 'chievo-verona', 'crotone', 'fiorentina',
                     'genoa', 'hellas-verona', 'inter-milan', 'juventus', 'lazio', 'napoli', 'roma',
                     'sampdoria', 'sassuolo', 'spal', 'torino', 'udinese']

## 2018-2019 Serie A
lst_teams_sa_1819 = ['ac-milan', 'atalanta', 'bologna', 'cagliari', 'chievo-verona', 'empoli', 'fiorentina',
                     'frosinone', 'genoa', 'inter-milan', 'juventus', 'lazio', 'napoli', 'parma', 'roma',
                     'sampdoria', 'sassuolo', 'spal', 'torino', 'udinese']

## 2019-2020 Serie A
lst_teams_sa_1920 = ['ac-milan', 'atalanta', 'bologna', 'brescia', 'cagliari', 'fiorentina',
                     'genoa', 'hellas-verona', 'inter-milan', 'juventus', 'lazio', 'lecce', 'napoli', 'parma', 'roma',
                     'sampdoria', 'sassuolo', 'spal', 'torino', 'udinese']

## 2020-2021 Serie A
lst_teams_sa_2021 = ['ac-milan', 'atalanta', 'benevento', 'bologna', 'cagliari', 'crotone', 'fiorentina',
                     'genoa', 'hellas-verona', 'inter-milan', 'juventus', 'lazio', 'napoli', 'parma', 'roma',
                     'sampdoria', 'sassuolo', 'spezia', 'torino', 'udinese']

In [13]:
# La Liga

## 2013-2014 La Liga


## 2014-2015 La Liga


## 2015-2016 La Liga


## 2016-2017 La Liga
lst_teams_ll_1617 = ['alaves', 'athletic-club', 'atletico-madrid', 'barcelona', 'celta-vigo', 'deportivo', 'eibar', 'espanyol',
                     'granada', 'las-palmas', 'malaga', 'osasuna', 'real-betis', 'real-madrid',
                     'real-sociedad', 'sevilla', 'sporting-gijon', 'valencia', 'villarreal']

## 2017-2018 La Liga
lst_teams_ll_1718 = ['alaves', 'athletic-club', 'atletico-madrid', 'barcelona', 'celta-vigo', 'deportivo', 'eibar', 'espanyol',
                     'getafe', 'girona', 'las-palmas', 'levante', 'malaga', 'real-betis', 'real-madrid',
                     'real-sociedad', 'sevilla', 'valencia', 'villarreal']

## 2018-2019 La Liga
lst_teams_ll_1819 = ['alaves', 'athletic-club', 'atletico-madrid', 'barcelona', 'celta-vigo', 'eibar', 'espanyol',
                     'getafe', 'girona', 'huesca', 'leganes', 'levante', 'rayo-vallecano', 'real-betis', 'real-madrid',
                     'real-sociedad', 'sevilla', 'valencia', 'valladolid']

## 2019-2020 La Liga
lst_teams_ll_1920 = ['alaves', 'athletic-club', 'atletico-madrid', 'barcelona', 'celta-vigo', 'eibar', 'espanyol',
                     'getafe', 'granada', 'leganes', 'levante', 'mallorca', 'osasuna', 'real-betis', 'real-madrid',
                     'real-sociedad', 'sevilla', 'valencia', 'valladolid']

## 2020-2021 La Liga
lst_teams_ll_2021 = ['alaves', 'athletic-club', 'atletico-madrid', 'barcelona', 'cadiz', 'celta-vigo', 'eibar',
                     'elche', 'getafe', 'granada', 'huesca', 'levante', 'osasuna', 'real-betis', 'real-madrid',
                     'real-sociedad', 'sevilla', 'valencia', 'valladolid']

In [14]:
# La Liga

## 2013-2014 Bundesliga


## 2014-2015 Bundesliga


## 2015-2016 Bundesliga


## 2016-2017 Bundesliga
lst_teams_b_1617 = ['augsburg', 'bayer-leverkusen', 'bayern-munich', 'borussia-dortmund', 'darmstadt',
                    'eintracht-frankfurt', 'freiburg', 'hamburg', 'hertha-berlin', 'hoffenheim',
                    'ingolstadt', 'koln', 'leipzig', 'mainz', 'monchengladbach', 'schalke-04', 'werder-bremen', 
                    'wolfsburg']

## 2017-2018 Bundesliga
lst_teams_b_1718 = ['augsburg', 'bayer-leverkusen', 'bayern-munich', 'borussia-dortmund', 'eintracht-frankfurt',
                    'freiburg', 'hamburg', 'hannover', 'hertha-berlin', 'hoffenheim', 'koln',
                    'leipzig', 'mainz', 'monchengladbach', 'schalke-04', 'stuttgart', 'werder-bremen', 
                    'wolfsburg']

## 2018-2019 Bundesliga
lst_teams_b_1819 = ['augsburg', 'bayer-leverkusen', 'bayern-munich', 'borussia-dortmund', 'dusseldorf',
                    'eintracht-frankfurt', 'freiburg', 'hannover', 'hertha-berlin', 'hoffenheim',
                    'leipzig', 'mainz', 'monchengladbach', 'nurnberg', 'schalke-04', 'stuttgart', 'werder-bremen', 
                    'wolfsburg']

## 2019-2020 Bundesliga
lst_teams_b_1920 = ['augsburg', 'bayer-leverkusen', 'bayern-munich', 'borussia-dortmund', 'dusseldorf',
                    'eintracht-frankfurt', 'freiburg', 'hertha-berlin', 'hoffenheim', 'koln',
                    'leipzig', 'mainz', 'monchengladbach', 'paderborn', 'schalke-04', 'union-berlin', 'werder-bremen', 
                    'wolfsburg']

## 2020-2021 Bundesliga
lst_teams_b_2021 = ['arminia-bielefeld', 'augsburg', 'bayer-leverkusen', 'bayern-munich', 'borussia-dortmund',
                    'eintracht-frankfurt', 'freiburg', 'hertha-berlin', 'hoffenheim', 'leipzig', 'mainz', 'monchengladbach',
                    'schalke-04', 'stuttgart', 'union-berlin', 'werder-bremen', 'wolfsburg']

In [15]:
# Ligue 1

## 2013-2014 


## 2014-2015


## 2015-2016


## 2016-2017 
lst_teams_l1_1617 = ['angers', 'bastia', 'bordeaux', 'caen', 'dijon', 'guingamp', 'lille', 'lorient',
                     'lyon', 'marseille', 'metz', 'monaco', 'montpellier', 'nancy', 'nantes', 'nice', 'psg', 'rennes', 
                     'st-etienne', 'toulouse'
                    ]

## 2017-2018 
lst_teams_l1_1718 = ['amiens', 'angers', 'bordeaux', 'caen', 'dijon', 'guingamp', 'lille', 'lyon', 'marseille',
                     'metz', 'monaco', 'montpellier', 'nantes', 'nice', 'psg', 'rennes', 
                     'st-etienne', 'strasbourg', 'toulouse', 'troyes'
                    ]

## 2018-2019 
lst_teams_l1_1819 = ['amiens', 'angers', 'bordeaux', 'caen', 'dijon', 'guingamp', 'lille', 'lyon', 'marseille',
                     'monaco', 'montpellier', 'nantes', 'nice', 'nimes', 'psg', 'reims', 'rennes', 
                     'st-etienne', 'strasbourg', 'toulouse'
                    ]

## 2019-2020
lst_teams_l1_1920 = ['amiens', 'angers', 'bordeaux', 'brest', 'dijon', 'lille', 'lyon', 'marseille',
                     'metz', 'monaco', 'montpellier', 'nantes', 'nice', 'nimes', 'psg', 'reims', 'rennes', 
                     'st-etienne', 'strasbourg', 'toulouse'
                    ]

## 2020-2021 
lst_teams_l1_2021 = ['angers', 'bordeaux', 'brest', 'dijon', 'lens', 'lille', 'lorient', 'lyon', 'marseille',
                     'metz', 'monaco', 'montpellier', 'nantes', 'nice', 'nimes', 'psg', 'reims', 'rennes', 
                     'st-etienne', 'strasbourg'
                    ]

In [16]:
# MLS

## 2013


## 2014


## 2015
lst_teams_mls_15 = ['chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-dallas', 'houston-dynamo', 'la-galaxy', 
                    'montreal-impact', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]


## 2016
lst_teams_mls_16 = ['chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-dallas', 'houston-dynamo', 'la-galaxy', 
                    'montreal-impact', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

## 2017
lst_teams_mls_17 = ['atlanta-united', 'chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-dallas', 'houston-dynamo', 'la-galaxy', 
                    'minnesota-united', 'montreal-impact', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

## 2018
lst_teams_mls_18 = ['atlanta-united', 'chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-dallas', 'houston-dynamo', 'la-fc', 'la-galaxy', 
                    'minnesota-united', 'montreal-impact', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

## 2019 
lst_teams_mls_19 = ['atlanta-united', 'chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-cincinnati', 'fc-dallas', 'houston-dynamo', 'inter-miami', 'la-fc', 'la-galaxy', 
                    'minnesota-united', 'montreal-impact', 'nashville-sc', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

## 2020
lst_teams_mls_20 = ['atlanta-united', 'chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-cincinnati', 'fc-dallas', 'houston-dynamo', 'inter-miami', 'la-fc', 'la-galaxy', 
                    'minnesota-united', 'montreal-impact', 'nashville-sc', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

## 2021 
lst_teams_mls_21 = ['atlanta-united', 'austin', 'chicago-fire', 'colorado-rapids', 'columbus-crew', 'dc-united',
                    'fc-cincinnati', 'fc-dallas', 'houston-dynamo', 'inter-miami', 'la-fc', 'la-galaxy', 
                    'minnesota-united', 'montreal-impact', 'nashville-sc', 'ne-revolution', 'nyc-fc', 
                    'ny-red-bulls', 'orlando-city', 'philadelphia-union', 'portland-timbers', 'real-salt-lake',
                    'san-jose-earthquakes', 'seattle-sounders', 'sporting-kc', 'toronto-fc', 'vancouver-whitecaps'
                   ]

In [17]:
lst_seasons = ['2016-2017', '2017-2018', '2018-2019', '2019-2020', '2020-2021']

### Defined Filepaths

In [18]:
# Set up initial paths to subfolders
base_dir = os.path.join('..', '..', )
data_dir = os.path.join(base_dir, 'data')
data_dir_sb = os.path.join(base_dir, 'data', 'sb')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')

### Define Scrapers
Two different scrapers:
1. Previous seasons (`scrape_capology_season_prev`)
2. Current seasons (slightly different webpage structure, so needs to be different) (`scrape_capology_season_current`)

#### Previous season scraper

In [None]:
# Define function for scraping a defined season of Capology data
def scrape_capology_season_prev(lst_teams, season, comp):
    
    ### Print statement
    print(f'Scraping for {comp} for the {season} season has now started...')
    
    ## Create empty list for DataFrame
    dfs_players = []
    
    for team in lst_teams:
        if not os.path.exists(os.path.join(f'./data/{comp}/{season}/{team}_{comp}_{season}.csv')):
            url = f'https://www.capology.com/club/{team}/salaries/{season}/'
            print(f'Scraping {team} for the {season} season')
            wd = webdriver.Chrome('chromedriver', options=options)
            wd.get(url)
            html = wd.page_source
            time.sleep(5)           # if this is too low, it stops works, 5 seems fine
            html = wd.page_source   # this must be repeated, no idea why, but otherwise code doesn't work
            df = pd.read_html(html, header=0)[1]

            ### Data Engineering
            df = df.iloc[1: , :]
            df = df.rename(columns=df.iloc[0])
            df = df[:-1]
            df = df.iloc[1: , :]
            df = df.reset_index()
            df = df.drop(['index', 'Rank'], axis=1)
            df['Team'] = team
            df['Team'] = df['Team'].str.replace('-', ' ').str.title().str.replace('Fc', 'FC').str.replace('Ac', 'AC')
            df['League'] = comp
            df['League'] = df['League'].str.replace('-', ' ').str.title()
            df['Season'] = season
            print(f'Saving DataFrame of {team} for the {season} season')

            ### Save to CSV
            df.to_csv(f'./data/{comp}/{season}/{team}_{comp}_{season}.csv')

            ### Append to joint DataFrame
            dfs_players.append(df)
        else:
            df = pd.read_csv(f'./data/{comp}/{season}/{team}_{comp}_{season}.csv', index_col=None, header=0)
            print(f'{team} already scraped and saved for the {season} season')

            ### Append to joint DataFrame
            dfs_players.append(df)
        
    ## Concatenate DataFrames to one DataFrame
    df_players_all = pd.concat(dfs_players)

    ## Engineer unified data
    df_players_all['Team'] = df_players_all['Team'].str.replace('-', ' ').str.title().str.replace('Fc', 'FC')
    df_players_all['League'] = df_players_all['League'].str.replace('-', ' ').str.title()

    ## Save to CSV
    df_players_all.to_csv(f'./data/{comp}/{season}/all_{comp}_{season}.csv')
    
    ### Print statement
    print(f'Scraping for {comp} for the {season} season is now complete')
    
    ## Return unified season dataset
    return df_players_all

#### Current season scraper

In [None]:
# Define function for scraping a defined season of Capology data
def scrape_capology_season_current(lst_teams, season, comp):
    
    ### Print statement
    print(f'Scraping for {comp} for the {season} season has now started...')
    
    ## Create empty list for DataFrame
    dfs_players = []
    
    ## 
    for team in lst_teams:
        if not os.path.exists(os.path.join(f'./data/{comp}/{season}/{team}_{comp}_{season}_last_updated_{todays_date}.csv')):
            url = f'https://www.capology.com/club/{team}/salaries/{season}/'
            print(f'Scraping {team} for the {season} season')
            wd = webdriver.Chrome('chromedriver', options=options)
            wd.get(url)
            html = wd.page_source
            time.sleep(4)           # if this is too low, it stops works, 5 seems fine
            html = wd.page_source   # this must be repeated, no idea why, but otherwise code doesn't work
            df = pd.read_html(html, header=0)[1]

            ### Data Engineering
            df = df.iloc[1: , :]
            df = df.rename(columns=df.iloc[0])
            df = df[:-1]
            df = df.iloc[1: , :]
            df = df.reset_index()
            
            ### Drop 5th and first 3 columns of dataframe
            df = df.iloc[: , 3:]
            
            ### Create new columns
            df['Team'] = team
            df['Team'] = df['Team'].str.replace('-', ' ').str.title().str.replace('Fc', 'FC').str.replace('Ac', 'AC')
            df['League'] = comp
            df['League'] = df['League'].str.replace('-', ' ').str.title()
            df['Season'] = season
            print(f'Saving DataFrame of {team} for the {season} season')

            ### Save to CSV
            df.to_csv(f'./data/{comp}/{season}/{team}_{comp}_{season}_last_updated_{todays_date}.csv')

            ### Append to joint DataFrame
            dfs_players.append(df)
        else:
            df = pd.read_csv(f'./data/{comp}/{season}/{team}_{comp}_{season}_last_updated_{todays_date}.csv', index_col=None, header=0)
            print(f'{team} already scraped and saved for the {season} season')

            ### Append to joint DataFrame
            dfs_players.append(df)
        
    ## Concatenate DataFrames to one DataFrame
    df_players_all = pd.concat(dfs_players)

    ## Engineer unified data
    df_players_all['Team'] = df_players_all['Team'].str.replace('-', ' ').str.title().str.replace('Fc', 'FC')
    df_players_all['League'] = df_players_all['League'].str.replace('-', ' ').str.title()
    df_players_all = df_players_all.drop(df.columns[1], axis=1)

    ## Save to CSV
    df_players_all.to_csv(f'./data/{comp}/{season}/all_{comp}_{season}_last_updated_{todays_date}.csv')
    
    ### Print statement
    print(f'Scraping for {comp} for the {season} season is now complete')
    
    ## Return unified season dataset
    return df_players_all

### Create Directory Structure

In [19]:
"""
# Update this for this scraping environment

# make the directory structure
for folder in ['combined', 'competitions', 'events', 'tactics', 'lineups']:
    path = os.path.join(data_dir_sb, 'raw', folder)
    if not os.path.exists(path):
        os.mkdir(path)
"""

"\n# Update this for this scraping environment\n\n# make the directory structure\nfor folder in ['combined', 'competitions', 'events', 'tactics', 'lineups']:\n    path = os.path.join(data_dir_sb, 'raw', folder)\n    if not os.path.exists(path):\n        os.mkdir(path)\n"

### Notebook Settings

In [20]:
# Display all columns of displayed pandas DataFrames
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None

## <a id='#section2'>2. Project Brief</a>
This Jupyter notebook is part of a series of notebooks to scrape, parse, engineer, unify, and the model, culminating in a an Expected Transfer (xTransfer) player performance vs. valuation model. This model aims to determine the under- and over-performing players based on their on-the-pitch output against transfer fee and wages.

This particular notebook is one of several web scraping notebooks, that takes player salary data from the [Capology](https://www.capology.com/), and scrapes it using [Selenium](https://www.selenium.dev/) and [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) and manipulates it as Dataframes using [pandas](http://pandas.pydata.org/).

This notebook, along with the other notebooks in this project workflow are shown in the following diagram:

![roadmap](../../img/football_analytics_data_roadmap.png)

Links to these notebooks in the [`football_analytics`](https://github.com/eddwebster/football_analytics) GitHub repository can be found at the following:
*    [Webscraping](https://github.com/eddwebster/football_analytics/tree/master/notebooks/1_data_scraping)
     +    [FBref Player Stats Webscraping]()
     +    [TransferMarket Player Bio and Status Webscraping]()
     +    [TransferMarkt Player Recorded Transfer Fees Webscraping]()
     +    [Capology Player Salary Webscraping]()
     +    [FBref Team Stats Webscraping]()
*    [Data Parsing](https://github.com/eddwebster/football_analytics/tree/master/notebooks/2_data_parsing)
     +    [ELO Team Ratings Data Parsing]()
*    [Data Engineering](https://github.com/eddwebster/football_analytics/tree/master/notebooks/3_data_engineering)
     +    [FBref Player Stats Data Engineering]()
     +    [TransferMarket Player Bio and Status Data Engineering]()
     +    [TransferMarkt Player Recorded Transfer Fees Data Engineering]()
     +    [Capology Player Salary Data Engineering]()
     +    [FBref Team Stats Data Engineering]()
     +    [ELO Team Ratings Data Parsing]()
     +    [TransferMarkt Team Recorded Transfer Fee Data Engineering]() (aggregated from [TransferMarkt Player Recorded Transfer Fees notebook]())
     +    [Capology Team Salary Data Engineering]() (aggregated from [Capology Player Salary notebook]())
*    [Joining of Datasets]()
     +    [Player Golden ID of Football Datasets]()
     +    [Team Golden ID of Football Datasets]()
*    [Production Datasets]()
     +    [Player Performance/Market Value Dataset]()
     +    [Team Performance/Market Value Dataset]()
*    [Modeling]()
     +    [Expected Transfer (xTransfer) Model]()

---

<a id='section3'></a>

## <a id='#section3'>3. Data Scraping</a>

### <a id='#section3.1'>3.1. Introduction</a>
Two different scrapers:
1. Previous seasons (`scrape_capology_season_prev`)
2. Current seasons (slightly different webpage structure, so needs to be different) (`scrape_capology_season_current`)

### <a id='#section3.2'>3.2. Scrape data by League and Season</a>

#### <a id='#section3.2.1'>3.2.1. Premier League

In [1]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_current(lst_teams_pl_2021, '2020-2021', 'premier-league')

## Display DataFrame
df_players_all.head()

NameError: name 'scrape_capology_season_current' is not defined

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_pl_1920, '2019-2020', 'premier-league')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_pl_1819, '2018-2019', 'premier-league')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_pl_1718, '2017-2018', 'premier-league')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_pl_1617, '2016-2017', 'premier-league')

## Display DataFrame
df_players_all.head()

#### <a id='#section3.2.1'>3.2.1. Serie A

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_current(lst_teams_sa_2021, '2020-2021', 'serie-a')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_sa_1920, '2019-2020', 'serie-a')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_sa_1819, '2018-2019', 'serie-a')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_sa_1718, '2017-2018', 'serie-a')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_sa_1617, '2016-2017', 'serie-a')

## Display DataFrame
df_players_all.head()

#### <a id='#section3.2.1'>3.2.1. La Liga

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_current(lst_teams_ll_2021, '2020-2021', 'la-liga')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_ll_1920, '2019-2020', 'la-liga')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_ll_1819, '2018-2019', 'la-liga')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_ll_1718, '2017-2018', 'la-liga')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_ll_1617, '2016-2017', 'la-liga')

## Display DataFrame
df_players_all.head()

#### <a id='#section3.2.1'>3.2.1. Bundesliga

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_current(lst_teams_b_2021, '2020-2021', 'bundesliga')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_b_1920, '2019-2020', 'bundesliga')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_b_1819, '2018-2019', 'bundesliga')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_b_1718, '2017-2018', 'bundesliga')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_b_1617, '2016-2017', 'bundesliga')

## Display DataFrame
df_players_all.head()

#### <a id='#section3.2.1'>3.2.1. Ligue 1

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_current(lst_teams_l1_2021, '2020-2021', 'ligue-1')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_l1_1920, '2019-2020', 'ligue-1')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_l1_1819, '2018-2019', 'ligue-1')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_l1_1718, '2017-2018', 'ligue-1')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_l1_1617, '2016-2017', 'ligue-1')

## Display DataFrame
df_players_all.head()

#### <a id='#section3.2.1'>3.2.1. MLS

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_current(lst_teams_mls_21, '2021', 'mls')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_mls_20, '2020', 'mls')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_mls_19, '2019', 'mls')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_mls_18, '2018', 'mls')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_mls_17, '2017', 'mls')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_mls_16, '2016', 'mls')

## Display DataFrame
df_players_all.head()

In [None]:
# Create DataFrame using 'scrape_capology_season' function, include - 1) List of teams (e.g. lst_teams_pl_2021), 2) Season (e.g. 2020-2021), 3) Competition (e.g. premier-league)
df_players_all = scrape_capology_season_prev(lst_teams_mls_15, '2015', 'mls')

## Display DataFrame
df_players_all.head()

#### <a id='#section3.2.1'>3.2.1. Championship

---

<a id='section4'></a>

## <a id='#section4'>4. Data Unification</a>
Unify the scraped and landed datasets per team, league and season Glob code.

In [84]:
# Show files in directory
all_files = glob.glob(os.path.join('./data/*/*/all_*.csv'))
all_files

['./data/serie-a/2018-2019/all_serie-a_2018-2019.csv',
 './data/serie-a/2019-2020/all_serie-a_2019-2020.csv',
 './data/serie-a/2020-2021/all_serie-a_2020-2021_last_updated_02082021.csv',
 './data/serie-a/2016-2017/all_serie-a_2016-2017.csv',
 './data/serie-a/2017-2018/all_serie-a_2017-2018.csv',
 './data/premier-league/2018-2019/all_premier-league_2018-2019.csv',
 './data/premier-league/2019-2020/all_premier-league_2019-2020.csv',
 './data/premier-league/2020-2021/all_premier-league_2020-2021_last_updated_02082021.csv',
 './data/premier-league/2016-2017/all_premier-league_2016-2017.csv',
 './data/premier-league/2017-2018/all_premier-league_2017-2018.csv',
 './data/ligue-1/2018-2019/all_ligue-1_2018-2019.csv',
 './data/ligue-1/2019-2020/all_ligue-1_2019-2020.csv',
 './data/ligue-1/2020-2021/all_ligue-1_2020-2021_last_updated_02082021.csv',
 './data/ligue-1/2016-2017/all_ligue-1_2016-2017.csv',
 './data/ligue-1/2017-2018/all_ligue-1_2017-2018.csv',
 './data/mls/2015/all_mls_2015.csv',
 '

In [85]:
lst_all_teams = []    # pd.concat takes a list of DataFrames as an argument

for filename in all_files:
    df_temp = pd.read_csv(filename, index_col=None, header=0)
    lst_all_teams.append(df_temp)

df_players_all = pd.concat(lst_all_teams, axis=0, ignore_index=True)

In [86]:
## Engineer unified data
df_players_all['Team'] = df_players_all['Team'].str.replace('-', ' ').str.title().str.replace('Fc', 'FC')
df_players_all['League'] = df_players_all['League'].str.replace('-', ' ').str.title()
df_players_all

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Player,Weekly GrossBase Salary(IN EUR),Annual GrossBase Salary(IN EUR),"Adj. GrossBase Salary(2021, IN EUR)",Pos.,Age,Country,Team,League,Season,Status,Expiration,Length,EstimatedGross Total(IN EUR),Weekly GrossBase Salary(IN GBP),Annual GrossBase Salary(IN GBP),"Adj. GrossBase Salary(2021, IN GBP)",EstimatedGross Total(IN GBP),Weekly GrossBase Salary(IN USD),Annual GrossBase Salary(IN USD),"Adj. GrossBase Salary(2021, IN USD)",RosterStatus,EstimatedGross Total(IN USD)
0,0,0.0,Gonzalo Higuaín,"€ 338,327","€ 17,593,000","€ 17,568,773",F,30,Argentina,Ac Milan,Serie A,2018-2019,,,,,,,,,,,,,
1,1,1.0,Gianluigi Donnarumma,"€ 213,673","€ 11,111,000","€ 11,095,699",K,19,Italy,Ac Milan,Serie A,2018-2019,,,,,,,,,,,,,
2,2,2.0,Lucas Biglia,"€ 124,635","€ 6,481,000","€ 6,472,075",M,32,Argentina,Ac Milan,Serie A,2018-2019,,,,,,,,,,,,,
3,3,3.0,Alessio Romagnoli,"€ 124,635","€ 6,481,000","€ 6,472,075",D,23,Italy,Ac Milan,Serie A,2018-2019,,,,,,,,,,,,,
4,4,4.0,Tiemoué Bakayoko,"€ 124,635","€ 6,481,000","€ 6,472,075",M,23,France,Ac Milan,Serie A,2018-2019,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21384,35,35.0,Pedro Martínez,€ 0,€ 0,€ 0,M,21,Spain,Villarreal,La Liga,2017-2018,,,,,,,,,,,,,
21385,36,36.0,Chuca,€ 0,€ 0,€ 0,M,20,Spain,Villarreal,La Liga,2017-2018,,,,,,,,,,,,,
21386,37,37.0,Cédric Bakambu,€ 0,€ 0,€ 0,F,26,Democratic Republic of Congo,Villarreal,La Liga,2017-2018,,,,,,,,,,,,,
21387,38,38.0,Bruno Soriano,€ 0,€ 0,€ 0,M,33,Spain,Villarreal,La Liga,2017-2018,,,,,,,,,,,,,


---

<a id='section5'></a>

## <a id='#section5'>5. Summary</a>
This notebook scrapes player statstics data from [Capology](https://www.capology.com/) via [FBref.com](https://fbref.com/en/), using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames, [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) for webscraping.

___

<a id='section6'></a>

## <a id='#section6'>6. Next Steps</a>
This raw data is now ready to be engineered in a separate notebook, which can be found in Data Engineering subfolder in GitHub can be found [here](https://github.com/eddwebster/football_analytics/tree/master/notebooks/B\)%20Data%20Engineering) and a static version of the record linkage notebook in which the FBref data is joined to TransferMarkt data can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/B%29%20Data%20Engineering/Record%20Linkage%20of%20FBref%20and%20TransferMarkt%20Datasets.ipynb).

___

<a id='section7'></a>

## <a id='#section7'>7. References</a>

#### Data and Web Scraping
*    
*    
*    

---

***Visit my website [EddWebster.com](https://www.eddwebster.com) or my [GitHub Repository](https://github.com/eddwebster) for more projects. If you'd like to get in contact, my Twitter handle is [@eddwebster](http://www.twitter.com/eddwebster) and my email is: edd.j.webster@gmail.com.***

[Back to the top](#top)

## Engineer Unified Dataset
To add
- Original value columns

In [87]:
df_players_all['Weekly GrossBase Salary(IN EUR)'] =  (df_players_all['Weekly GrossBase Salary(IN EUR)']
                                                          .replace('None', np.nan)
                                                          .astype(str)
                                                          .str.replace('£','')
                                                          .str.replace('€','')
                                                          .str.replace('$','')
                                                          .str.replace(',','')
                                                          .str.extract('(\d+)', expand=False)
                                                     ).astype(float)

  


In [88]:
df_players_all['Weekly GrossBase Salary(IN EUR)'] =  (df_players_all['Weekly GrossBase Salary(IN EUR)']
                                                          .replace('None', np.nan)
                                                          .astype(str)
                                                          .str.replace('£','')
                                                          .str.replace('€','')
                                                          .str.replace('$','')
                                                          .str.replace(',','')
                                                          .str.extract('(\d+)', expand=False)
                                                     ).astype(float)

df_players_all['Annual GrossBase Salary(IN EUR)'] = (df_players_all['Annual GrossBase Salary(IN EUR)']
                                                          .replace('None', np.nan)
                                                          .astype(str)
                                                          .str.replace('£','')
                                                          .str.replace('€','')
                                                          .str.replace('$','')
                                                          .str.replace(',','')
                                                          .str.extract('(\d+)', expand=False)
                                                     ).astype(float)

df_players_all['EstimatedGross Total(IN EUR)'] = (df_players_all['EstimatedGross Total(IN EUR)']
                                                      .replace('None', np.nan)
                                                      .astype(str)
                                                      .str.replace('£','')
                                                      .str.replace('€','')
                                                      .str.replace('$','')
                                                      .str.replace(',','')
                                                      .str.extract('(\d+)', expand=False)
                                                 ).astype(float)

df_players_all['Adj. GrossBase Salary(2021, IN EUR)'] = (df_players_all['Adj. GrossBase Salary(2021, IN EUR)']
                                                             .replace('None', np.nan)
                                                             .astype(str)
                                                             .str.replace('£','')
                                                             .str.replace('€','')
                                                             .str.replace('$','')
                                                             .str.replace(',','')
                                                             .str.extract('(\d+)', expand=False)
                                                        ).astype(float)

df_players_all['Weekly GrossBase Salary(IN USD)'] = (df_players_all['Weekly GrossBase Salary(IN USD)']
                                                         .replace('None', np.nan)
                                                         .astype(str)
                                                         .str.replace('£','')
                                                         .str.replace('€','')
                                                         .str.replace('$','')
                                                         .str.replace(',','')
                                                         .str.extract('(\d+)', expand=False)
                                                    ).astype(float)

df_players_all['Annual GrossBase Salary(IN USD)'] = (df_players_all['Annual GrossBase Salary(IN USD)']
                                                          .replace('None', np.nan)
                                                          .astype(str)
                                                          .str.replace('£','')
                                                          .str.replace('€','')
                                                          .str.replace('$','')
                                                          .str.replace(',','')
                                                          .str.extract('(\d+)', expand=False)
                                                     ).astype(float)

df_players_all['EstimatedGross Total(IN USD)'] = (df_players_all['EstimatedGross Total(IN USD)']
                                                      .replace('None', np.nan)
                                                      .astype(str)
                                                      .str.replace('£','')
                                                      .str.replace('€','')
                                                      .str.replace('$','')
                                                      .str.replace(',','')
                                                      .str.extract('(\d+)', expand=False)
                                                 ).astype(float)

df_players_all['Adj. GrossBase Salary(2021, IN USD)'] = (df_players_all['Adj. GrossBase Salary(2021, IN USD)']
                                                              .replace('None', np.nan)
                                                              .astype(str)
                                                              .str.replace('£','')
                                                              .str.replace('€','')
                                                              .str.replace('$','')
                                                              .str.replace(',','')
                                                              .str.extract('(\d+)', expand=False)
                                                         ).astype(float)

df_players_all['Weekly GrossBase Salary(IN GBP)'] = (df_players_all['Weekly GrossBase Salary(IN GBP)']
                                                          .replace('None', np.nan)
                                                          .astype(str)
                                                          .str.replace('£','')
                                                          .str.replace('€','')
                                                          .str.replace('$','')
                                                          .str.replace(',','')
                                                          .str.extract('(\d+)', expand=False)
                                                     ).astype(float)

df_players_all['Annual GrossBase Salary(IN GBP)'] = (df_players_all['Annual GrossBase Salary(IN GBP)']
                                                          .replace('None', np.nan)
                                                          .astype(str)
                                                          .str.replace('£','')
                                                          .str.replace('€','')
                                                          .str.replace('$','')
                                                          .str.replace(',','')
                                                          .str.extract('(\d+)', expand=False)
                                                     ).astype(float)

df_players_all['EstimatedGross Total(IN GBP)'] = (df_players_all['EstimatedGross Total(IN GBP)']
                                                      .replace('None', np.nan)
                                                      .astype(str)
                                                      .str.replace('£','')
                                                      .str.replace('€','')
                                                      .str.replace('$','')
                                                      .str.replace(',','')
                                                      .str.extract('(\d+)', expand=False)
                                                 ).astype(float)

df_players_all['Adj. GrossBase Salary(2021, IN GBP)'] = (df_players_all['Adj. GrossBase Salary(2021, IN GBP)']
                                                              .replace('None', np.nan)
                                                              .astype(str)
                                                              .str.replace('£','')
                                                              .str.replace('€','')
                                                              .str.replace('$','')
                                                              .str.replace(',','')
                                                              .str.extract('(\d+)', expand=False)
                                                         ).astype(float)

  
  app.launch_new_instance()


In [89]:
# Currency Convert
from currency_converter import CurrencyConverter
import math

In [90]:
#
df_players_all['Currency'] = np.where(df_players_all['Annual GrossBase Salary(IN EUR)'].notnull(), 'EUR',
                                      np.where(df_players_all['Annual GrossBase Salary(IN GBP)'].notnull(), 'GBP',
                                               np.where(df_players_all['Annual GrossBase Salary(IN EUR)'].notnull(), 'USD', 'n/a')
                                              )
                                     )

In [91]:
# Get EUR to GBP exchange rate

## Get latest currency rates
c = CurrencyConverter()

##  Get conversion rate from EUR to GBP
rate_eur_gbp = (c.convert(1, 'EUR', 'GBP'))
rate_eur_gbp

##  Get conversion rate from USD to GBP
rate_usd_gbp = (c.convert(1, 'USD', 'GBP'))
rate_usd_gbp

##  
rate_gbp_gbp = 1
rate_gbp_gbp

1

In [92]:
df_players_all['Exchange Rate'] = np.where(df_players_all['Currency'] == 'EUR', rate_eur_gbp,
                                           np.where(df_players_all['Currency'] == 'USD', rate_usd_gbp,
                                                     np.where(df_players_all['Currency'] == 'GBP', 1, np.nan)
                                                   )
                                          )

df_players_all['Exchange Rate'] = df_players_all['Exchange Rate'].replace('None', np.nan).astype(float)

In [93]:
#

## Coalesce the four salary columns

###
df_players_all['Estimated Gross Total Original'] = (df_players_all['EstimatedGross Total(IN GBP)']
                                                        .combine_first(df_players_all['EstimatedGross Total(IN GBP)'])
                                                        .combine_first(df_players_all['EstimatedGross Total(IN USD)'])
                                                        .replace('None', np.nan)
                                                        .astype(float)
                                                   )

df_players_all['Estimated Gross Total GBP'] = (df_players_all['Estimated Gross Total Original'] * df_players_all['Exchange Rate'])

df_players_all['Estimated Gross Total GBP'] = (df_players_all['Estimated Gross Total GBP']
                                                      .fillna(-1)
                                                      .astype(int)
                                                      .astype(str)
                                                      .replace('-1', np.nan)
                                                 )

###
df_players_all['Weekly Gross Base Salary Original'] = (df_players_all['Weekly GrossBase Salary(IN GBP)']
                                                           .combine_first(df_players_all['Weekly GrossBase Salary(IN EUR)'])
                                                           .combine_first(df_players_all['Weekly GrossBase Salary(IN USD)'])
                                                           .replace('None', np.nan)
                                                           .astype(float)
                                                      )

df_players_all['Weekly Gross Base Salary GBP'] = df_players_all['Weekly Gross Base Salary Original'] * df_players_all['Exchange Rate']

df_players_all['Weekly Gross Base Salary GBP'] = (df_players_all['Weekly Gross Base Salary GBP']
                                                      .fillna(-1)
                                                      .astype(int)
                                                      .astype(str)
                                                      .replace('-1', np.nan)
                                                 )

###
df_players_all['Annual Gross Base Salary Original'] = (df_players_all['Annual GrossBase Salary(IN GBP)']
                                                           .combine_first(df_players_all['Annual GrossBase Salary(IN EUR)'])
                                                           .combine_first(df_players_all['Annual GrossBase Salary(IN USD)'])
                                                           .replace('None', np.nan)
                                                           .astype(float)
                                                      )

df_players_all['Annual Gross Base Salary GBP'] = df_players_all['Annual Gross Base Salary Original'] * df_players_all['Exchange Rate']

df_players_all['Annual Gross Base Salary GBP'] = (df_players_all['Annual Gross Base Salary GBP']
                                                      .fillna(-1)
                                                      .astype(int)
                                                      .astype(str)
                                                      .replace('-1', np.nan)
                                                 )

###
df_players_all['Adj. Gross Base Salary for Current Season Original'] = (df_players_all['Adj. GrossBase Salary(2021, IN GBP)']
                                                                            .combine_first(df_players_all['Adj. GrossBase Salary(2021, IN EUR)'])
                                                                            .combine_first(df_players_all['Adj. GrossBase Salary(2021, IN USD)'])
                                                                            .replace('None', np.nan)
                                                                            .astype(float)
                                                                       )

df_players_all['Adj. Gross Base Salary for Current Season GBP'] = df_players_all['Adj. Gross Base Salary for Current Season Original'] * df_players_all['Exchange Rate']

df_players_all['Adj. Gross Base Salary for Current Season GBP'] = (df_players_all['Adj. Gross Base Salary for Current Season GBP']
                                                                        .fillna(-1)
                                                                        .astype(int)
                                                                        .astype(str)
                                                                        .replace('-1', np.nan)
                                                                   )


## Coalesce the two status columns

###
df_players_all['Status'] = (df_players_all['Status']
                                .combine_first(df_players_all['RosterStatus'])
                                .combine_first(df_players_all['EstimatedGross Total(IN USD)'])
                                .replace('None', np.nan)
                                .astype(str)
                           )



In [94]:
df_players_all = df_players_all[~df_players_all['Pos.'].isin(['No data available in table'])]

In [95]:
df_players_all['Pos.'].unique()

array(['F', 'K', 'M', 'D', 'GK', 'CF', 'CB', 'AM', 'LW', 'CM', 'DM', 'RW',
       'LB', 'RB', 'SS', 'LM', 'RM'], dtype=object)

In [96]:
## Map Positions

### 
dict_positions_grouped = {'K': 'Goalkeeper',
                          'D': 'Defender',
                          'M': 'Midfielder',
                          'F': 'Forward',
                          'GK': 'Goalkeeper',
                          'LB': 'Defender',
                          'RB': 'Defender',
                          'CB': 'Defender',
                          'DM': 'Midfielder',
                          'LM': 'Midfielder',
                          'CM': 'Midfielder',
                          'RM': 'Midfielder',
                          'AM': 'Midfielder',
                          'LW': 'Forward',
                          'RW': 'Forward',
                          'SS': 'Forward',
                          'CF': 'Forward'
                         }

### Map grouped positions to DataFrame
df_players_all['Pos.'] = df_players_all['Pos.'].map(dict_positions_grouped)

In [97]:
## Separate Goalkeeper and Outfielders
df_players_all['Outfielder Goalkeeper'] = np.where(df_players_all['Pos.'].isnull(), np.nan, (np.where(df_players_all['Pos.'] == 'Goalkeeper', 'Goalkeeper', 'Outfielder')))


## Define columns
cols = ['Player',
        'Season',
        'League',
        'Team',
        'Pos.',
        'Outfielder Goalkeeper',
        'Age',
        'Country',   
        'Weekly Gross Base Salary GBP',
        'Annual Gross Base Salary GBP',
        'Adj. Gross Base Salary for Current Season GBP',
        'Estimated Gross Total GBP',
        'Status',
        'Expiration',
        'Length'
       ]

## Select columns of interest
df_players_all_select = df_players_all[cols]

## Sort by 'mins_total' decending
df_players_all_select = df_players_all_select.sort_values(['League', 'Season', 'Team', 'Player'], ascending=[True, True, True, True])

## Drop index
df_players_all_select = df_players_all_select.reset_index(drop=True)

## Rename columns
df_players_all_select = (df_players_all_select
                             .rename(columns={'Player': 'player',
                                              'Season': 'season',
                                              'League': 'league',
                                              'Team': 'team',
                                              'Pos.': 'position',
                                              'Outfielder Goalkeeper': 'outfielder_goalkeeper',
                                              'Age': 'age',
                                              'Country': 'country',
                                              'Weekly Gross Base Salary GBP': 'weekly_gross_base_salary_gbp',
                                              'Annual Gross Base Salary GBP': 'annual_gross_base_salary_gbp',
                                              'Adj. Gross Base Salary for Current Season GBP': 'adj_current_gross_base_salary_gbp',
                                              'Estimated Gross Total GBP': 'estimated_gross_total_gbp',
                                              'Status': 'current_contract_status',
                                              'Expiration': 'current_contract_expiration',
                                              'Length': 'current_contract_length',
                                             }
                                    )
                        )

## 
df_players_all_select.head(10)

Unnamed: 0,player,season,league,team,position,outfielder_goalkeeper,age,country,weekly_gross_base_salary_gbp,annual_gross_base_salary_gbp,adj_current_gross_base_salary_gbp,estimated_gross_total_gbp,current_contract_status,current_contract_expiration,current_contract_length
0,Albian Ajeti,2016-2017,Bundesliga,Augsburg,Forward,Outfielder,19,Switzerland,0,0,0,,,,
1,Alexander Esswein,2016-2017,Bundesliga,Augsburg,Forward,Outfielder,26,Germany,12919,671795,696824,,,,
2,Alfred Finnbogason,2016-2017,Bundesliga,Augsburg,Forward,Outfielder,27,Iceland,0,0,0,,,,
3,Andreas Luthe,2016-2017,Bundesliga,Augsburg,Goalkeeper,Goalkeeper,29,Germany,5939,308881,320389,,,,
4,Caiuby,2016-2017,Bundesliga,Augsburg,Forward,Outfielder,27,Brazil,12919,671795,696824,,,,
5,Christoph Janker,2016-2017,Bundesliga,Augsburg,Defender,Outfielder,31,Germany,7325,380924,395116,,,,
6,Daniel Baier,2016-2017,Bundesliga,Augsburg,Midfielder,Outfielder,32,Germany,16573,861807,893916,,,,
7,Daniel Opare,2016-2017,Bundesliga,Augsburg,Defender,Outfielder,25,Ghana,0,0,0,,,,
8,Dominik Kohr,2016-2017,Bundesliga,Augsburg,Midfielder,Outfielder,22,Germany,12919,671795,696824,,,,
9,Dong-won Ji,2016-2017,Bundesliga,Augsburg,Forward,Outfielder,25,South Korea,17248,896927,930345,,,,


In [98]:
# Still to engineer
# - 'current_contract_status', 'current_contract_expiration', and 'current_contract_length' are blank unless it's a 2021 
#   row. The 2021 data can be joined back onto the previous years. May however need to scrape more of the data to
#   get contract information of players no longer in same leage (i.e. relegation, move abroad)

In [99]:
df_players_all.loc[df_players_all['Player'] == 'Albian Ajeti']

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Player,Weekly GrossBase Salary(IN EUR),Annual GrossBase Salary(IN EUR),"Adj. GrossBase Salary(2021, IN EUR)",Pos.,Age,Country,Team,League,Season,Status,Expiration,Length,EstimatedGross Total(IN EUR),Weekly GrossBase Salary(IN GBP),Annual GrossBase Salary(IN GBP),"Adj. GrossBase Salary(2021, IN GBP)",EstimatedGross Total(IN GBP),Weekly GrossBase Salary(IN USD),Annual GrossBase Salary(IN USD),"Adj. GrossBase Salary(2021, IN USD)",RosterStatus,EstimatedGross Total(IN USD),Currency,Exchange Rate,Estimated Gross Total Original,Estimated Gross Total GBP,Weekly Gross Base Salary Original,Weekly Gross Base Salary GBP,Annual Gross Base Salary Original,Annual Gross Base Salary GBP,Adj. Gross Base Salary for Current Season Original,Adj. Gross Base Salary for Current Season GBP,Outfielder Goalkeeper
5317,15,15.0,Albian Ajeti,,,,Forward,22,Switzerland,West Ham,Premier League,2019-2020,,,,,50000.0,2600000.0,2600000.0,,,,,,,GBP,1.0,,,50000.0,50000,2600000.0,2600000,2600000.0,2600000,Outfielder
17092,35,35.0,Albian Ajeti,0.0,0.0,0.0,Forward,19,Switzerland,Augsburg,Bundesliga,2016-2017,,,,,,,,,,,,,,EUR,0.90053,,,0.0,0,0.0,0,0.0,0,Outfielder


In [100]:
df_players_all_select.loc[df_players_all_select['player'] == 'Albian Ajeti']

Unnamed: 0,player,season,league,team,position,outfielder_goalkeeper,age,country,weekly_gross_base_salary_gbp,annual_gross_base_salary_gbp,adj_current_gross_base_salary_gbp,estimated_gross_total_gbp,current_contract_status,current_contract_expiration,current_contract_length
0,Albian Ajeti,2016-2017,Bundesliga,Augsburg,Forward,Outfielder,19,Switzerland,0,0,0,,,,
16789,Albian Ajeti,2019-2020,Premier League,West Ham,Forward,Outfielder,22,Switzerland,50000,2600000,2600000,,,,


## Create Wide Dataset
1 row per player

## Filter Players in 'Big 5' European Leagues
Create separate DataFrame

In [107]:
df_players_all_select['league'].unique()

array(['Bundesliga', 'La Liga', 'Ligue 1', 'Mls', 'Premier League',
       'Serie A'], dtype=object)

In [108]:
# Filter plays in the Big 5 European Leagues

## Define list of countries
lst_big5_countries = ['Bundesliga', 'Ligue 1', 'Premier League', 'Serie A', 'La Liga']

## Filter list of Big 5 European League countries from DataFrame
df_players_big5_select = df_players_all_select[df_players_all_select['league'].isin(lst_big5_countries)]

In [109]:
df_players_big5_select.shape

(16741, 15)

## Export Dataset

In [110]:
# Export DataFrames

## All teams
df_players_all_select.to_csv(data_dir + f'/capology_all_1617_2021_last_updated_{todays_date}.csv', index=None, header=True)
df_players_all_select.to_csv('../Football/data/export/' + 'capology_big5_mls_teams_latest.csv', index=None, header=True)

## Big 5 European teams
df_players_big5_select.to_csv(data_dir + f'/capology_big5_1617_2021_last_updated_{todays_date}.csv', index=None, header=True)
df_players_big5_select.to_csv('../Football/data/export/' + 'capology_big5_teams_latest.csv', index=None, header=True)