# Capstone Project: English Premier League Predictions

## Executive Summary

Football is the most [popular sport](http://www.biggestglobalsports.com/) in the world. It is mostly dominant in Europe, South America and Africa. Most of the top footballers are from Europe and South America. As most of the past world cup winners were from these two continents. Brazil has won the most time which is 5 times. Other countries that won before includes Germany, Italy, Argentina, France, Spain, England and Uruguay. 

In the different countries, each have their individual football leagues. One country can have more than one league and each league usually have 20 teams. Each league in a country is connected to one another. An example will be the English Premier League. It has 20 teams every season and whenever the season ends, the bottom three teams will be relegated to the league below them, which is the English Football League. Then, the top three teams will be promoted to the English Premier League. It is the same for the English Football League where the teams were relegated, and there are many leagues below them too.

[English Premier League](https://www.premierleague.com/home) is one of the top few leagues in the world. It was originally founded as Football League Division One in 1888, and broke away from the Football League in 1992, forming the Premier League. Notable teams in the league are Manchester United, Arsenal, Chelsea, Manchester City and Liverpool.

## Problem Statement 

We will be predicting the football matches in the English Premier League. As football matches are always full of surprise, we will be using past matches data and FIFA game series data. With these data and analysis, the classification model can assist potential stakeholders such as football pundits, betting website, football fans and shareholders of football clubs. 

As mentioned, we will be using data from past matches and the FIFA game series.

FIFA is a football game which has all the team stats and player stats. With these data and analysis, we can better understand the relationship between the result and the match stats. As football is associated with betting, we will also compare the predictions with odds from [Singapore Pools](https://online.singaporepools.com/en/sports/competition/36/football/england/english-premier).

There will be a total of seven notebooks:

1) Part 1A - Data Acquisition (This notebook)

2) Part 1B - FIFA & Season Fixtures Clean Up

3) Part 2 - Data Preparation, EDA & Feature Engineering

4) Part 3 - Modeling for Result

5) Part 4A - Modeling for Home Total Goals

6) Part 4B - Modeling for Away Total Goals

7) Part 5 - Predictions & Conclusion


# Part 1A - Data Acquisition 

There will be two type of data used. One is match fixtures for past seasons and current seasons, the other will be FIFA players stats for each season. We will be looking at seasons starting from 2017/2018 to current, 2020/2021.

The match fixtures data will be scraped from [FBREF](https://fbref.com/en/). For the FIFA players stats, the data will be taken from [Kaggle](https://www.kaggle.com/stefanoleone992/fifa-21-complete-player-dataset?select=players_21.csv). 


Match reports will be scraped from each season into a folder each. Then, a function will be used to combine all csv inside the folder into one. 

A separate dataset will be scraped for model predictions. It will be scrapped from the current season.

### Contents:
 - [Functions for Web Scraping](#-Functions-for-Web-Scraping)
 - [Web Scraping from FBREF](#-Web-Scraping-from-FBREF)


In [1]:
#Importing of Modules
import pandas as pd

from bs4 import BeautifulSoup
import requests

import os
import time
import csv
import glob

In [2]:
#Display of rows and columns
pd.set_option('max_rows',None)
pd.set_option('max_columns',None)

## Functions for Web Scraping 

In [3]:
def scoresfixtures(link,ids):
    '''
    Description: This function picks all the games that had in one season and combinate all links to one especific list
    
    Inputs:
        - link: The link of the main page that have all season games desired.
        - ids: The ID of the championship table
        
    Outputs:
        - specific list that has all the links os all matches of the season
    
    
    '''
    req = requests.get(link)
    if req.status_code == 200:
        content = req.content

    soup = BeautifulSoup(content, 'html.parser')
    tb = soup.find(id=ids)

    s1= []
    s2= []
    for i in tb.find_all("a"):
            s1.append(str(i))
            s2.append(str(i.get_text('href')))


    # Calling DataFrame constructor after zipping 
    # both lists, with columns specified 
    di = pd.DataFrame(list(zip(s1, s2)), 
                   columns =['Codes', 'ID']) 

    s4=[]
    for i in di["Codes"]:
        i = i.replace('<a href="','')
        i = i.replace('</a>','')
        s4.append(str(i))


    s5 = []

    for i in di['Codes']:
        if "matches" in i:
            s5.append(str(i))
        else:
            s5.append(0)

    s6 = []
    for i in di["Codes"]:
        if '<a href="/en/squads/' in i:
            i = i.replace('<a href="/en/squads/','')
            i = i[0:8]
            s6.append(str(i))
        else:
            s6.append(0)        

    # Calling DataFrame constructor after zipping 
    # both lists, with columns specified 
    da = pd.DataFrame(list(zip(s1, s2,s4,s5,s6)), 
                   columns =['CODES', 'ID','URL_FINAL','GAMES_2020',"TEAM_CODE"])        

    s9 = []
    for i in da["URL_FINAL"]:
        if 'Match Report' in i:
            s9.append(str(i))
        else:
            pass
    return s9

In [1]:
def match_report(url,folder_name):
    '''
    Description: This function goes to de URL of the match and treat all data in order to append it in one single Dataframe.
    
    Input:
        - url: Url of the html page
        
    Output:
        - Dataframe will be saved as csv
    
    '''
    #make the request
    pg = 'https://fbref.com'
    url_pg = pg+ url
    req = requests.get(url_pg)
    if req.status_code == 200:
        content = req.content
    #accessing data from site
    soup = BeautifulSoup(content, 'html.parser')

    table_general = soup.find_all(class_ = "table_container")
    #Some match reports have lesser data, therefore, the table is lesser
    if len(table_general) <= 2:
        table_team_1 = table_general[0]
        table_team_2 = table_general[1]
        table_team_5 = soup.find(class_='venuetime')
    elif len(table_general) > 4:
        table_team_1 = table_general[0]
        table_team_2 = table_general[7]
        table_team_3 = table_general[5]
        table_team_4 = table_general[12]
        table_team_5 = soup.find(class_='venuetime')
    else:
        table_team_1 = table_general[0]
        table_team_2 = table_general[2]
        table_team_5 = soup.find(class_='venuetime')

    #collecting data
    table_match = soup.find_all('div',class_ = "scorebox_meta")
    oi_match = str(table_match)

    #treating data
    toby = oi_match.split('<small>')
    match = str(toby[2])
    match = match.split('</small>')
    match = str(match[0])
    stadium = toby[4]
    stadium = stadium.split('</small>')
    stadium = str(stadium[0])    
    
    #collecting data
    date = table_team_5.get('data-venue-date')        

    #treating data
    name = str(soup.title)
    name = name.replace(" ","_")
    name = name.replace("<title>","")
    name = name.replace(".","")
    name_final = name.split("Report")[0]


    #treating data
    name_final = name_final.split("_Match")
    name_final = name_final[0]    

    # STR transform and reading tables
    if len(table_general) <= 2:
        table_str_1 = str(table_team_1)
        table_str_2 = str(table_team_2)
    elif len(table_general) > 4:
        table_str_1 = str(table_team_1)
        table_str_2 = str(table_team_2)
        table_str_3 = str(table_team_3)
        table_str_4 = str(table_team_4)
        df_1 = pd.read_html(table_str_1)[0]
        df_2 = pd.read_html(table_str_2)[0]
        df_3 = pd.read_html(table_str_3)[0]
        df_4 = pd.read_html(table_str_4)[0]
    else:
        table_str_1 = str(table_team_1)
        table_str_2 = str(table_team_2)
        df_1 = pd.read_html(table_str_1)[0]
        df_2 = pd.read_html(table_str_2)[0]

    #treating data
    time = str(name_final)
    time = time.replace("_"," ")
    time = time.split(" vs ")
    time_1 = str(time[0])
    time_2 = str(time[1])

    #Dtframe transforming
    df_1 = pd.DataFrame(df_1)
    df_1.columns = df_1.columns.droplevel()
    df_1['Home'] = str(time_1)
    df_1['Away'] = str(time_2)
    df_1['Match'] = str(name_final)
    df_1['Date'] = str(date)
    df_1['Stadium'] = str(stadium)
    df_1['Attendance'] = match
    if len(table_general) > 4:
        df_3.columns = df_3.columns.droplevel()
        df_3.drop(columns=['Player', '#', 'Nation', 'Pos', 'Age', 'Min', 'CrdY', 'CrdR'],inplace=True)
        df_5 = pd.merge(df_1,df_3,left_index=True,right_index=True)
    else:
        df_5 = df_1

    df_2 = pd.DataFrame(df_2)
    df_2.columns = df_2.columns.droplevel()
    df_2['Home'] = str(time_2)
    df_2['Away'] = str(time_1)
    df_2['Match'] = str(name_final)
    df_2['Date'] = str(date)
    df_2['Stadium'] = str(stadium)
    df_2['Attendance'] = match
    if len(table_general) > 4:
        df_4.columns = df_4.columns.droplevel()
        df_4.drop(columns=['Player', '#', 'Nation', 'Pos', 'Age', 'Min', 'CrdY', 'CrdR'],inplace=True)
        df_6 = pd.merge(df_2,df_4,left_index=True,right_index=True)
    else:
        df_6 = df_2    
    
    #APPENDING Dataframes
    df_8 = df_5.append(df_6)
    df_8.to_csv(f'data/{folder_name}/{name_final}.csv',index=False)

In [3]:
def combine_csv(direct,name):
    '''
    Description: This function is to combine all csv file into one in a folder
    
    Input:
        - direct - folder location
        - name - new csv file name
        
    Output:
        - All csv will be combined into a new csv
    
    '''
    os.chdir(direct)
    extension = 'csv'
    all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
    combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
    combined_csv.to_csv(name,index=False,encoding ='utf-8-sig')

## Web Scraping from FBREF

### Season 2017/2018

In [None]:
#start = time.time()

#for url in scoresfixtures("https://fbref.com/en/comps/9/1631/schedule/2017-2018-Premier-League-Scores-and-Fixtures",
#                          "div_sched_ks_1631_1"):
#       match_report(url,'epl2018')

#print('Duration: {} seconds'.format(time.time() - start))

In [None]:
#Web Scrap for EPL
#combine_csv("data/epl2018",'epl2018.csv')

### Season 2018/2019

In [6]:
#start = time.time()

#for url in scoresfixtures("https://fbref.com/en/comps/9/1889/schedule/2018-2019-Premier-League-Scores-and-Fixtures",
#                          "div_sched_ks_1889_1"):
#       match_report(url,'epl2019')

#print('Duration: {} seconds'.format(time.time() - start))


In [7]:
#Web Scrap for EPL
#combine_csv("data/epl2019",'epl2019.csv')

### Season 2019/2020

In [19]:
#start = time.time()

#for url in scoresfixtures("https://fbref.com/en/comps/9/3232/schedule/2019-2020-Premier-League-Scores-and-Fixtures",
#                          "div_sched_ks_3232_1"):
#       match_report(url,'epl2020')

#print('Duration: {} seconds'.format(time.time() - start))

In [20]:
#Web Scrap for EPL
#combine_csv("data/epl2020",'epl2020.csv')

### Season 2020/2021

In [16]:
#Web Scrap for epl 2021 fixtures
#start = time.time()

#for url in scoresfixtures("https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures",
#                          "div_sched_ks_10728_1"):
#       match_report(url,'epl2021')

#print('Duration: {} seconds'.format(time.time() - start))

Duration: 98.67593288421631 seconds


In [18]:
#combine_csv("data/epl2021",'epl2021.csv')

### Prediction Set

This is also the Season 2020/2021 dataset, however it will be updated now and then for predictions. 

In [8]:
#Web Scrap for epl 2021 fixtures
#start = time.time()

#for url in scoresfixtures("https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures",
#                          "div_sched_ks_10728_1"):
#       match_report(url,'epl2021_prediction')

#print('Duration: {} seconds'.format(time.time() - start))

Duration: 524.2856922149658 seconds


In [9]:
#combine_csv("data/epl2021_prediction",'epl2021_predict.csv')