#  The last 100 years of football, visualized

This notebook describes the code behind the project FootB, which web-scraps the final table of Serie A (Italy), Premier League (UK), League 1 (France), La Liga (Spain) and looks at their overall statistics. 

## The data

The data come from the english version of wikipedia. The workflow looks something like this:
1. Load the webpage
2. Find the table we are interested in
3. Transform the data in a homogeneous usable format

Easy! Unfortunatelly there are some complications: leagues change name over the years (the Premier League was known as Football League before 1992); final tables can contain extra-columns or have differen column names depending on the year and/or on the league; some table entries can be of different type

A good start is to look at the data. Here is first five entries of the table for Serie A (1985-86):

<img src="Serie_A.png" alt="Drawing" style="width: 700px;"/>


The columns we are interested in are: the Team name, the game played (Pld), won (W), lost (L), drawn (D), goals for (GF), goals against (GA), goal difference (GD) and points (Pts). Not all leagues used the same scoring system, so below we will redefine Pts using the current system: 3 points for a victory, 1 point for drawn, 0 points for loss.

## The code
First, let's load some modules:

In [7]:
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
import numpy as np
import httplib2
import re

We want write a function which takes the string of a website address in input and returns a pandas dataframe of the final table.

In [5]:
def FootB(webpage):
    # codecodecode
    return dataframe

Most of the wikipedia tables we are dealing with will more than one table:

In [9]:
header = {'User-Agent': 'Mozilla/5.0'} # Needed to prevent 403 error on Wikipedia

wpage = 'https://en.wikipedia.org/wiki/1985-86_Serie_A'
req = urllib2.Request(wpage, headers = header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page, 'html5lib')
all_tables = soup.find_all('table')

where `all_tables` is a BeautifulSoup object which contains all the tables of our wikipedia page. 

We can loop through the tables and break the loop when we find a table containing `W` and `D` in the first row:

In [17]:
theone = [u'W', u'D']
for table in all_tables:
    all_rows = table.find_all('tr')
    lencols = len(all_rows[0].find_all('th')) 
    #Extract column names only
    cols = np.array([all_rows[0].find_all('th')[i].get_text().encode('ascii', 'ignore') for i in range(0, lencols)])
        
    if (theone[0] in cols) & (theone[1] in cols):
        # Delete unicode link tags from the Team column
        if len(cols[1]) > 10:
            cols[1] = cols[1].strip('\n\n\nv\nt\ne\n\n\n\n') 
        # Rename columns which do not match our standard format
        cols[cols == 'Club'] = 'Team'
        cols[cols == 'Played'] = 'Pld'
        cols[cols == 'Points'] = 'Pts'
        cols[cols == 'Wins'] = 'W'
        cols[cols == 'Draws'] = 'D'
        cols[cols == 'Losses'] = 'L'
        cols[cols == 'Goals for'] = 'GF'
        cols[cols == 'Goals against'] = 'GA'
        cols[cols == 'F'] = 'GF'
        cols[cols == 'A'] = 'GA'
        # Mask rows which are not related to Team information 
        rows = []
        for r in all_rows:
            if len(r) > 5:
                rows.append(r)
        lenrows = len(rows)
        break