## Premier League

**Introduction:**

The Premier League is a professional association football league in England and highest level of the English football league system. Contested by 20 clubs, it operates on a system of promotion and relegation with the English Football League (EFL). Seasons usually run from August to May, with each team playing 38 matches: two against each other team, one home and one away. Most games are played on weekend afternoons, with occasional weekday evening fixtures (Wikipedia, 2024).

## Web Scraping From Wikipedia

### Information on Premier League 2024/2025 Teams

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
url_pl = "https://en.wikipedia.org/wiki/Premier_League"
page_pl = requests.get(url_pl)
soup = BeautifulSoup(page_pl.content, "html")

In [3]:
# getting the column name
pl_2425_col = soup.find_all("table", class_="wikitable sortable")[5].find_all("th")
col_clean = []
for col in pl_2425_col:
    col_clean.append(col.text.strip())

In [4]:
# getting the row data
pl_2425_row = soup.find_all("table", class_="wikitable sortable")[5].find_all("td")
row_clean = []
row_dirty = []
for row in pl_2425_row:
    row_data = row.text.strip("")
    if "\n" not in row_data:
        row_dirty.append(row_data)
    else:
        row_dirty.append(row_data.strip("\n"))
        row_clean.append(row_dirty)
        row_dirty = []

In [5]:
# getting some information on premier league 2024/2025 teams
df_pl_2425 = pd.DataFrame(columns=col_clean, data=row_clean)
df_pl_2425.to_csv("Premier League Output/df_pl_2425.csv", index=False)
df_pl_2425.head(3)

Unnamed: 0,2024–25Club,2023–24Position,First season intop division,First season inPremier League,Seasonsin topdivision,Seasonsin PremierLeague,First season ofcurrent spell intop division,No. of seasonsof current spellin Premier League,Topdivisiontitles,Mostrecent topdivision title
0,Arsenal[v 1][v 2],2nd,1904–05,1992–93,108,33,1919–20 (99 seasons[v 3]),33,13,2003–04
1,Aston Villa[v 1][v 4],4th,1888–89,1992–93,111,30,2019–20 (6 seasons),6,7,1980–81
2,Bournemouth,12th,2015–16,2015–16,8,8,2022–23 (3 seasons),3,0,–


### Information on Premier League Name Evolutions

In [6]:
# getting the column name
pl_names_col = soup.find_all("table", class_="wikitable")[10].find_all("th")
name_col_clean = []
for col in pl_names_col:
    name_col_clean.append(col.text.strip())

In [7]:
# getting the dirty row data
pl_names_row = soup.find_all("table", class_="wikitable")[10].find_all("td")
name_row_dirty = []
for row in pl_names_row:
    data = row.text.strip()
    name_row_dirty.append(data)

# adding "Barclays" for the sponsor information at index -4
name_row_dirty.insert(-4, "Barclays")

In [8]:
# getting the clean row data
name_row_clean = []
name_dirty = []
for name in name_row_dirty:
    name_dirty.append(name)
    if len(name_dirty) == 3:
        name_row_clean.append(name_dirty)
        name_dirty = []

In [9]:
# getting some information on premier league names evolution
pl_name = pd.DataFrame(columns=name_col_clean, data=name_row_clean)
pl_name.to_csv("Premier League Output/pl_name.csv", index=False)
pl_name

Unnamed: 0,Period,Sponsor,Brand
0,1992–1993,No sponsor,FA Premier League
1,1993–2001,Carling,FA Carling Premiership[20]
2,2001–2004,Barclaycard,FA Barclaycard Premiership[20]
3,2004–2007,Barclays,FA Barclays Premiership
4,2007–2016,Barclays,Barclays Premier League[20][133]
5,2016–present,No sponsor,Premier League


### Information on Premier League 2024/2025 Managers

In [10]:
# getting the managers information columns
pl_managers_col = soup.find("table", class_="wikitable sortable plainrowheaders").find_all("th")
managers_col_clean = []
i = 0
for col in pl_managers_col:
    i += 1
    managers_col_clean.append(col.text.strip())
    if i == 5:
        break

In [11]:
# getting the managers name for row
pl_managers_name = soup.find("table", class_="wikitable sortable plainrowheaders").find_all("th")
managers_name = []
i = 0
for col in pl_managers_col:
    i += 1
    if i <= 5:
        continue
    managers_name.append(col.text.strip())

In [12]:
# getting the managers data for row
pl_managers_row = soup.find("table", class_="wikitable sortable plainrowheaders").find_all("td")
managers_row_clean = []
managers_data = []
i = 0
for data in pl_managers_row:
    managers_data.append(data.text.strip())
    if len(managers_data) == 4:
        managers_row_clean.append(managers_data)
        managers_data = []

In [13]:
# getting premier league managers 2024/2025 data
df_managers = pd.DataFrame(columns=managers_col_clean[1:], data=managers_row_clean)
df_managers.insert(0, managers_col_clean[0], managers_name)
df_managers.to_csv("Premier League Output/df_managers.csv", index=False)
df_managers.head()

Unnamed: 0,Manager,Nationality,Club,Appointed,Time as manager
0,Pep Guardiola,Spain,Manchester City,1 July 2016,"8 years, 130 days"
1,Thomas Frank,Denmark,Brentford,16 October 2018,"6 years, 23 days"
2,Mikel Arteta,Spain,Arsenal,20 December 2019,"4 years, 324 days"
3,Marco Silva,Portugal,Fulham,1 July 2021,"3 years, 130 days"
4,Eddie Howe,England,Newcastle United,8 November 2021,"3 years, 0 days"


In [14]:
# getting columns about top transfer in premier league
top_transfers_pl_col = soup.find_all("table", class_="wikitable")[-1].find_all("th")
transfers_col_clean = []
for col in top_transfers_pl_col:
    transfers_col_clean.append(col.text.strip())

transfers_col_clean[-2] = "Transfer from"
transfers_col_clean.insert(-1, "Transfer to")
del transfers_col_clean[0]

In [15]:
# getting rows data about top transfer in premier league
top_transfers_pl_row = soup.find_all("table", class_="wikitable")[-1].find_all("td")
transfer_row_clean = []
transfer_row_data = []
for row in top_transfers_pl_row:
    if len(row.text.strip()) <= 2 and row.text.strip().isdigit():
        continue
    transfer_row_data.append(row.text.strip())
    if len(transfer_row_data) == 6:
        transfer_row_clean.append(transfer_row_data)
        transfer_row_data = []

In [16]:
# getting the top transfers that involved premier league teams
df_transfers = pd.DataFrame(columns=transfers_col_clean, data=transfer_row_clean)
df_transfers.to_csv("Premier League Output/df_transfers.csv", index=False)
df_transfers

Unnamed: 0,Player,Fee (£ million),Year,Transfer from,Transfer to,Reference(s)
0,Philippe Coutinho (BRA),£105[a],2018,Liverpool,Barcelona,[281]
1,Moisés Caicedo (ECU),£100[b],2023,Brighton & Hove Albion,Chelsea,[268]
2,Declan Rice (ENG),£100[c],2023,West Ham United,Arsenal,[282]
3,Jack Grealish (ENG),£100,2021,Aston Villa,Manchester City,[270]
4,Eden Hazard (BEL),£89[d],2019,Chelsea,Real Madrid,[283]
5,Harry Kane (ENG),£86.4,2023,Tottenham Hotspur,Bayern Munich,[284]
6,Gareth Bale (WAL),£86,2013,Tottenham Hotspur,Real Madrid,[285][286]
7,Cristiano Ronaldo (POR),£80,2009,Manchester United,Real Madrid,[287][288]
8,Harry Maguire (ENG),£80,2019,Leicester City,Manchester United,[277][278]
9,Romelu Lukaku (BEL),£75,2017,Everton,Manchester United,[289][290][291]


### Information of the Premier League Champions

In [17]:
url = "https://en.wikipedia.org/wiki/List_of_English_football_champions"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html")

In [18]:
# getting the football league champions (1888 - 1892)
champ_fl_col = soup.find_all("table", class_="wikitable sortable")[0].find_all("th")
champ_fl_col_clean = []
for col in champ_fl_col:
    champ_fl_col_clean.append(col.text.strip())

In [19]:
champ_fl_row = soup.find_all("table", class_="wikitable sortable")[0].find_all("td")
champ_fl_row_clean = []
champ_fl_row_data = []
for data in champ_fl_row:
    champ_fl_row_data.append(data.text.strip())
    if len(champ_fl_row_data) == 5:
        champ_fl_row_clean.append(champ_fl_row_data)
        champ_fl_row_data = []

In [20]:
df_fl_champ = pd.DataFrame(columns=champ_fl_col_clean, data=champ_fl_row_clean)
df_fl_champ

Unnamed: 0,Season,Champions (number of titles),Runners-up,Third place,Winning manager
0,1888–89,Preston North End[a][b] (1),Aston Villa,Wolverhampton Wanderers,William Sudell (secretary manager)
1,1889–90,Preston North End (2),Everton,Blackburn Rovers,William Sudell (secretary manager)
2,1890–91,Everton (1),Preston North End,Notts County,Dick Molyneux (secretary manager)
3,1891–92,Sunderland (1),Preston North End,Bolton Wanderers,Tom Watson


In [21]:
# getting the football league first division champions (1892 - 1992)
champ_flfd_col = soup.find_all("table", class_="wikitable sortable")[1].find_all("th")
champ_flfd_col_clean = []
for col in champ_flfd_col:
    champ_flfd_col_clean.append(col.text.strip())

In [22]:
champ_flfd_row = soup.find_all("table", class_="wikitable sortable")[1].find_all("td")
champ_flfd_row_clean = []
champ_flfd_row_data = []
for data in champ_flfd_row:
    if data.text.strip() == "1915–16 to 1918–19" or data.text.strip() == "League suspended due to the First World War":
        continue
    if data.text.strip() == "1939–40 to 1945–46" or data.text.strip() == "League suspended due to the Second World War":
        continue
    champ_flfd_row_data.append(data.text.strip())
    if len(champ_flfd_row_data) == 5:
        champ_flfd_row_clean.append(champ_flfd_row_data)
        champ_flfd_row_data = []

In [23]:
df_flfd_champ = pd.DataFrame(columns=champ_flfd_col_clean, data=champ_flfd_row_clean)
df_flfd_champ.head()

Unnamed: 0,Season,Champions (number of titles),Runners-up,Third place,Winning manager
0,1892–93,Sunderland (2),Preston North End,Everton,Tom Watson
1,1893–94,Aston Villa (1),Sunderland,Derby County,George Ramsay
2,1894–95,Sunderland (3),Everton,Aston Villa,Tom Watson
3,1895–96,Aston Villa (2),Derby County,Everton,George Ramsay
4,1896–97,Aston Villa[b] (3),Sheffield United,Derby County,George Ramsay


In [24]:
# getting the premier league champions (1992 - present)
champ_pl_col = soup.find_all("table", class_="wikitable sortable")[2].find_all("th")
champ_pl_col_clean = []
for col in champ_pl_col:
    champ_pl_col_clean.append(col.text.strip())

In [25]:
champ_pl_row = soup.find_all("table", class_="wikitable sortable")[2].find_all("td")
champ_pl_row_clean = []
champ_pl_row_data = []
for data in champ_pl_row:
    champ_pl_row_data.append(data.text.strip())
    if len(champ_pl_row_data) == 5:
        champ_pl_row_clean.append(champ_pl_row_data)
        champ_pl_row_data = []

In [26]:
df_pl_champ = pd.DataFrame(columns=champ_pl_col_clean, data=champ_pl_row_clean)
df_pl_champ.head()

Unnamed: 0,Season,Champions (number of titles),Runners-up,Third place,Winning manager
0,1992–93,Manchester United (8),Aston Villa,Norwich City,Alex Ferguson
1,1993–94,Manchester United[b] (9),Blackburn Rovers,Newcastle United,Alex Ferguson
2,1994–95,Blackburn Rovers (3),Manchester United,Nottingham Forest,Kenny Dalglish
3,1995–96,Manchester United[b] (10),Newcastle United,Liverpool,Alex Ferguson
4,1996–97,Manchester United (11),Newcastle United,Arsenal,Alex Ferguson


In [27]:
# concatenate the champions of all premier league series 
df_champions = pd.concat([df_fl_champ, df_flfd_champ, df_pl_champ])
df_champions.to_csv("Premier League Output/df_champions.csv", index=False)
df_champions

Unnamed: 0,Season,Champions (number of titles),Runners-up,Third place,Winning manager
0,1888–89,Preston North End[a][b] (1),Aston Villa,Wolverhampton Wanderers,William Sudell (secretary manager)
1,1889–90,Preston North End (2),Everton,Blackburn Rovers,William Sudell (secretary manager)
2,1890–91,Everton (1),Preston North End,Notts County,Dick Molyneux (secretary manager)
3,1891–92,Sunderland (1),Preston North End,Bolton Wanderers,Tom Watson
0,1892–93,Sunderland (2),Preston North End,Everton,Tom Watson
...,...,...,...,...,...
27,2019–20,Liverpool (19),Manchester City,Manchester United,Jürgen Klopp
28,2020–21,Manchester City[f] (7),Manchester United,Liverpool,Pep Guardiola
29,2021–22,Manchester City (8),Liverpool,Chelsea,Pep Guardiola
30,2022–23,Manchester City[m] (9),Arsenal,Manchester United,Pep Guardiola


### Information of EFL Championship Teams

In [28]:
url = "https://en.wikipedia.org/wiki/EFL_Championship"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html")

In [29]:
# getting EFL teams information
efl_col = soup.find("table", class_="wikitable sortable").find_all("th")
efl_col_clean = []
for col in efl_col:
    efl_col_clean.append(col.text.strip())

In [30]:
efl_data = soup.find("table", class_="wikitable sortable").find_all("td")
efl_row = []
for data in efl_data:
    efl_row.append(data.text.strip())

index_data_insert = efl_row.index("Sheffield") + 4
data_insert = efl_row[efl_row.index("Sheffield")]
efl_row.insert(index_data_insert, data_insert)

efl_row_clean = []
efl_row_data = []
for data in efl_row:
    efl_row_data.append(data)
    if len(efl_row_data) == 4:
        efl_row_clean.append(efl_row_data)
        efl_row_data = []

In [31]:
efl_teams = pd.DataFrame(columns=efl_col_clean, data=efl_row_clean)
efl_teams.to_csv("Premier League Output/efl_teams.csv", index=False)
efl_teams.head()

Unnamed: 0,Team,Location,Stadium,Capacity
0,Blackburn Rovers,Blackburn,Ewood Park,31367
1,Bristol City,Bristol,Ashton Gate Stadium,27000
2,Burnley,Burnley,Turf Moor,21944
3,Cardiff City,Cardiff,Cardiff City Stadium,33280
4,Coventry City,Coventry,Coventry Building Society Arena,32609
