# **Web Scraping with Beautifoul Soup**

## **FifaIndex.com screenshot from 10/11 season**
---

![Website Screenshot](https://i.ibb.co/C9LtmwF/fifaindex.jpg)

## **Custom function for scraping data with Beautiful Soup**
---

![Website HTML Screenshot](https://i.ibb.co/sWshT3J/fifaindex-html-page.jpg)

In [1]:
def ratings_export(link):

    """ Extract Attack, Midfield, Defence & Overall team ratings from fifaindex.com website
    and return a dictionary with team names as keys and a list of ratings as each key value """

    
    import requests
    from bs4 import BeautifulSoup # might need to install it first: pip install BeautifulSoup4  

    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')

    new_data1=[]
    new_data2=[]

    data1 = [element.text for element in soup.find_all(class_="badge badge-dark rating r2")] # extracting group:r2 of team ratings from HTML page
    j=0
    while j<len(data1):
        new_data1.append(data1[j:j+4])
        j+=4

    data2 = [element.text for element in soup.find_all(class_="badge badge-dark rating r3")] # extracting group:r3 of team ratings from HTML page
    i=0
    while i<len(data2):
        new_data2.append(data2[i:i+4])
        i+=4
    
    ranking_data = new_data1 + new_data2 # combining different lists of ratings to a single list

    team_data = [element.text for element in soup.find_all(class_="link-team")] # extracting team names from HTML page 
    team_data = [v for i, v in enumerate(team_data) if i % 2 != 0] # removing not required data from the list

    finaldict = dict(zip(team_data, ranking_data)) # combining team names to their respective ratings

    return finaldict   

## **Data extraction from multiple seasons**
---

In [2]:
seasons = ['10/11', '11/12', '12/13', '13/14', '14/15', '15/16', '16/17', '17/18', '18/19', '19/20']
# fifa index website links with each season
season_links = ['https://www.fifaindex.com/teams/fifa11_7/?league=13&order=desc',
                'https://www.fifaindex.com/teams/fifa12_9/?league=13&order=desc',
                'https://www.fifaindex.com/teams/fifa13_10/?league=13&order=desc',
                'https://www.fifaindex.com/teams/fifa14_13/?league=13&order=desc',
                'https://www.fifaindex.com/teams/fifa15_14/?league=13&order=desc',
                'https://www.fifaindex.com/teams/fifa16_73/?league=13&order=desc',
                'https://www.fifaindex.com/teams/fifa17_173/?league=13&order=desc',
                'https://www.fifaindex.com/teams/fifa18_278/?league=13&order=desc',
                'https://www.fifaindex.com/teams/fifa19_353/?league=13&order=desc',
                'https://www.fifaindex.com/teams/fifa20_419/?league=13&order=desc',]

In [3]:
team_ratings_per_season = {}

for index, link in enumerate(season_links):
    team_ratings_per_season["%s" %seasons[index]] = ratings_export(link) # create nested dictionary

In [4]:
for k in sorted(team_ratings_per_season):
   print(k, ":")
   # check if all teams have 4 unique ratings
   print("Does each team have 4 unique ratings?", all(len(val)==4 for val in team_ratings_per_season[k].values()))
   for key, value in team_ratings_per_season[k].items():
      if len(value) < 4:
         print(key, ":", value)
   print("No of teams:", len(team_ratings_per_season[k]))
   print('')

10/11 :
Does each team have 4 unique ratings? False
Everton : ['80']
No of teams: 20

11/12 :
Does each team have 4 unique ratings? True
No of teams: 19

12/13 :
Does each team have 4 unique ratings? False
Arsenal : ['80', '80', '81']
No of teams: 19

13/14 :
Does each team have 4 unique ratings? False
Tottenham Hotspur : ['80']
No of teams: 20

14/15 :
Does each team have 4 unique ratings? False
Arsenal : ['80', '80']
No of teams: 20

15/16 :
Does each team have 4 unique ratings? True
No of teams: 20

16/17 :
Does each team have 4 unique ratings? False
Leicester City : ['81']
No of teams: 20

17/18 :
Does each team have 4 unique ratings? False
Everton : ['80']
No of teams: 20

18/19 :
Does each team have 4 unique ratings? False
Everton : ['80', '80']
No of teams: 20

19/20 :
Does each team have 4 unique ratings? False
Leicester City : ['82', '80']
No of teams: 20



## **Adding manually missing data & changing names to match Kaggle dataset**
---

In [5]:
# adding values to meet the 4-ratings criteria
team_ratings_per_season['10/11']["Everton"] = ['80','79','78','78']
team_ratings_per_season['11/12']["Arsenal"] = ['84','82','80','81']
team_ratings_per_season['12/13']["Southampton"] = ['73','74','71','73']
team_ratings_per_season['12/13']["Arsenal"] = ['80','79','80','80']
team_ratings_per_season['13/14']["Tottenham Hotspur"] = ['82','79','78','81']
team_ratings_per_season['14/15']["Arsenal"] = ['81','80','80','80']
team_ratings_per_season['16/17']["Leicester City"] = ['79','78','76','78']
team_ratings_per_season['17/18']["Everton"] = ['77','80','79','79']
team_ratings_per_season['18/19']["Everton"] = ['80','79','78','79']
team_ratings_per_season['19/20']["Leicester City"] = ['77','78','79','78']

# Changing names of 2 teams to match premier_df dataset on a different Jupyter Notebook
team_ratings_per_season['17/18']["Brighton and Hove Albion"] = team_ratings_per_season['17/18'].pop('Brighton & Hove Albion')
team_ratings_per_season['18/19']["Brighton and Hove Albion"] = team_ratings_per_season['18/19'].pop('Brighton & Hove Albion')
team_ratings_per_season['19/20']["Brighton and Hove Albion"] = team_ratings_per_season['19/20'].pop('Brighton & Hove Albion')
team_ratings_per_season['15/16']["AFC Bournemouth"] = team_ratings_per_season['15/16'].pop('Bournemouth')
team_ratings_per_season['16/17']["AFC Bournemouth"] = team_ratings_per_season['16/17'].pop('Bournemouth')
team_ratings_per_season['17/18']["AFC Bournemouth"] = team_ratings_per_season['17/18'].pop('Bournemouth')
team_ratings_per_season['18/19']["AFC Bournemouth"] = team_ratings_per_season['18/19'].pop('Bournemouth')

In [6]:
for k in sorted(team_ratings_per_season):
   print(k, ":")
   # check if all teams have 4 unique ratings
   print("Does each team have 4 unique ratings?", all(len(val)==4 for val in team_ratings_per_season[k].values()))
   print("No of teams:", len(team_ratings_per_season[k]))
   print('')

10/11 :
Does each team have 4 unique ratings? True
No of teams: 20

11/12 :
Does each team have 4 unique ratings? True
No of teams: 20

12/13 :
Does each team have 4 unique ratings? True
No of teams: 20

13/14 :
Does each team have 4 unique ratings? True
No of teams: 20

14/15 :
Does each team have 4 unique ratings? True
No of teams: 20

15/16 :
Does each team have 4 unique ratings? True
No of teams: 20

16/17 :
Does each team have 4 unique ratings? True
No of teams: 20

17/18 :
Does each team have 4 unique ratings? True
No of teams: 20

18/19 :
Does each team have 4 unique ratings? True
No of teams: 20

19/20 :
Does each team have 4 unique ratings? True
No of teams: 20



## **Storing the final dictionary to be accessed in other Jupyter Notebooks**
---

In [7]:
%store team_ratings_per_season

Stored 'team_ratings_per_season' (dict)
