# FBref Player Report Scraper

*This is a scraper for extracting the data from FBref's player scouting reports. The goal is to scrape the three columns of the full scouting report: Stats (the column with the names of each metric), Per 90 Values, and Percentiles.*

In [1]:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

In [2]:
url = 'https://fbref.com/en/players/507c7bdf/scout/365_m1/Bruno-Fernandes-Scouting-Report'
page = requests.get(url)
page

<Response [200]>

In [3]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


In [4]:
soup = BeautifulSoup(page.text,'lxml')

In [5]:
for table in soup.find_all('table'):
  print(table.get('class'))

['stats_table', 'sortable', 'suppress_partial', 'suppress_share', 'suppress_link']
['stats_table', 'sortable', 'suppress_partial', 'suppress_share', 'suppress_link']
['stats_table', 'sortable', 'min_width', 'suppress_glossary', 'suppress_partial', 'suppress_share', 'suppress_link']
['stats_table', 'sortable', 'min_width', 'suppress_glossary', 'suppress_partial', 'suppress_share', 'suppress_link']
['stats_table', 'sortable', 'suppress_partial', 'suppress_share', 'suppress_link']
['stats_table', 'sortable', 'suppress_partial', 'suppress_share', 'suppress_link']


In [6]:
table = soup.find('table', id="scout_full_AM")

About the table id: 
- it's for the full scouting report ("scout_full_XX"), not the summary one, which is "scout_summary_XX"
- the 'XX' is for the position (group): "GK" for goalkeepers, "CB" for centre-backs, "FB" for full-backs, "MF" for midfielders, "AM" for att. midfielders/wingers, "FW" for centre forwards/strikers/etc.

In [7]:
headers = table.find('thead')

In [8]:
headers

<thead> <tr class="over_header"> <th aria-label="" class="over_header center group_start" colspan="3" data-stat="header_standard">Standard Stats</th> </tr> <tr> <th aria-label="Statistic" class="poptip center" data-over-header="Standard Stats" data-stat="statistic" data-tip="Hover over stat name for glossary definition" scope="col">Statistic</th> <th aria-label="Per 90" class="poptip center" data-over-header="Standard Stats" data-stat="per90" data-tip="Stats are listed in values per 90 minutes played or as percentages" scope="col">Per 90</th> <th aria-label="Percentile" class="poptip center" data-over-header="Standard Stats" data-stat="percentile" scope="col">Percentile</th> </tr> </thead>

In [9]:
data_rows = table.find('tbody')

In [10]:
#data_rows.find_all('th',{'class':"right poptip endpoint endpoint"})

In [11]:
data_rows.find('th',{'class':"right poptip endpoint endpoint"}).text.strip()

'Goals'

In [12]:
data_rows.find_all('td',{'class':"right"})

[<td class="right" csk="0.24" data-stat="per90">0.24</td>,
 <td class="right" csk="0.30" data-stat="per90">0.30</td>,
 <td class="right" csk="0.54" data-stat="per90">0.54</td>,
 <td class="right" csk="0.16" data-stat="per90">0.16</td>,
 <td class="right" csk="0.08" data-stat="per90">0.08</td>,
 <td class="right" csk="0.10" data-stat="per90">0.10</td>,
 <td class="right" csk="0.22" data-stat="per90">0.22</td>,
 <td class="right iz" csk="0.00" data-stat="per90">0.00</td>,
 <td class="right" csk="0.29" data-stat="per90">0.29</td>,
 <td class="right" csk="0.21" data-stat="per90">0.21</td>,
 <td class="right" csk="0.48" data-stat="per90">0.48</td>,
 <td class="right" csk="0.69" data-stat="per90">0.69</td>,
 <td class="right iz" data-stat="per90" style="padding:1px;"></td>,
 <td class="right" csk="2.19" data-stat="per90">2.19</td>,
 <td class="right" csk="7.88" data-stat="per90">7.88</td>,
 <td class="right" csk="5.91" data-stat="per90">5.91</td>,
 <td class="right" csk="0.24" data-stat="per

In [13]:
data_rows.find('td',{'class':"right"}).text.strip()

'0.24'

In [14]:
#data_rows.find_all('td',{'class':"left endpoint endpoint tooltip"})

In [15]:
data_rows.find('td',{'class':"left endpoint endpoint tooltip"}).text.strip()

'49'

we have to do some cleaning because  if we test for instance for the "Proggressive Passes Received" (per 90 = 8.75, percentile = 79), we see that it works for the stat name and percentile, but not for the per 90 values

In [16]:
stats_column = data_rows.find_all('th',{'class':"right poptip endpoint endpoint"})

In [17]:
percentile_values = data_rows.find_all('td',{'class':"left endpoint endpoint tooltip"})

In [18]:
stats_column[14].text

'Progressive Passes Rec'

In [19]:
percentile_values[14].text.strip()

'18'

In [20]:
per_90 = data_rows.find_all('td',{'class':"right"})

In [21]:
per_90[14].text.strip()

'7.88'

4.09 is the value (per 90) of Progressive passes, which is just before Progressive Passes Rec.

In [22]:
per_90[15].text.strip()

'5.91'

Here's why: there are some empty rows, e.g.:

In [23]:
per_90[12].text.strip()

''

Here's the issue, looking at the html lines we see that there are lines with class "right iz" instead of "right", but some of them contain actual per 90 values (=0) but others aren't data rows/lines.  
- td class="right" csk="0.06" data-stat="per90">0.06<
- td class="right iz" csk="0.00" data-stat="per90">0.00<
- td class="right iz" csk="0.00" data-stat="per90">0.00<
- **td class="right iz" data-stat="per90" style="padding:1px;"><**
- td class="right" csk="0.73" data-stat="per90">0.73<
- td class="right" csk="0.73" data-stat="per90">0.73<
- td class="right" csk="0.19" data-stat="per90">0.19<  
 
We have to remove only the lines where it's empty! (note: zero isn't empty, it's still a value)

In [24]:
data2 = data_rows

In [25]:
for x in data2.find_all():
   if len(x.get_text(strip=True)) == 0:
      x.extract()

In [26]:
per_90_values = data2.find_all('td',{'class':"right"})

In [27]:
per_90_values[14].text.strip()

'5.91'

In [28]:
stats_column2 = data2.find_all('th',{'class':"right poptip endpoint endpoint"})

In [29]:
stats_column2[14].text.strip()

'Progressive Passes Rec'

In [30]:
percentiles = data2.find_all('td',{'class':"left endpoint endpoint tooltip"})

In [31]:
percentiles[14].text.strip()

'18'

Note: we could dig up more stuff, like the "high/average/low" values for each metric, but I'll stick to the three main columns for now...  
I'm keeping my first attempt at creating a dataframe (df) below for the sake of documentation, but it led to a dead end...

In [32]:
df = pd.DataFrame({'Statistic':stats_column2, "Per 90": per_90_values, "Percentile":percentiles})

In [33]:
df

Unnamed: 0,Statistic,Per 90,Percentile
0,[Goals],[0.24],"[[49], , ]"
1,[Assists],[0.30],"[[81], , ]"
2,[Goals + Assists],[0.54],"[[72], , ]"
3,[Non-Penalty Goals],[0.16],"[[33], , ]"
4,[Penalty Kicks Made],[0.08],"[[89], , ]"
...,...,...,...
130,[Own Goals],[0.00],"[[50], , ]"
131,[Ball Recoveries],[6.09],"[[93], , ]"
132,[Aerials Won],[0.56],"[[54], , ]"
133,[Aerials Lost],[0.84],"[[61], , ]"


In [34]:
#note: this is relevant because it what you'd get in the csv file if you saved that dataframe into that format...
df.astype(str)

Unnamed: 0,Statistic,Per 90,Percentile
0,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.24"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
1,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.30"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
2,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.54"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
3,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.16"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
4,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.08"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
...,...,...,...
130,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right iz"" csk=""0.00"" data-stat=""per...","<td class=""left endpoint endpoint tooltip"" csk..."
131,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""6.09"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
132,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.56"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
133,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.84"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."


**The solution is just to 1) create one dataframe per column, and then 2) combine them into a final dataframe with the three columns**

In [35]:
stats = pd.DataFrame(stats_column2)

In [36]:
stats

Unnamed: 0,0
0,Goals
1,Assists
2,Goals + Assists
3,Non-Penalty Goals
4,Penalty Kicks Made
...,...
130,Own Goals
131,Ball Recoveries
132,Aerials Won
133,Aerials Lost


In [37]:
values_per90 = pd.DataFrame(per_90_values)

In [38]:
values_per90

Unnamed: 0,0
0,0.24
1,0.30
2,0.54
3,0.16
4,0.08
...,...
130,0.00
131,6.09
132,0.56
133,0.84


*percentile = pd.DataFrame(percentiles)* doesn't work, we have to iterate through *percentiles*, and create a list out of the text output for each line, and then we create a dataframe from that list

In [39]:
#After a lot of trial and error, this is how I first found a solution to how to select only the text part of each line 
#from the percentiles set -> iterate via for loop
#it's necessary to use strip() because for some reason the values are arrays with the two spaces "  " added after the numerical
#value (for percentiles) we're looking for...
for each in percentiles:
    print(each.text.strip())

49
81
72
33
89
88
26
57
71
54
99
97
16
99
18
49
73
62
44
21
23
15
85
89
88
71
54
24
27
29
93
95
30
95
99
83
86
56
94
95
49
99
98
47
81
99
95
97
96
96
62
99
95
93
89
81
98
95
83
78
89
71
84
97
93
5
5
98
98
82
30
59
54
52
72
85
83
55
21
48
44
89
83
95
77
78
96
98
54
4
73
87
68
73
85
97
8
95
99
98
89
81
42
95
11
14
67
88
43
75
17
35
16
12
32
91
90
90
18
26
57
53
40
12
45
83
73
83
54
11
50
93
54
61
67


In [40]:
#but this is the loop we're gonna use, to create a list with all this data
#creating new (empty) list
percentile_list = []
#we get the (stripped) text value for each line of the percentiles set, and add it all into the new list
for each in percentiles:
    percentile_list.append(each.text.strip())

In [41]:
percentile_list

['49',
 '81',
 '72',
 '33',
 '89',
 '88',
 '26',
 '57',
 '71',
 '54',
 '99',
 '97',
 '16',
 '99',
 '18',
 '49',
 '73',
 '62',
 '44',
 '21',
 '23',
 '15',
 '85',
 '89',
 '88',
 '71',
 '54',
 '24',
 '27',
 '29',
 '93',
 '95',
 '30',
 '95',
 '99',
 '83',
 '86',
 '56',
 '94',
 '95',
 '49',
 '99',
 '98',
 '47',
 '81',
 '99',
 '95',
 '97',
 '96',
 '96',
 '62',
 '99',
 '95',
 '93',
 '89',
 '81',
 '98',
 '95',
 '83',
 '78',
 '89',
 '71',
 '84',
 '97',
 '93',
 '5',
 '5',
 '98',
 '98',
 '82',
 '30',
 '59',
 '54',
 '52',
 '72',
 '85',
 '83',
 '55',
 '21',
 '48',
 '44',
 '89',
 '83',
 '95',
 '77',
 '78',
 '96',
 '98',
 '54',
 '4',
 '73',
 '87',
 '68',
 '73',
 '85',
 '97',
 '8',
 '95',
 '99',
 '98',
 '89',
 '81',
 '42',
 '95',
 '11',
 '14',
 '67',
 '88',
 '43',
 '75',
 '17',
 '35',
 '16',
 '12',
 '32',
 '91',
 '90',
 '90',
 '18',
 '26',
 '57',
 '53',
 '40',
 '12',
 '45',
 '83',
 '73',
 '83',
 '54',
 '11',
 '50',
 '93',
 '54',
 '61',
 '67']

In [42]:
#turning the list into a dataframe
percentileS = pd.DataFrame(percentile_list)

In [43]:
percentileS

Unnamed: 0,0
0,49
1,81
2,72
3,33
4,89
...,...
130,50
131,93
132,54
133,61


In [44]:
#columns of the three dataframes aren't named so we add column names here
stats.columns = ['Statistic']
values_per90.columns = ['Per 90']
percentileS.columns = ['Percentile']
#putting all three together into a single final dataframe
scouting_report = pd.concat([stats,values_per90,percentileS],axis=1,join='inner')

In [45]:
scouting_report

Unnamed: 0,Statistic,Per 90,Percentile
0,Goals,0.24,49
1,Assists,0.30,81
2,Goals + Assists,0.54,72
3,Non-Penalty Goals,0.16,33
4,Penalty Kicks Made,0.08,89
...,...,...,...
130,Own Goals,0.00,50
131,Ball Recoveries,6.09,93
132,Aerials Won,0.56,54
133,Aerials Lost,0.84,61


In [46]:
#save to csv
scouting_report.to_csv('Bruno Fernandes FBRef Report.csv',index=False)