# FBRef Player Report Scraper

*This is a scraper for extracting the data from FBRef's player scouting reports, using the example of Eugénie Le Sommer. The goal is to scrape the three columns of the full scouting report: Stats (the column with the names of each metric), Per 90 Values, and Percentiles.*

In [1]:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

In [2]:
url = 'https://fbref.com/en/players/1139a223/scout/365_f1/Eugenie-Le-Sommer-Scouting-Report'
page = requests.get(url)
page

<Response [200]>

In [3]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


In [4]:
soup = BeautifulSoup(page.text,'lxml')

In [5]:
soup

<!DOCTYPE html>
<html class="no-js" data-root="/home/fb/deploy/www/base" data-version="klecko-" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport"/>
<link href="https://cdn.ssref.net/req/202309071" rel="dns-prefetch"/>
<!-- Quantcast Choice. Consent Manager Tag v2.0 (for TCF 2.0) -->
<script async="true" type="text/javascript">
    (function() {
	var host = window.location.hostname;
	var element = document.createElement('script');
	var firstScript = document.getElementsByTagName('script')[0];
	var url = 'https://cmp.quantcast.com'
	    .concat('/choice/', 'XwNYEpNeFfhfr', '/', host, 
		    '/choice.js?tag_version=V2');
	var uspTries = 0;
	var uspTriesLimit = 3;
	element.async = true;
	element.type = 'text/javascript';
	element.src = url;
	
	firstScript.parentNode.insertBefore(element, firstScript);
	
	function makeStub() {
	    var TCF_LOCATOR_NAME = '_

In [6]:
for table in soup.find_all('table'):
  print(table.get('class'))

['stats_table', 'sortable', 'suppress_partial', 'suppress_share', 'suppress_link']
['stats_table', 'sortable', 'min_width', 'suppress_glossary', 'suppress_partial', 'suppress_share', 'suppress_link']
['stats_table', 'sortable', 'suppress_partial', 'suppress_share', 'suppress_link']


In [7]:
table = soup.find('table', id="scout_full_FW")

In [8]:
table

<table class="stats_table sortable suppress_partial suppress_share suppress_link" data-cols-to-freeze=",1" id="scout_full_FW"> <caption><img alt="FBref.com Logo" onerror="this.src='https://cdn.ssref.net/req/202102032/logos/fb-logo.png'; this.onerror = null;" src="https://cdn.ssref.net/req/202102032/logos/fb-logo.svg" style="width: 3.35em;"/> Eugénie Le Sommer <span style="text-overflow:ellipsis;">Complete Scouting Report</span> Table</caption> <colgroup><col/><col/><col/></colgroup> <thead> <tr class="over_header"> <th aria-label="" class="over_header center group_start" colspan="3" data-stat="header_standard">Standard Stats</th> </tr> <tr> <th aria-label="Statistic" class="poptip center" data-over-header="Standard Stats" data-stat="statistic" data-tip="Hover over stat name for glossary definition" scope="col">Statistic</th> <th aria-label="Per 90" class="poptip center" data-over-header="Standard Stats" data-stat="per90" data-tip="Stats are listed in values per 90 minutes played or as 

Note: here the table id (for the full scouting report, not the summary one, which is "scout_summary_XX") is "scout_full_FW" (for CF/etc) but for a CB it would be "scout_full_CB", for a full-back, "scout_full_FB", and so on. GK for goalkeepers, MF for midfielders, AM is for att. midfielders/wingers

In [9]:
headers = table.find('thead')

In [10]:
headers

<thead> <tr class="over_header"> <th aria-label="" class="over_header center group_start" colspan="3" data-stat="header_standard">Standard Stats</th> </tr> <tr> <th aria-label="Statistic" class="poptip center" data-over-header="Standard Stats" data-stat="statistic" data-tip="Hover over stat name for glossary definition" scope="col">Statistic</th> <th aria-label="Per 90" class="poptip center" data-over-header="Standard Stats" data-stat="per90" data-tip="Stats are listed in values per 90 minutes played or as percentages" scope="col">Per 90</th> <th aria-label="Percentile" class="poptip center" data-over-header="Standard Stats" data-stat="percentile" scope="col">Percentile</th> </tr> </thead>

In [11]:
data_rows = table.find('tbody')

In [12]:
data_rows

<tbody> <tr><th class="right poptip endpoint endpoint" data-endpoint="/en/ajax/glossary.cgi?html=1&amp;stat=goals" data-stat="statistic" data-tip="Goals scored or allowed" scope="row" style="color: #000">Goals</th><td class="right" csk="0.45" data-stat="per90">0.45</td><td class="left endpoint endpoint tooltip" csk="41" data-endpoint="/en/ajax/distribution.cgi?html=1&amp;person_id=1139a223&amp;name_display=Eugénie Le Sommer&amp;pool=365_f1&amp;pos=FW&amp;stat=goals&amp;pos_title=Forwards" data-stat="percentile"><div align="right" style="display: inline-block; width: 1.75em;">41</div> <div style="width: min(30vw,190px); display: inline-block;"> <div style="display: inline-block; width: 41%; background-color: rgb(175 152 152);"> </div> </div> </td></tr> <tr><th class="right poptip endpoint endpoint" data-endpoint="/en/ajax/glossary.cgi?html=1&amp;stat=assists" data-stat="statistic" data-tip="Assists" scope="row" style="color: #000">Assists</th><td class="right" csk="0.39" data-stat="per9

In [13]:
data_rows.find_all('th',{'class':"right poptip endpoint endpoint"})

[<th class="right poptip endpoint endpoint" data-endpoint="/en/ajax/glossary.cgi?html=1&amp;stat=goals" data-stat="statistic" data-tip="Goals scored or allowed" scope="row" style="color: #000">Goals</th>,
 <th class="right poptip endpoint endpoint" data-endpoint="/en/ajax/glossary.cgi?html=1&amp;stat=assists" data-stat="statistic" data-tip="Assists" scope="row" style="color: #000">Assists</th>,
 <th class="right poptip endpoint endpoint" data-endpoint="/en/ajax/glossary.cgi?html=1&amp;stat=goals_assists" data-stat="statistic" data-tip="Goals and Assists" scope="row" style="color: #000">Goals + Assists</th>,
 <th class="right poptip endpoint endpoint" data-endpoint="/en/ajax/glossary.cgi?html=1&amp;stat=goals_pens" data-stat="statistic" data-tip="Non-Penalty Goals" scope="row" style="color: #000">Non-Penalty Goals</th>,
 <th class="right poptip endpoint endpoint" data-endpoint="/en/ajax/glossary.cgi?html=1&amp;stat=pens_made" data-stat="statistic" data-tip="Penalty Kicks Made" scope="ro

In [14]:
data_rows.find('th',{'class':"right poptip endpoint endpoint"}).text.strip()

'Goals'

In [15]:
data_rows.find_all('td',{'class':"right"})

[<td class="right" csk="0.45" data-stat="per90">0.45</td>,
 <td class="right" csk="0.39" data-stat="per90">0.39</td>,
 <td class="right" csk="0.84" data-stat="per90">0.84</td>,
 <td class="right" csk="0.45" data-stat="per90">0.45</td>,
 <td class="right iz" csk="0.00" data-stat="per90">0.00</td>,
 <td class="right iz" csk="0.00" data-stat="per90">0.00</td>,
 <td class="right" csk="0.06" data-stat="per90">0.06</td>,
 <td class="right iz" csk="0.00" data-stat="per90">0.00</td>,
 <td class="right" csk="0.73" data-stat="per90">0.73</td>,
 <td class="right" csk="0.73" data-stat="per90">0.73</td>,
 <td class="right" csk="0.36" data-stat="per90">0.36</td>,
 <td class="right" csk="1.09" data-stat="per90">1.09</td>,
 <td class="right iz" data-stat="per90" style="padding:1px;"></td>,
 <td class="right" csk="1.69" data-stat="per90">1.69</td>,
 <td class="right" csk="4.09" data-stat="per90">4.09</td>,
 <td class="right" csk="8.75" data-stat="per90">8.75</td>,
 <td class="right" csk="0.45" data-sta

In [16]:
data_rows.find('td',{'class':"right"}).text.strip()

'0.45'

In [17]:
data_rows.find_all('td',{'class':"left endpoint endpoint tooltip"})

[<td class="left endpoint endpoint tooltip" csk="41" data-endpoint="/en/ajax/distribution.cgi?html=1&amp;person_id=1139a223&amp;name_display=Eugénie Le Sommer&amp;pool=365_f1&amp;pos=FW&amp;stat=goals&amp;pos_title=Forwards" data-stat="percentile"><div align="right" style="display: inline-block; width: 1.75em;">41</div> <div style="width: min(30vw,190px); display: inline-block;"> <div style="display: inline-block; width: 41%; background-color: rgb(175 152 152);"> </div> </div> </td>,
 <td class="left endpoint endpoint tooltip" csk="99" data-endpoint="/en/ajax/distribution.cgi?html=1&amp;person_id=1139a223&amp;name_display=Eugénie Le Sommer&amp;pool=365_f1&amp;pos=FW&amp;stat=assists&amp;pos_title=Forwards" data-stat="percentile"><div align="right" style="display: inline-block; width: 1.75em;">99</div> <div style="width: min(30vw,190px); display: inline-block;"> <div style="display: inline-block; width: 99%; background-color: rgb(52 175 52);"> </div> </div> </td>,
 <td class="left endpo

In [18]:
data_rows.find('td',{'class':"left endpoint endpoint tooltip"}).text.strip()

'41'

we have to do some cleaning because  if we test for instance for the "Proggressive Passes Received" (per 90 = 8.75, percentile = 79), we see that it works for the stat name and percentile, but not for the per 90 values

In [19]:
stats_column = data_rows.find_all('th',{'class':"right poptip endpoint endpoint"})

In [20]:
percentile_values = data_rows.find_all('td',{'class':"left endpoint endpoint tooltip"})

In [21]:
stats_column[14].text

'Progressive Passes Rec'

In [22]:
percentile_values[14].text.strip()

'79'

In [23]:
per_90 = data_rows.find_all('td',{'class':"right"})

In [24]:
per_90[14].text.strip()

'4.09'

4.09 is the value (per 90) of Progressive passes, which is just before Progressive Passes Rec.

In [25]:
per_90[15].text.strip()

'8.75'

Here's why: there are some empty rows, e.g.:

In [26]:
per_90[12].text.strip()

''

Here's the issue, looking at the html lines we see that there are lines with class "right iz" instead of "right", but some of them contain actual per 90 values (=0) but others aren't data rows/lines.  
- td class="right" csk="0.06" data-stat="per90">0.06<
- td class="right iz" csk="0.00" data-stat="per90">0.00<
- td class="right iz" csk="0.00" data-stat="per90">0.00<
- **td class="right iz" data-stat="per90" style="padding:1px;"><**
- td class="right" csk="0.73" data-stat="per90">0.73<
- td class="right" csk="0.73" data-stat="per90">0.73<
- td class="right" csk="0.19" data-stat="per90">0.19<  
 
We have to remove only the lines where it's empty! (note: zero isn't empty, it's still a value)

In [27]:
data2 = data_rows

In [28]:
for x in data2.find_all():
   if len(x.get_text(strip=True)) == 0:
      x.extract()

print(data2)

<tbody> <tr><th class="right poptip endpoint endpoint" data-endpoint="/en/ajax/glossary.cgi?html=1&amp;stat=goals" data-stat="statistic" data-tip="Goals scored or allowed" scope="row" style="color: #000">Goals</th><td class="right" csk="0.45" data-stat="per90">0.45</td><td class="left endpoint endpoint tooltip" csk="41" data-endpoint="/en/ajax/distribution.cgi?html=1&amp;person_id=1139a223&amp;name_display=Eugénie Le Sommer&amp;pool=365_f1&amp;pos=FW&amp;stat=goals&amp;pos_title=Forwards" data-stat="percentile"><div align="right" style="display: inline-block; width: 1.75em;">41</div>  </td></tr> <tr><th class="right poptip endpoint endpoint" data-endpoint="/en/ajax/glossary.cgi?html=1&amp;stat=assists" data-stat="statistic" data-tip="Assists" scope="row" style="color: #000">Assists</th><td class="right" csk="0.39" data-stat="per90">0.39</td><td class="left endpoint endpoint tooltip" csk="99" data-endpoint="/en/ajax/distribution.cgi?html=1&amp;person_id=1139a223&amp;name_display=Eugénie

In [29]:
per_90_values = data2.find_all('td',{'class':"right"})

In [30]:
per_90_values[14].text.strip()

'8.75'

In [31]:
stats_column2 = data2.find_all('th',{'class':"right poptip endpoint endpoint"})

In [32]:
stats_column2[14].text.strip()

'Progressive Passes Rec'

In [33]:
percentiles = data2.find_all('td',{'class':"left endpoint endpoint tooltip"})

In [34]:
percentiles[14].text.strip()

'79'

Note: we could dig up more stuff, like the "high/average/low" values for each metric, but I'll stick to the three main columns for now...  
I'm keeping my first attempt at creating a dataframe (df) below for the sake of documentation, but it led to a dead end...

In [35]:
df = pd.DataFrame()

In [36]:
df = df.append(pd.DataFrame({'Statistic':stats_column2, "Per 90": per_90_values, "Percentile":percentiles}))

  df = df.append(pd.DataFrame({'Statistic':stats_column2, "Per 90": per_90_values, "Percentile":percentiles}))


In [37]:
df

Unnamed: 0,Statistic,Per 90,Percentile
0,[Goals],[0.45],"[[41], , ]"
1,[Assists],[0.39],"[[99], , ]"
2,[Goals + Assists],[0.84],"[[66], , ]"
3,[Non-Penalty Goals],[0.45],"[[45], , ]"
4,[Penalty Kicks Made],[0.00],"[[28], , ]"
...,...,...,...
130,[Own Goals],[0.00],"[[50], , ]"
131,[Ball Recoveries],[4.28],"[[90], , ]"
132,[Aerials won],[0.97],"[[45], , ]"
133,[Aerials lost],[0.58],"[[97], , ]"


In [38]:
#note: this is relevant because it what you'd get in the csv file if you saved that dataframe into that format...
df.astype(str)

Unnamed: 0,Statistic,Per 90,Percentile
0,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.45"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
1,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.39"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
2,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.84"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
3,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.45"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
4,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right iz"" csk=""0.00"" data-stat=""per...","<td class=""left endpoint endpoint tooltip"" csk..."
...,...,...,...
130,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right iz"" csk=""0.00"" data-stat=""per...","<td class=""left endpoint endpoint tooltip"" csk..."
131,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""4.28"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
132,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.97"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."
133,"<th class=""right poptip endpoint endpoint"" dat...","<td class=""right"" csk=""0.58"" data-stat=""per90""...","<td class=""left endpoint endpoint tooltip"" csk..."


**The solution is just to 1) create one dataframe per column, and then 2) combine them into a final dataframe with the three columns**

In [39]:
stats = pd.DataFrame(stats_column2)

In [40]:
stats

Unnamed: 0,0
0,Goals
1,Assists
2,Goals + Assists
3,Non-Penalty Goals
4,Penalty Kicks Made
...,...
130,Own Goals
131,Ball Recoveries
132,Aerials won
133,Aerials lost


In [41]:
values_per90 = pd.DataFrame(per_90_values)

In [42]:
values_per90

Unnamed: 0,0
0,0.45
1,0.39
2,0.84
3,0.45
4,0.00
...,...
130,0.00
131,4.28
132,0.97
133,0.58


*percentile = pd.DataFrame(percentiles)* doesn't work, we have to iterate through *percentiles*, and create a list out of the text output for each line, and then we create a dataframe from that list

In [43]:
#After a lot of trial and error, this is how I first found a solution to how to select only the text part of each line 
#from the percentiles set -> iterate via for loop
#it's necessary to use strip() because for some reason the values are arrays with the two spaces "  " added after the numerical
#value (for percentiles) we're looking for...
for each in percentiles:
    print(each.text.strip())

41
99
66
45
28
24
45
52
93
97
99
99
48
99
79
41
72
69
38
24
28
57
83
28
24
93
97
99
1
1
99
99
93
99
97
99
97
97
99
99
78
97
86
90
99
99
99
99
31
99
86
99
99
97
97
99
69
17
93
76
99
99
93
47
99
38
28
97
99
99
1
62
76
14
97
99
93
26
66
21
36
76
90
45
97
45
86
93
34
1
72
38
76
90
86
21
50
99
38
45
59
99
83
99
48
45
24
52
45
93
69
62
48
79
48
83
69
90
79
45
52
50
97
38
59
93
90
90
24
55
50
90
45
97
97


In [44]:
#but this is the loop we're gonna use, to create a list with all this data
#creating new (empty) list
percentile_list = []
#we get the (stripped) text value for each line of the percentiles set, and add it all into the new list
for each in percentiles:
    percentile_list.append(each.text.strip())

In [45]:
percentile_list

['41',
 '99',
 '66',
 '45',
 '28',
 '24',
 '45',
 '52',
 '93',
 '97',
 '99',
 '99',
 '48',
 '99',
 '79',
 '41',
 '72',
 '69',
 '38',
 '24',
 '28',
 '57',
 '83',
 '28',
 '24',
 '93',
 '97',
 '99',
 '1',
 '1',
 '99',
 '99',
 '93',
 '99',
 '97',
 '99',
 '97',
 '97',
 '99',
 '99',
 '78',
 '97',
 '86',
 '90',
 '99',
 '99',
 '99',
 '99',
 '31',
 '99',
 '86',
 '99',
 '99',
 '97',
 '97',
 '99',
 '69',
 '17',
 '93',
 '76',
 '99',
 '99',
 '93',
 '47',
 '99',
 '38',
 '28',
 '97',
 '99',
 '99',
 '1',
 '62',
 '76',
 '14',
 '97',
 '99',
 '93',
 '26',
 '66',
 '21',
 '36',
 '76',
 '90',
 '45',
 '97',
 '45',
 '86',
 '93',
 '34',
 '1',
 '72',
 '38',
 '76',
 '90',
 '86',
 '21',
 '50',
 '99',
 '38',
 '45',
 '59',
 '99',
 '83',
 '99',
 '48',
 '45',
 '24',
 '52',
 '45',
 '93',
 '69',
 '62',
 '48',
 '79',
 '48',
 '83',
 '69',
 '90',
 '79',
 '45',
 '52',
 '50',
 '97',
 '38',
 '59',
 '93',
 '90',
 '90',
 '24',
 '55',
 '50',
 '90',
 '45',
 '97',
 '97']

In [46]:
#turning the list into a dataframe
percentileS = pd.DataFrame(percentile_list)

In [47]:
percentileS

Unnamed: 0,0
0,41
1,99
2,66
3,45
4,28
...,...
130,50
131,90
132,45
133,97


In [48]:
#columns of the three dataframes aren't named so we add column names here
stats.columns = ['Statistic']
values_per90.columns = ['Per 90']
percentileS.columns = ['Percentile']
#putting all three together into a single final dataframe
scouting_report = pd.concat([stats,values_per90,percentileS],axis=1,join='inner')

In [49]:
scouting_report

Unnamed: 0,Statistic,Per 90,Percentile
0,Goals,0.45,41
1,Assists,0.39,99
2,Goals + Assists,0.84,66
3,Non-Penalty Goals,0.45,45
4,Penalty Kicks Made,0.00,28
...,...,...,...
130,Own Goals,0.00,50
131,Ball Recoveries,4.28,90
132,Aerials won,0.97,45
133,Aerials lost,0.58,97


In [50]:
#save to csv
scouting_report.to_csv('Eugénie Le Sommer FBRef Report.csv',index=False)