# Scraping Part 1: Getting Player URLs

## Context

The goal of the FUT-Stats project is to run analyses to better understand the FIFA Ultimate Team market and economy, but that requires data. There is no official data source from EA for FUT, but the player-run website FUTBin has become the de facto source for player and pricing data from the game.

However, there is also no official FUTBin-sanctioned way to acquire their data in a form that we can easily work with, so we'll have to scrape it from the website. For the most part, the plan is to use Scrapy to do this, fetching data from each player into a database we can then work from.

The first step is to page through the player list and get the URLs for every player card. We could do this in Scrapy along with all of the other steps, but it's an opportunity for showing how more basic web scraping with the requests and Beautiful Soup libraries works.

The goal for this notebook is to hopefully, by the end, have a saved list of every player URL on the Futbin platform.

## Analyzing the website

Before we write even a single line of code, we need to have a look at the website and see what information we want and where it's located. Let's load up futbin.com.  
  
  
  
### Homepage
![](img/futbin_homepage.png)  


### Player directory
![](img/futbin_playerdirectory.png)  


### Player page
![](img/futbin_player.png) 


## Making a plan

Essentially, what we want to do is open the player directory, page through the players so we collect every single URL, and then collect as much information we can about them from their individual page. We could do this a few different ways. One of them is to use a spider-like approach, where our crawler goes through each page and "clicks" on each player, collecting data along the way.

Another is to first collect every player URL and then run another scraper over that list to collect player information.

Both approaches have their advantages. The first is harder to build but easier to run, since it does everything in a single step. The second is easier to build and more robust against being blocked by the server, since it can be batched more easily.

Since we want to show the code directly in the notebook we're going to use the second approach, which we can run using the simpler syntax of requests and bs4, as opposed to Scrapy, which has a more complex object-oriented approach to building webscrapers.

## Importing the relevant libraries

In [5]:
import pandas as pd

import requests
import bs4
import json

The first thing we need to do is make sure the page we want to scrape is actually accessible by our code. The way the requests library work is quite simple - an HTTP request is made to the URL you supply and the response is saved as an object.

In [6]:
players_url = 'https://www.futbin.com/players'
res = requests.get(players_url)

res

<Response [200]>

A 200 response means OK, so the script was able to fetch the webpage without further issues. Sometimes, you may need to specify custom headers and cookies to avoid being blocked by the target server. 

The .text and .content attributes can be used on the response object to look at the HTML returned by the page.

In [7]:
#As we can see, quite a lot of HTML was returned - this is the code that makes up the page you see in your browser.
len(res.text)

277313

Inspecting HTML to find the player URLs
![](img/futbin_table_inspect.png) 


Finding elements with selectors

In [34]:
player_urls = []

In [35]:
soup = bs4.BeautifulSoup(res.text, 'lxml')

results = soup.find_all('a', class_='player_name_players_table')

for result in results:
    player_urls.append(f"https://futbin.com{result['href']}")

In [38]:
player_urls

['https://futbin.com/20/player/44079/lionel-messi',
 'https://futbin.com/20/player/44085/virgil-van-dijk',
 'https://futbin.com/20/player/44119/cristiano-ronaldo',
 'https://futbin.com/20/player/45418/diego-maradona',
 'https://futbin.com/20/player/45488/pele',
 'https://futbin.com/20/player/48190/kevin-de-bruyne',
 'https://futbin.com/20/player/48196/virgil-van-dijk',
 'https://futbin.com/20/player/48243/lionel-messi',
 'https://futbin.com/20/player/48290/robert-lewandowski',
 'https://futbin.com/20/player/48350/cristiano-ronaldo',
 'https://futbin.com/20/player/1/pele',
 'https://futbin.com/20/player/44081/kylian-mbappe',
 'https://futbin.com/20/player/44089/kevin-de-bruyne',
 'https://futbin.com/20/player/48193/sadio-mane',
 'https://futbin.com/20/player/48395/kylian-mbappe',
 'https://futbin.com/20/player/48403/neymar-jr',
 'https://futbin.com/20/player/2/diego-maradona',
 'https://futbin.com/20/player/44080/sadio-mane',
 'https://futbin.com/20/player/44082/alisson',
 'https://futb

Finding the "next page" button
![](img/futbin_next_inspect.png) 

In [45]:
next_button = soup.find_all('a', attrs={'aria-label': 'Next'})[0]
next_url = f"https://futbin.com{next_button['href']}"
print(next_url)

https://futbin.com/players?page=2


What if it's the last page?

In [54]:
last_page_res = requests.get('https://www.futbin.com/20/players?page=754')
last_page_soup = bs4.BeautifulSoup(last_page_res.text)
last_page_soup.find_all('a', attrs={'aria-label': 'Next'})

[]

Adding cookies and headers

Putting it all together

In [80]:
import time
import random

def scrape_player_urls(url, player_urls, verbose=True, sleep_min=500, sleep_max=1000):
    
    cookies = {
        '__cfduid': 'da7fb59c676afaf8e9241118df829015a1591911739',
        'PHPSESSID': '9m1l5qrjvooidbuu4hl39qkppq',
        'theme_player': 'true',
        'comments': 'true',
        'platform': 'ps4',
    }

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en,en-US;q=0.7,pt-BR;q=0.3',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Cache-Control': 'max-age=0',
        'TE': 'Trailers',
    }

    sleep_time = random.randrange(sleep_min, sleep_max)
    time.sleep(sleep_time/100)
    
    base_url = 'https://futbin.com'

    if verbose:
        print(f'Scraping {url}...')

    res = requests.get(url, headers=headers, cookies=cookies)
    if res.status_code != 200:
        print(f'Error code {res.status_code} when requesting {url}. Waiting one minute to retry.')
        time.sleep(60)
        res = requests.get(url, headers=headers, cookies=cookies)
        if res.status_code != 200:
            print(f'Request failed for a second time. Aborting and returning early.')
            return player_urls

    soup = bs4.BeautifulSoup(res.text, 'lxml')
    players = soup.find_all('a', class_='player_name_players_table')
    for player in players:
        player_urls.append(base_url + player['href'])

    next_button_search = soup.find_all('a', attrs={'aria-label': 'Next'})
    if next_button_search:
        next_button = next_button_search[0]
        next_url = base_url + next_button['href'] 
        return scrape_player_urls(next_url, player_urls)
    else:
        return player_urls

In [85]:
player_urls = []
first_page = 'https://www.futbin.com/players?page=579'

player_urls = scrape_player_urls(first_page, player_urls)

Scraping https://www.futbin.com/players?page=579...
Scraping https://futbin.com/players?page=580...
Scraping https://futbin.com/players?page=581...
Scraping https://futbin.com/players?page=582...
Scraping https://futbin.com/players?page=583...
Scraping https://futbin.com/players?page=584...
Scraping https://futbin.com/players?page=585...
Scraping https://futbin.com/players?page=586...
Scraping https://futbin.com/players?page=587...
Scraping https://futbin.com/players?page=588...
Scraping https://futbin.com/players?page=589...
Scraping https://futbin.com/players?page=590...
Scraping https://futbin.com/players?page=591...
Scraping https://futbin.com/players?page=592...
Scraping https://futbin.com/players?page=593...
Scraping https://futbin.com/players?page=594...
Scraping https://futbin.com/players?page=595...
Scraping https://futbin.com/players?page=596...
Scraping https://futbin.com/players?page=597...
Scraping https://futbin.com/players?page=598...
Scraping https://futbin.com/players?

Scraping https://futbin.com/players?page=750...
Scraping https://futbin.com/players?page=751...
Scraping https://futbin.com/players?page=752...
Scraping https://futbin.com/players?page=753...
Scraping https://futbin.com/players?page=754...


Saving player URLs

In [87]:
pd.Series(player_urls).to_csv('player_urls_xxx.csv', header=False, index=False)

5262

Side note: reading HTML tables with pandas

In [89]:
html_table = pd.read_html(res.text)[0]
html_table.head()

Unnamed: 0,Name,RAT,POS,VER,PS,SKI,WF,WR,PAC,SHO,PAS,DRI,DEF,PHY,Unnamed: 14,Unnamed: 15,BS,IGS
0,Vegard Storsve,48,GK,Normal,200,1,2,M \ M,50,50,48,51,30,49,"185cm | 6'1"" High & Average+ (65kg)",6,278,538
1,Huanhuan Shan,48,ST,Normal,200,2,3,M \ L,58,43,35,49,30,48,"185cm | 6'1"" High & Average (75kg)",15,263,1181
2,Lewis Collins,48,CM,Normal,200,2,3,M \ L,70,40,46,52,38,38,"178cm | 5'10"" Lean (67kg)",5,284,1308
3,Guobo Liu,48,CM,Normal,200,2,2,L \ L,63,38,45,48,41,53,"189cm | 6'2"" High & Average (75kg)",10,288,1372
4,Peng Wang,48,CAM,Normal,1K,2,3,M \ M,60,34,51,48,36,46,"175cm | 5'9"" Average (70kg)",9,275,1321


Side note: making things faster with Scrapy