# AutoBet Rating Scraper

This notebook will scrape the FIFA index (https://www.fifaindex.com/) for real-time player and club performance data. The main purpose for this index is to adjust the player/club abilities in the game FIFA '17 to match the performance of their real-life counterpart. Although it's unclear how FIFA determines these values, they might be useful features for the AutoBet classifier.

In [1]:
from lxml import html
from tqdm import *
import pandas as pd
import requests

## Player information
#### *Hyperlink extraction*

First, we need to scrape the website for the hyperlinks referencing to all players in their database. These links are saved in the list *links* and stored to csv for backup.

In [86]:
# Set base webpage
base = 'https://www.fifaindex.com/players/'

# Create html tree
page = requests.get(base)
tree = html.fromstring(page.content)

# Get player hyperlinks
links = list(set([link for link in tree.xpath('//*[@id="no-more-tables"]/table/tbody/tr/td/a[@title]/@href') if link.startswith('/player/')]))

# Repeat for all pages (587 pages in total)
for i in range(2, 587):
    base = base[:34] + str(i)+'/'
    page = requests.get(base)
    tree = html.fromstring(page.content)
    
    player_links = list(set([link for link in tree.xpath('//*[@id="no-more-tables"]/table/tbody/tr/td/a[@title]/@href') if link.startswith('/player/')]))
    
    for link in player_links:
        links.append(link)
    
pd.Series(links).to_csv('../data/hyperlinks.csv', index=False)

In [None]:
links = list(pd.read_csv('../data/hyperlinks.csv'))

#### *Feature scraping*

Now that we have all the hyperlinks, we can extract the features from them. These are stored in a pandas DataFrame, where the players are indexed in the rows and features stored over the columns. With my network speed, the downloading will take about 1.5 hours.

In [None]:
page = requests.get('https://www.fifaindex.com' + links[0])
tree = html.fromstring(page.content)

# Get feature names
features = []
for element in tree.find_class('pull-right'):
    try:
        features.append(element.getparent().text_content()[:-3])
    except:
        continue      
feature = features[16:]
feature.insert(0, 'Overall_2')
feature.insert(0, 'Overall_1')
feature.append('Name')

# Scrape player features and create DataFrame
data = []
for hyperlink in tqdm(links):    
    page = requests.get('https://www.fifaindex.com' + hyperlink)
    tree = html.fromstring(page.content)
    features = [int(element.text_content()) for element in tree.find_class('label rating')]
    features.append(tree.find_class('panel-title')[0].text_content()[:-6])
    data.append(features)

df = pd.DataFrame(data, columns=feature)
df.to_csv('../data/player_features.csv', index=False, encoding='utf8')

## Club information

We will do the same thing for clubs

#### *Hyperlink extraction*

In [25]:
# Set base webpage
base = 'https://www.fifaindex.com/teams/'

# Create html tree
page = requests.get(base)
tree = html.fromstring(page.content)

# Get player hyperlinks
links = list(set([link for link in tree.xpath('//*[@id="no-more-tables"]/table/tbody/tr/td/a[@title]/@href') if link.startswith('/team/')]))

# Repeat for all pages (587 pages in total)
for i in range(2, 23):
    base = base[:32] + str(i)+'/'
    page = requests.get(base)
    tree = html.fromstring(page.content)
    
    team_links = list(set([link for link in tree.xpath('//*[@id="no-more-tables"]/table/tbody/tr/td/a[@title]/@href') if link.startswith('/team/')]))
    
    for link in team_links:
        links.append(link)
    
pd.Series(links).to_csv('./team_hyperlinks.csv', index=False)

In [None]:
links = list(pd.read_csv('./team_hyperlinks.csv'))

#### *Feature scraping*

In [119]:
page = requests.get('https://www.fifaindex.com' + links[0])
tree = html.fromstring(page.content)

# Get feature names
features = []
for element in tree.find_class('pull-right'):
    try:
        features.append(element.getparent().text_content()[:-3])
    except:
        continue    
feature = features[2:14]
feature.insert(len(feature), 'Club')

# Scrape club features and create DataFrame
data = []
for hyperlink in links:
    page = requests.get('https://www.fifaindex.com' + hyperlink)
    tree = html.fromstring(page.content)
    features = [e.text_content() for e in tree.find_class('pull-right')][2:14]    
    features.append(tree.find_class('team normal')[0].items()[1][1])
    data.append(features)
    
df = pd.DataFrame(data, columns=feature)
df.to_csv('./team_features.csv', index=False, encoding='utf8')

## Save to SQL

All dataframes are stored in .sqlite for later usage

In [9]:
import sqlite3
from pandas.io import sql

tf = pd.read_csv('../data/team_features.csv', encoding='latin-1')
tl = pd.read_csv('../data/team_hyperlinks.csv', encoding='latin-1')
pf = pd.read_csv('../data/player_features.csv', encoding='latin-1')
pl = pd.read_csv('../data/player_hyperlinks.csv', encoding='latin-1')

db = sqlite3.connect('../data/features.sqlite')

sql.to_sql(tf, name='team_features', con=db, index=False)
sql.to_sql(tl, name='team_links', con=db, index=False)
sql.to_sql(pf, name='player_features', con=db, index=False)
sql.to_sql(pl, name='player_links', con=db, index=False)

db.commit()

