# FIFA/Coca-Cola World Ranking
Parsing the FIFA World Ranking from fifa.com.

## First part. 
Get data from 2007 

Columns:
- **id** — counrty id
- **country_full** — country full name
- **country_abrv** — country abbreviation
- **rank** — current country rank
- **total_points** — current total points
- **previous_points** — total points in last rating
- **rank_change** — how rank has changed since the last publication
- **confederation** — FIFA confederations
- **rank_date** — date of rating calculation

In [2]:
import datetime
import requests as r
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

In [3]:
date_id = 'id1'  # first date 31 12 1992
fifa_url = 'https://www.fifa.com/fifa-world-ranking/ranking-table/men/rank'

### Collection of dates when ratings were created

In [4]:
page_source = r.get(f'{fifa_url}/{date_id}/')
page_source.encoding = 'utf8'
soup = BeautifulSoup(page_source.content, 'html.parser')

In [5]:
date_ids_soup = soup.find('ul', {'class': 'fi-ranking-schedule__nav'}).find_all('li')
date_ids_df = pd.DataFrame(columns=['date', 'date_id'])
for date_data in date_ids_soup:
    date = pd.to_datetime(
        date_data.text.strip(), 
        format='%d %B %Y'
    )
    
    date_ids_df = date_ids_df.append(
        {
            'date': date, 
            'date_id': date_data['data-value']
        },ignore_index=True
    )
    
print(f'First date: {date_ids_df.date.min()}\n'
      f'Last date: {date_ids_df.date.max()}')

First date: 1992-12-31 00:00:00
Last date: 2019-10-24 00:00:00


### Parsing data on the ratings of teams since 2007.
And saving them to `.csv` file

In [6]:
fifa_ranking = pd.DataFrame(columns=[
    'id', 'rank', 'country_full', 'country_abrv', 
    'total_points', 'previous_points', 'rank_change', 
    'confederation', 'rank_date'
])

start_time = datetime.datetime.now()
print("Start parsing.. ", datetime.datetime.now()-start_time)

for i, (date, date_id) in enumerate(date_ids_df.values, start=1):
    try:
        page_source = r.get(f'{fifa_url}/{date_id}/')
    except Exception as e:
        print(f'Parsing error. Last "date_id" - {date_id}\n', e)
        break
        
    page_source.encoding = 'utf8'
    soup = BeautifulSoup(page_source.content, 'html.parser')
    teams_data = soup.find('tbody').find_all('tr')
    
    for team_data in teams_data:
        fifa_ranking = fifa_ranking.append({
            'id': int(team_data['data-team-id']), 
            'country_full': team_data.find('span', {'class': 'fi-t__nText'}).text, 
            'country_abrv': team_data.find('span', {'class': 'fi-t__nTri'}).text,
            'rank': int(team_data.find('td', {'class': 'fi-table__rank'}).text), 
            'total_points': int(team_data.find('td', {'class': 'fi-table__points'}).text),
            'previous_points': int(soup.find('td', {'class': 'fi-table__prevpoints'}).text.replace('', '0')),
            'rank_change': int(soup.find('td', {'class': 'fi-table__rankingmovement'}).text.replace('-', '0')),
            'confederation': team_data.find('td', {'class': 'fi-table__confederation'}).text.strip('#'),
            'rank_date': date
        }, ignore_index=True)
        
    if i % 25 == 0:
        print(f'Complite {i}/{date_ids_df.shape[0]} dates')
    
else:
    print(f'Parsing complite. Time {datetime.datetime.now()-start_time}')
    fifa_ranking.to_csv(
        f'fifa_ranking-{str(date_ids_df.date.max())[:10]}.csv',  #  cut last date to format "XXXX-XX-XX"
        index=False, 
        encoding='utf-8'
    )
    print('Dataframe saved in currently folder')
    
fifa_ranking.head()  

Start parsing..  0:00:00


Complite 25/299 dates


Complite 50/299 dates


Complite 75/299 dates


Complite 100/299 dates


Complite 125/299 dates


Complite 150/299 dates


Complite 175/299 dates


Complite 200/299 dates


Complite 225/299 dates


Complite 250/299 dates


Complite 275/299 dates


Parsing complite. Time 0:58:57.890855


Dataframe saved in currently folder


Unnamed: 0,id,rank,country_full,country_abrv,total_points,previous_points,rank_change,confederation,rank_date
0,43935,1,Belgium,BEL,1755,10705020,0,UEFA,2019-10-24
1,43946,2,France,FRA,1726,10705020,0,UEFA,2019-10-24
2,43924,3,Brazil,BRA,1715,10705020,0,CONMEBOL,2019-10-24
3,43942,4,England,ENG,1651,10705020,0,UEFA,2019-10-24
4,43930,5,Uruguay,URU,1642,10705020,0,CONMEBOL,2019-10-24


## Second path.
Grouping by country and generating some features:

- **delta_points** — delta between `total_points` and `previous_points`
- **points_mean_(alltime/4year/1year)** — average points for: all time, last 4 year, last year
- **delta_points_mean_(alltime/4year/1year)** — average delta points for: all time, last 4 year, last year
- **delta_points_sum_(alltime/4year/1year)** — amount of points delta for: all time, last 4 year, last year
- **rank_change_mean_(alltime/4year/1year)** — average rank change for: all time, last 4 year, last year
- **rank_change_sum_(alltime/4year/1year)** — sum of rank changes for: all time, last 4 year, last year
- **rank_mean_(alltime/4year/1year)** — average rank for: all time, last 4 year, last year
- **best_rank_last_4y** — best rank in the last 4 years
- **worst_rank_last_4y** — worst rank in the last 4 years
- **delta_ranks_last_4y** — delta between best rank and worst rank in the last 4 years

### Group and standardize data

In [29]:
fifa_ranking.replace({'country_abrv': 'LIB'}, 'LBN', inplace=True)  # Lebanon has two abbreviations
fifa_ranking_plus = fifa_ranking[fifa_ranking['rank_date'] == fifa_ranking.rank_date.max()][[
    'id', 'rank', 'country_full', 'country_abrv',  
    'total_points', 'previous_points', 'rank_change', 
    'rank_date'
]]
fifa_ranking_plus.rename(columns={'rank_date': 'last_update'}, inplace=True)

### Create meta features

In [30]:
# Create features
def get_meta_features(full_dataframe, first_date, feature_prefix: str):
    meta_features = full_dataframe[full_dataframe['rank_date'] >= first_date].groupby('id').agg({
        'total_points': 'mean', 
        'delta_points': ['mean', 'sum'], 
        'rank_change': ['mean', 'sum'],
        'rank': 'mean'
    }).reset_index()
    
    # conversion multiindex columns to index and rename
    meta_features.columns = meta_features.columns.tolist()
    meta_features.rename(columns={
        ('id', ''): 'id',
        ('total_points', 'mean'): f'points_mean_{feature_prefix}',
        ('delta_points', 'mean'): f'delta_points_mean_{feature_prefix}',
        ('delta_points', 'sum'): f'delta_points_sum_{feature_prefix}',
        ('rank_change', 'mean'): f'rank_change_mean_{feature_prefix}',
        ('rank_change', 'sum'): f'rank_change_sum_{feature_prefix}',
        ('rank', 'mean'): f'rank_mean_{feature_prefix}',
    }, inplace=True)
    
    # data simplification
    meta_features[f'points_mean_{feature_prefix}'] = meta_features[f'points_mean_{feature_prefix}'].astype('int')
    meta_features[f'delta_points_sum_{feature_prefix}'] = meta_features[f'delta_points_sum_{feature_prefix}'].astype('int')
    meta_features[f'rank_mean_{feature_prefix}'] = meta_features[f'rank_mean_{feature_prefix}'].astype('int')
    meta_features[f'delta_points_mean_{feature_prefix}'] = round(meta_features[f'delta_points_mean_{feature_prefix}'], 2)
    meta_features[f'rank_change_mean_{feature_prefix}'] = round(meta_features[f'rank_change_mean_{feature_prefix}'], 2)

    return meta_features

In [31]:
threshold_alltime = fifa_ranking.rank_date.min()  # first date in dataframe
threshold_4years = datetime.datetime(
    fifa_ranking.rank_date.max().year-4,  # last year - 4
    fifa_ranking.rank_date.max().month,
    1
)
threshold_1year = datetime.datetime(
    fifa_ranking.rank_date.max().year-1,  # last year - 1
    fifa_ranking.rank_date.max().month,
    1
)

thresholds = {
    'alltime': threshold_alltime,
    '4years': threshold_4years,
    '1year': threshold_1year
}

# add delta between total_points and previous_points
fifa_ranking['delta_points'] = (fifa_ranking.total_points - fifa_ranking.previous_points).astype('int')

for prefix, date in thresholds.items():
    fifa_ranking_plus = fifa_ranking_plus.merge(
        get_meta_features(fifa_ranking, date, prefix), 
        on='id'
    )
fifa_ranking_plus.info();

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 0 to 209
Data columns (total 26 columns):
id                           210 non-null int64
rank                         210 non-null int64
country_full                 210 non-null object
country_abrv                 210 non-null object
total_points                 210 non-null int64
previous_points              210 non-null int64
rank_change                  210 non-null int64
last_update                  210 non-null datetime64[ns]
points_mean_alltime          210 non-null int32
delta_points_mean_alltime    210 non-null float64
delta_points_sum_alltime     210 non-null int32
rank_change_mean_alltime     210 non-null float64
rank_change_sum_alltime      210 non-null int64
rank_mean_alltime            210 non-null int32
points_mean_4years           210 non-null int32
delta_points_mean_4years     210 non-null float64
delta_points_sum_4years      210 non-null int32
rank_change_mean_4years      210 non-null float64
rank_change_

In [32]:
def create_rank_changing_features(full_dataframe, first_date, feature_prefix: str):
    rank_features = full_dataframe[
        full_dataframe['rank_date'] >= first_date
    ].groupby('id').agg({
        'rank': ['min', 'max', lambda x: max(x)- min(x)]
    }).reset_index()
    
    rank_features.columns = rank_features.columns.tolist()
    rank_features.rename(columns={
        ('id', ''): 'id',
        ('rank', 'min'): f'best_rank_{feature_prefix}',
        ('rank', 'max'): f'worst_rank_{feature_prefix}',
        ('rank', '<lambda>'): f'delta_ranks_{feature_prefix}'
    }, inplace=True)
    
    return rank_features

In [33]:
fifa_ranking_plus = fifa_ranking_plus.merge(
        create_rank_changing_features(fifa_ranking, threshold_4years, '4year'), 
        on='id'
    )

fifa_ranking_plus.head().T

Unnamed: 0,0,1,2,3,4
id,43935,43946,43924,43942,43930
rank,1,2,3,4,5
country_full,Belgium,France,Brazil,England,Uruguay
country_abrv,BEL,FRA,BRA,ENG,URU
total_points,1755,1726,1715,1651,1642
previous_points,10705020,10705020,10705020,10705020,10705020
rank_change,0,0,0,0,0
last_update,2019-10-24 00:00:00,2019-10-24 00:00:00,2019-10-24 00:00:00,2019-10-24 00:00:00,2019-10-24 00:00:00
points_mean_alltime,697,833,969,825,755
delta_points_mean_alltime,-5.71082e+06,-5.71068e+06,-5.71055e+06,-5.71069e+06,-5.71076e+06


### Saving new dataframe to .csv file

In [34]:
fifa_ranking_plus.to_csv(
    f'fifa_ranking_plus-{str(fifa_ranking_plus.last_update.max())[:10]}.csv',  #  cut last date to format "XXXX-XX-XX"
    index=False, 
    encoding='utf-8'
)
'Dataframe saved in currently folder'

'Dataframe saved in currently folder'