In [54]:
import pandas as pd

In [55]:
df = pd.read_csv('./initial_ds/ratings.csv')

# Data clean up

## Removing data for irrelevant leagues

In [56]:
df['division'].unique()

array(['Segunda Division ', 'Copa del Rey ', 'Champions League ',
       'Primera Division ', 'Europa League ', 'Champions League q. ',
       'Supercopa ', 'Intertoto Cup ', 'Supercup ', 'Europa League q. ',
       'Primera Division rel. ', 'Europa Conf. League ',
       'Europa Conf. League q. '], dtype=object)

In [57]:
df = df.replace({
    'Segunda Division ': 'segunda',
    'Primera Division ': 'primera'
})

df = df[df['division'].isin(['segunda', 'primera'])]

## Processing the match dates

The date column must be converted into DateTime64 format.

In [58]:
df['date'] = pd.to_datetime(df['date'])

## Dropping redundant data

The two unused columns must be dropped.

In [59]:
df = df.drop(['unused_1', 'unused_2'], axis='columns')

The probability columns are based on merely the difference in the ratings of the two teams; hence they are not very accurate and may be dropped.

In [60]:
df = df.drop(['prob_h', 'prob_d', 'prob_a'], axis='columns')

Some matches are set to be held later and have not happened yet; those are to be removed from the dataset.

In [61]:
df = df[
    (df['date'] <= pd.to_datetime('today')) &
    (df['result'] != '')]

The result of the match is already available in other datasets and is redundant, so it must be dropped as well.

In [62]:
df = df.drop('result', axis='columns')

## Converting the ratings to integers

The rating columns must all be converted to Int64

In [64]:
rating_cols = [
    'home_pre_rating', 
    'home_rating_delta',
    'home_post_rating',
    'away_pre_rating',
    'away_rating_delta',
    'away_post_rating'
]

for col in rating_cols:
    df[col] = df[col].astype('Int64')

## Making team names consistent

In [None]:
df[df['division'] == 'primera']['home'].value_counts()

Valencia CF            449
Athletic Bilbao        449
FC Barcelona           448
Real Madrid            448
Sevilla FC             430
Espanyol Barcelona     430
Villarreal CF          410
Atlético Madrid        410
Real Sociedad          391
Real Betis             372
CA Osasuna             353
Getafe CF              335
Celta Vigo             335
Málaga CF              323
Deportivo La Coruña    323
RCD Mallorca           315
Levante UD             266
Real Valladolid        258
Racing Santander       228
Real Zaragoza          228
Rayo Vallecano         220
CD Alavés              209
Granada CF             171
SD Eibar               133
UD Almería             126
Sporting Gijón         114
UD Las Palmas           95
Elche CF                88
Recreativo Huelva       76
CD Numancia             76
CD Leganés              76
Cádiz CF                69
Girona FC               50
Real Murcia             38
Albacete                38
CD Tenerife             38
Real Oviedo             38
S