In [58]:
import pandas as pd

In [59]:
df = pd.read_csv("./initial_ds/ratings.csv")

# Data clean up

## Removing data for irrelevant leagues

In [60]:
df['division'].unique()

array(['Segunda Division ', 'Copa del Rey ', 'Champions League ',
       'Primera Division ', 'Europa League ', 'Champions League q. ',
       'Supercopa ', 'Intertoto Cup ', 'Supercup ', 'Europa League q. ',
       'Primera Division rel. ', 'Europa Conf. League ',
       'Europa Conf. League q. '], dtype=object)

In [61]:
df = df[df['division'] == 'Primera Division ']

## Processing the match dates

The date column must be converted into DateTime64 format.

In [62]:
df['date'] = pd.to_datetime(df['date'])

## Dropping redundant data

The division column only contains the value 'Primera Division ', so it can be dropped.

In [63]:
df = df.drop('division', axis='columns')

The two unused columns must be dropped.

In [64]:
df = df.drop(['unused_1', 'unused_2'], axis='columns')

The probability columns are based on merely the difference in the ratings of the two teams; hence they are not very accurate and may be dropped.

In [65]:
df = df.drop(['prob_h', 'prob_d', 'prob_a'], axis='columns')

Some matches are set to be held later and have not happened yet; those are to be removed from the dataset.

In [66]:
df = df[
    (df['date'] <= pd.to_datetime('today')) &
    (df['result'] != '')]

The result of the match is already available in other datasets and is redundant, so it must be dropped as well.

In [67]:
df = df.drop('result', axis='columns')

## Converting the ratings to integers

The rating columns must all be converted to Int64

In [68]:
rating_cols = [
    'home_pre_rating', 
    'home_rating_delta',
    'home_post_rating',
    'away_pre_rating',
    'away_rating_delta',
    'away_post_rating'
]

for col in rating_cols:
    df[col] = df[col].astype('Int64')

# Saving the cleaned up dataset

In [69]:
df.to_csv("./processed_ds/ratings.csv", index=False)