# Data Prep

Loads the output of a [transfermark-webscrapper](https://github.com/dcaribou/transfermarkt-scraper) run, and a applies a series of transformations to produce a file that is validated and more friendly for perfoming analyisis. Some of these transformations are

* Creating handy ID columns
* Renaming fileds to comply with naming convention
* Parsing raw values into their own columns

## Load
Input to the data prep process is excepted to be the output of the [transfermark-webscrapper](https://github.com/dcaribou/transfermarkt-scraper). I. e., a file with JSON lines with one line per player.

In [1]:
import pandas as pd

raw_file = '../data/tfmkt__2019-08-03__GB1_ES1.json'

raw = pd.read_json(
  raw_file,
  lines=True,
  convert_dates=True,
  orient={'index','date'}
)

## Prep
The prep phase applies a series of transformations on the raw data frame that we loaded above

In [2]:
from prep_lib import *

### Flatten
Firstly, we need to explode the data frame to have one ne row per player appearance, rather than one row per player

In [3]:
raw_flat = flatten(raw, ['stats'])

### Rename
Modify the names of the input columns to make them consisent with a naming convention

In [4]:
with_renamed_columns = renames(raw_flat)

### Update
- [x] Convert `goals`, `assists`, `own_goals` and`date` to the approproate types
- [x] yellow_cards / red_cards (no need for second_yellows)
- [ ] club name formatting: fc-watford -> FC Watford
- [ ] player name formatting: adam-masina -> Adam Masina
- [ ] Position: use longer names instead of the chryptic 'LB', etc (use 'filter by position' [here](https://www.transfermarkt.co.uk/diogo-jota/leistungsdatendetails/spieler/340950/saison/2020/verein/0/liga/0/wettbewerb/GB1/pos/0/trainer_id/0/plus/1) to get the mappings)

In [5]:
with_improved_columns = improve_columns(with_renamed_columns)

### Create
- [x] Add surrogate keys `game_id`, `player_id`, `appearance_id`, `home_club_id`, `away_club_id`
- [x] Split `result` into `home_club_goals` and `away_club_goals`
- [x] Approximate appearance `season`

In [6]:

with_new_columns = add_new_columns(with_improved_columns)

### Filter
* Only season 2018 is complete on the current file, so we remove the rest
  - [ ] Rather than hardcoding the filter, the whole script should be parameterized for a specific season
* To reduce the scope of this version of the data prep scritp, select only appearances from domestic competitions


In [7]:
with_filtered_appearances = filter_appearances(with_new_columns)

## Validate
Validate that the output dataframe contains consistent data. Two types of checks are performed.

### Value checks
- [x] Fields `red_cards`, `yellow_cards`, `own_goals`, `assists`, `goals` and `minutes_played` contain values within an expected range
- [x] Rows are unique on `player_id` + `date`
- [ ] `position` field is either one of the long form player positions from Transfermarkt

### Completeness checks
- [x] Number of teams per domestic competition must be exactly 20
- [ ] Each club must play 38 games per season on the domestic competition
- [ ] On each match, both clubs should have at least 11 appearances
- [ ] Similarly, each club must have at least 11 appearances per game


In [8]:
validate(with_filtered_appearances)

AssertionError: 

In [None]:
## Save

In [None]:
with_filtered_appearances.to_csv(
  'data/tfmkt__2019-08-03__GB1_ES1__prep.csv',
  index=False
)