### Very Humble Beginings

In this third notebook, I plan to get into Recurrent Neural Network Model - LSTM. I want to try predicting player coordinates in the next n seconds by looking at last n seconds. 

However, I wish to get into some preprocessing/feature engineering before starting to build the model. 

In [16]:
# usual suspects!
import pandas as pd
import numpy as np
from numpy import array

# file format
import json

# dictionary
from collections import defaultdict

# preprocessing
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

# models
from keras.models import Sequential
from keras.models import Model

#layers
from keras.layers import Input
from keras.layers.merge import Concatenate
from keras.layers import Bidirectional
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, LSTM
from keras.layers import GlobalMaxPooling1D
from keras.layers.embeddings import Embedding
from keras.layers import RepeatVector
from keras.layers import TimeDistributed

# incase we needs these
from sklearn.model_selection import train_test_split
import re

# visualizations
import matplotlib.pyplot as plt

In [17]:
# pandas setup
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 4000)
pd.set_option('display.precision', 3)

Let's bring the all events dataset. 

In [18]:
# read in dataset
events = pd.read_csv("../../csv_files/AI_in_Soccer/02-eventsAll.csv", index_col=0)

I like to work with pipelines. In my opinion, they are easy to trace and makes the project shorter by saving a lot of processing time. 

It can give the project the status of completeness. It may feel something like building a lego structure with proper staircase to the 2nd floor. It avoids limiting the imagination by wondering how the little Lego man goes to the second floor of the building we just put together (I love [Lego](https://www.lego.com/en-us)!).

Again, just four months into serious Python here...These may not have the ultimate efficiency, but they get the job done!

Now, let's build the pipeline piece by piece.

### Pipeline
I need some selections. 

First, I select a match. I am not going with `gameweek` here, because it may have multiple matches of the same team if the there is a postponed match previously. Also, this selection goes with development level only. The user level access would already have the match data loaded.

Second, I select a player. For this the work at the production level I must construct a dictionary with player id's are the value of the players names. Because, nobody will remember a player by its player id in this dataset. There can be players' names (possibly a drop down in the final web application) based on the teams in the match.

Third, I sort and filter. Why sort? Because, the sequence of events is important for this prediction model. LSTM works on sequential data. So, I pay special attention to pick the train and test portions in sequential order. 

Why filter? Because I need the time and the coordiante features for LSTM. The `event` is for the final pipeline, the great combo of models: "The League of LSTM & XGboost", PG:13. coming soon...And the `matchPeriod` is for the train/test split.

Also, the momentum of every match is different from the previous and next matches. It is due to differences such as:
 - Importance of the match
 - Toughness of the opponent
 - Stamina level of the players
 - Home vs away match factor
 - The weather effect 

To honor these differences, I stick with one match. How? Well, train & validate in first period and test in second period, hence the fourth: Split.


Let's define the name: id dictionary and then the process.

In [54]:
# load the players data
players={}
with open('../../Data/AI_in_Soccer/players.json') as json_data:
    players = json.load(json_data)

# scan the "players" and deveop the name:id dictionary
player_names = {}
for i in range(0, len(players)):    
    player_names[players[i]['shortName']] = int(players[i]['wyId'])

In [86]:
# now the initilizer
def dataset_init(df, matchid, player_name, period):
    
    # 1 - Pick a match
    match_df= df[df['matchId'] ==  matchid]
    
    # 2 - Pick a player
    player_id = player_names[player_name]
    player_df = match_df[match_df['playerId'] ==  player_id].reset_index(drop=True) 
    
    # 3 - Sort the dataframe and filter
    player_df.sort_values(by=['gameweek', 'matchId', 'matchPeriod', 'eventSec'], inplace=True)
    df = player_df[['matchPeriodNum', 'event', 'eventSec', 'x_start', 'y_start']]
    
    # 4 - Split the dataframe
    train_test_df = df[df.matchPeriodNum == period]
    
    return train_test_df

Now, the deep-prep:

In [137]:
# A - ultimate dataset prep tool.          
def deep_prep(df):
    
    # 1) round the seconds
    df['eventSecRound'] = df['eventSec'].round(0)
    
    # 2) drop duplicate seconds (low possibility, but a life saver in later stages)
    df.drop_duplicates(subset='eventSecRound', keep="first", inplace=True)

    # 3) calculate and set the deltas
    df['sec_delta'] = df.eventSecRound.diff().round(0).shift(-1)
    df['x_delta'] = df.x_start.diff().shift(-1)
    df['y_delta'] = df.y_start.diff().shift(-1)

    # 4) calculate the unit delta
    df['x_unit'] = df.x_delta / df.sec_delta
    df['y_unit'] = df.y_delta / df.sec_delta
    
    # 5) add a temporary label 
    df['temp_label'] = 1
    
    # 6) sanity check #1
    print(df.shape[0])
    
    # 7) add event seconds to get the dataset to 1 second intervals
    df = (df.merge(how='right', on='eventSecRound', 
                               right = pd.DataFrame({'eventSecRound':np.arange(df.iloc[0]['eventSecRound'], 
                                                             df.iloc[-1]['eventSecRound'] + 1, 1)})).
     sort_values('eventSecRound').
     reset_index().
     drop(['index'], axis=1))
    
    # 8) fill the temp label NaN s with 0 for later use in processes 
    df['temp_label'].fillna(0, inplace=True)

    # 9) sanity check #2
    print(df.temp_label.sum())
    
    # 10) fill NaN's in both fields
    value_copy(df, 'x_unit')
    value_copy(df, 'y_unit')
    
    # 11) initiate the new ball coordinate fields
    df['x_ball'] = 0.0
    df['y_ball'] = 0.0
    
    # 12) add the observations between gaps by differences equally seperated by unit factor
    temp_coord(df, 'x_ball', 'x_start', 'x_unit')
    temp_coord(df, 'y_ball', 'y_start', 'y_unit')
    print(df)
    # 13) create new fields for final coordinates
    df['x_ball_plus'] = df['x_ball'] + df['x_unit']
    df['x_ball_minus'] = df['x_ball'] - df['x_unit']
    df['y_ball_plus'] = df['y_ball'] + df['y_unit']
    df['y_ball_minus'] = df['y_ball'] - df['y_unit']
    
    # 14a) generate x coordinates
    df['x_player'] = 0.0
    df['x_player'][0] = df['x_start'][0]
    for i in range(1, len(df['temp_label'])):
        if df['temp_label'][i] == 0:
            df['x_player'][i] = np.random.uniform(df['x_ball_plus'][i], df['x_ball_minus'][i])
        else:
            df['x_player'][i] = df['x_start'][i]

    # 14b) generate y coordinates
    df['y_player'] = 0.0
    df['y_player'][0] = df['y_start'][0]
    for i in range(1, len(df['temp_label'])):
        if df['temp_label'][i] == 0:
            df['y_player'][i] = np.random.uniform(df['y_ball_plus'][i], df['y_ball_minus'][i])
        else:
            df['y_player'][i] = df['y_start'][i]
            
    # and the big boy!        
    return df[['event', 'x_player', 'y_player']]


# B - fill NaN's with the appropriate value
def value_copy(df, column):
    df[column] = np.where(~df[column].isnull(), df[column], 100000)
    for i in range(1, len(df[column])):
        if df[column][i] == 100000:
            df[column][i] = df[column][i-1]
        else:
            pass
        
# C - add temp dummy coordinates
def temp_coord(df, temp, static, unit):
    df[temp] = 0.0
    df[temp][0] = df[static][0]
    for i in range(1, len(df.temp_label)):
        if df.temp_label[i] == 1:
            df[temp][i] = df[static][i]
        else:
            df[temp][i] = df[temp][i-1] + df[unit][i]

The ultimate goal is creating regular timesteps in one second intervals, a sequetial repopulation of observations.  The original dataset does not have regular intervals. It has events and the events apart from eachother in irregular timesteps. I need equally spaced sequences. Since these sequnces will be used to place the player on the pitch, I need randomness in the cordinate data. Let's go over the steps in the process item by item:

    1) The first step is rounding the seconds since I am setting the interval space to one second in this version.
    2) The rounded seconds end up as duplicates rarely. Once that happens, the process crashes. To avoid that, I drop the dedupe and keep the first to avoid the error.
    3) Then calculate the differences between rounded seconds and respective coordinates. I shift them up by one, so I can use them to create that many rows for equally spaced time sequnce.
    4) Get the unit change in coordinates given the number of seconds between 2 events.
    5) Create a temporary label to use later ("temp_label"). The timing of this step is important. Because, I will zero the ones related to dummy coordinates for selective purposes. It will be clear later.
    6) A check of the number of rows prior to sequential repopulation of rows.
    7) This is where I repopulate the the datarame by one second intervals.
    8) Once the repopulation is completed, the temp_label's in those will be NaN's (or zeros). If theya re not zeros, I make sure they are.
    9) Check the previously populated temp_label sum (where temp_label == 1) and compare to the initial datasets row counts. They need to be equal to each other. Again, this is for a selective purpose which I will explain in later stages.
    10) Next I make sure that x and y unit changes are identical along the observations between events. As I added sequential one second intervals betweens rounded event seconds, I need to perform a similar activity for the coordinates. Since we are looking at a player's events data with given coordinates at each event, we can safely assume that the player moves between those coordinates during the time between the same two events. based on that assumption,  the process knows how many units it needs to add to the previous coordinate to create the next one in the sequence. I know I sounded like a time travel problem, but stay with me!
    11) Create the ball coordiante column and ...
    12) The item 10 leads to the calculation of the ball coordinates! They are linear. The process looks at the temp_label and checks if it is "0". If it is "0" then it creates a row and uses unit changes to assign new coordnaites for that row. If it is "1", it leaves it alone. What do I mean by that? It creates linear ball movement. The ball will travel in a straight line (with the excption of "bend it like Beckham" cases) between 2 events. There it is, item 10 (unit changes) is added to the "x_start" and "y_start" in each interval between events to create the movement of the ball in a sequential order. Now I have a fake ball that is moving around to direct the player around the field! However, it is just a fake ball...with a purpose (fake ball paradox).
    13) A soccer player is not likely to move on the pitch in straight lines all the time. He may do that in a case of sprint (without the ball). But that is about it. There will be some sort of randomness involved in his activities. That randomness is partially directed by the ball (the fake ball paradox!). In other words, the ball we created is only a representation of the mass/momentum of the real ball. And I use that to randomize the player's movements. How? Uniform distribution! The uniform distribution needs borders. In this item, I create those borders. Imagine a line of boxes sequentially ordered along the line between 2 events (fake ball!). The number of the boxes are equal to the number of rows created in item 12. And finally..
    14) I pick random points from each uniform distriution and create the player movement by connecting those coordinates. Yes, the randomness may have larger differences that one unit between each seconds, but know this, soccer players are fast! They can move about 3 yards in a given second. Well, it is very unlikely to encounter a case where their artifical movement is not realistic. But to ease the pain I can add some distance limit between these coordinates if it makes you happy...in version 2.0! For real, I will look at it at a later stage. The time is running out.

That is about it!

A and B are the helping processes and they are pretty self explanatory.

Ready!

I go ahead and start the fun. I want to pick a match of Arsenal (who else!) and then get the midfielder named "G. Xhaka". Then, I initilize. Finally, I prepare the training and test sets and save the to csv.

In [138]:
# initilize
train_df = dataset_init(events, 2500040, "G. Xhaka", 1)
test_df = dataset_init(events, 2500040, "G. Xhaka", 2)


**Quick comment:** I decide to use a midfielder as my test subject. Good news is that the midfield players are involved with a lot of events given their nature and high ball skill level. The bad news is that they are everywhere on the football pitch in a given game! I mean, they help defense, they support attack, they sprint forward, they run back, they move box to box. All the midfielder activity leads to a bigger challange for the prediction model. No problem, it would not be fun if it was easy.

In [139]:
# and deep-prep
train_df = deep_prep(train_df)
test_df = deep_prep(test_df)

87
87.0


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


      matchPeriodNum       event  eventSec  x_start  y_start  eventSecRound  \
0                1.0  short_pass    23.933     55.0     18.0           24.0   
1                NaN         NaN       NaN      NaN      NaN           25.0   
2                NaN         NaN       NaN      NaN      NaN           26.0   
3                NaN         NaN       NaN      NaN      NaN           27.0   
4                NaN         NaN       NaN      NaN      NaN           28.0   
5                NaN         NaN       NaN      NaN      NaN           29.0   
6                NaN         NaN       NaN      NaN      NaN           30.0   
7                NaN         NaN       NaN      NaN      NaN           31.0   
8                NaN         NaN       NaN      NaN      NaN           32.0   
9                NaN         NaN       NaN      NaN      NaN           33.0   
10               NaN         NaN       NaN      NaN      NaN           34.0   
11               1.0  short_pass    34.904     72.0 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


62
62.0
      matchPeriodNum       event  eventSec  x_start  y_start  eventSecRound  \
0                2.0  short_pass    28.830     45.0     58.0           29.0   
1                NaN         NaN       NaN      NaN      NaN           30.0   
2                NaN         NaN       NaN      NaN      NaN           31.0   
3                NaN         NaN       NaN      NaN      NaN           32.0   
4                NaN         NaN       NaN      NaN      NaN           33.0   
5                NaN         NaN       NaN      NaN      NaN           34.0   
6                NaN         NaN       NaN      NaN      NaN           35.0   
7                NaN         NaN       NaN      NaN      NaN           36.0   
8                NaN         NaN       NaN      NaN      NaN           37.0   
9                NaN         NaN       NaN      NaN      NaN           38.0   
10               NaN         NaN       NaN      NaN      NaN           39.0   
11               NaN         NaN       NaN  

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [119]:
# check
print(train_df.shape)
print(test_df.shape)

(2791, 3)
(3043, 3)


In [120]:
# save as csv
train_df.to_csv("../../csv_files/AI_in_Soccer/LSTM_train.csv")
test_df.to_csv("../../csv_files/AI_in_Soccer/LSTM_test.csv")