### Let's Go!

Hi there! Let's talk football (or soccer).
I am about to take you through a journey of loading, editting, and compiling data in a master dataset.

Let'start by loading the libraries!

In [1]:
# for reading json format files
import json

# for data munging
import pandas as pd
import numpy as np

# for dictionary formation
from collections import defaultdict

# for pickling
import pickle

For me it is bery important to stay organized in a project. It saves a lot of time. Also, I like to check the directory to make sure I address the correct path at all times.

In [2]:
# check directory
!pwd

/Users/atahankocak/ds/Projects/A.I-in-Soccer/notebook


I want to see more data than usual. I feel more comfortable and have the ability to learn better by sight.

In [3]:
# pandas setup
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)
pd.set_option('display.precision', 3)

Good to go!

Next part is about the investigation of the data in its original habitat. The data is in document format. Therefore, I utilize NoSQL via PyMongo. Let's take a look at what we are dealing with.

In [4]:
# load PyMongo Library and prettyprint
from pymongo import MongoClient
from pprint import pprint

First, I start the mongo daemon (mongod) in the artifical pandas' backed terminal. Then I import the documents in the related databases. And finally create clients for the databases for further investigation.

The structure of the documents are made easy. Each country has its own seasonal data. There is 2 international competitions as well. These internation competitions can share player data among eachother and with other coutries. That may create a problem for my work. Therefore, I plan to exclude the international competitions from this analysis for two reasons:
- Players' assigned roles can be switched in international competitions. It can result confusion in predicting the event for given player's role.
- Not all the league players can attend the competition. That can cause heavy bias towards the better players adn shape the analysis towards their data structure. Also, there will be other players from other leagues (the ones we do nto have data on). I want to keep things focused and simple initially.

In [5]:
# events database
!mongoimport --db events --collection england --jsonArray --file ../data/events/events_england.json
!mongoimport --db events --collection germany --jsonArray --file ../data/events/events_germany.json
!mongoimport --db events --collection apain --jsonArray --file ../data/events/events_spain.json
!mongoimport --db events --collection italy --jsonArray --file ../data/events/events_italy.json
!mongoimport --db events --collection france --jsonArray --file ../data/events/events_france.json

2020-04-02T13:42:07.118-0400	connected to: mongodb://localhost/
2020-04-02T13:42:10.119-0400	[##......................] events.england	20.1MB/180MB (11.1%)
2020-04-02T13:42:13.120-0400	[#######.................] events.england	55.5MB/180MB (30.8%)
2020-04-02T13:42:16.124-0400	[############............] events.england	91.3MB/180MB (50.7%)
2020-04-02T13:42:19.120-0400	[################........] events.england	126MB/180MB (70.1%)
2020-04-02T13:42:22.120-0400	[#####################...] events.england	161MB/180MB (89.1%)
2020-04-02T13:42:23.769-0400	[########################] events.england	180MB/180MB (100.0%)
2020-04-02T13:42:23.769-0400	643150 document(s) imported successfully. 0 document(s) failed to import.
2020-04-02T13:42:23.920-0400	connected to: mongodb://localhost/
2020-04-02T13:42:26.921-0400	[#####...................] events.germany	33.3MB/146MB (22.8%)
2020-04-02T13:42:29.921-0400	[############............] events.germany	73.3MB/146MB (50.2%)
2020-04-02T13:42:32.921-0400	[#####

I do the same with matches and players databases...

In [6]:
# matches database
!mongoimport --db matches --collection england --jsonArray --file ../data/matches/matches_england.json
!mongoimport --db matches --collection germany --jsonArray --file ../data/matches/matches_germany.json
!mongoimport --db matches --collection spain --jsonArray --file ../data/matches/matches_spain.json
!mongoimport --db matches --collection italy --jsonArray --file ../data/matches/matches_italy.json
!mongoimport --db matches --collection france --jsonArray --file ../data/matches/matches_france.json

2020-04-02T13:43:20.340-0400	connected to: mongodb://localhost/
2020-04-02T13:43:20.418-0400	380 document(s) imported successfully. 0 document(s) failed to import.
2020-04-02T13:43:20.553-0400	connected to: mongodb://localhost/
2020-04-02T13:43:20.608-0400	306 document(s) imported successfully. 0 document(s) failed to import.
2020-04-02T13:43:20.739-0400	connected to: mongodb://localhost/
2020-04-02T13:43:20.820-0400	380 document(s) imported successfully. 0 document(s) failed to import.
2020-04-02T13:43:20.954-0400	connected to: mongodb://localhost/
2020-04-02T13:43:21.044-0400	380 document(s) imported successfully. 0 document(s) failed to import.
2020-04-02T13:43:21.174-0400	connected to: mongodb://localhost/
2020-04-02T13:43:21.247-0400	380 document(s) imported successfully. 0 document(s) failed to import.


In [7]:
# players database
!mongoimport --db players --collection players --jsonArray --file ../data/players.json

2020-04-02T13:43:21.387-0400	connected to: mongodb://localhost/
2020-04-02T13:43:21.474-0400	3603 document(s) imported successfully. 0 document(s) failed to import.


In [8]:
# teams database
!mongoimport --db teams --collection teams --jsonArray --file ../data/teams.json

2020-04-02T13:43:21.611-0400	connected to: mongodb://localhost/
2020-04-02T13:43:21.621-0400	142 document(s) imported successfully. 0 document(s) failed to import.


In [9]:
# create a clients
client = MongoClient()
events_db = client.events
matches_db = client.matches
players_db = client.players
teams_db = client.teams

In [21]:
# check the collections
print("Events database collections: ", events_db.list_collection_names())
print("Matches database collections: ", matches_db.list_collection_names())
print("Players database collections: ", players_db.list_collection_names())
print("Teams database collections: ", teams_db.list_collection_names())

Events database collections:  ['france', 'germany', 'italy', 'spain', 'england']
Matches database collections:  ['germany', 'france', 'england', 'italy', 'spain']
Players database collections:  ['players']
Teams database collections:  ['teams']


Let's check one event in **database**:"events_db" and **collection**:"england".

In [11]:
# events
list(events_db.england.find().limit(1))

[{'_id': ObjectId('5e62cd8d9bc590f7b0586f84'),
  'eventId': 8,
  'subEventName': 'Simple pass',
  'tags': [{'id': 1801}],
  'playerId': 25413,
  'positions': [{'y': 49, 'x': 49}, {'y': 78, 'x': 31}],
  'matchId': 2499719,
  'eventName': 'Pass',
  'teamId': 1609,
  'matchPeriod': '1H',
  'eventSec': 2.7586489999999912,
  'subEventId': 85,
  'id': 177959171}]

This dataset is formed by the events at given times and locations by football players. The following features got my attention:
- `subEventName` / `subEventId`: I can use this create my targets.
- `eventName` / `eventId`: The generalized version of s`ubEventName`. It will be helpful in grouping the targets.
- `playerId`: This is an important feature for the first wing, the player location prediction. Because my location prediction pipeline requires a single pleayer to process (for now).
- `positions` / `eventSec` / `matchPeriod`: These are essential features. 
    - `positions`: The GPS location of the players (on the football pitch). This feature plays central role in our models.
    - `eventSec`: The time in the game when the ball meets the player and player performs the event. It is a very important feature, especially in the first wing of the project (player location prediction - feature engineering)
    - `matchPeriod`: Another time related feature. This is important because, the player position will reset between halves. And I need to seperate the datasets into halves to avoid player coordinate shift between halves. No worries, if this does not make sense now. It will in the future notebooks.
- `teamId` : This feature can be useful in team level analysis and player selection.
- `tags`: Outcome of the event (accurate, not accurate, missed ball, ...). I may or may not need this feature. I will keep it for now. But, it is seems like a great feature for another classifier to predict the outcomes of xxx under xxx conditions.
- `matchId`: Good for match identification and match selection for player location prediction wing of the project.

Let's check a document from matches in england.

In [12]:
#matches
list(matches_db.england.find().limit(1))

[{'_id': ObjectId('5e62d2e8fa808e1cbaaaa127'),
  'status': 'Played',
  'roundId': 4165368,
  'gameweek': 0,
  'teamsData': {'9598': {'scoreET': 0,
    'coachId': 122788,
    'side': 'away',
    'teamId': 9598,
    'score': 2,
    'scoreP': 0,
    'hasFormation': 1,
    'formation': {'bench': [{'playerId': 69964,
       'assists': '0',
       'goals': 'null',
       'ownGoals': '0',
       'redCards': '0',
       'yellowCards': '0'},
      {'playerId': 69353,
       'assists': '0',
       'goals': 'null',
       'ownGoals': '0',
       'redCards': '0',
       'yellowCards': '0'},
      {'playerId': 212604,
       'assists': '0',
       'goals': 'null',
       'ownGoals': '0',
       'redCards': '0',
       'yellowCards': '0'},
      {'playerId': 69400,
       'assists': '0',
       'goals': 'null',
       'ownGoals': '0',
       'redCards': '0',
       'yellowCards': '0'},
      {'playerId': 230626,
       'assists': '0',
       'goals': 'null',
       'ownGoals': '0',
       'redCards'

I see a very detailed match info collection. It stores match date, time, winner, score, players' major activities (such as goals), extra  & regular time squads, referee info, substitutions, and competition id. The two features I may extract and use are time related fields:
  - `dateutc` (match start time and date)
  - `gamesweek` (week of the competition).

And players...

In [13]:
list(players_db.players.find().limit(1))

[{'_id': ObjectId('5e8224b42794caff6dd51401'),
  'passportArea': {'name': 'Senegal',
   'id': '686',
   'alpha3code': 'SEN',
   'alpha2code': 'SN'},
  'weight': 82,
  'firstName': 'Alfred John Momar',
  'middleName': '',
  'lastName': "N'Diaye",
  'currentTeamId': 683,
  'birthDate': '1990-03-06',
  'height': 187,
  'role': {'code2': 'MD', 'code3': 'MID', 'name': 'Midfielder'},
  'birthArea': {'name': 'France',
   'id': '250',
   'alpha3code': 'FRA',
   'alpha2code': 'FR'},
  'wyId': 32793,
  'foot': 'right',
  'shortName': "A. N'Diaye",
  'currentNationalTeamId': 19314}]

I will consider the following fields:
- `weight`, `height`, `birthDate`, and `foot`: In my experiences as a recovering Actuary and a pationate Data Scientist, I found out that bio-specs are very effective in defining observations. I experienced their incredibilty detailed impact on results such as calculating a 50 billion dollars of liability. About the `birthDate`, I will extract the age out of the birthdate.
- `role`: My analysis and predictions are based on coordinates and the results will be spatial. Therefore, this feature can help my model with its high correlation (domain knowledge) with players' locations on the pitch. 
- `shortName`: Player's name. It helps to identify the player to use domain knowledge for better analysis.

In [14]:
list(teams_db.teams.find().limit(1))

[{'_id': ObjectId('5e8247eb4ceffe42a3d0c39e'),
  'city': 'Vigo',
  'name': 'Celta de Vigo',
  'wyId': 692,
  'officialName': 'Real Club Celta de Vigo',
  'area': {'name': 'Spain',
   'id': '724',
   'alpha3code': 'ESP',
   'alpha2code': 'ES'},
  'type': 'club'}]

I can extract the team `name` to support my domain knowledge.

One other feature I noticed in all the datasets, with the exception of "events" dataset, is the `wyId`. It is the primary key of the datasets. It matches the two foreign keys in "events" dataset. These foreign keys are `teamId` and `playerId`. I can use them in to merge the `teamName` and `shortName` from teams and players respectively.

Now I have a list of potential features I can use in my analysis and predictions. Next, I create the dataframes for events, teams, players, and matches. 

But before, I would like to seperate the `positions` values. I can cut into `x_start` & `y_start` coordinates for the start of the event and `x_end` & `y_end` for the end of the event. 

Also, there is a nested datasets issue. In the case of events and matches, the structure of the documents are nested dictionary with nation as the key for each league. Each league is consist of around 500K to 700K records. I tackle this problem by saving json files for each nation as dictionaries and then seperate the nations into lists. Finally, I unite the lists and convert to pandas dataframe.

That elads to data loss problem. The keys (nations) disolves as a result of the seperation. To keep nations, I add `league` feature and copy nations in the observation level (value).

In [22]:
# form a list of nations for looping
nations = events_db.list_collection_names()
nations

['france', 'germany', 'italy', 'spain', 'england']

In [23]:
# load events data
events = {}
for nation in nations:
    with open('../data/events/events_%s.json' %nation) as json_data:
        events[nation] = json.load(json_data)

# load matches data
matches = {}
for nation in nations:
    with open('../data/matches/matches_%s.json' %nation) as json_data:
        matches[nation] = json.load(json_data)
        
# load the players data
players={}
with open('../data/players.json') as json_data:
    players = json.load(json_data)
    
# load teams data
teams = {}
with open('../data/teams.json') as json_data:
        teams = json.load(json_data)

In addition to the seperation of the (x,y) coordinates, I replace the missing ending (x,y) coordinates with the start coordinates of the same event. As I understand, the (x,y) coordinates of the dataset are also the actual ball coordinates. When the player meets the ball, there is an event. And the ball will not trvel between two points as a result in such events like foul or offside. The mathc will be paused at the event's start location. Therefore, I will assign the begining coordinates as ending coordinates in such cases.

In [24]:
# seperate x and y coordinates
for nation in nations:
    for i in range(0, len(events[nation])):
        # add competition
        events[nation][i]['competition'] = nation
        
        # seperate coordinates
        events[nation][i]['x_start'] = events[nation][i]['positions'][0]["x"]
        events[nation][i]['y_start'] = events[nation][i]['positions'][0]["y"]
        
        # some events do not have ending coordinates, e.g. "Foul"
        if len(events[nation][i]['positions']) == 2:
            events[nation][i]['x_end'] = events[nation][i]['positions'][1]["x"]
            events[nation][i]['y_end'] = events[nation][i]['positions'][1]["y"]
        else:
            events[nation][i]['x_end'] = events[nation][i]['x_start']
            events[nation][i]['y_end'] = events[nation][i]['y_start']

In [25]:
# add league as a field in events
for nation in nations:
    for i in range(0, len(events[nation])):
        # add competition
        events[nation][i]['league'] = nation

In [26]:
# events    
    # seperation
engEvents, spaEvents, itaEvents, gerEvents, fraEvents = \
events['england'], events['spain'], events['italy'], events['germany'], events['france']
        
    # unity
allEvents = engEvents + spaEvents + itaEvents + gerEvents + fraEvents

# matches
    # seperation
engMatches, spaMatches, itaMatches, gerMatches, fraMatches = \
matches['england'], matches['spain'], matches['italy'], matches['germany'], matches['france']
        
    # unity
allMatches = engMatches + spaMatches + itaMatches + gerMatches + fraMatches

In [27]:
# sanity check
print("The record counts for dictionary and list of events match?",
      len(allEvents) == sum(list(len(events[nation]) for nation in nations)))

print("The record counts for dictionary and list of matches match?",
      len(allMatches) == sum(list(len(matches[nation]) for nation in nations)))

The record counts for dictionary and list of events match? True
The record counts for dictionary and list of matches match? True


In [28]:
# create the dataframes
events_df = pd.DataFrame.from_dict(allEvents, orient='columns')
players_df = pd.DataFrame.from_dict(players, orient='columns')
teams_df = pd.DataFrame.from_dict(teams, orient='columns')
matches_df = pd.DataFrame.from_dict(allMatches, orient='columns')

In [29]:
events_df.head(3)

Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,subEventId,id,competition,x_start,y_start,x_end,y_end,league
0,8,Simple pass,[{'id': 1801}],25413,"[{'y': 49, 'x': 49}, {'y': 78, 'x': 31}]",2499719,Pass,1609,1H,2.759,85,177959171,england,49,49,31,78,england
1,8,High pass,[{'id': 1801}],370224,"[{'y': 78, 'x': 31}, {'y': 75, 'x': 51}]",2499719,Pass,1609,1H,4.947,83,177959172,england,31,78,51,75,england
2,8,Head pass,[{'id': 1801}],3319,"[{'y': 75, 'x': 51}, {'y': 71, 'x': 35}]",2499719,Pass,1609,1H,6.542,82,177959173,england,51,75,35,71,england


In [30]:
matches_df.head(3)

Unnamed: 0,status,roundId,gameweek,teamsData,seasonId,dateutc,winner,venue,wyId,label,date,referees,duration,competitionId
0,Played,4405654,38,"{'1646': {'scoreET': 0, 'coachId': 8880, 'side...",181150,2018-05-13 14:00:00,1659,Turf Moor,2500089,"Burnley - AFC Bournemouth, 1 - 2","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 385705, 'role': 'referee'}, {'r...",Regular,364
1,Played,4405654,38,"{'1628': {'scoreET': 0, 'coachId': 8357, 'side...",181150,2018-05-13 14:00:00,1628,Selhurst Park,2500090,"Crystal Palace - West Bromwich Albion, 2 - 0","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 381851, 'role': 'referee'}, {'r...",Regular,364
2,Played,4405654,38,"{'1609': {'scoreET': 0, 'coachId': 7845, 'side...",181150,2018-05-13 14:00:00,1609,The John Smith's Stadium,2500091,"Huddersfield Town - Arsenal, 0 - 1","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 384965, 'role': 'referee'}, {'r...",Regular,364


In [31]:
teams_df.head(3)

Unnamed: 0,city,name,wyId,officialName,area,type
0,Newcastle upon Tyne,Newcastle United,1613,Newcastle United FC,"{'name': 'England', 'id': '0', 'alpha3code': '...",club
1,Vigo,Celta de Vigo,692,Real Club Celta de Vigo,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club
2,Barcelona,Espanyol,691,Reial Club Deportiu Espanyol,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club


In [32]:
players_df.head(3)

Unnamed: 0,passportArea,weight,firstName,middleName,lastName,currentTeamId,birthDate,height,role,birthArea,wyId,foot,shortName,currentNationalTeamId
0,"{'name': 'Turkey', 'id': '792', 'alpha3code': ...",78,Harun,,Tekin,4502,1989-06-17,187,"{'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk...","{'name': 'Turkey', 'id': '792', 'alpha3code': ...",32777,right,H. Tekin,4687.0
1,"{'name': 'Senegal', 'id': '686', 'alpha3code':...",73,Malang,,Sarr,3775,1999-01-23,182,"{'code2': 'DF', 'code3': 'DEF', 'name': 'Defen...","{'name': 'France', 'id': '250', 'alpha3code': ...",393228,left,M. Sarr,4423.0
2,"{'name': 'France', 'id': '250', 'alpha3code': ...",72,Over,,Mandanda,3772,1998-10-26,176,"{'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk...","{'name': 'France', 'id': '250', 'alpha3code': ...",393230,,O. Mandanda,


All good!
It is time to focus on setting the features and merging them to the events_df. 

Now lets's create the `age` feature and extract the `role` in players_df. Also, get the `month` and `hour` from matches_df.datetime.

In [33]:
# set players's role
players_df['role'] = [role.get('code3') for role in players_df.role] 

# set player's age
players_df['birthDate'] = pd.to_datetime(players_df.birthDate, format="%Y-%m-%d")
players_df['stableYear'] = "2018-01-01"
players_df['stableYear'] = pd.to_datetime(players_df.stableYear, format="%Y-%m-%d")
players_df['age'] = players_df['stableYear'] - players_df['birthDate']
players_df['age'] = players_df['age']/np.timedelta64(1,'Y')

# set month and hour of the match
matches_df['datetime'] = pd.to_datetime(matches_df.dateutc, format="%Y-%m-%d %H:%M:%S")
matches_df['month'] = matches_df.datetime.dt.month
matches_df['hour'] = matches_df.datetime.dt.hour

# matches dataset
matches_df['label'] = matches_df['label'].str.strip(" ")
matches_df['matchScore'] = matches_df.label.str.split(',').str[-1]
matches_df['matchScoreHome'] = matches_df.matchScore.str.split('-').str.get(0)
matches_df['matchScoreAway'] = matches_df.matchScore.str.split('-').str.get(-1)
matches_df['teams'] = matches_df.label.str.split(',').str[0]
matches_df['teamHome'] = matches_df.teams.str.split('-').str.get(0)
matches_df['teamAway'] = matches_df.teams.str.split('-').str.get(-1)


In [34]:
players_df.head(2)

Unnamed: 0,passportArea,weight,firstName,middleName,lastName,currentTeamId,birthDate,height,role,birthArea,wyId,foot,shortName,currentNationalTeamId,stableYear,age
0,"{'name': 'Turkey', 'id': '792', 'alpha3code': ...",78,Harun,,Tekin,4502,1989-06-17,187,GKP,"{'name': 'Turkey', 'id': '792', 'alpha3code': ...",32777,right,H. Tekin,4687,2018-01-01,28.543
1,"{'name': 'Senegal', 'id': '686', 'alpha3code':...",73,Malang,,Sarr,3775,1999-01-23,182,DEF,"{'name': 'France', 'id': '250', 'alpha3code': ...",393228,left,M. Sarr,4423,2018-01-01,18.941


In [35]:
matches_df.head(2)

Unnamed: 0,status,roundId,gameweek,teamsData,seasonId,dateutc,winner,venue,wyId,label,date,referees,duration,competitionId,datetime,month,hour,matchScore,matchScoreHome,matchScoreAway,teams,teamHome,teamAway
0,Played,4405654,38,"{'1646': {'scoreET': 0, 'coachId': 8880, 'side...",181150,2018-05-13 14:00:00,1659,Turf Moor,2500089,"Burnley - AFC Bournemouth, 1 - 2","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 385705, 'role': 'referee'}, {'r...",Regular,364,2018-05-13 14:00:00,5,14,1 - 2,1,2,Burnley - AFC Bournemouth,Burnley,AFC Bournemouth
1,Played,4405654,38,"{'1628': {'scoreET': 0, 'coachId': 8357, 'side...",181150,2018-05-13 14:00:00,1628,Selhurst Park,2500090,"Crystal Palace - West Bromwich Albion, 2 - 0","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 381851, 'role': 'referee'}, {'r...",Regular,364,2018-05-13 14:00:00,5,14,2 - 0,2,0,Crystal Palace - West Bromwich Albion,Crystal Palace,West Bromwich Albion


And now with the merges...

In [36]:
# merge matches into events
events_df = pd.merge(events_df, matches_df[['wyId', 'gameweek', 'datetime', 'month', 'hour', 'teamHome']],
                     how='left', left_on='matchId', right_on='wyId')

In [39]:
events_df.head(2)

Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,subEventId,id,competition,x_start,y_start,x_end,y_end,league,wyId,gameweek,datetime,month,hour,teamHome
0,8,Simple pass,[{'id': 1801}],25413,"[{'y': 49, 'x': 49}, {'y': 78, 'x': 31}]",2499719,Pass,1609,1H,2.759,85,177959171,england,49,49,31,78,england,2499719,1,2017-08-11 18:45:00,8,18,Arsenal
1,8,High pass,[{'id': 1801}],370224,"[{'y': 78, 'x': 31}, {'y': 75, 'x': 51}]",2499719,Pass,1609,1H,4.947,83,177959172,england,31,78,51,75,england,2499719,1,2017-08-11 18:45:00,8,18,Arsenal


In [40]:
# del `wyId` for next merge
del events_df['wyId']

In [41]:
# merge teams into events
events_df = pd.merge(events_df, teams_df[['wyId', 'name']],
                     how='left', left_on='teamId', right_on='wyId')

In [42]:
# del `wyId` for next merge
del events_df['wyId']

In [43]:
# merge players into events
events_df = pd.merge(events_df, players_df[['wyId', 'shortName', 'height', 'weight', 'age', 'foot', 'role']],
                     how='left', left_on='playerId', right_on='wyId')

In [44]:
# rename the name colmn to avoid confusion
events_df.rename(columns={'name': 'teamName'}, inplace=True)

On last filtering and we are ready for the csv conversion into next stage!

In [46]:
events_df = events_df[['id', 'matchId', 'gameweek', 'month', 'hour', 'teamHome',
                       'teamId', 'teamName', 'league',
                       'playerId', 'shortName', 'role', 'age', 'height', 'weight', 'foot', 
                       'eventName', 'subEventName', 
                       'matchPeriod', 'eventSec',
                       'x_start', 'y_start', 'x_end', 'y_end']]

In [47]:
events_df.head(2)

Unnamed: 0,id,matchId,gameweek,month,hour,teamHome,teamId,teamName,league,playerId,shortName,role,age,height,weight,foot,eventName,subEventName,matchPeriod,eventSec,x_start,y_start,x_end,y_end
0,177959171,2499719,1,8,18,Arsenal,1609,Arsenal,england,25413,A. Lacazette,FWD,26.599,175.0,73.0,right,Pass,Simple pass,1H,2.759,49,49,31,78
1,177959172,2499719,1,8,18,Arsenal,1609,Arsenal,england,370224,R. Holding,DEF,22.284,189.0,75.0,right,Pass,High pass,1H,4.947,31,78,51,75


It seems fine. Lets save it and finalize this section.

In [48]:
# save main df
events_df.to_csv('../../csv_files/AI_in_Soccer/events.csv')

"events_df" is our main dataset from now on.

In [49]:
# save other datasets as well
players_df.to_csv('players.csv')
teams_df.to_csv('teams.csv')
matches_df.to_csv('matches.csv')

In [50]:
# to save the RAM!
del events_df
del players_df
del matches_df
del teams_df

Next stop: EDA and more Feature Enngineering!