How do red cards impact a soccer game? Going to look at four seasons worth of data from 2018-2022 in the Bundesliga. 

In [1]:
# import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# let's take a look at the first of data 
test_season = pd.read_csv("data/D1_18-19.csv")
print(test_season.columns) 

Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',
       'HTAG', 'HTR', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY',
       'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH',
       'IWD', 'IWA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'VCH', 'VCD',
       'VCA', 'Bb1X2', 'BbMxH', 'BbAvH', 'BbMxD', 'BbAvD', 'BbMxA', 'BbAvA',
       'BbOU', 'BbMx>2.5', 'BbAv>2.5', 'BbMx<2.5', 'BbAv<2.5', 'BbAH', 'BbAHh',
       'BbMxAHH', 'BbAvAHH', 'BbMxAHA', 'BbAvAHA', 'PSCH', 'PSCD', 'PSCA'],
      dtype='object')


There is about 60 columns. We don't ned all of that. We will also need to rename the columns we do use and create new ones. <br>
Here is what we need: 
  - Season (need to create to identify the season: 18-19, 19-20, 20-21, 21-22)
  - Date
  - HomeTeam
  - AwayTeam
  - FTHG (Full time home goals) --> rename home_goals
  - FTAG (Full time away goals) --> rename away_goals
  - FTR (Full time result - H=Home win, A=Away Win, D=Draw) --> rename result
  - HY (Home yellow cards) --> rename home_yellows
  - AY (Away yellow cards) --> rename away_yellows
  - HR (Home red cards) --> rename home_reds
  - AR (Away red cards) --> rename away_reds



Here's the plan of attack
1. Define a function to do the work of removing the columns we don't need, adding the columns we do, and renaming the columns. 
2. Pass the four csv's through the function to create 4 dataframes for the four seasons.
3. Combine the data frames together to create a larger datafame with all info on it.
4. Double check the data frame

In [3]:
# 1 - define the function - convert_season
# It should take in two arguments: a path to a csv, and a season name
# it will keep and rename the following columns:
new_cols = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HY','AY', 'HR', 'AR']
new_col_names = {
  "FTHG": "home_goals",
  "FTAG": "away_goals",
  "FTR": "result",
  "HY": "home_yellows",
  "AY": "away_yellows",
  "HR": "home_reds",
  "AR": "away_reds"
}
# function should return a pandas dataframe
def convert_season(csv, season):
  df = pd.read_csv(csv, usecols=new_cols)
  df = df.rename(columns=new_col_names)
  df["season"] = season
  return df  

In [4]:
#test the first season
season_18_19_df = convert_season("data/D1_18-19.csv", "2018-2019")
print(season_18_19_df.head())

         Date            HomeTeam       AwayTeam  home_goals  away_goals  \
0  24/08/2018       Bayern Munich     Hoffenheim           3           1   
1  25/08/2018  Fortuna Dusseldorf       Augsburg           1           2   
2  25/08/2018            Freiburg  Ein Frankfurt           0           2   
3  25/08/2018              Hertha       Nurnberg           1           0   
4  25/08/2018          M'gladbach     Leverkusen           2           0   

  result  home_yellows  away_yellows  home_reds  away_reds     season  
0      H             1             4          0          0  2018-2019  
1      A             1             0          0          0  2018-2019  
2      A             1             2          0          0  2018-2019  
3      H             2             2          0          0  2018-2019  
4      H             1             2          0          0  2018-2019  


In [5]:
#should go back and change type of date to a datetime
print(season_18_19_df.dtypes)

Date            object
HomeTeam        object
AwayTeam        object
home_goals       int64
away_goals       int64
result          object
home_yellows     int64
away_yellows     int64
home_reds        int64
away_reds        int64
season          object
dtype: object
