How do red cards impact a soccer game? Going to look at four seasons worth of data from 2018-2022 in the Bundesliga. 

In [29]:
# import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [30]:
# let's take a look at the first of data 
test_season = pd.read_csv("data/D1_18-19.csv")
print(test_season.columns) 

Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',
       'HTAG', 'HTR', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY',
       'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH',
       'IWD', 'IWA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'VCH', 'VCD',
       'VCA', 'Bb1X2', 'BbMxH', 'BbAvH', 'BbMxD', 'BbAvD', 'BbMxA', 'BbAvA',
       'BbOU', 'BbMx>2.5', 'BbAv>2.5', 'BbMx<2.5', 'BbAv<2.5', 'BbAH', 'BbAHh',
       'BbMxAHH', 'BbAvAHH', 'BbMxAHA', 'BbAvAHA', 'PSCH', 'PSCD', 'PSCA'],
      dtype='object')


There is about 60 columns. We don't ned all of that. We will also need to rename the columns we do use and create new ones. <br>
Here is what we need: 
  - Season (need to create to identify the season: 18-19, 19-20, 20-21, 21-22)
  - Date
  - HomeTeam
  - AwayTeam
  - FTHG (Full time home goals) --> rename home_goals
  - FTAG (Full time away goals) --> rename away_goals
  - FTR (Full time result - H=Home win, A=Away Win, D=Draw) --> rename result
  - HY (Home yellow cards) --> rename home_yellows
  - AY (Away yellow cards) --> rename away_yellows
  - HR (Home red cards) --> rename home_reds
  - AR (Away red cards) --> rename away_reds



Here's the plan of attack
1. Define a function to do the work of removing the columns we don't need, adding the columns we do, and renaming the columns. 
2. Pass the four csv's through the function to create 4 dataframes for the four seasons.
3. Combine the data frames together to create a larger datafame with all info on it.
4. Double check the data frame

In [31]:
# 1 - define the function - convert_season
# It should take in two arguments: a path to a csv, and a season name
# it will keep and rename the following columns:
cols = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST', 'HY','AY', 'HR', 'AR', "season"]
csv_cols = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST', 'HY','AY', 'HR', 'AR']
new_col_names = {
  "FTHG": "home_goals",
  "FTAG": "away_goals",
  "FTR": "result",
  "HS": "home_shots",
  "AS": "away_shots",
  "HST": "home_shots_target",
  "AST": "away_shots_target",
  "HY": "home_yellows",
  "AY": "away_yellows",
  "HR": "home_reds",
  "AR": "away_reds"
}
data_source = [
  {"source": "data/D1_10-11.csv", "name":"2010/2011"},
  {"source": "data/D1_11-12.csv", "name":"2011/2012"},
  {"source": "data/D1_12-13.csv", "name":"2012/2013"},
  {"source": "data/D1_13-14.csv", "name":"2013/2014"},
  {"source": "data/D1_14-15.csv", "name":"2014/2015"},
  {"source": "data/D1_15-16.csv", "name":"2015/2016"},
  {"source": "data/D1_16-17.csv", "name":"2016/2017"},
  {"source": "data/D1_17-18.csv", "name":"2017/2018"},
  {"source": "data/D1_18-19.csv", "name":"2018/2019"},
  {"source": "data/D1_19-20.csv", "name":"2019/2020"},
  {"source": "data/D1_20-21.csv", "name":"2020/2021"},
  {"source": "data/D1_21-22.csv", "name":"2021/2022"},
]

#initialize empty dataframe
df = pd.DataFrame(columns=cols)

for season in data_source:
  # concat the season to dataframe
  df = pd.concat([df, pd.read_csv(season["source"], usecols=csv_cols)])
  # change the season column to name
  df["season"] = df["season"].fillna(season["name"])

print(df.columns)
print(df.shape)
# df = df.rename(columns=new_col_names)

Index(['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS',
       'HST', 'AST', 'HY', 'AY', 'HR', 'AR', 'season'],
      dtype='object')
(3672, 15)


In [32]:
print(df.head(5))
print(df.tail(5))

       Date       HomeTeam        AwayTeam FTHG FTAG FTR  HS  AS HST AST HY  \
0  20/08/10  Bayern Munich       Wolfsburg    2    1   H  17  11   5   5  1   
1  21/08/10        FC Koln  Kaiserslautern    1    3   A  10  17   4   6  1   
2  21/08/10       Freiburg        St Pauli    1    3   A   9  17   3   7  0   
3  21/08/10        Hamburg      Schalke 04    2    1   H  18  13   6   4  2   
4  21/08/10       Hannover   Ein Frankfurt    2    1   H  13  17   7   3  0   

  AY HR AR     season  
0  3  0  0  2010/2011  
1  2  1  0  2010/2011  
2  0  0  0  2010/2011  
3  0  0  1  2010/2011  
4  1  0  0  2010/2011  
           Date      HomeTeam       AwayTeam FTHG FTAG FTR  HS  AS HST AST HY  \
301  14/05/2022         Mainz  Ein Frankfurt    2    2   D  13  13   4   4  2   
302  14/05/2022    M'gladbach     Hoffenheim    5    1   H  19  10   9   2  1   
303  14/05/2022     Stuttgart        FC Koln    2    1   H  24  16  12   3  1   
304  14/05/2022  Union Berlin         Bochum    3    2   

In [33]:
#rename cols
df = df.rename(columns=new_col_names)
print(df.columns)

Index(['Date', 'HomeTeam', 'AwayTeam', 'home_goals', 'away_goals', 'result',
       'home_shots', 'away_shots', 'home_shots_target', 'away_shots_target',
       'home_yellows', 'away_yellows', 'home_reds', 'away_reds', 'season'],
      dtype='object')


In [34]:
# Convert date type to date
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True)
#convert cols to int
col_types = {
  "home_goals": "int64",
  "away_goals": "int64",
  "home_shots": "int64",
  "away_shots": "int64",
  "home_shots_target": "int64",
  "away_shots_target": "int64",
  "home_yellows": "int64",
  "away_yellows": "int64",
  "home_reds": "int64",
  "away_reds": "int64",
}
df = df.astype(col_types)

print(df.dtypes)

Date                 datetime64[ns]
HomeTeam                     object
AwayTeam                     object
home_goals                    int64
away_goals                    int64
result                       object
home_shots                    int64
away_shots                    int64
home_shots_target             int64
away_shots_target             int64
home_yellows                  int64
away_yellows                  int64
home_reds                     int64
away_reds                     int64
season                       object
dtype: object


  df["Date"] = pd.to_datetime(df["Date"], dayfirst=True)


We now have a singe df with all of the needed data in it. Check for nulls and before analysis

In [35]:
print(df.isna().sum())

Date                 0
HomeTeam             0
AwayTeam             0
home_goals           0
away_goals           0
result               0
home_shots           0
away_shots           0
home_shots_target    0
away_shots_target    0
home_yellows         0
away_yellows         0
home_reds            0
away_reds            0
season               0
dtype: int64


The data is cleaned and formatted and ready to go. Time for analysis.
## What is the goal of the analysis?
The goal of the analysis is to determing what effect red cards have on the outcome of matches. Compare winning % without red cards vs winning % with red cards.
## Plan of attack: 
1. Home team winning percentage and away team winning percentage grouped by season.
2. Breakdown those percentages by red cards v no red cards
3. Visualize the findings.

### Define wining percentage:
In soccer, draws are very common, and thus winning percentage shouldn't be a single #%, but instead be three #%'s. Win rate, Draw rate, loss rate.
> Example: if a team has played 10 matches and won 5, drew 2, and lost 3, then win_rate = 50.0%, draw_rate = 20.0%, loss_rate = 30.0%

In [36]:
# calculations for rates
# win_rate = total # of wins / total # of matches played
# in the result column 'H' = home win, "A" = away win, "D" = draw

total_matches = len(df)
print(f"The total number of matches played is {total_matches:,}")

num_home_wins = len(df[df["result"] == "H"])
print(f"The home team won {num_home_wins} times")

num_draws = len(df[df["result"] == "D"])
print(f"There were {num_draws} draws")

num_away_wins = total_matches - num_home_wins - num_draws
print(f"The away team won {num_away_wins} times")

home_win_rate = (num_home_wins / total_matches) * 100
draw_rate = (num_draws / total_matches) *100
away_win_rate = (num_away_wins / total_matches) * 100
print(f"The overall home winning rate is {home_win_rate:.1f}%")
print(f"The overall draw rate is {draw_rate:.1f}%")
print(f"The overall away winning rate is {away_win_rate:.1f}%")

The total number of matches played is 3,672
The home team won 1657 times
There were 889 draws
The away team won 1126 times
The overall home winning rate is 45.1%
The overall draw rate is 24.2%
The overall away winning rate is 30.7%
