<b>Data mining Project - 2021/22</b><br/>
<span>
<b>Authors:</b> Mariagiovanna Rotundo (560765), Nunzio Lopardo (600005)</a> and Renato Eschini (203021)<br/>
<b>Group:</b>3<br/>
<b>Release date:</b> 26/12/2021
</span>

# Data understanding

In this notebook, we tried to understand what are the meaning of data and what are the domains of the attributes in the dataset. We also analyzed if there are errors of any kind seen in the theoretical part of the course.

**Importing libraries**

In [None]:
import math
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

import collections
from scipy.stats.stats import pearsonr
import pandas as pd
import os
from datetime import date
import datetime

import seaborn as sns
import re

In [None]:
from ipynb.fs.full.functions_understanding import *

**Loading the datasets**

In [None]:
# load of the data
DATASET_DIR = "dataset" + os.path.sep
df_tennis = pd.read_csv(DATASET_DIR + 'tennis_matches.csv', sep=',', index_col=0) 

#index_col=False say to not use the first column as ID
df_male = pd.read_csv(DATASET_DIR + 'male_players.csv', sep=',', index_col=False)
df_female = pd.read_csv(DATASET_DIR + 'female_players.csv', sep=',', index_col=False) 

**Print some records of the datasets**

A first print to look to the structure and data of the three datasets

In [None]:
df_tennis.head()

In [None]:
df_male.head()

In [None]:
df_female.head()

**Table obtained from the following analysis about columns of tennis dataset**

A table that summarize the type of attribute in the tennis dataset. For each winner attribute there is the loser counterpart that is not in not reported in the table

|  Categorical  |   Ordinal   |      Numerical     | Ratio-Scaled |
|:-------------:|:-----------:|:------------------:|:------------:|
|   tournay_id  |  match_num  |      draw_size     |   winner_ht  |
|  tournay_name | winner_rank |       minutes      |  winner_age  |
|    surface    |             |     winner_ace     |              |
| tournay_level |             |      winner_df     |              |
|   winner_id   |             |     winner_svpt    |              |
|   winner_ioc  |             |    winner_1stln    |              |
|  winner_hand  |             |    winner_1stwon   |              |
|  winner_entry |             |    winner_2stwon   |              |
|    best_of    |             |       w_svgms      |              |
|               |             | winner_rank_points |              |
|               |             |      w_bdsaved     |              |
|               |             |      w_bdfaced     |              |

## 1. Missing values: Null

A first check on the datsets is about the analysis of the missing values (not marked with default values such us "unknown" or similar). This check is done for all the three datsets and it is analyzed attribute by attribute also looking at quantity of missing values 

**male dataset**

In [None]:
#info about data that we have for male
df_male.info()

In [None]:
#number of null in the columns
df_male.isnull().sum(axis = 0)

In [None]:
df_male.isnull().sum(axis = 0).plot(kind='bar', ylabel="number of nulls")

The male dataframe is composed by 2 columns: name and surname. It has 55208 entries and in the name column there are 177 nulls, while in surname there are 43 nulls.

**female dataset**

In [None]:
#info about data that we have for female
df_female.info()

In [None]:
#number of null in the columns
df_female.isnull().sum(axis = 0)

In [None]:
df_female.isnull().sum(axis = 0).plot(kind='bar', ylabel="number of nulls")

Also the female dataframe is composed by 2 columns: name and surname. It has 46172 entries and there are 1667 null values in the name, but in the surname column there are no null.

**tennis dataset**

In [None]:
df_tennis.info()

In [None]:
#we see if the attributes have some null values
df_tennis.isnull().any()

In [None]:
#since all the attributes has missing values, we count them
df_tennis.isnull().sum(axis = 0)

Tennis dataframe is composed by 49 columns and 186128 rows.
Here, some attributes have very few null values (such as 26 or 27), but other attributes has more than 50% of null values (such us 103818 or 160301 null values on 186128)

In [None]:
df_tennis.isnull().sum(axis = 0).hist(bins="sturges", grid=False)

## 2. Duplicate data

A second check that was done is about duplicate data: a first check is about duplicate rows. Also a check to see if the same person (same name and surname) is in both male and female dataframes is done. 

**male and female dataset**

In [None]:
#we see if there are duplicates in the dataset male and female
df_male.duplicated(keep='first').sum()

In [None]:
df_female.duplicated(keep='first').sum()

Both the datasets of male and female has rows with same name and surname. These rows can be duplicates (so they correspond to the same person) or they can be homonyms. In this latter case we cannot distinguish the matches of one player by the ones of the other player. In the next steps, looking data of tennis datafrane, we will consider them as the same duplicates and not homonyms.

In [None]:
#we remove (only) the duplicated rows
df_male_no_dup = df_male.drop_duplicates()
df_female_no_dup = df_female.drop_duplicates()

**Looking for people in both the datsets**

In [None]:
#see if a name can be both male and female and manage them
df_players = pd.concat([df_male_no_dup, df_female_no_dup])
df_players[df_players.duplicated(keep='first')==True]

In [None]:
df_players.duplicated(keep='first').sum()

There are 74 rows for both players male and female.

In [None]:
df_players[df_players.duplicated(subset=['name', 'surname'], keep='first')==True]

**tennis dataset**

In [None]:
#check if there are duplicated rows in the dataset
df_tennis.duplicated(keep='first').sum()

In the tennis dataframe there are 309 duplicate rows.

So, in all the three dataframes there are duplicates

## 3. Errors in male and female

Names and surnames are analyzed to find characters that can't be in there (such as "?" or numbers) and missing values (also ones characterized by default values are search for).<br>
Are considered as valid characters only letters and the simbols "'", ".", "-" for names like "O'Connors", "Jr." or similar with same pattern.<br>
It is done a check to see if pandas library correctly splits values in the colums instead of read values of 2 columns as a single value of only one column

**male dataset**

In [None]:
#see if all the names and surnames are valid
df_male[df_male['surname'].str.count("[a-zA-Z '.-]")!=df_male['surname'].str.len()]

In [None]:
df_male[df_male['name'].str.count("[a-zA-Z ',.-]")!=df_male['name'].str.len()]

In [None]:
df_check_name = df_male[~df_male['name'].isna()]
df_check_name[df_check_name['name'].str.contains(',')]

In [None]:
df_check_surname = df_male[~df_male['surname'].isna()]
df_check_surname[df_check_surname['surname'].str.contains(',')]

In [None]:
#unknown names
print("Unknown name: ", df_male[df_male['name'].str.lower()=='unknown'].shape[0])
#unknown surnames
print("Unknown surname: ", df_male[df_male['surname'].str.lower()=='unknown'].shape[0])

In the male dataset, in the surname column there are 99 invalid values, while in the name there are 179 (also nulls are counted). Furtermore, there are 50 missing values specified by the "unknown" values in the name column and 6 in the surname columns. There is also a row split in a wrong way by pandas library where name and surname are all in name columns and surname is Nan (the row 40071)

**female**

In [None]:
df_female[df_female['surname'].str.count("[a-zA-Z '.-]")!=df_female['surname'].str.len()]

In [None]:
df_female[df_female['name'].str.count("[a-zA-Z ',.-]")!=df_female['name'].str.len()]

In [None]:
df_check_name = df_female[~df_female['name'].isna()]
df_check_name[df_check_name['name'].str.contains(',')]

In [None]:
df_check_surname = df_female[~df_female['surname'].isna()]
df_check_surname[df_check_surname['surname'].str.contains(',')]

In [None]:
#unknown names
print("Unknown name: ", df_female[df_female['name'].str.lower()=='unknown'].shape[0])
#unknown surnames
print("Unknown surname: ", df_female[df_female['surname'].str.lower()=='unknown'].shape[0])

In the female dataset, in the surname column there are 2 invalid values, while in the name there are 1667 (also nulls are counted). There are no missing values specified by the "unknown" values and no rows split in a wrong way by pandas.

## 4. Analysis of tennis dataset

**print of name of columns of tennis dataset**

In [None]:
df_tennis.columns

**print of types of the columns**

In [None]:
df_tennis.dtypes.value_counts()

In [None]:
#Values in the columns with type object
for column in df_tennis.select_dtypes(include=['object']).columns:
    print("Distinct Values in "+str(column)+": \n", df_tennis[column].unique(), "\n")

#### tourney_id

For this attribute it is checked that the first 4 characters is a valid year (a year not in the future and not before the 1874, year in which tennis was invented) for the tourney as specified in the documentation. It is also checked that a tourney id is associated to more matches and it is counted the number of different toruney ids.

In [None]:
#check that for non null values, the first 4 char are the year
#count rows whose firts 4 char are not numbers
df_tennis[df_tennis['tourney_id'].str[:4].str.isnumeric()==False].shape[0] 

In [None]:
#chech if some years are in the future (so are invalid)(ignore nulls)
df_tennis[pd.to_numeric(df_tennis['tourney_id'].str[:4]).fillna(0).astype('int') > date.today().year].shape[0] 

In [None]:
#chech if there are invalid years because to much in the past (ignore nulls)
df_tennis[pd.to_numeric(df_tennis['tourney_id'].str[:4]).fillna(date.today().year).astype('int') < 1874 ].shape[0] 

The first 4 numbers in the tourney_id are always valid when the tourney_id id not null

In [None]:
#chech that for each tourney there is more than one match (every id appears more than once)
df_tennis[df_tennis['tourney_id'].duplicated(keep=False)==False].shape[0] 

Every torney_id appears more than once

In [None]:
#check how many distinct tourney are present
df_tennis["tourney_id"].value_counts().count()

There are 4853 distinc tourney in the dataset

#### tourney_name

In [None]:
#chech that for the same tourney_id we have always the same name
len(df_tennis.groupby(['tourney_id','tourney_name']).size())-len(df_tennis.groupby(['tourney_id']).size())
#df_tennis[df_tennis['tourney_id']=='2019-M021'] #this is an example of toruney id with more names

In [None]:
len(df_tennis.groupby(['tourney_name','tourney_id']).size())-len(df_tennis.groupby(['tourney_name']).size())

For a certain tourney_id we may have more tourney_names, so we should manage these names for example to remove errors from some rows. For a tourney_name we can have more tourney_ids.

In [None]:
df_tennis["tourney_name"].value_counts().count() #do not consider also the nan

In [None]:
dict_tourney_id = df_tennis.groupby('tourney_id')['tourney_name'].unique().apply(list).to_dict()
for key, value in dict_tourney_id.items():
    if len(value)>1:
        print(key, value)

It is possible to notice that the names associated to the same tourney id are very similar. For example: 'US Open' and 'Us Open' or 'Rome Masters' and 'Rome'

**surface**

In [None]:
df_tennis.groupby('surface')['tourney_id'].nunique()

The domains is composed by 4 different values and some of them appear more frequently than others.

#### tourney_level

Levels for male tourney and female tourney are specified and analyzed. It is analyzed also the domain to see if there are some more codes than the ones in the documentation

In [None]:
#codes from documentation
levels_man = ['G', 'M', 'A', 'C', 'S', 'F', 'D']
levels_woman = levels_man + ['P', 'PM', 'I', 'T1']
levels_woman_man = ['E','J','T']
all_levels = levels_man + levels_woman + levels_woman_man

In [None]:
codes = df_tennis[(~df_tennis['tourney_level'].isin(all_levels)) & (~df_tennis['tourney_level'].isna())]['tourney_level'].unique()
codes

There are 2 more codes respect the one expressely indicated by the documentation: 'O', that is the code for Olympic Games (male and female) and 'W', that is a code associated to women's tourneys. The numeric codes are associated to both male and female tourneys.

In [None]:
#list updated with new values
levels_woman = levels_woman + ['W']
male_female_codes = codes.tolist()
male_female_codes.remove('W')
levels_woman_man = levels_woman_man + male_female_codes
all_levels = levels_man + levels_woman + levels_woman_man

In [None]:
#check what are other codes that can appear (for women)
df_other_levels = df_tennis[~df_tennis['tourney_level'].isin(all_levels)]
#get codes about the prize money
df_other_levels[df_other_levels['tourney_level'].str.isnumeric()==True]['tourney_level'].unique()

In [None]:
#get the other codes not cited in the document and that are not prize
df_other_levels[df_other_levels['tourney_level'].str.isnumeric()==False]['tourney_level'].unique()

In [None]:
#check if there are at least one row for each cited code
list(set(all_levels) - set(df_tennis['tourney_level'].unique()))

For these codes there are no rows in the dataset

In [None]:
#get the occurrenes of each level
df_tennis["tourney_level"].value_counts()

Check between sex of the player and code of the tourney where xe plays

In [None]:
# check sex by names
df_male['combined'] = df_male['name'].astype(str) + ' ' + df_male['surname'].astype(str)
df_female['combined'] = df_female['name'].astype(str) + ' ' + df_female['surname'].astype(str)

# we transform into dictionaries to optimize search performance, putting name and surname as keys
dict_male = df_male['combined'].to_dict();
dict_male_rev = {value:key for key, value in dict_male.items()}

dict_female = df_female['combined'].to_dict();
dict_female_rev = {value:key for key, value in dict_female.items()}

df_tennis_level_tmp = df_tennis.copy() # avoid overwrite original dataset
# apply CheckSex as lambda function to all rows, add new columns with sex
df_tennis_level_tmp['w_sex'] = df_tennis_level_tmp['winner_name'].apply(lambda x: CheckSex(x, dict_male_rev, dict_female_rev))
df_tennis_level_tmp['l_sex'] = df_tennis_level_tmp['loser_name'].apply(lambda x: CheckSex(x, dict_male_rev, dict_female_rev))

no_error = True
for row in df_tennis_level_tmp.itertuples():
    level = row.tourney_level
    # if there is a nan in level, skip to next...
    if str(level)=='nan':
        continue
    w_sex = row.w_sex
    l_sex = row.l_sex 
    # check sex...
    if w_sex == 'm' or l_sex == 'm':
        # search for the level in the respective set for man and woman/man
        if level not in levels_man and level not in levels_woman_man:  
            no_error = False
            print('level error: w_sex:' + w_sex + ' - l_sex:' + l_sex + ' - tourney_id:' + str(row.tourney_id) + ' - ' + str(wn) + ' vs ' + str(ln) + ' - level:' + str(level))
    elif w_sex == 'f' or l_sex == 'f':
        # search for the level in the respective set for man and woman/man
        if level not in levels_woman and level not in  levels_woman_man:
            no_error = False
            print('level error: w_sex:' + w_sex + ' - l_sex:' + l_sex + ' - tourney_id:' + str(row.tourney_id) + ' - ' + str(wn) + ' vs ' + str(ln) + ' - level:' + str(level))
            
if no_error:
    print("All levels are correct")

In [None]:
df_tennis["tourney_level"].value_counts().plot(kind='bar', ylabel="number of nulls")
df_tennis["tourney_level"].unique()

Graphic to show that occurrences of values in the tourney level are unbalanced

#### winner_name and loser_name

Check if there are invalid characters in the names of winner and loser as done for male and female dataset, using also an external source to compare names

In [None]:
#check that names are valid
df_tennis[df_tennis['winner_name'].str.count("[a-zA-Z ',.-]")!=df_tennis['winner_name'].str.len()]['winner_name']

In [None]:
df_tennis[df_tennis['loser_name'].str.count("[a-zA-Z ',.-]")!=df_tennis['loser_name'].str.len()]['loser_name']

Execute the next three lines to execute the checks on male and female names.

In [None]:
df_names = pd.read_csv(DATASET_DIR + 'names.csv', sep=',', index_col=False)
df_names.head()

In [None]:
male_names = df_names[df_names['Gender'] == 'MALE']['Name']
male_names = dict.fromkeys(male_names, None)

In [None]:
female_names = df_names[df_names['Gender'] == 'FEMALE']['Name']
female_names = dict.fromkeys(female_names, None)

*male name*

In [None]:
invalid_names = []
for name in df_male['name'].dropna().tolist():
    if (name in female_names):
        invalid_names.append(name)
print(dict.fromkeys(invalid_names, None).keys())

*female names*

In [None]:
invalid_names = []
for name in df_female['name'].dropna().tolist():
    if (name in male_names):
        invalid_names.append(name)
print(dict.fromkeys(invalid_names, None).keys())

There are invalid characters in the names of some winners and some losers

#### winner_id and loser_id

Check if the same id is associated to only one player and that a player has only one id

In [None]:
df_tennis['winner_id'].value_counts()

In [None]:
df_tennis['loser_id'].value_counts()

In [None]:
find_match_sameWL(df_tennis)

There are five records in which winner and loser are the same player so they have the same ids (and same information about the player). 

In [None]:
df_mul_names = df_tennis[df_tennis['winner_id'].isin(get_w_id_names(df_tennis))][['winner_name','winner_id']].sort_values(by=['winner_name','winner_id'])
df_mul_names = df_mul_names.value_counts().reset_index()
df_mul_names.columns = ['winner_name', 'id', 'count']
df_mul_names.sort_values(by='id')

In [None]:
df_mul_names = df_tennis[df_tennis['loser_id'].isin(get_l_id_names(df_tennis))][['loser_name','loser_id']].sort_values(by=['loser_name','loser_id'])
df_mul_names = df_mul_names.value_counts().reset_index()
df_mul_names.columns = ['loser_name', 'id', 'count']
df_mul_names.sort_values(by='id')

In [None]:
df_tennis[df_tennis['winner_name'].isin(get_w_name_ids(df_tennis))][['winner_name','winner_id','winner_ioc','winner_hand']].drop_duplicates().sort_values(by='winner_name')

In [None]:
df_tennis[df_tennis['loser_name'].isin(get_l_name_ids(df_tennis))][['loser_name','loser_id','loser_ioc','loser_hand']].drop_duplicates().sort_values(by='loser_name')

There are tennis players that presents more then one id associated, both for winners(9) and losers(19) and there are ids associated to more than one players.

#### winner_hand and loser_hand

Check if all players have only one between L and R and that values different from "U", "L", and "R" do not appear

In [None]:
#check that there are not indicated hand that are invalid (ignore nulls)
hand = ['R','L','U']
df_tennis[~df_tennis['winner_hand'].fillna('U').str.upper().isin(hand)].shape[0]

In [None]:
df_tennis[~df_tennis['loser_hand'].fillna('U').str.upper().isin(hand)].shape[0]

In [None]:
winner_hand_dict = df_tennis.groupby(['winner_id','winner_name'])['winner_hand'].unique().apply(list).to_dict()
loser_hand_dict = df_tennis.groupby(['loser_id','loser_name'])['loser_hand'].unique().apply(list).to_dict()

In [None]:
for key, value in winner_hand_dict.items():
    if key in loser_hand_dict.keys() and value[0] not in loser_hand_dict[key]:
        loser_hand_dict[key].append(value[0])

In [None]:
loser_hand_dict

In [None]:
for key, value in loser_hand_dict.items():
    if len(loser_hand_dict[key])>1 and ('U' not in loser_hand_dict[key] and np.nan not in loser_hand_dict[key]):
        print(key)
        print(loser_hand_dict[key])

There are not invalid entries for the hand of winner or loser. Furthermore thare are not different hands (both L and R) for the same player

In [None]:
index = df_tennis[~df_tennis['winner_hand'].isna()]['winner_hand'].unique()
pd.DataFrame({'winner': df_tennis['winner_hand'].value_counts(), 'loser': df_tennis['loser_hand'].value_counts()}, index=index).plot.bar(color=["#66ff66","#6666ff"])

#### winner_ioc and loser_ioc, International Olympic Code validity check

An external source is used to check if all the IOC codes in the dataset are valid

In [None]:
df_countrycode = pd.read_csv(DATASET_DIR + 'country-codes_csv.csv', sep=',', index_col=False) 

##### Wrong codes winner_ioc

In [None]:
w_check_cc = pd.Series(~df_tennis.winner_ioc.isin(df_countrycode.IOC).values, df_tennis.winner_ioc.values)
w_check = w_check_cc[w_check_cc].index
w_check.value_counts()

##### Wrong codes loser_ioc

In [None]:
l_check_cc = pd.Series(~df_tennis.loser_ioc.isin(df_countrycode.IOC).values, df_tennis.loser_ioc.values)
l_check = l_check_cc[l_check_cc].index
l_check.value_counts()

These codes are not IOC codes. We can verify that the list of IOC codes that is incorrect is not in ISO format by mistake.

In [None]:
i = w_check.unique()
type(i)
for c in i:
    exist =  df_countrycode["ISO3166-1-Alpha-3"].str.contains('MNE').any()
    print(c + " " + str(exist))

In [None]:
i = l_check.unique()
type(i)
for c in i:
    exist =  df_countrycode["ISO3166-1-Alpha-3"].str.contains('MNE').any()
    print(c + " " + str(exist))

**round**

We see that all the not null values in this column are correct values looking at the meaning of the codes that appear.

In [None]:
df_tennis.groupby('round')['tourney_id'].nunique()

**best_of**

It is checked that values in best of column are 3, 5 or null. Then, since there is a relationship between this column and the column of the score these are analized together to see if there are wrong best of values looking the score or viceversa.

In [None]:
#check if there are different values form 3 or 5
df_tennis['best_of'].value_counts(dropna = False)

There are not different values from 3 and 5, a part for some null values.

In [None]:
df_tennis['best_of'].value_counts().plot(kind='bar')

The values of this columns are imbalanced: there are many rows with best of equal to 3 and few rows with best of equal to 5.

#### score

It is checked that scores are valid according to the tennis rules (also enunciated below): https://www.wikihow.it/Tenere-il-Punteggio-a-Tennis

If the **match** is at best of 3 then a player, to win, must win 2 sets. If instead it's at best of 5 the playes must win 3 sets.<br>
Every sets is composed by **games**. The winner is the player that wins 6 games with at least 2 games od advantage (for example 6-4, 6-3, ..., but not 6-5).<br>
In the case of 6-5 the first player wins the set is win the following game (7-5).<br>
In case pf 6-6 the **Tie-Break** is played. The Tie-Break is won by the player that is the first to do 7 points with an advantage of 2 (so, for example, 7-5, 7-4, ...). If both the players do 6 points then wins the first that have 2 points of advantage on the adversary (for example 8-6, 9-7, 10-8, ...)

Best of 3 means that there are at most 3 games in a match, while best of 5 means that can be at most 5 sets.

In [None]:
#check that all the scores of the match are valid. (we do not consider nulls)
df_tennis_score = df_tennis[~df_tennis['score'].isna()]

**Walkover** ("WO" or "w/o")- Unopposed victory. A walkover is awarded when the opponent fails to start the match for any reason, such as injury.<br>
**Retirement** ("ret") - Player's withdrawal during a match, causing the player to forfeit the tournament. Usually this happens due to injury<br>
**Default** :def - Disqualification of a player in a match by the chair umpire after the player has received four code violation warnings, generally for their conduct on court. A default can occur with less than four code violations warnings if the code violation is judged severe enough to warrant it. A double default occurs when both players are disqualified. Defaults also occur when a player misses a match with no valid excuse. Defaults are considered losses.<br>
**Bye** :bye - Automatic advancement of a player to the next round of a tournament without facing an opponent. Byes are often awarded in the first round to the top-seeded players in a tournament<br>

These are reasons for games not played or interrupted

In [None]:
#error because less of 2 games without valid reasons
count_less_2 = 0
#error because we have too many games in a match
too_many = 0

#number of walkover
walkover = 0
#errors using RET instead of WO
wrong_walkover = 0
#number of defaults
default = 0
#number of byes
bye = 0

for match in df_tennis[~df_tennis['score'].isna()]['score']:
    sets = match.split( )
    if len(sets)==1 and Walkover(sets[0]):
        walkover+=1
        continue
    if len(sets)==1 and Retirement(sets[0]):
        wrong_walkover+=1
        continue
    if len(sets)==1 and Default(sets[0]):
        default+=1
        continue
    if len(sets)==1 and Bye(sets[0]):
        bye+=1
        continue
    if len(sets)<2:
        count_less_2+=1
        continue
    #maximum number of sets for best of 5 is 6: 5 games + RET or DEF
    if len(sets)>6:
        too_many+=1
        continue

In [None]:
print('walkover:', walkover)
print('wrong_walkover:', wrong_walkover)
print('default:', default)
print('bye:', bye)
print('errors: less than 2 games:', count_less_2)
print('errors: too many games:', too_many)

In [None]:
index = ["Walkover", "Default", "Bye", "Ret as wolkover", "Erorrs"]
values = [walkover, default, bye, wrong_walkover, count_less_2+too_many]
plt.bar(index,values, width=0.5, color="#66ff66")

There are games are not played or interrupted because walkover, default, bye or retirements. Sometimes there is a wrong use of retirement at start of the match, because a retirement before the game is called walkover. This is used in a wrong way 8 times. There are also invalid matches because there are too few sets with no interruption of the match.

In [None]:
#best of 5: 3,4 o 5 games, best of 3: 2 or 3 games (with points)
best_5 = 0
best_3 = 0

valid_change_best_of = 0
invalid_matches = 0


#check of the best of 3
for match in df_tennis_score[df_tennis_score['best_of']==3]['score']:
    sets = match.split( )
    #maximum number of sets for best of 3 is 4: 3 games + RET or DEF
    if len(sets)>4 or (len(sets)==4 and not Retirement(sets[3]) and not Default(sets[3])):
        best_5+=1
        #print(sets)
        if validity_match(sets, 5) == True:
            valid_change_best_of +=1
        else:
            invalid_matches+=1
    elif len(sets)>=2:
        if validity_match(sets, 3) == False:
            invalid_matches+=1
        
        
#check of the best of 5
for match in df_tennis_score[df_tennis_score['best_of']==5]['score']:
    sets = match.split( )
    if len(sets)==2 and not Retirement(sets[1]) and not Default(sets[1]):
        best_3+=1
        if validity_match(sets, 3) == True:
            valid_change_best_of +=1
        else:
            invalid_matches+=1
    elif len(sets)>=2:
         if validity_match(sets, 5) == False:
            invalid_matches+=1

    

In [None]:
print('errors: best of 5 classyfied as best of 3:', best_5)
print('errors: best of 3 classyfied as best of 5:', best_3)
print('valid change of best of', valid_change_best_of)
print('Invalid matches', invalid_matches)

In [None]:
fig = plt.figure(figsize=(20, 5)) 
fig_dims = (1, 3)

plt.subplot2grid(fig_dims, (0, 0))
plt.bar(["best 5 as 3", "best 3 as 5"], [best_5, best_3], width=0.5, color="#66ff66")

plt.subplot2grid(fig_dims, (0, 1))
plt.bar(["Total changes best_of", "Valid changes best_of"], [best_5+best_3, valid_change_best_of], width=0.5, color="#6666ff")

plt.subplot2grid(fig_dims, (0, 2))
plt.bar(["invalid for wrong best_of", "invalid matches"], [valid_change_best_of, invalid_matches], width=0.5, color="#ff6666")

There are some invalid points about some games of some matches. There are some matches classifyed as best of 3 but have more than 3 games and they can be classified as best of 5. The same for the best of 5. There are some matches that are not valid (5122) because the results of the games are inpossible results because the rules of the tennis. 

We can see in the first bar chart that most of the error of classification are about matches best of 5 classified as best of 3. In the second chart we can notice that we can correct this classification for most of these matches. 

In the 3rd chart we can see that the error we have notice in the match because of a wrong classification are a very small part of the total errors. So, the most of the errors in the scores are not because the classification is wrong, but because the saved results are invalid.

In [None]:
count_5=0
for index, row in  df_tennis_score[df_tennis_score['best_of']==3].iterrows():
    sets = row['score'].split( )
    if validity_match(sets,5) and is_best_of_5(sets):
        count_5+=1

In [None]:
count_3=0
for index, row in  df_tennis_score[df_tennis_score['best_of']==5].iterrows():
    sets = row['score'].split( )
    if validity_match(sets,3) and is_best_of_3(sets):
        count_3+=1

In [None]:
print('best of 5 classyfied as best of 3 looking scores:', count_5)
print('best of 3 classyfied as best of 5 looking scores:', count_3)

Looking at scores, some best of 3 values can be classified as best of 5 (1) and viceversa (97).

#### match_num

Check if the same match_num appears more than once in the same tourney and in this case how many times it appears. 

In [None]:
df_tennis_matchgroups = df_tennis.drop_duplicates().groupby(['tourney_id', 'match_num']).size().reset_index(name='size')
print(df_tennis_matchgroups[df_tennis_matchgroups['size']>2])
#number of match with more than one winner
df_tennis_matchgroups[df_tennis_matchgroups['size']>2].shape[0]

We can see that a match can appear more than once, and in particular also 3 or 4 times in the same tourney. Below there is a print of an example where a match num appears 3 times in the same tourney.

In [None]:
df_tennis[(df_tennis['tourney_id']=='2016-520') & (df_tennis['match_num']==100)][['tourney_id', 'tourney_name', 'match_num', 'winner_name', 'loser_name','tourney_level']]

#### draw_size

In [None]:
#count the row with an invalid number (negative or less than 2)
df_tennis[df_tennis['draw_size'] < 2].shape[0]

All the numbers of players of a tourney are valid because more than 1 player is present in every tourney (we consider only the numbers and not the nulls)

**tourney_date**

In [None]:
#the date are in float so they need to be converted in date object 
df_tennis['tourney_date'].isnull().sum()

In [None]:
#check if there are present data greater then today
df_tennis['tourney_date'] = pd.to_datetime(df_tennis['tourney_date'], format='%Y%m%d')
invalid_data = 0
today = pd.to_datetime(datetime.date.today())
for date in df_tennis['tourney_date']:
    if date > today:
        invalid_data +=1
print(invalid_data)

There aren't invalid dates associated to the tourneys

#### winner_ht and loser_ht

Check that all the heights of players (winner and loser) are valid (for example an height of 2 cm is not valid)

In [None]:
df_tennis['winner_ht'].max()

In [None]:
df_tennis['winner_ht'].min()

There are invalid height as the one of 2 cm

In [None]:
# print all possibile winner ht in an asc sort order
df_tennis['winner_ht'].value_counts().sort_index()

In [None]:
df_tennis['winner_ht'].value_counts().sort_index().plot.bar(
    figsize=(10, 4), 
    title="winner_ht distribution",
    xlabel="winner_ht", 
    ylabel="Frequency")

Plot on all the height in the dataset

In [None]:
# let's try to find outliers with boxplot visualization
df_tennis.plot.box(y="winner_ht", vert=False, grid=True, figsize=(10, 4));

Search of outliers. We can notice that the ouliers are few values.

Same reasonings for the winner player are done also for the loser

In [None]:
df_tennis['loser_ht'].max()

In [None]:
df_tennis['loser_ht'].min()

In [None]:
# print all possibile loser ht in an asc sort order
df_tennis['winner_ht'].value_counts().sort_index()

In [None]:
df_tennis['loser_ht'].value_counts().sort_index().plot.bar(
    figsize=(10, 4), 
    title="loser_ht distribution",
    xlabel="loser_ht", 
    ylabel="Frequency")

In [None]:
# let's try to find outliers with boxplot visualization
df_tennis.plot.box(y="loser_ht", vert=False, grid=True, figsize=(10, 4));

Check if the players have different heights, i.e. if they have grown over time

In [None]:
# get all players ids from winner and looser, without duplicate
players_ids = list(set(df_tennis['winner_id'].dropna().unique().tolist()) | set(df_tennis['loser_id'].dropna().unique().tolist()))

count = 0

results = []

# find players with different ht
for player_id in players_ids:
    w_ht_players = df_tennis[df_tennis['winner_id']==player_id]['winner_ht'].dropna().unique().tolist()
    l_ht_players = df_tennis[df_tennis['loser_id']==player_id]['loser_ht'].dropna().unique().tolist()
    ht_players = list(set(w_ht_players) | set(l_ht_players))
    diff = len(ht_players)
    if diff > 1:
        results.append(player_id)
        count = count + 1

# print results        
print("find " + str(count) + " players with different height, ids: " + str(results))        
print("")
for player_id in results:
    w_ht_players = df_tennis[df_tennis['winner_id']==player_id]
    l_ht_players = df_tennis[df_tennis['loser_id']==player_id]    
    result = pd.concat([w_ht_players, l_ht_players])
    result.sort_values(by=['tourney_date'], inplace=True)
            
    printhead = True
    last_ht = 0
    for index, row in result.iterrows():             
        if printhead:
            if row["winner_id"] == player_id:
                name = row["winner_name"]
            else:
                name = row["loser_name"]
            print(" --------------- PLAYER " + str(player_id) + " " + name + " --------------- ")
            printhead = False
        age = None
        match_id = row["tourney_id"]
        match_name = row["tourney_name"]
        if row["winner_id"] == player_id:
            ht = row["winner_ht"]
            age = ConvertAge(row['winner_age'])
        else:
            ht = row["loser_ht"]
            age = ConvertAge(row['loser_age'])        
        
        if last_ht != ht:
            print(str(match_id)  + "\t-\t" + match_name  + "\t-\t" + str(ht)  + " - " + str(row["tourney_date"])  + " - " + age)
            last_ht = ht        
    print(" -----------------------------------------------------  ")   

There is only one player that grown in time (David Goffin)

#### winner_age and loser_age


In [None]:
print(str(df_tennis['winner_age'].max()) + " converted-> " + ConvertAge(df_tennis['winner_age'].max()))

In [None]:
print(str(df_tennis['winner_age'].min()) + " converted-> " + ConvertAge(df_tennis['winner_age'].min()))

In [None]:
# let's try to find outliers in age with boxplot visualization
df_tennis.boxplot(vert=False, column=['winner_age'], return_type='axes',figsize=(10, 4))

In [None]:
sns.histplot(data=df_tennis['winner_age'], bins="sturges", binrange=(10,50), color="lightgreen", kde=True, kde_kws={'clip':(10,50)}).lines[0].set_color('blue')

In [None]:
#loser age

In [None]:
print(str(df_tennis['loser_age'].max()) + " -> " + ConvertAge(df_tennis['loser_age'].max()))
print(str(df_tennis['loser_age'].min()) + " -> " + ConvertAge(df_tennis['loser_age'].min()))

In [None]:
# prints a table sorted by winner age, with tournament, winner name and tournament date
df_tennis_tmp = df_tennis.copy() # avoid overwrite original dataset
df_tennis_tmp['loser_age'] = df_tennis_tmp['loser_age'].dropna().apply(lambda x: ConvertAge(x))
df_tennis_tmp['tourney_date'] = pd.to_datetime(df_tennis_tmp['tourney_date'], format='%Y%m%d')
df_tennis_tmp[['tourney_name', 'tourney_date', 'loser_name', 'loser_age']].dropna().sort_values(by='loser_age')

In [None]:
df_tennis.boxplot(vert=False, column=['loser_age'], return_type='axes',figsize=(10, 4))

In [None]:
#df_tennis['winner_age'].hist(bins=50, grid=False, range=(10, 50))
#df_tennis['winner_age'].plot(kind='kde', xlim=[10,50])

fig = plt.figure(figsize=(16, 5)) 
fig_dims = (1, 2)

plt.subplot2grid(fig_dims, (0, 0))
sns.histplot(data=df_tennis['loser_age'], bins="sturges", binrange=(10,50), color="lightgreen", kde=True, kde_kws={'clip':(10,50)}).lines[0].set_color('blue')

plt.subplot2grid(fig_dims, (0, 1))
sns.kdeplot(data=df_tennis['loser_age'], color="green", clip=(10,50))
sns.kdeplot(data=df_tennis['winner_age'], color="blue", clip=(10,50))

In [None]:
ConvertTime(59.4)

#### minutes

In [None]:
df_tennis.loc[df_tennis['minutes'] < 0, 'minutes'].count()

In [None]:
df_tennis.loc[df_tennis['minutes'] == 0, 'minutes'].count()

In [None]:
df_tennis['minutes'].max()

In [None]:
df_tennis[df_tennis['best_of']==3]['minutes'].mean()

On average, best-of-3 tennis matches last about 90 minutes, while best-of-5 matches last 2 hours and 45 minutes (=165 minutes)

In [None]:
print(str(df_tennis['minutes'].mean()) + " converted-> " + ConvertTime(df_tennis['minutes'].mean()))

In [None]:
print(str(df_tennis['minutes'].max()) + " converted-> " + ConvertTime(df_tennis['minutes'].max()))

In [None]:
print(str(df_tennis['minutes'].min()) + " converted-> " + ConvertTime(df_tennis['minutes'].min()))

In [None]:
# let's try to find outliers in minutes with scatter visualization
df_tennis.boxplot(vert=False, column=['minutes'], return_type='axes',figsize=(10, 4))

In [None]:
sns.histplot(data=df_tennis['minutes'], bins="doane", binrange=(0,300), color="lightgreen", kde=True, kde_kws={'clip':(10,300)}).lines[0].set_color('blue')

In [None]:
df_tennis[df_tennis['minutes']> 665].shape[0]

Longest tennis games in history worldwide is 11 hours and 5 minutes (= 665 minutes) for best of 5 and 6 hours and 31 mintes for best of 3

There are 128 entry with a match duration equal to 0, a tennis match duration is on average 40 minutes our mean is 97.67.

In [None]:
df_tennis_min_filtered = df_tennis[df_tennis['minutes']<= 0]
df_tennis_min_checked = df_tennis_min_filtered.apply(lambda x: IsMatchWithZeroIncorrect(x['score']), axis=1)
df_tennis_min_zero_res = df_tennis_min_filtered[df_tennis_min_checked]
df_tennis_min_zero_res[['tourney_id', 'score', 'minutes']]

0 could be considered as a default value

#### w_ace, w_df and w_svpt

In [None]:
negative_w_ace = df_tennis.loc[df_tennis['w_ace'] < 0].shape[0]
print(negative_w_ace)

In [None]:
#check if there are more ace than service performed
df_tennis.loc[df_tennis['w_svpt'] < df_tennis['w_ace']].shape[0]

In [None]:
df_tennis.plot.scatter('w_svpt', 'w_ace')
plt.show()

In [None]:
#W_df: winner's number of doubles faults
negative_w_df = df_tennis.loc[df_tennis['w_df'] < 0].shape[0]
print(negative_w_df)

In [None]:
df_tennis.plot.scatter('w_svpt', 'w_df')
plt.show()

In [None]:
#check if there are more double faults than service performed
df_tennis.loc[df_tennis['w_svpt'] < df_tennis['w_df']].shape[0]

In [None]:
#W_svpt: winner's number of serve points
negative_w_ace = df_tennis.loc[df_tennis['w_svpt'] < 0].shape[0]
print(negative_w_ace)

In [None]:
#check for outliers for w_svpt
df_tennis.boxplot(vert=False, column=['w_svpt'], return_type='axes',figsize=(10, 3))
plt.show()

There are outiliers in the service point, values that goes over twenty-five hundred.
*It can be theoretically possible.*

In [None]:
df_tennis.loc[df_tennis['w_svpt'] >300][['tourney_id', 'tourney_name', 'best_of', 'score','winner_name','w_svpt','w_ace','loser_name','l_svpt', 'l_ace']]

#### l_ace, l_df and l_svpt

In [None]:
negative_l_ace = df_tennis.loc[df_tennis['l_ace'] < 0].shape[0]
print(negative_l_ace)

In [None]:
#check if there are more ace than service performed
df_tennis.loc[df_tennis['l_svpt'] < df_tennis['l_ace']].shape[0]

In [None]:
df_tennis.plot.scatter('l_svpt', 'l_ace')
plt.show()

In [None]:
#W_df: winner's number of doubles faults
negative_l_df = df_tennis.loc[df_tennis['l_df'] < 0].shape[0]
print(negative_l_df)

In [None]:
df_tennis.plot.scatter('l_svpt', 'l_df')
plt.show()

In [None]:
#check if there are more double faults than service performed
df_tennis.loc[df_tennis['l_svpt'] < df_tennis['l_df']].shape[0]

In [None]:
#W_svpt: winner's number of serve points
negative_l_ace = df_tennis.loc[df_tennis['l_svpt'] < 0].shape[0]
print(negative_l_ace)

In [None]:
#check for outliers for w_svpt
df_tennis.boxplot(vert=False, column=['l_svpt'], return_type='axes',figsize=(10, 3))
plt.show()

#### w_1stIn

In [None]:
#check to find negative values
df_tennis.loc[df_tennis['w_1stIn'] < 0].shape[0]

#### l_1stIn

In [None]:
#check on loser’s number of first serves made to find negative values
df_tennis.loc[df_tennis['l_1stIn'] < 0].shape[0]

#### w_1stWon and w_2ndWon

In [None]:
#check to find negative values
df_tennis.loc[df_tennis['w_1stWon'] < 0].shape[0]

In [None]:
#check if there are more w_1stWon than fist service performed
df_tennis.loc[df_tennis['w_1stIn'] < df_tennis['w_1stWon']].shape[0]

In [None]:
#check to find negative values
df_tennis.loc[df_tennis['w_2ndWon'] < 0].shape[0]

In [None]:
#check that the number of serve point is not smaller than the number of first serves 
df_tennis[df_tennis['w_svpt'] < df_tennis['w_1stIn']].shape[0]

In [None]:
#check that the number of serve point is not smaller than won serve points (first and second serve)
df_tennis[df_tennis['w_svpt'] < df_tennis['w_1stWon']+df_tennis['w_2ndWon']].shape[0]

#### l_1stWon and l_2ndWon


In [None]:
df_tennis.loc[df_tennis['l_1stWon'] < 0].shape[0]

In [None]:
df_tennis.loc[df_tennis['l_1stIn'] < df_tennis['l_1stWon']].shape[0]

In [None]:
df_tennis.loc[df_tennis['l_2ndWon'] < 0].shape[0]

In [None]:
df_tennis[df_tennis['l_svpt'] < df_tennis['l_1stIn']].shape[0]

In [None]:
df_tennis[df_tennis['l_svpt'] < df_tennis['l_1stWon']+df_tennis['l_2ndWon']].shape[0]

#### w_SvGms,  w_bpSaved and w_bpFaced

In [None]:
#Checks on w_SvGms
df_tennis.loc[df_tennis['w_SvGms'] < 0].shape[0]

In [None]:
import re
#w_SvGms: winner’s number of serve games
re_score1 = "\d[\d-][^()]"
re_score2 = "\d"
df_cleaned = df_tennis[~(df_tennis['score'].isnull()) & ~(df_tennis['w_SvGms'].isnull())]
scores = df_cleaned['score'].tolist()
invalid_SvGms = []
for i, score in enumerate(df_tennis['score'].tolist()):
    if ((df_tennis.at[i,'score'] == 'nan') | (str(df_tennis.at[i,'w_SvGms']) == 'nan')):
        continue
#     if (Retirement(score) | Walkover(score) | Default(score) | Bye(score)):
#         continue
    int_scores = re.findall(re_score1, str(score))
    int_scores = re.findall(re_score2, str(int_scores))
    if(len(int_scores) > 6):
        continue
    score_sum = np.sum(list(map(lambda x: int(x), int_scores)))
    if (((np.floor(score_sum/2) - 2) <= df_tennis.at[i,'w_SvGms']) or 
        ((np.ceil(score_sum/2) + 2) >= df_tennis.at[i,'w_SvGms'])):
        continue
    invalid_SvGms.append(i)
#     print('+++ SCORE INVALIDO +++')
#     print("SCORE -> " + str(df_tennis.at[i,'score']))
#     print("GAME SERVITI -> " + str(df_tennis.at[i,'w_SvGms']))
#     print(np.floor(score_sum/2))
#     print(np.ceil(score_sum/2))
#     print('+++++++')
print(invalid_SvGms)

In [None]:
df_tennis['w_bpSaved'].min() < 0

In [None]:
df_tennis['w_bpFaced'].min() < 0

In [None]:
#check that number of faces is never bigger than number of saved
df_tennis[df_tennis['w_bpSaved']>df_tennis['w_bpFaced']].shape[0]

In [None]:
#check for outliers for w_bpFaced
df_tennis.boxplot(vert=False, column=['w_bpFaced'], return_type='axes',figsize=(10, 3))
plt.show()

In [None]:
df_tennis[df_tennis['w_bpFaced']>75][['tourney_id', 'tourney_name', 'best_of', 'score','winner_name','w_bpFaced','loser_name','w_bpFaced']]

In [None]:
#check for outliers for w_bpSaved
df_tennis.boxplot(vert=False, column=['w_bpSaved'], return_type='axes',figsize=(10, 3))
plt.show()

In [None]:
df_tennis[df_tennis['w_bpSaved']>75][['tourney_id', 'tourney_name', 'best_of', 'score','winner_name','w_bpFaced','w_bpSaved','loser_name','l_bpFaced','l_bpSaved']]

#### l_SvGms,  l_bpSaved and l_bpFaced

In [None]:
df_tennis.loc[df_tennis['l_SvGms'] < 0].shape[0]

In [None]:
#l_SvGms: loser’s number of serve games
re_score1 = "\d[\d-][^()]"
re_score2 = "\d"
df_cleaned = df_tennis[~(df_tennis['score'].isnull()) & ~(df_tennis['l_SvGms'].isnull())]
scores = df_cleaned['score'].tolist()
invalid_SvGms = []
for i, score in enumerate(df_tennis['score'].tolist()):
    if ((df_tennis.at[i,'score'] == 'nan') | (str(df_tennis.at[i,'l_SvGms']) == 'nan')):
        continue
#     if (Retirement(score) | Walkover(score) | Default(score) | Bye(score)):
#         continue
    int_scores = re.findall(re_score1, str(score))
    int_scores = re.findall(re_score2, str(int_scores))
    if(len(int_scores) > 6):
        continue
    score_sum = np.sum(list(map(lambda x: int(x), int_scores)))
    if (((np.floor(score_sum/2) - 2) <= df_tennis.at[i,'l_SvGms']) or 
        ((np.ceil(score_sum/2) + 2) >= df_tennis.at[i,'l_SvGms'])):
        continue
    invalid_SvGms.append(i)
#     print('+++ SCORE INVALIDO +++')
#     print("SCORE -> " + str(df_tennis.at[i,'score']))
#     print("GAME SERVITI -> " + str(df_tennis.at[i,'w_SvGms']))
#     print(np.floor(score_sum/2))
#     print(np.ceil(score_sum/2))
#     print('+++++++')
print(invalid_SvGms)

In [None]:
#check for outliers for w_bpFaced
df_tennis.boxplot(vert=False, column=['l_bpFaced'], return_type='axes',figsize=(10, 3))
plt.show()

In [None]:
df_tennis[df_tennis['l_bpFaced']>75][['tourney_id', 'tourney_name', 'best_of', 'score','winner_name','w_bpFaced','loser_name','l_bpFaced']]

In [None]:
df_tennis['l_bpSaved'].min() < 0

In [None]:
df_tennis['l_bpFaced'].min() < 0

In [None]:
#check that number of faces is never bigger than number of saved
df_tennis[df_tennis['l_bpSaved']>df_tennis['l_bpFaced']].shape[0]

In [None]:
#check for outliers for l_bpSaved
df_tennis.boxplot(vert=False, column=['l_bpSaved'], return_type='axes',figsize=(10, 3))
plt.show()

In [None]:
df_tennis[df_tennis['l_bpSaved']>75][['tourney_id', 'tourney_name', 'best_of', 'score','winner_name','w_bpFaced','w_bpSaved','loser_name','l_bpFaced','l_bpSaved']]

#### Winner_rank and loser_rank

In [None]:
df_tennis['winner_rank'].max()

In [None]:
df_tennis['winner_rank'].min() >= 1 #there cannot be rank smaller than 1

In [None]:
df_tennis['winner_rank'].min()

In [None]:
df_tennis.boxplot(vert=False, column=['winner_rank'], return_type='axes',figsize=(10, 4))

In [None]:
df_tennis['loser_rank'].max()

In [None]:
df_tennis['loser_rank'].min() >= 1 #there cannot be rank smaller than 1

In [None]:
df_tennis['loser_rank'].min()

In [None]:
df_tennis.boxplot(vert=False, column=['loser_rank'], return_type='axes',figsize=(10, 4))

#### Winner_rank_points and loser_rank_points

In [None]:
df_tennis['winner_rank_points'].max()<21750 #21750 is the maximum a player can reach

In [None]:
df_tennis['winner_rank_points'].max()

In [None]:
df_tennis['winner_rank_points'].min()

In [None]:
df_tennis[df_tennis['winner_rank_points'] == df_tennis['winner_rank_points'].min()].shape[0] 

In [None]:
df_tennis.boxplot(vert=False, column=['winner_rank_points'], return_type='axes',figsize=(10, 4))

In [None]:
df_tennis.plot.scatter('winner_rank', 'winner_rank_points')
plt.show()

In [None]:
df_tennis['loser_rank_points'].max()<21750 #21750 is the maximum a player can reach

In [None]:
df_tennis['loser_rank_points'].max()

In [None]:
df_tennis['loser_rank_points'].min()

In [None]:
df_tennis[df_tennis['loser_rank_points'] == df_tennis['loser_rank_points'].min()].shape[0] 

In [None]:
df_tennis.boxplot(vert=False, column=['loser_rank_points'], return_type='axes',figsize=(10, 4))

In [None]:
df_tennis.plot.scatter('loser_rank', 'loser_rank_points')
plt.show()

**tourney_spectators**

In [None]:
df_tennis['tourney_spectators'].min() < 0

In [None]:
df_tennis['tourney_spectators'].max()

In [None]:
df_tennis['tourney_spectators'].min()

**tourney_revenue**

In [None]:
df_tennis['tourney_revenue'].min() < 0

In [None]:
df_tennis['tourney_revenue'].max()

## Correlation

In [None]:
df_numeric = df_tennis[['draw_size', 'minutes','w_ace','w_df','w_svpt','w_1stIn', 'w_1stWon', 'w_2ndWon', 
                        'w_SvGms', 'winner_rank', 'winner_rank_points', 'w_bpSaved',  'w_bpFaced','l_ace',
                        'l_df','l_svpt','l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'loser_rank', 
                        'loser_rank_points', 'l_bpSaved',  'l_bpFaced', 'tourney_spectators', 'tourney_revenue',
                       'winner_ht','winner_age','loser_ht','loser_age']]
df_numeric

In [None]:
plt.figure(figsize = (30,20))
sns.heatmap(df_numeric.corr(), annot=True)

In [None]:
treshold = 0.8
correlation = df_numeric.corr()
correlation_filtered = correlation[correlation>treshold]
correlation_filtered

In [None]:
correlation_filtered = correlation[correlation_filtered.sum()>1]
correlation_filtered = correlation_filtered[correlation_filtered>treshold]
correlation_filtered = correlation_filtered.dropna(axis=1,how='all')
correlation_filtered

**w_svpt, w_1stIn, w_1stWon, w_SvGms** are all correlated because more service game (w_SvGms) a player does and, as consequence, more serve points (w_svpt) ze does (because if the player serves in a game, ze serves for all the points in that game, so these number increase together). If a player serves, the number of the first serves (W_1stln) increase. Increasing the number of first serves, the probability of won some of these serves increase so, more first serves we do, more won first serves (W_1stWon) we expect. The same reasoning is done also for the loser. Furhermore, winner and loser serves in an alternate way so, if the number of serves of the winner increase (for example because of more games), then also the number of the loser increase, so these information grown together.

**w_bpSaved and w_bpFaced** are correlated because more are the breakpoints faced and more can be the breakpoint saved. 

In [None]:
correlation_filtered = correlation[correlation<-treshold]
correlation_filtered = correlation_filtered[correlation_filtered<-treshold]
correlation_filtered.dropna(axis=1,how='all',inplace=True)
correlation_filtered

there are no negative correlations to consider