# MILESTONE 1

**Assia Ouanaya, Alexandre Variengien, Félicie Giraud-Sauveur  
Data visualization project: ATP Tennis [Data Set](https://www.kaggle.com/datasets/edoardoba/atp-tennis-data)**

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Problematic

Since its creation in the nineteenth century, tennis has kept increasing in popularity to become one of the most played sports, and one of the most mediatized. You cannot ignore the names of the best figures such as Rafael Nadal or Novak Djokovic that compete in the unmissable Grand Slam tournaments: from Roland Garros to Wellington.
Nonetheless, the usual picture of tennis shared by the media tends to focus on a tiny subset of both the events and the players.
With this visualization, we want to give a holistic image of the community of professional tennis players, its global structure.  

First, we want to visualize the pattern of matchmaking through tournaments and through the seasons. All the tournaments can be seen as a graph whose nodes are the players and the edges are the tennis matches. We want to visualize the structure of this graph.  

In addition to the competitions, we would focus on individual players: besides the well-known superstars, what is the usual career of a professional tennis player, what are his characteristics? We want to share the stories of the ones whose names will never appear in any international headlines, while they make up the biggest chunk of the community.  
Those questions, such as the role of right or left handedness, are the subject of common discussions. Through the use of data visualization, we aim at shedding new light on these topics.  

This visualization is intended for the general public, with no particular interest in tennis. Our goal is to show the fascinating patterns that emerge in the data visualization from this field and to share the statistical stories that never appear in newspapers.

## Dataset

We chose to work with the ATP Tennis with Betting Odds found on [Kaggle](https://www.kaggle.com/datasets/edoardoba/atp-tennis-data/metadata). 

Description of the dataset:
- ATP: Tournament number
- Location: Venue of tournament
- Tournament: Name of tournament
- Date: Date of match
- Series: Name of ATP tennis series
- Court: Type of court
- Surface: Type of surface
- Round: Round of match
- Best of: Maximum number of sets playable in match
- Winner: Match winner
- Loser: Match loser
- WRank: ATP Entry ranking of the match winner as of the start of the tournament
- LRank: ATP Entry ranking of the match loser as of the start of the tournament
- WPts: ATP Entry points of the match winner as of the start of the tournament
- LPts: ATP Entry points of the match loser as of the start of the tournament
- W1: Number of games won in 1st set by match winner
- L1: Number of games won in 1st set by match loser
- W2: Number of games won in 2nd set by match winner
- L2: Number of games won in 2nd set by match loser
- W3: Number of games won in 3rd set by match winner
- L3: Number of games won in 3rd set by match loser
- W4: Number of games won in 4th set by match winner
- L4: Number of games won in 4th set by match loser
- W5: Number of games won in 5th set by match winner
- L5: Number of games won in 5th set by match loser
- Wsets: Number of sets won by match winner
- Lsets: Number of sets won by match loser
- Comment: Comment on the match (Completed, won through retirement of loser, or via Walkover)
- B365W: Bet365 odds of match winner
- B365L: Bet365 odds of match loser
- PSW: Bet&Win odds of match winner
- PSL: Bet&Win odds of match loser
- MaxW: Maximum odds of match winner
- MaxL: Maximum odds of match loser
- AvgW: Average odds of match winner
- AvgL: Average odds of match loser
- EXW: Expekt odds of match winner
- EXL: Expekt odds of match loser
- LBW: Ladbrokes odds of match winner
- LBL: Ladbrokes odds of match loser
- SJW: Stan James odds of match winner
- SJL: Stan James odds of match loser
- UBW: Unibet odds of match winner
- UBL: Unibet odds of match loser
- pl1_flag: Winners Nationality
- pl1_year_pro: Winners starting year as a pro
- pl1_weight: Winners weight
- pl1_height: Winners height
- pl1_hand: Winners playing hand
- pl2_flag: Losers Nationality
- pl2_year_pro: Losers starting year as a pro
- pl2_weight: Winners weight
- pl2_height: Losers height
- pl2_hand: Losers playing hand

In [8]:
df = pd.read_csv("tennis_data.csv")
print(df.shape)
df

(36120, 54)


  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,ATP,Location,Tournament,Date,Series,Court,Surface,Round,Best of,Winner,...,pl1_flag,pl1_year_pro,pl1_weight,pl1_height,pl1_hand,pl2_flag,pl2_year_pro,pl2_weight,pl2_height,pl2_hand
0,1,Adelaide,Adelaide International 1,2022-01-03,ATP250,Outdoor,Hard,1st Round,3,Kwon S.W.,...,KOR,2015.0,72.0,180.0,Right-Handed,JPN,2014.0,64.0,170.0,Left-Handed
1,1,Adelaide,Adelaide International 1,2022-01-03,ATP250,Outdoor,Hard,1st Round,3,Monteiro T.,...,BRA,2011.0,78.0,183.0,Left-Handed,GER,2014.0,80.0,188.0,Right-Handed
2,1,Adelaide,Adelaide International 1,2022-01-03,ATP250,Outdoor,Hard,1st Round,3,Djere L.,...,SRB,2013.0,80.0,185.0,Right-Handed,ESP,2011.0,76.0,180.0,Right-Handed
3,1,Adelaide,Adelaide International 1,2022-01-03,ATP250,Outdoor,Hard,1st Round,3,Johnson S.,...,USA,2012.0,86.0,188.0,Right-Handed,AUS,2018.0,85.0,188.0,Right-Handed
4,1,Adelaide,Adelaide International 1,2022-01-04,ATP250,Outdoor,Hard,1st Round,3,Moutet C.,...,FRA,2016.0,71.0,175.0,Left-Handed,DEN,2020.0,77.0,188.0,Right-Handed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36115,65,Shanghai,Masters Cup,2008-11-14,Masters Cup,Indoor,Hard,Round Robin,3,Simon G.,...,FRA,2002.0,70.0,183.0,Right-Handed,CZE,1996.0,76.0,185.0,Right-Handed
36116,65,Shanghai,Masters Cup,2008-11-14,Masters Cup,Indoor,Hard,Round Robin,3,Murray A.,...,GBR,2005.0,84.0,191.0,Right-Handed,SUI,1998.0,85.0,185.0,Right-Handed
36117,65,Shanghai,Masters Cup,2008-11-15,Masters Cup,Indoor,Hard,Semifinals,3,Djokovic N.,...,SRB,2003.0,77.0,188.0,Right-Handed,FRA,2002.0,70.0,183.0,Right-Handed
36118,65,Shanghai,Masters Cup,2008-11-15,Masters Cup,Indoor,Hard,Semifinals,3,Davydenko N.,...,RUS,1999.0,70.0,178.0,Right-Handed,GBR,2005.0,84.0,191.0,Right-Handed


## Exploratory Data Analysis

In [9]:
df.columns

Index(['ATP', 'Location', 'Tournament', 'Date', 'Series', 'Court', 'Surface',
       'Round', 'Best of', 'Winner', 'Loser', 'WRank', 'LRank', 'WPts', 'LPts',
       'W1', 'L1', 'W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5', 'Wsets',
       'Lsets', 'Comment', 'B365W', 'B365L', 'PSW', 'PSL', 'MaxW', 'MaxL',
       'AvgW', 'AvgL', 'EXW', 'EXL', 'LBW', 'LBL', 'SJW', 'SJL', 'UBW', 'UBL',
       'pl1_flag', 'pl1_year_pro', 'pl1_weight', 'pl1_height', 'pl1_hand',
       'pl2_flag', 'pl2_year_pro', 'pl2_weight', 'pl2_height', 'pl2_hand'],
      dtype='object')

As we are not interested in the betting odds columns we will drop them before any further analysis. The columns in question are the following: 
- B365W: Bet365 odds of match winner
- B365L: Bet365 odds of match loser
- PSW: Bet&Win odds of match winner
- PSL: Bet&Win odds of match loser
- MaxW: Maximum odds of match winner
- MaxL: Maximum odds of match loser
- AvgW: Average odds of match winner
- AvgL: Average odds of match loser
- EXW: Expekt odds of match winner
- EXL: Expekt odds of match loser
- LBW: Ladbrokes odds of match winner
- LBL: Ladbrokes odds of match loser
- SJW: Stan James odds of match winner
- SJL: Stan James odds of match loser
- UBW: Unibet odds of match winner
- UBL: Unibet odds of match loser

In [11]:
df[['B365W', "B365L", "PSW", "PSL", "MaxW", "MaxL", "AvgW", "AvgL", "EXW", "EXL", "LBW", "LBL", "SJW", "SJL", "UBW", "UBL"]].describe()

Unnamed: 0,B365W,B365L,PSW,PSL,MaxW,MaxL,AvgW,AvgL,EXL,LBW,LBL,SJW,SJL,UBW,UBL
count,35914.0,35937.0,33173.0,33173.0,29706.0,29706.0,29706.0,29706.0,28717.0,28131.0,28142.0,15572.0,15579.0,5309.0,5309.0
mean,1.839653,3.685053,1.927658,4.117603,2.000127,7.28134,1.8418,3.514055,3.323456,1.810226,3.451461,1.796538,3.557943,1.819319,3.567555
std,1.189168,3.884496,1.328985,5.316389,1.543608,347.601852,1.08103,3.191139,2.542908,1.031691,3.075889,1.004273,3.27251,1.038893,3.412837
min,0.971,0.967,0.974,1.01,1.01,1.01,1.01,1.01,1.0,1.0,1.0,1.0,1.01,1.01,1.02
25%,1.22,1.72,1.273,1.79,1.3,1.83,1.25,1.73,1.73,1.25,1.73,1.22,1.73,1.23,1.75
50%,1.5,2.5,1.56,2.62,1.6,2.73,1.52,2.51,2.5,1.5,2.5,1.5,2.63,1.5,2.52
75%,2.1,4.0,2.15,4.19,2.23,4.39,2.08,3.87,3.8,2.0,4.0,2.0,4.0,2.03,4.0
max,34.0,101.0,46.0,121.0,76.0,42586.0,23.45,36.44,40.0,26.0,51.0,19.0,81.0,18.0,50.0


In [12]:
df.drop(inplace=True, columns=["PSW", "PSL", "MaxW", "MaxL", "AvgW", "AvgL", "EXW", "EXL", "LBW", "LBL", "SJW", "SJL", "UBW", "UBL"])
df.shape

(36120, 40)

In [13]:
#df.drop(inplace=True, columns=['W1', 'L1', 'W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5', 'Wsets', 'Lsets'])
#df.shape

Let's check the number of NA values in the dataset's columns and handle them.

In [14]:
df.isna().sum()

ATP                 0
Location            0
Tournament          0
Date                0
Series              0
Court               0
Surface             0
Round               0
Best of             0
Winner              0
Loser               0
WRank              13
LRank              80
WPts               11
LPts               79
W1                232
L1                229
W2                571
L2                571
W3              19076
L3              19076
W4              32677
L4              32677
W5              34825
L5              34825
Wsets             231
Lsets             234
Comment             0
B365W             206
B365L             183
pl1_flag           96
pl1_year_pro       96
pl1_weight         96
pl1_height         96
pl1_hand           96
pl2_flag          649
pl2_year_pro      649
pl2_weight        649
pl2_height        649
pl2_hand          649
dtype: int64

In [15]:
df.dropna(axis=0, subset=['pl1_flag', 'pl2_flag', 'WRank', 'LRank', 'WPts', 'LPts'], inplace=True)
df.shape

(35377, 40)

In [16]:
df.dropna(axis=0, subset=['pl1_flag', 'pl2_flag'], inplace=True)
df.isna().sum()

ATP                 0
Location            0
Tournament          0
Date                0
Series              0
Court               0
Surface             0
Round               0
Best of             0
Winner              0
Loser               0
WRank               0
LRank               0
WPts                0
LPts                0
W1                231
L1                228
W2                567
L2                567
W3              18599
L3              18599
W4              31978
L4              31978
W5              34095
L5              34095
Wsets             230
Lsets             233
Comment             0
B365W             193
B365L             171
pl1_flag            0
pl1_year_pro        0
pl1_weight          0
pl1_height          0
pl1_hand            0
pl2_flag            0
pl2_year_pro        0
pl2_weight          0
pl2_height          0
pl2_hand            0
dtype: int64

Now let's investigate our numerical features.

In [17]:
df.describe()

Unnamed: 0,ATP,Best of,WRank,LRank,WPts,LPts,W1,L1,W2,L2,...,Wsets,Lsets,B365W,B365L,pl1_year_pro,pl1_weight,pl1_height,pl2_year_pro,pl2_weight,pl2_height
count,35377.0,35377.0,35377.0,35377.0,35377.0,35377.0,35146.0,35149.0,34810.0,34810.0,...,35147.0,35144.0,35184.0,35206.0,35377.0,35377.0,35377.0,35377.0,35377.0,35377.0
mean,31.720723,3.395455,56.766713,83.366622,1993.12853,1159.382367,5.80305,4.106262,5.783108,3.944441,...,2.155831,0.418621,1.849004,3.621398,2005.005371,80.878367,187.003873,2005.077451,79.958137,186.340871
std,18.022301,0.796582,69.786739,96.106565,2420.985833,1307.405901,1.231894,1.829837,1.249189,1.860474,...,0.464083,0.563051,1.195452,3.759611,5.177058,7.891915,12.092681,11.856877,7.518406,19.810576
min,1.0,3.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.971,0.967,1989.0,7.0,10.0,0.0,7.0,10.0
25%,18.0,3.0,16.0,34.0,706.0,562.0,6.0,3.0,6.0,3.0,...,2.0,0.0,1.25,1.66,2001.0,75.0,183.0,2001.0,75.0,183.0
50%,31.0,3.0,40.0,62.0,1090.0,815.0,6.0,4.0,6.0,4.0,...,2.0,0.0,1.5,2.5,2004.0,80.0,185.0,2005.0,80.0,185.0
75%,48.0,3.0,75.0,100.0,2075.0,1240.0,6.0,6.0,6.0,6.0,...,2.0,1.0,2.1,4.0,2008.0,85.0,191.0,2008.0,84.0,191.0
max,67.0,5.0,1890.0,1890.0,16950.0,16950.0,7.0,7.0,7.0,7.0,...,3.0,2.0,34.0,67.0,2021.0,108.0,1883.0,2021.0,108.0,1883.0


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35377 entries, 0 to 36119
Data columns (total 40 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ATP           35377 non-null  int64  
 1   Location      35377 non-null  object 
 2   Tournament    35377 non-null  object 
 3   Date          35377 non-null  object 
 4   Series        35377 non-null  object 
 5   Court         35377 non-null  object 
 6   Surface       35377 non-null  object 
 7   Round         35377 non-null  object 
 8   Best of       35377 non-null  int64  
 9   Winner        35377 non-null  object 
 10  Loser         35377 non-null  object 
 11  WRank         35377 non-null  float64
 12  LRank         35377 non-null  float64
 13  WPts          35377 non-null  float64
 14  LPts          35377 non-null  float64
 15  W1            35146 non-null  float64
 16  L1            35149 non-null  float64
 17  W2            34810 non-null  float64
 18  L2            34810 non-nu

In [23]:
df.loc['Sched']

KeyError: 'Sched'

In [24]:
df.plot.box(subplots=True, figsize=(20,15), layout=(3,4))
plt.tight_layout()
plt.show()

ValueError: Layout of 3x4 must be larger than required size 26

<Figure size 1440x1080 with 0 Axes>

We notice there are a lot of outliers in the following columns:
- WRank: ATP Entry ranking of the match winner as of the start of the tournament
- LRank: ATP Entry ranking of the match loser as of the start of the tournament
- WPts: ATP Entry points of the match winner as of the start of the tournament
- LPts: ATP Entry points of the match loser as of the start of the tournament

We also see that the Best of (accounting for number of sets per match) is a binary value.

In [135]:
df['Best of'].unique()

array([3, 5])

Let's check if there any correlation between any of the columns.

In [136]:
df.corr()

Unnamed: 0,ATP,Best of,WRank,LRank,WPts,LPts,pl1_year_pro,pl1_weight,pl1_height,pl2_year_pro,pl2_weight,pl2_height
ATP,1.0,-0.040426,-0.038902,-0.039708,0.031047,0.060345,-0.030682,0.024094,0.023009,-0.010975,0.028496,0.011222
Best of,-0.040426,1.0,-0.073942,0.007127,0.132261,0.003425,-0.008905,0.025494,0.009138,-0.002944,-0.00222,-0.007428
WRank,-0.038902,-0.073942,1.0,0.110108,-0.418702,-0.124845,0.104485,-0.085206,-0.037404,0.012252,-0.027012,-0.0058
LRank,-0.039708,0.007127,0.110108,1.0,-0.13114,-0.392384,-0.000968,-0.023531,-0.016406,0.031685,-0.050017,-0.008298
WPts,0.031047,0.132261,-0.418702,-0.13114,1.0,0.242517,-0.123227,0.096934,0.035083,-0.005357,0.048077,0.013632
LPts,0.060345,0.003425,-0.124845,-0.392384,0.242517,1.0,0.013176,0.043907,0.026795,-0.015393,0.114412,0.029406
pl1_year_pro,-0.030682,-0.008905,0.104485,-0.000968,-0.123227,0.013176,1.0,0.032769,0.108927,0.188979,-0.03156,0.011203
pl1_weight,0.024094,0.025494,-0.085206,-0.023531,0.096934,0.043907,0.032769,1.0,0.474027,-0.012917,0.016343,-0.001605
pl1_height,0.023009,0.009138,-0.037404,-0.016406,0.035083,0.026795,0.108927,0.474027,1.0,0.01221,0.006732,-0.000831
pl2_year_pro,-0.010975,-0.002944,0.012252,0.031685,-0.005357,-0.015393,0.188979,-0.012917,0.01221,1.0,0.022188,0.026952


It seems there no high correlation between the dataset columns.

Let's see if there is any 0-standard deviation column.

In [137]:
df.std(numeric_only=True)

ATP               18.022301
Best of            0.796582
WRank             69.786739
LRank             96.106565
WPts            2420.985833
LPts            1307.405901
pl1_year_pro       5.177058
pl1_weight         7.891915
pl1_height        12.092681
pl2_year_pro      11.856877
pl2_weight         7.518406
pl2_height        19.810576
dtype: float64

All of the columns have non-null standard deviation accounting that all columns are interesting to use in our analysis.

What are the non-numerical features of the dataset?

In [138]:
print(df.select_dtypes(exclude=np.number).head(4).columns)
df.select_dtypes(exclude=np.number).head(4)

Index(['Location', 'Tournament', 'Date', 'Series', 'Court', 'Surface', 'Round',
       'Winner', 'Loser', 'Comment', 'pl1_flag', 'pl1_hand', 'pl2_flag',
       'pl2_hand'],
      dtype='object')


Unnamed: 0,Location,Tournament,Date,Series,Court,Surface,Round,Winner,Loser,Comment,pl1_flag,pl1_hand,pl2_flag,pl2_hand
0,Adelaide,Adelaide International 1,2022-01-03,ATP250,Outdoor,Hard,1st Round,Kwon S.W.,Nishioka Y.,Completed,KOR,Right-Handed,JPN,Left-Handed
1,Adelaide,Adelaide International 1,2022-01-03,ATP250,Outdoor,Hard,1st Round,Monteiro T.,Altmaier D.,Completed,BRA,Left-Handed,GER,Right-Handed
2,Adelaide,Adelaide International 1,2022-01-03,ATP250,Outdoor,Hard,1st Round,Djere L.,Carballes Baena R.,Completed,SRB,Right-Handed,ESP,Right-Handed
3,Adelaide,Adelaide International 1,2022-01-03,ATP250,Outdoor,Hard,1st Round,Johnson S.,Vukic A.,Completed,USA,Right-Handed,AUS,Right-Handed


What are the different locations we have?

In [139]:
df.Location.unique()

array(['Adelaide', 'Melbourne', 'Sydney', 'Cordoba', 'Montpellier',
       'Pune', 'Buenos Aires', 'Dallas', 'Rotterdam', 'Delray Beach',
       'Doha', 'Marseille', 'Rio de Janeiro', 'Acapulco', 'Dubai ',
       'Santiago', 'Antalya', 'Singapore', 'Miami', 'Cagliari',
       'Marbella', 'Monte Carlo', 'Barcelona', 'Belgrade', 'Estoril',
       'Munich', 'Madrid', 'Rome', 'Geneva', 'Lyon', 'Parma', 'Paris',
       'Stuttgart', 'Halle', 'Queens Club', 'Eastbourne', 'Mallorca',
       'London', 'Bastad', 'Hamburg', 'Newport', 'Gstaad', 'Los Cabos',
       'Umag', 'Atlanta', 'Kitzbuhel', 'Washington', 'Toronto',
       'Cincinnati', 'Winston-Salem', 'New York', 'Metz', 'Nur-Sultan',
       'San Diego', 'Sofia', 'Indian Wells', 'Antwerp', 'Moscow',
       'St. Petersburg', 'Vienna', 'Stockholm', 'Turin', 'Auckland',
       'Cologne', 'Sardinia', 'Brisbane', 'Sao Paulo', 'Houston',
       'Marrakech', 'Budapest', "'s-Hertogenbosch", 'Montreal', 'Chengdu',
       'Zhuhai', 'Beijing', 'Tokyo'

In [140]:
pd.DataFrame(df.groupby('Location').count().sort_values(by='ATP',ascending=False)['ATP'])

Unnamed: 0_level_0,ATP
Location,Unnamed: 1_level_1
Paris,2428
Melbourne,1999
New York,1886
London,1809
Indian Wells,1234
...,...
Cagliari,26
Zhuhai,26
Singapore,24
Amersfoort,23


What are different tennis tournaments?

In [141]:
df.Tournament.unique()

array(['Adelaide International 1', 'Melbourne Summer Set',
       'Adelaide International 2', 'Sydney Tennis Classic',
       'Australian Open', 'Cordoba Open', 'Open Sud de France',
       'Maharashtra Open', 'Argentina Open', 'Dallas Open',
       'ABN AMRO World Tennis Tournament', 'Delray Beach Open',
       'Qatar Exxon Mobil Open', 'Open 13', 'Rio Open',
       'Abierto Mexicano', 'Dubai Tennis Championships', 'Chile Open',
       'Antalya Open', 'Great Ocean Road Open', 'Murray River Open',
       'Singapore Open', 'Miami Open', 'Sardegna Open',
       'AnyTech365 Andalucia Open', 'Monte Carlo Masters',
       'Barcelona Open', 'Serbia Open', 'Millennium Estoril Open',
       'BMW Open', 'Mutua Madrid Open', "Internazionali BNL d'Italia",
       'Geneva Open', 'Lyon Open', 'Belgrade Open', 'Emilia-Romagna Open',
       'French Open', 'Mercedes Cup', 'Halle Open',
       "Queen's Club Championships", 'Viking International',
       'Mallorca Championships', 'Wimbledon', 'Nordea Op

In [142]:
pd.DataFrame(df.groupby('Tournament').count().sort_values(by='ATP',ascending=False)['ATP'])

Unnamed: 0_level_0,ATP
Tournament,Unnamed: 1_level_1
Australian Open,1871
French Open,1754
US Open,1753
Wimbledon,1629
BNP Paribas Open,1139
...,...
Belgrade Open,25
Mifel Open,25
Singapore Open,24
Countrywide Classic,24


Who are the top 5 players that won the highest number of matches?

In [143]:
pd.DataFrame(df.groupby('Winner').agg('count').sort_values(by='ATP', ascending=False).head(5)['ATP'])

Unnamed: 0_level_0,ATP
Winner,Unnamed: 1_level_1
Djokovic N.,824
Nadal R.,757
Federer R.,673
Murray A.,564
Ferrer D.,498


Who are the 5 players that lost the highest number of matches?

In [144]:
pd.DataFrame(df.groupby('Loser').agg('count').sort_values(by='ATP', ascending=False).head(5)['ATP'])

Unnamed: 0_level_0,ATP
Loser,Unnamed: 1_level_1
Seppi A.,316
Simon G.,315
Fognini F.,314
Verdasco F.,311
Lopez F.,308


Which type of court is more popular?

In [145]:
pd.DataFrame(df.groupby('Court').count()['ATP'])

Unnamed: 0_level_0,ATP
Court,Unnamed: 1_level_1
Indoor,6405
Outdoor,28972


In [146]:
pd.DataFrame(df.groupby('Surface').count()['ATP'])

Unnamed: 0_level_0,ATP
Surface,Unnamed: 1_level_1
Carpet,226
Clay,10969
Grass,3945
Hard,20237


What are the different of comments?

In [147]:
df.Comment.unique()

array(['Completed', 'Retired', 'Walkover', 'Awarded', 'Rrtired',
       'Disqualified', 'Sched'], dtype=object)

We see that there are a typo, let's fix it.

In [149]:
df.replace('Rrtired', value = 'Retired', inplace=True)

In [150]:
df.Comment.unique()

array(['Completed', 'Retired', 'Walkover', 'Awarded', 'Disqualified',
       'Sched'], dtype=object)

What are the status of the matches?

In [151]:
pd.DataFrame(df.groupby('Comment').count()['ATP'])

Unnamed: 0_level_0,ATP
Comment,Unnamed: 1_level_1
Awarded,2
Completed,34025
Disqualified,2
Retired,1126
Sched,1
Walkover,221


Which nationality wins the most in the tennis field?

In [152]:
pd.DataFrame(df.groupby('pl1_flag').count().sort_values(by='ATP', ascending=False)['ATP']).head(10)

Unnamed: 0_level_0,ATP
pl1_flag,Unnamed: 1_level_1
ESP,4724
FRA,3787
USA,3008
GER,2154
ARG,2031
SRB,1685
RUS,1648
ITA,1559
CRO,1204
SUI,1196


Which nationality loses the most in the tennis field?

In [153]:
pd.DataFrame(df.groupby('pl2_flag').count().sort_values(by='ATP', ascending=False)['ATP']).head(10)

Unnamed: 0_level_0,ATP
pl2_flag,Unnamed: 1_level_1
ESP,3907
FRA,3747
USA,3203
GER,2502
ARG,2099
ITA,1871
RUS,1602
AUS,1365
SRB,1039
CRO,996


Right-handed vs Left-handed ?

In [154]:
pd.DataFrame(df.groupby('pl1_hand').count()['ATP'])

Unnamed: 0_level_0,ATP
pl1_hand,Unnamed: 1_level_1
Left-Handed,4705
Right-Handed,30672


In [155]:
pd.DataFrame(df.groupby('pl2_hand').count()['ATP'])

Unnamed: 0_level_0,ATP
pl2_hand,Unnamed: 1_level_1
Left-Handed,5019
Right-Handed,30358


TO DO AFTER EDA :
- link player names with players images  
- link flag names with flag images
- see if needed to turn non-numerical features to dummy variables
- normalize ?

## Related work/Our ideas

These data are usually presented in a very classical way in the form of tables detailing the statistics of the players according to different criteria as can be seen on these two reference sites: http://www.tennisabstract.com/cgi-bin/leaders.cgi?f=B0s00o1 or https://www.atptour.com/en/stats.

Our idea is original in the sense that it aims to show much more the links between players and countries. The way of approaching the data is therefore different. Our ambitions for the moment are as follows:

- ***Tab 1*: Network between players:**
    - Make a network that connects players who have played a match. Interactivity: you can move the nodes with the mouse to make the network move.
    - A node corresponds to a player (with the picture of the player in the node). The bigger the node, the more matches the player has made. 
    - Network oriented to know who is the winner.
    - Slider to choose the year.
    - Button to apply filters according to different variables: Location, Tournament, Serie, Court, Surface, Players ranking range.
    - When hovering a summary of information is displayed with a possibility to click on an extension. This extension is a whole page about the player is displayed with: his picture, the number of matches played in ATP by series per year, the number of wins and the number of losses according to the year/the tournament/the type of court/the type of surface, his nationality, his weight, his year of start as a pro and his playing hand. There is also a planisphere that can be rotated and where the points of the tournaments in which he participated by year are displayed (inspiration: https://observablehq.com/@d3/world-tour or https://observablehq.com/@d3/versor-dragging). Also, a list with his main opponents. 
    - It is also possible to click on the link between the two players: a box is displayed with the characteristics of the matches: their photo, location, tournament, date, series, court, surface, round, winner, rank and points of the players at the beginning of the tournament, number of games won by each player for each set, number of sets won by each player, comment, odds.
    - Inspiration: http://www.claudiobellei.com/2017/02/04/viznetworks/


- ***Tab 2*: Connection map between countries:**
    - World map with the countries colored according to for example the number of victories of theirs players.
    - Link between the countries whose players play each other in a match.
    - Slider to choose the year.
    - Button to apply filters according to different variables: Location, Tournament, Serie, Court, Surface, Players ranking range.
    - Possibility to click on a country: a box appears with the list of players of the country and the number of matches of each one for this year with their number of victories.
    - You can also click on the links between the countries: a box with the characteristics of the match is displayed with : location, tournament, date, series, court, surface, round, winner, rank and points of the players at the beginning of the tournament, number of games won by each for each set, number of sets won by each, comment, odds.
    - Inspiration: https://d3-graph-gallery.com/graph/connectionmap_multi.html  
    


- ***Tab 3*: Bart Charts with bars that move over time:**
    - Two bart charts on the same line with the bars moving as the years go by.
    - A first bart chart to show the evolution of the countries ranking in terms of number of wins. 
    - A second bart chart to show the evolution of the players in terms of number of wins.
    - Button to apply filters according to different variables: Series, Court, Surface.
    - Inspiration: https://observablehq.com/@d3/bar-chart-race  
    
    


- ***(Side ideas for the site)***:
    - Make the loading symbol in the form of a spinning tennis ball.
    - Make a mascot in the form of a tennis racket to explain the site's features at the beginning.
    - Make symbols for the selection of variables, a small cup for the winner, flags for the nationalities, etc...  


*NB : The dataset is extracted from the Kaggle database (https://www.kaggle.com/datasets/edoardoba/atp-tennis-data) and only a "Djokovic vs Nadal quick analysis" is presented as analysis of this data.*

In [48]:
winns = pd.Series(df['Winner'].unique())
lose = pd.Series(df['Loser'].unique())
len(winns.append(lose).unique())

697