# What am I Investigating?
Tennis is a racquet sport that can be played individually against a single opponent (singles) or between two teams of two players each (doubles). Players use a strung racquet to strike a hollow rubber ball, covered with felt, over a net and into the opponent's court. The objective of the game is to play the ball in such a way that the opponent cannot play a valid return. The player who is unable to return the ball will not gain a point, while the opposite player will.  
Tennis is played on different surfaces, including grass, clay, hard courts, and even indoor carpet. Each surface affects the ball's speed and bounce differently, leading to varied styles of play.  
*I'm intrested to find out if there's a relationship between the surface the match was played on and its duration*.

# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from scipy.stats import skew, kurtosis

pd.set_option('display.max_columns', 50)

# Loading Data

In [2]:
folder_path = '/content/drive/MyDrive/Tennis_Analysis/full_matches_data'

# List all files in the directory with a .csv extension
all_files = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f)) and f.endswith('.csv')]

# Use a list comprehension to read each file into a dataframe and then concatenate them all
combined_df = pd.concat([pd.read_csv(os.path.join(folder_path, f)) for f in all_files], ignore_index=True)

In [3]:
combined_df.sample(10)

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
101861,2022-2807,Gijon,Hard,32,A,20221010,289,104918,,WC,Andy Murray,R,190.0,GBR,35.4,106398,,,Pedro Cachin,R,185.0,ARG,27.4,2-6 7-5 7-6(3),3,R16,167.0,5.0,2.0,90.0,65.0,45.0,14.0,16.0,2.0,6.0,13.0,0.0,118.0,89.0,56.0,14.0,16.0,6.0,9.0,48.0,975.0,61.0,813.0
100236,2022-0403,Miami Masters,Hard,128,M,20220321,176,111456,,,Mackenzie Mcdonald,R,178.0,USA,26.9,136440,,,Dominik Koepfer,L,180.0,GER,27.8,6-7(8) 6-4 6-4,3,R128,161.0,3.0,0.0,107.0,74.0,43.0,16.0,16.0,8.0,15.0,0.0,11.0,113.0,71.0,37.0,16.0,16.0,7.0,16.0,54.0,981.0,64.0,899.0
48583,2004-505,Vina del Mar,Clay,32,A,20040209,17,103454,1.0,,Nicolas Massu,R,183.0,CHI,24.3,103248,,,Harel Levy,R,185.0,ISR,25.5,6-2 6-4,3,R16,75.0,5.0,1.0,49.0,26.0,20.0,17.0,9.0,1.0,1.0,3.0,0.0,47.0,32.0,22.0,5.0,9.0,0.0,3.0,14.0,1564.0,106.0,373.0
22551,1996-747,Beijing,Carpet,32,A,19961007,7,102563,,,Thomas Johansson,R,180.0,SWE,21.5,101318,,,Javier Frana,L,185.0,ARG,29.7,6-3 7-5,3,R32,70.0,3.0,4.0,58.0,28.0,25.0,18.0,11.0,0.0,1.0,8.0,0.0,67.0,43.0,30.0,9.0,10.0,5.0,8.0,80.0,623.0,84.0,583.0
23540,1997-401,Philadelphia,Hard,32,A,19970224,10,101647,,,Byron Black,R,175.0,ZIM,27.3,101735,,Q,Richard Fromberg,R,196.0,AUS,26.8,6-4 6-3,3,R32,67.0,1.0,0.0,49.0,24.0,20.0,16.0,9.0,0.0,0.0,9.0,2.0,55.0,34.0,28.0,4.0,10.0,2.0,5.0,55.0,766.0,85.0,575.0
62977,2009-314,Gstaad,Clay,32,A,20090726,2,105064,,Q,Thomaz Bellucci,L,188.0,BRA,21.5,103967,,LL,Michael Lammer,R,185.0,SUI,27.3,6-7(5) 7-6(5) 6-4,3,R32,190.0,2.0,6.0,125.0,72.0,49.0,25.0,17.0,10.0,15.0,3.0,2.0,127.0,78.0,47.0,27.0,17.0,9.0,15.0,119.0,571.0,188.0,348.0
74343,2012-560,US Open,Hard,128,G,20120827,120,104925,2.0,,Novak Djokovic,R,188.0,SRB,25.2,104527,18.0,,Stan Wawrinka,R,183.0,SUI,27.4,6-4 6-1 3-1 RET,5,R16,93.0,4.0,0.0,68.0,45.0,30.0,13.0,11.0,4.0,6.0,5.0,3.0,57.0,23.0,15.0,12.0,10.0,1.0,5.0,2.0,11270.0,19.0,1730.0
93899,2019-540,Wimbledon,Grass,128,G,20190701,102,104919,,,Leonardo Mayer,R,188.0,ARG,32.1,105208,,,Ernests Gulbis,R,190.0,LAT,30.8,6-1 7-6(12) 6-2,5,R128,137.0,2.0,5.0,90.0,51.0,40.0,26.0,15.0,2.0,3.0,2.0,6.0,108.0,58.0,40.0,19.0,13.0,12.0,17.0,59.0,875.0,92.0,617.0
76896,2013-314,Gstaad,Clay,28,A,20130722,20,104527,2.0,,Stan Wawrinka,R,183.0,SUI,28.3,104593,,,Daniel Gimeno Traver,R,185.0,ESP,27.9,7-5 7-6(4),3,R16,95.0,12.0,2.0,68.0,48.0,38.0,12.0,12.0,2.0,4.0,4.0,8.0,101.0,59.0,39.0,21.0,12.0,9.0,12.0,10.0,2915.0,62.0,772.0
77038,2013-421,Canada Masters,Hard,56,M,20130805,34,105577,,WC,Vasek Pospisil,R,193.0,CAN,23.1,103285,,,Radek Stepanek,R,185.0,CZE,34.6,6-2 6-4,3,R32,75.0,7.0,5.0,59.0,31.0,26.0,10.0,9.0,5.0,7.0,1.0,2.0,50.0,32.0,18.0,7.0,9.0,1.0,6.0,71.0,696.0,51.0,900.0


# Understanding the Big Picture

Getting the number of rows and columns in the dataset (pre-cleaning):

In [4]:
number_of_rows_pre_cleaning = combined_df.shape[0]
number_of_columns_pre_cleaning = combined_df.shape[1]

print(f"The dataset has {number_of_rows_pre_cleaning} rows and {number_of_columns_pre_cleaning} columns.")

The dataset has 104682 rows and 49 columns.


Examining the data type of each column:

In [9]:
combined_df.dtypes

tourney_id             object
tourney_name           object
surface                object
draw_size               int64
tourney_level          object
tourney_date            int64
match_num               int64
winner_id               int64
winner_seed           float64
winner_entry           object
winner_name            object
winner_hand            object
winner_ht             float64
winner_ioc             object
winner_age            float64
loser_id                int64
loser_seed            float64
loser_entry            object
loser_name             object
loser_hand             object
loser_ht              float64
loser_ioc              object
loser_age             float64
score                  object
best_of                 int64
round                  object
minutes               float64
w_ace                 float64
w_df                  float64
w_svpt                float64
w_1stIn               float64
w_1stWon              float64
w_2ndWon              float64
w_SvGms   

Counting the number of NA's in each column:

In [5]:
combined_df.isna().sum()

tourney_id                0
tourney_name              0
surface                   0
draw_size                 0
tourney_level             0
tourney_date              0
match_num                 0
winner_id                 0
winner_seed           62282
winner_entry          91873
winner_name               0
winner_hand               9
winner_ht              2454
winner_ioc                0
winner_age                5
loser_id                  0
loser_seed            81382
loser_entry           83599
loser_name                0
loser_hand               42
loser_ht               4855
loser_ioc                 0
loser_age                18
score                     0
best_of                   0
round                     0
minutes               13036
w_ace                 10207
w_df                  10207
w_svpt                10207
w_1stIn               10207
w_1stWon              10207
w_2ndWon              10207
w_SvGms               10206
w_bpSaved             10207
w_bpFaced           

Checking if there are any duplicate rows:

In [6]:
number_of_duplicated_rows = combined_df.duplicated().sum()
print(f"There are {number_of_duplicated_rows} duplicated rows in the dataset.")

There are 0 duplicated rows in the dataset.


# Data Cleaning

Deleting matches (rows) that we don't know their statistics:

In [None]:
match_stats_columns_names = ["w_ace", "w_df", "w_svpt", "w_1stIn", "w_1stWon", "w_2ndWon", "w_SvGms", "w_bpSaved", "w_bpFaced", "l_ace", "l_df", "l_svpt", "l_1stIn", "l_1stWon", "l_2ndWon", "l_SvGms", "l_bpSaved", "l_bpFaced"]

combined_df.dropna(subset=match_stats_columns_names, inplace=True)

combined_df[match_stats_columns_names].isna().sum() # If the code above worked, the columns in the list above should have 0 NA's.

Deleting unnecessary columns:

In [None]:
names_of_columns_to_drop = ["winner_seed", "winner_entry", "loser_seed", "loser_entry", "match_num"]

combined_df.drop(columns=names_of_columns_to_drop, inplace=True)

combined_df.columns

Changing the format of the dates in the tourney_date column, from `YYYYMMDD` format to `Y M D` format.

In [None]:
combined_df['tourney_date'] = pd.to_datetime(combined_df['tourney_date'], format='%Y%m%d')
combined_df['tourney_date'].head()

Creating new columns:

In [None]:
# "total_points" - total points played in the match (sum of the serve points both the winner and the loser had).
combined_df["total_points"] = combined_df["w_svpt"] + combined_df["l_svpt"]

# "w_1st_percentage" - percentage of first serve in the winner
combined_df.sample(7)

Getting the number of rows and columns in the dataset (post-cleaning):

In [None]:
number_of_rows_post_cleaning = combined_df.shape[0]
number_of_columns_post_cleaning = combined_df.shape[1]

print(f"The dataset has {number_of_rows_post_cleaning} rows and {number_of_columns_post_cleaning} columns.")
print(f"{number_of_rows_pre_cleaning-number_of_rows_post_cleaning} rows were deleted and {number_of_columns_pre_cleaning-number_of_columns_post_cleaning} were removed.")

Creating a subset used in the multivariate analysis:

In [None]:
col_names_multivariate_df = ["w_ace", "w_df", "w_svpt", "w_1stIn", "w_1stWon", "w_2ndWon", "w_SvGms", "w_bpSaved", "w_bpFaced", "l_ace", "l_df", "l_svpt", "l_1stIn", "l_1stWon", "l_2ndWon", "l_SvGms", "l_bpSaved", "l_bpFaced", "winner_ht", "winner_age", "loser_ht", "loser_age", "minutes", "total_points"]

multivariate_analysis_df = combined_df[col_names_multivariate_df]

multivariate_analysis_df.head()

# Univariate Analysis

## Categorical Variables

### Surface Type

Counting the number of games played on each surface:

In [None]:
combined_df["surface"].value_counts()

Proportion of matches played on each surface:

In [None]:
combined_df["surface"].value_counts(normalize=True)

Plotting the number of games played on each surface type:

In [None]:
plt.figure(figsize=(12,6))
surfaces_countplot = sns.countplot(data=combined_df, x="surface", order=combined_df.surface.value_counts().index)
sns.despine(left=True, bottom=True)
surfaces_countplot.bar_label(surfaces_countplot.containers[0], label_type='edge')
surfaces_countplot.set_xlabel("Type of Surface")
surfaces_countplot.set_ylabel("Number of Matches Played")
surfaces_countplot.set_title("Number of Matches Played Between 1991 and 2023* on each surface")
plt.show()

### Best of 5 VS Best of 3:

Counting the number of best of 5 and best of 3 matches played:

In [None]:
combined_df["best_of"].value_counts()

Proportion of each best of- match:

In [None]:
combined_df["best_of"].value_counts(normalize=True)

Plotting the number of best of 5 and best of 3 matches played:

In [None]:
plt.figure(figsize=(12,6))
best_of_countplot = sns.countplot(data=combined_df, x="best_of", order=combined_df.best_of.value_counts().index)
sns.despine(left=True, bottom=True)
best_of_countplot.bar_label(best_of_countplot.containers[0], label_type='edge')
best_of_countplot.set_xlabel("Best of How Many Sets")
best_of_countplot.set_ylabel("Number of Matches")
best_of_countplot.set_title("Number of Matches Played Between 1991 and 2023")
plt.show()

### Number of Matches by Tournament Type:

Counting the number of matches in each tournament type:

In [None]:
combined_df["tourney_level"].value_counts()

Proportion of tournament type:

In [None]:
combined_df["tourney_level"].value_counts(normalize=True)

Plotting:

In [None]:
plt.figure(figsize=(12,6))
tourney_level_countplot = sns.countplot(data=combined_df, x="tourney_level", order=combined_df["tourney_level"].value_counts().index)
sns.despine(left=True, bottom=True)
tourney_level_countplot.bar_label(tourney_level_countplot.containers[0], label_type='edge')
tourney_level_countplot.set_xlabel("Tourney Type")
tourney_level_countplot.set_ylabel("Number of Matches")
tourney_level_countplot.set_title("Number of Matches Played Between 1991 and 2023 in each tourney type")
plt.show()

## Numerical Variables

### Analyzing the total_points variable:

Descriptive statistics:

In [None]:
round(combined_df.total_points.describe(), 3)

Skewness and Kurtosis:

In [None]:
total_points_skewness = combined_df.total_points.skew()
total_points_kurtosis = combined_df.total_points.kurtosis()

print(f"Skewness: {round(total_points_skewness, 3)}")
print(f"Kurtosis: {round(total_points_kurtosis, 3)}")

Plotting:

In [None]:
plt.figure(figsize=(12, 6))
total_point_histogram = sns.histplot(data=combined_df, x="total_points")
sns.despine(left=True, bottom=True)
total_point_histogram.set_xlabel("Total Number of Points Played in a Match")
total_point_histogram.set_ylabel("Number of Matches")
plt.show()

### Analyzing the winner_age variable:

Descriptive statistics:

In [None]:
round(combined_df["winner_age"].describe(), 3)

Skewness and Kurtosis:

In [None]:
winner_age_skewness = combined_df["winner_age"].skew()
winner_age_kurtosis = combined_df["winner_age"].kurtosis()

print(f"Skewness: {round(winner_age_skewness, 3)}")
print(f"Kurtosis: {round(winner_age_kurtosis, 3)}")

Plotting:

In [None]:
plt.figure(figsize=(12, 6))
winner_age_histogram = sns.histplot(data=combined_df, x="winner_age")
sns.despine(left=True, bottom=True)
winner_age_histogram.set_xlabel("Winners Age")
winner_age_histogram.set_ylabel("Number of Players")
winner_age_histogram.set_title("Winners Age Distribution")
plt.show()

### Analyzing the loser_age variable:

Descriptive statistics:

In [None]:
combined_df["loser_age"].describe()

Skewness and Kurtosis:

In [None]:
loser_age_skewness = combined_df["loser_age"].skew()
loser_age_kurtosis = combined_df["loser_age"].kurtosis()

print(f"Skewness: {round(loser_age_skewness, 3)}")
print(f"Kurtosis: {round(loser_age_kurtosis, 3)}")

Plotting:

In [None]:
plt.figure(figsize=(12, 6))
loser_age_histogram = sns.histplot(data=combined_df, x="loser_age")
sns.despine(left=True, bottom=True)
loser_age_histogram.set_xlabel("Loser Age")
loser_age_histogram.set_ylabel("Number of Players")
loser_age_histogram.set_title("Losers Age Distribution")
plt.show()

# Bivariate Analysis

## Finding trends in the numric variables:

* The goal of the pairplot visualization is to examine the relationships between all the numric variabels in the dataset. Though in many cases a correlation matrix is used to do so, I prefer to do this visualization in order to see if there's any non linear correlations between any variabels.

Creating a pairplot visualization:

In [None]:
pairplot = sns.pairplot(multivariate_analysis_df, corner=True)
plt.show()

* I chose to create the correlation matrix in order to check the magnitude of the linear correlation between variabels.  

Creating a correlation matrix:

In [None]:
corr_matrix = round(multivariate_analysis_df.corr(),2)

mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

plt.figure(figsize=(20,12))
sns.heatmap(corr_matrix, annot=True, mask=mask, square=True, linewidths=2)
plt.show()

**Conclusions drawn from the pairplot visualization and correlation matrix:**  
* After examining the pairplot visualization, it seems that the only type of relationships are linear.
* From examining the correlation matrix, it seems that there is a strong correlation between some variabels. All of the strong correlations make sense - the correlation between the length of the match in minutes to the total number of points played in the match (*r=0.92*), for example, is easy to explain: if the match is longer, it means the players has more time to play more points.

## Comparing the average number of total points played on each surfce

Calculating each surface's average:

In [None]:
total_points_mean_by_surface_df = round(combined_df.groupby(["surface"], as_index=False)[["total_points"]].mean(), 2)
total_points_mean_by_surface_df.sort_values(by="total_points", ascending=False, inplace=True, ignore_index=True)
total_points_mean_by_surface_df

Plotting:

In [None]:
plt.figure(figsize=(16,9))
points_by_surface_plot = sns.boxplot(combined_df, x="surface", y="total_points", order=total_points_mean_by_surface_df["surface"])
sns.despine(left=True, bottom=True)
points_by_surface_plot.set_xlabel("Surface Type")
points_by_surface_plot.set_ylabel("Number of Points Played In A Match")
points_by_surface_plot.set_title("Total Points On Each Surface")
plt.show()

## Comparing the average match length for each event

In [None]:
average_duration_by_tourney_level_df = round(combined_df.groupby(["tourney_level"], as_index=False)[["minutes", "total_points"]].mean(), 3)
average_duration_by_tourney_level_df.sort_values(by="minutes", ascending=False, inplace=True, ignore_index=True)
average_duration_by_tourney_level_df

In [None]:
plt.figure(figsize=(16,9))
duration_by_tourney_level_plot = sns.boxplot(combined_df, x="tourney_level", y="minutes", order=average_duration_by_tourney_level_df["tourney_level"])
sns.despine(left=True, bottom=True)
duration_by_tourney_level_plot.set_xlabel("Tourney Level")
duration_by_tourney_level_plot.set_ylabel("Match Duration, in Minutes")
duration_by_tourney_level_plot.set_title("Match Duration in Minutes by Tourney Level")
plt.show()

## Serve Analysis: