# What am I Investigating?
Tennis, a globally renowned sport, is a perfect blend of athleticism, strategy, and skill. Spanning from community courts to grand slam arenas, the game attracts millions of enthusiasts worldwide, both as participants and spectators. While the sport's beauty lies in its intricacies, one element stands out as a critical determinant of a player's success: the serve.  

Serving in tennis is equivalent to a pitcher's throw in baseball or a quarterback's pass in football. It sets the tone for the entire point, giving the server an initial advantage. Within this domain, the first serve is particularly crucial. A powerful and accurate first serve can drastically diminish the opponent's chances of returning the ball effectively, thereby providing the server with a strategic upper hand.  

However, it's not just about getting the ball into play. The quality of the first serve, measured by both its accuracy (percentage of successful first serves) and its effectiveness (percentage of points won on a successful first serve), can be game-changing. A higher first serve percentage indicates consistent performance, while a higher winning percentage on the first serve points reflects the serve's potency.  

This study seeks to delve deeper into these metrics to uncover patterns and insights. Specifically, I aim to answer:  
1. Is there a difference between winners' and losers' first serve percentages?
2. Does a difference exist between winners' and losers' percentages of winning first serve points?

By analyzing these questions, I hope to shed light on the pivotal role the first serve plays in a tennis match's outcome and potentially provide players and coaches with valuable insights to refine their strategies.  

---

# Data Source:
The data is taken from kaggle and is called ['Huge Tennis Database'](https://www.kaggle.com/datasets/guillemservera/tennis). This dataset is a comprehensive collection of ATP tennis rankings, match results, and player statistics. It is derived from the original database created and maintained by Jeff Sackmann, which could be found in the [following github repository](https://github.com/JeffSackmann/tennis_atp).

# Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from scipy.stats import skew, kurtosis

pd.set_option('display.max_columns', 50)

# Loading Data

Loading the datasets:

In [None]:
folder_path = '/content/drive/MyDrive/Tennis_Analysis/full_matches_data'

# List all files in the directory with a .csv extension
all_files = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f)) and f.endswith('.csv')]

# Use a list comprehension to read each file into a dataframe and then concatenate them all
combined_df = pd.concat([pd.read_csv(os.path.join(folder_path, f)) for f in all_files], ignore_index=True)

*On kaggle there are two methods for importing this data - by using SQLlite or by using CSV files. I chose to download the CSV files regarding all the ATP matches conducted between 1968 and 2023. Because the data of each year's matches is stored in a separate CSV file, I had to concatenate them into a variable called `combined_df`.*

Taking a glance at the dataframe:

In [None]:
combined_df.head(10)

# Understanding the Big Picture

Getting the number of rows and columns in the dataset (pre-cleaning):

In [None]:
number_of_rows_pre_cleaning = combined_df.shape[0]
number_of_columns_pre_cleaning = combined_df.shape[1]

print(f"The dataset has {number_of_rows_pre_cleaning} rows and {number_of_columns_pre_cleaning} columns.")

---

Examining the data type of all the columns in the dataset:

In [None]:
combined_df.dtypes

*After examining the data types of all the columns, I can see a problem with some of the columns. Though some of the variables are discrete, Pandas falsely attributed them to `float64`.  
In the data cleaning section I'll change the data type of those columns to the smaller data type in order to reduce memory usage.*



---



Getting the descriptive statistics of the numeric variables:

In [None]:
combined_df.describe()



---



Counting the number of NA's in each column:

In [None]:
combined_df.isna().sum()

*While examining the number of NA's in each column, my main focus is on the number of NA's in the variables depicting the match stats, because I intened the analyze them.*



---



Checking if there are any duplicate rows:

In [None]:
number_of_duplicated_rows = combined_df.duplicated().sum()
print(f"There are {number_of_duplicated_rows} duplicated rows in the dataset.")

# Data Cleaning

## Deleting Unnecessary Columns

In [None]:
names_of_columns_to_drop = ["winner_seed", "winner_entry", "loser_seed", "loser_entry", "match_num", "draw_size", "winner_hand", "winner_ioc", "loser_hand", "loser_ioc"]

combined_df.drop(columns=names_of_columns_to_drop, inplace=True)

combined_df.columns

*Removing unnecessary columns. Though analyzing some of them can potentially yield intresting results, I still chose to remove them and explore other variables in the data.*

## Data Formatting

Changing the format of the dates in the tourney_date column, from `YYYYMMDD` format to `Y M D` format.

In [None]:
combined_df["tourney_date"] = pd.to_datetime(combined_df['tourney_date'], format='%Y%m%d')
combined_df["tourney_date"].head()

## Feature Engineering

Dividing the 'tourney_date' column to year, month and day

In [None]:
# Year column
combined_df["Year"] = combined_df["tourney_date"].dt.strftime('%Y')

# Month column
combined_df["Month"] = combined_df["tourney_date"].dt.strftime('%m')

# Day column
combined_df["Day"] = combined_df["tourney_date"].dt.strftime('%d')

# Taking a glance at the new columns
combined_df[["Year", "Month", "Day"]].head(10)

Creating new columns:

In [None]:
# "total_points" - total points played in the match (sum of the serve points both the winner and the loser had).
combined_df["total_points"] = combined_df["w_svpt"] + combined_df["l_svpt"]

# "w_1st_serve_in_percentage" - the winners' percentage of first serve in
combined_df["w_1st_serve_in_percentage"] = combined_df["w_1stIn"]/combined_df["w_svpt"]
combined_df["w_1st_serve_in_percentage"] = round(combined_df["w_1st_serve_in_percentage"]*100, 3)

# "l_1st_serve_in_percentage" - the losers' percentage of first serve in
combined_df["l_1st_serve_in_percentage"] = combined_df["l_1stIn"]/combined_df["l_svpt"]
combined_df["l_1st_serve_in_percentage"] = round(combined_df["l_1st_serve_in_percentage"]*100, 3)

# "w_1st_serve_winning_percentage" - the winners' percentage of points won when the first serve was in
combined_df["w_1st_serve_winning_percentage"] = combined_df["w_1stWon"]/combined_df["w_1stIn"]
combined_df["w_1st_serve_winning_percentage"] = round(combined_df["w_1st_serve_winning_percentage"]*100, 3)

# "l_1st_serve_winning_percentage" - the losers' percentage of points won when the first serve was in
combined_df["l_1st_serve_winning_percentage"] = combined_df["l_1stWon"]/combined_df["l_1stIn"]
combined_df["l_1st_serve_winning_percentage"] = round(combined_df["l_1st_serve_winning_percentage"]*100, 3)

combined_df[["total_points", "w_svpt", "w_1stIn", "w_1st_serve_in_percentage", "l_svpt", "l_1stIn", "l_1st_serve_in_percentage", "w_1st_serve_winning_percentage", "l_1st_serve_winning_percentage"]].sample(10)

## Handling Missing Values

In the "Understanding The Big Picture" section I counted the number of [NA's in each column](https://colab.research.google.com/drive/1YiNo6ZBjWNuTTT72mS8qQZ_2W1ZzAFnL#scrollTo=INYVRYc3RsLx&line=1&uniqifier=1). As can be seen, we don't have the stats for 10,207 matches - 9.75% of the matches in the initial dataframe (we had information on 104,682 matches at the beginning).  
Before I decide how to handle those missing values I want to have a better understanding on them.

---



Filtering the data, keeping only the rows with NA's in the stats columns.

In [None]:
match_stats_columns_names = ["w_ace", "w_df", "w_svpt", "w_1stIn", "w_1stWon", "w_2ndWon", "w_SvGms", "w_bpSaved", "w_bpFaced", "l_ace", "l_df", "l_svpt", "l_1stIn", "l_1stWon", "l_2ndWon", "l_SvGms", "l_bpSaved", "l_bpFaced", "total_points", "w_1st_serve_in_percentage", "l_1st_serve_in_percentage", "w_1st_serve_winning_percentage", "l_1st_serve_winning_percentage"]

matches_without_stats_df = combined_df[combined_df[match_stats_columns_names].isna().any(axis=1)]

matches_without_stats_df.head(10)

[*The Data Preprocessing*](https://colab.research.google.com/drive/1YiNo6ZBjWNuTTT72mS8qQZ_2W1ZzAFnL#scrollTo=MGAwWswN9ClG) *section is a dedicated section for creating new subset dataframes used in the analysis section. I've decided to create the dataframe in the cell above, which includes only matches we don't know their stats, in order to thoroughly examine the missing values in the original dataframe.*

Aggregating the filtered data, counting the rows with missing values by year and tourney level:

In [None]:
number_of_NA_by_year_and_tourney_level_df = matches_without_stats_df.groupby(by=["Year", "tourney_level"], as_index=False).size()
number_of_NA_by_year_and_tourney_level_df.rename(columns={"size": "number_of_NA"}, inplace=True)
number_of_NA_by_year_and_tourney_level_df

Plotting the number of NA's by year:

In [None]:
# Creating the plot
plt.figure(figsize=(16, 9))
NA_by_year_barplot = sns.barplot(data=number_of_NA_by_year_and_tourney_level_df, x="Year", y="number_of_NA", errorbar=None, estimator="sum", color="slategray")

# Removing the frame
sns.despine(left=True, bottom=True)

# Adding Bar's Labels
NA_by_year_barplot.bar_label(NA_by_year_barplot.containers[0], label_type='edge')

# Rotating tick labels of the X axis
plt.xticks(rotation=45)

# Titles
NA_by_year_barplot.set_xlabel("Year")
NA_by_year_barplot.set_ylabel("Number of Matches Without Stats")
NA_by_year_barplot.set_title("Number of Matches Played Between 1991 and 2023 Without Stats")

# Showing the plot
plt.show()

Plotting the number of NA's tourney level:

In [None]:
# Creating the plot
plt.figure(figsize=(16, 9))
NA_by_tourney_level_barplot = sns.barplot(data=number_of_NA_by_year_and_tourney_level_df, x="tourney_level", y="number_of_NA", errorbar=None, estimator="sum", color="slategray")

# Removing the frame
sns.despine(left=True, bottom=True)

# Adding Bar's Labels
NA_by_tourney_level_barplot.bar_label(NA_by_tourney_level_barplot.containers[0], label_type='edge')

# Rotating tick labels of the X axis
plt.xticks(rotation=45)

# Titles
NA_by_tourney_level_barplot.set_xlabel("Tournament Type")
NA_by_tourney_level_barplot.set_ylabel("Number of Matches Without Stats")
NA_by_tourney_level_barplot.set_title("Number of Matches Played By Each Tourney Type Without Stats")

# Showing the plot
plt.show()

Plotting the number of NA's by both year and tourney level:

In [None]:
# Creating the plot
plt.figure(figsize=(16, 9))
NA_by_year_and_tourney_level_barplot = sns.barplot(data=number_of_NA_by_year_and_tourney_level_df, x="Year", y="number_of_NA", hue="tourney_level", estimator="sum")

# Removing the frame
sns.despine(left=True, bottom=True)

# Rotating tick labels of the X axis
plt.xticks(rotation=45)

# Titles
NA_by_year_and_tourney_level_barplot.set_xlabel("Year")
NA_by_year_and_tourney_level_barplot.set_ylabel("Number of Matches Without Stats")
NA_by_year_and_tourney_level_barplot.set_title("Number of Matches Played Between 1991 and 2023 Without Stats, By Tournament Type")

# Showing the plot
plt.show()

*From this plot it's easy to see that most of the NA's comes from matches conducted during a Davis Cup events, mainly until 2016. In this analysis I'm not going trace down why this happend, but I do believe it's an intersting topic to investigate in the future.*

Counting the number of matches per event type:

In [None]:
combined_df["tourney_level"].value_counts()

Deleting matches (rows) that we don't know their statistics:

In [None]:
combined_df.dropna(subset=match_stats_columns_names, inplace=True)

combined_df[match_stats_columns_names].isna().sum() # If the code above worked, the columns in the list above should have 0 NA's.

Deleting matches (rows) conducted in davis cup or tour finals and other season-ending events:

In [None]:
combined_df.drop(combined_df[(combined_df["tourney_level"] == "D") | (combined_df["tourney_level"] == "F")].index, inplace=True)

*I chose to delete rows based on two conditions: if they had NA values in the stats columns or if the matches was conducted in davis cup/tour finals and other season-ending events.*  
*I chose to do it because of the nature of those events: each type of event has unique features, making the matchs conducted during those tournaments distinguished from a typical tennis match. While I'm not 100% sure excluding those events from my analysis it the right thing to do, I think it's the better choice.*  
*From the plot that shows the number of missing values by year and event type, it seems that there alot of missing values from "A" events - which stands for other tour-level events (such as ATP 500 tournaments/ATP 250 tournaments). Because of the importance of such events, and the lack of specific type of event, I decided not to remove them.*

## Data Type Conversion

Converting all the match stats columns from float64 to int64:

In [None]:
for col in match_stats_columns_names:
    combined_df[col] = pd.to_numeric(combined_df[col], downcast='integer')

combined_df.dtypes

Checking is the descriptive statistics of some of the variables changed after the data type conversion:

In [None]:
combined_df[match_stats_columns_names].describe()

## Summarizing Changes In The Data's Shape

Getting the number of rows and columns in the dataset (post-cleaning):

In [None]:
number_of_new_columns_added = 8 # Note that this doesn't automatically counts the numebr of new columns added to the DF. Update if necessary
number_of_rows_post_cleaning = combined_df.shape[0]
number_of_columns_post_cleaning = combined_df.shape[1] - number_of_new_columns_added

print(f"The dataset has {number_of_rows_post_cleaning} rows and {number_of_columns_post_cleaning+number_of_new_columns_added} columns.")
print(f"{number_of_rows_pre_cleaning-number_of_rows_post_cleaning} rows were deleted and {number_of_columns_pre_cleaning-number_of_columns_post_cleaning} columns were removed. {number_of_new_columns_added} new columns were created based on existing variabels.")

# Data Preprocessing

Creating a subset used in the multivariate analysis:

In [None]:
col_names_multivariate_df = ["w_ace", "w_df", "w_svpt", "w_1stIn", "w_1stWon", "w_2ndWon", "w_SvGms", "w_bpSaved", "w_bpFaced", "l_ace", "l_df", "l_svpt", "l_1stIn", "l_1stWon", "l_2ndWon", "l_SvGms", "l_bpSaved", "l_bpFaced", "winner_ht", "winner_age", "loser_ht", "loser_age", "minutes", "total_points", "w_1st_serve_in_percentage", "l_1st_serve_in_percentage", "w_1st_serve_winning_percentage", "l_1st_serve_winning_percentage"]

multivariate_analysis_df = combined_df[col_names_multivariate_df]

multivariate_analysis_df.head()

Creating a new subset for the 1st serve in percentage by surface analysis:

In [None]:
first_serve_in_melted_df = pd.melt(combined_df, id_vars =["winner_id","winner_name", "loser_id", "loser_name", "Year", "surface", "tourney_level"], value_vars=["w_1st_serve_in_percentage", "l_1st_serve_in_percentage"])
first_serve_in_melted_df.rename(columns={"variable": "winner_or_loser", "value": "first_serve_percentage"}, inplace=True)

first_serve_in_melted_df.sort_values(["winner_name", "loser_name"]).tail(10)

Creating a new subset: wins on 1st serve percentage, by surface and event type:

In [None]:
wins_on_first_serve_melted_df = pd.melt(combined_df, id_vars =["winner_id","winner_name", "loser_id", "loser_name", "Year", "surface", "tourney_level"], value_vars=["w_1st_serve_winning_percentage", "l_1st_serve_winning_percentage"])
wins_on_first_serve_melted_df.rename(columns={"variable": "winner_or_loser", "value": "first_serve_wins_percentage"}, inplace=True)

wins_on_first_serve_melted_df.sort_values(["winner_name", "loser_name"]).tail(10)

Exporting all the dataframes to CSV files:

In [None]:
folder_path_to_export = '/content/drive/MyDrive/Tennis_Analysis/processed_datasets'

# Exporting the combined_df
combined_df.to_csv(path_or_buf=folder_path_to_export+"/combined_and_cleaned_df.csv")

# Exporting the first_serve_in_melted_df
first_serve_in_melted_df.to_csv(path_or_buf=folder_path_to_export+"/first_serve_in_melted_df.csv")

# Exporting the wins_on_first_serve_melted_df
wins_on_first_serve_melted_df.to_csv(path_or_buf=folder_path_to_export+"/wins_on_first_serve_melted_df.csv")

# Univariate Analysis

## Categorical Variables

### Surface Type

Counting the number of games played on each surface:

In [None]:
combined_df["surface"].value_counts()

Proportion of matches played on each surface:

In [None]:
combined_df["surface"].value_counts(normalize=True)

Plotting the number of games played on each surface type:

In [None]:
# Creating the plot
plt.figure(figsize=(12,6))
surfaces_countplot = sns.countplot(data=combined_df, x="surface", order=combined_df.surface.value_counts().index)

# Removing the frame
sns.despine(left=True, bottom=True)

# Adding Bar's Labels
surfaces_countplot.bar_label(surfaces_countplot.containers[0], label_type='edge')

# Titles
surfaces_countplot.set_xlabel("Type of Surface")
surfaces_countplot.set_ylabel("Number of Matches Played")
surfaces_countplot.set_title("Number of Matches Played Between 1991 and 2023* on each surface")

# Showing the plot
plt.show()

### Number of Matches by Tournament Type:

Counting the number of matches in each tournament type:

In [None]:
combined_df["tourney_level"].value_counts()

Proportion of tournament type:

In [None]:
combined_df["tourney_level"].value_counts(normalize=True)

Plotting:

In [None]:
# Creating the plot
plt.figure(figsize=(12,6))
tourney_level_countplot = sns.countplot(data=combined_df, x="tourney_level", order=combined_df["tourney_level"].value_counts().index)

# Removing the frame
sns.despine(left=True, bottom=True)

# Adding Bar's Labels
tourney_level_countplot.bar_label(tourney_level_countplot.containers[0], label_type='edge')

# Titles
tourney_level_countplot.set_xlabel("Tourney Type")
tourney_level_countplot.set_ylabel("Number of Matches")
tourney_level_countplot.set_title("Number of Matches Played Between 1991 and 2023 in each tourney type")

# Showing the plot
plt.show()

## Numerical Variables

### Analyzing the total_points variable:

Descriptive statistics:

In [None]:
round(combined_df["total_points"].describe(), 2)

Skewness and Kurtosis:

In [None]:
total_points_skewness = combined_df.total_points.skew()
total_points_kurtosis = combined_df.total_points.kurtosis()

print(f"Skewness: {round(total_points_skewness, 3)}")
print(f"Kurtosis: {round(total_points_kurtosis, 3)}")

Plotting:

In [None]:
# Creating the plot
plt.figure(figsize=(12, 6))
total_point_histogram = sns.histplot(data=combined_df, x="total_points")

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
total_point_histogram.set_xlabel("Total Number of Points Played in a Match")
total_point_histogram.set_ylabel("Number of Matches")

# Showing the plot
plt.show()

### Analyzing the w_1st_serve_in_percentage variable:

Descriptive statistics:

In [None]:
round(combined_df["w_1st_serve_in_percentage"].describe(), 2)

Skewness and Kurtosis:

In [None]:
w_1st_serve_in_percentage_skewness = combined_df["w_1st_serve_in_percentage"].skew()
w_1st_serve_in_percentage_kurtosis = combined_df["w_1st_serve_in_percentage"].kurtosis()

print(f"Skewness: {round(w_1st_serve_in_percentage_skewness, 3)}")
print(f"Kurtosis: {round(w_1st_serve_in_percentage_kurtosis, 3)}")

Plotting:

In [None]:
# Creating the plot
plt.figure(figsize=(12, 6))
w_1st_serve_in_percentage_histogram = sns.histplot(data=combined_df, x="w_1st_serve_in_percentage")

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
w_1st_serve_in_percentage_histogram.set_xlabel("Winner's First Serve In Percentage")
w_1st_serve_in_percentage_histogram.set_ylabel("Number of Players")
w_1st_serve_in_percentage_histogram.set_title("Winner's First Serve In Percentage Distribution")

# Showing the plot
plt.show()

### Analyzing the l_1st_serve_in_percentage variable:

Descriptive statistics:

In [None]:
round(combined_df["l_1st_serve_in_percentage"].describe(), 2)

Skewness and Kurtosis:

In [None]:
l_1st_serve_in_percentage_skewness = combined_df["l_1st_serve_in_percentage"].skew()
l_1st_serve_in_percentage_kurtosis = combined_df["l_1st_serve_in_percentage"].kurtosis()

print(f"Skewness: {round(l_1st_serve_in_percentage_skewness, 3)}")
print(f"Kurtosis: {round(l_1st_serve_in_percentage_kurtosis, 3)}")

Plotting:

In [None]:
# Creating the plot
plt.figure(figsize=(12, 6))
l_1st_serve_in_percentage_histogram = sns.histplot(data=combined_df, x="l_1st_serve_in_percentage")

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
l_1st_serve_in_percentage_histogram.set_xlabel("Loser's First Serve In Percentage")
l_1st_serve_in_percentage_histogram.set_ylabel("Number of Players")
l_1st_serve_in_percentage_histogram.set_title("Loser's First Serve In Percentage Distribution")

# Showing the plot
plt.show()

### Analyzing the w_1st_serve_winning_percentage variable:

Descriptive statistics:

In [None]:
round(combined_df["w_1st_serve_winning_percentage"].describe(), 2)

Skewness and Kurtosis:

In [None]:
w_1st_serve_winning_percentage_skewness = combined_df["w_1st_serve_winning_percentage"].skew()
w_1st_serve_winning_percentage_kurtosis = combined_df["w_1st_serve_winning_percentage"].kurtosis()

print(f"Skewness: {round(w_1st_serve_winning_percentage_skewness, 2)}")
print(f"Kurtosis: {round(w_1st_serve_winning_percentage_kurtosis, 2)}")

Plotting:

In [None]:
# Creating the plot
plt.figure(figsize=(12, 6))
w_1st_serve_winning_percentage_histogram = sns.histplot(data=combined_df, x="w_1st_serve_winning_percentage")

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
w_1st_serve_winning_percentage_histogram.set_xlabel("Winning On 1st Serve Percentage")
w_1st_serve_winning_percentage_histogram.set_ylabel("Number of Players")
w_1st_serve_winning_percentage_histogram.set_title("Winners First Serve Winning Percentage Distribution")

# Showing the plot
plt.show()

### Analyzing the l_1st_serve_winning_percentage variable:

Descriptive statistics:

In [None]:
round(combined_df["l_1st_serve_winning_percentage"].describe(), 2)

Skewness and Kurtosis:

In [None]:
l_1st_serve_winning_percentage_skewness = combined_df["l_1st_serve_winning_percentage"].skew()
l_1st_serve_winning_percentage_kurtosis = combined_df["l_1st_serve_winning_percentage"].kurtosis()

print(f"Skewness: {round(l_1st_serve_winning_percentage_skewness, 2)}")
print(f"Kurtosis: {round(l_1st_serve_winning_percentage_kurtosis, 2)}")

Plotting:

In [None]:
# Creating the plot
plt.figure(figsize=(12, 6))
l_1st_serve_winning_percentage_histogram = sns.histplot(data=combined_df, x="l_1st_serve_winning_percentage")

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
l_1st_serve_winning_percentage_histogram.set_xlabel("Winning On 1st Serve Percentage")
l_1st_serve_winning_percentage_histogram.set_ylabel("Number of Players")
l_1st_serve_winning_percentage_histogram.set_title("Losers First Serve Winning Percentage Distribution")

# Showing the plot
plt.show()

# Bivariate Analysis

## Finding trends in the numric variables:

* The goal of the pairplot visualization is to examine the relationships between all the numric variabels in the dataset. Though in many cases a correlation matrix is used to do so, I prefer to do this visualization in order to see if there's any non linear correlations between any variabels.

Creating a pairplot visualization:

In [None]:
pairplot = sns.pairplot(multivariate_analysis_df, corner=True)
plt.show()

* I chose to create the correlation matrix in order to check the magnitude of the linear correlation between variabels.  

Creating a correlation matrix:

In [None]:
corr_matrix = round(multivariate_analysis_df.corr(),2)

mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

plt.figure(figsize=(20,12))
sns.heatmap(corr_matrix, annot=True, mask=mask, square=True, linewidths=2)
plt.show()

**Conclusions drawn from the pairplot visualization and correlation matrix:**  
* After examining the pairplot visualization, it seems that the only type of relationships are linear.
* From examining the correlation matrix, it seems that there is a strong correlation between some variabels. All of the strong correlations make sense - the correlation between the length of the match in minutes to the total number of points played in the match (*r=0.92*), for example, is easy to explain: if the match is longer, it means the players has more time to play more points.

## Comparing the average number of total points played on each surfce

Calculating the average total number of points played on each surface:

In [None]:
total_points_mean_by_surface_df = round(combined_df.groupby(["surface"], as_index=False)[["total_points"]].mean(), 2)
total_points_mean_by_surface_df.sort_values(by="total_points", ascending=False, inplace=True, ignore_index=True)
total_points_mean_by_surface_df

Plotting the total number of points played on each surface distribution:

In [None]:
# Creating the plot
plt.figure(figsize=(16,9))
points_by_surface_plot = sns.boxplot(combined_df, x="surface", y="total_points", order=total_points_mean_by_surface_df["surface"])

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
points_by_surface_plot.set_xlabel("Surface Type")
points_by_surface_plot.set_ylabel("Number of Points Played In A Match")
points_by_surface_plot.set_title("Total Points On Each Surface")

# Showing the plot
plt.show()

## Comparing the average match length for each event

Calculating the average total number of points played & the average length of matches by each tourney level:

In [None]:
average_duration_by_tourney_level_df = round(combined_df.groupby(["tourney_level"], as_index=False)[["minutes", "total_points"]].mean(), 2)
average_duration_by_tourney_level_df.sort_values(by="minutes", ascending=False, inplace=True, ignore_index=True)
average_duration_by_tourney_level_df

Plotting the match length on each surface distribution:

In [None]:
# Creating the plot
plt.figure(figsize=(16,9))
duration_by_tourney_level_plot = sns.boxplot(combined_df, x="tourney_level", y="minutes", order=average_duration_by_tourney_level_df["tourney_level"])

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
duration_by_tourney_level_plot.set_xlabel("Tourney Level")
duration_by_tourney_level_plot.set_ylabel("Match Duration, in Minutes")
duration_by_tourney_level_plot.set_title("Match Duration in Minutes by Tourney Level")

# Showing the plot
plt.show()

## First Serve Percentage Difference Between Winners and Losers:

Calculating the average winners' and losers' first serve percentage:

In [None]:
round(first_serve_in_melted_df.groupby(["winner_or_loser"], as_index=False).agg({"first_serve_percentage": ["mean", "median"]}), 2)

Plotting the winners' and losers' first serve percentage distribution:

In [None]:
# Creating the plot
plt.figure(figsize=(16,9))
first_serve_percentage_winners_vs_losers_plot = sns.boxplot(data=combined_df[["w_1st_serve_in_percentage", "l_1st_serve_in_percentage"]])

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
first_serve_percentage_winners_vs_losers_plot.set_xlabel("Winner Or Loser")
first_serve_percentage_winners_vs_losers_plot.set_ylabel("1st Serve Percentage")
first_serve_percentage_winners_vs_losers_plot.set_title("First Serve Percentage Difference Between Winners and Losers")

# Showing the plot
plt.show()

## Comparing the 1st serve percentages per surface

Calculating the average and median winners' and losers' first serve percentage, per surface type:

In [None]:
first_serve_by_surface_aggregated_df = round(first_serve_in_melted_df.groupby(["surface", "winner_or_loser"], as_index=False).agg({"first_serve_percentage": ["mean", "median"]}), 2)
first_serve_by_surface_aggregated_df.sort_values(by=["surface", ("first_serve_percentage", "median")], ascending=[True, False], inplace=True, ignore_index=True)
first_serve_by_surface_aggregated_df

Plotting the winners' and losers' first serve percentage per surface type distribution:

In [None]:
# Creating the plot
plt.figure(figsize=(16,9))
first_serve_percentage_by_surface_plot = sns.boxplot(data=first_serve_in_melted_df, x="surface", y="first_serve_percentage", hue="winner_or_loser")

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
first_serve_percentage_by_surface_plot.set_xlabel("Surface Type")
first_serve_percentage_by_surface_plot.set_ylabel("1st Serve Percentage")
first_serve_percentage_by_surface_plot.set_title("First Serve Percentage Difference Between Winners and Losers, By Surface")

# Showing the plot
plt.show()

## Comparing the 1st serve percentages per event type

Calculating the average and median winners' and losers' first serve percentage, per tourney type:

In [None]:
first_serve_by_tourney_level_aggregated_df = round(first_serve_in_melted_df.groupby(["tourney_level", "winner_or_loser"], as_index=False).agg({"first_serve_percentage": ["mean", "median"]}), 2)
first_serve_by_tourney_level_aggregated_df.sort_values(by=["tourney_level", ("first_serve_percentage", "median")], ascending=[True, False], inplace=True, ignore_index=True)
first_serve_by_tourney_level_aggregated_df

Plotting the winners' and losers' first serve percentage per tourney type distribution:

In [None]:
# Creating the plot
plt.figure(figsize=(16,9))
first_serve_percentage_by_event_type_plot = sns.boxplot(data=first_serve_in_melted_df, x="tourney_level", y="first_serve_percentage", hue="winner_or_loser")

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
first_serve_percentage_by_event_type_plot.set_xlabel("Tourney Type")
first_serve_percentage_by_event_type_plot.set_ylabel("1st Serve Percentage")
first_serve_percentage_by_event_type_plot.set_title("First Serve Percentage Difference Between Winners and Losers, By Tourney Type")

# Showing the plot
plt.show()

## Comparing wins on 1st serve Difference Between Winners and Losers

Calculating the average winners' and losers' wins on first serve percentage:

In [None]:
round(combined_df[["w_1st_serve_winning_percentage", "l_1st_serve_winning_percentage"]].mean(), 2)

Plotting the winners' and losers' wins on first serve percentage distribution:

In [None]:
# Creating the plot
plt.figure(figsize=(16,9))
winning_first_serve_percentage_winners_vs_losers_plot = sns.boxplot(data=combined_df[["w_1st_serve_winning_percentage", "l_1st_serve_winning_percentage"]])

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
winning_first_serve_percentage_winners_vs_losers_plot.set_xlabel("Winner Or Loser")
winning_first_serve_percentage_winners_vs_losers_plot.set_ylabel("1st Serve Percentage")
winning_first_serve_percentage_winners_vs_losers_plot.set_title("First Serve Percentage Difference Between Winners and Losers")

# Showing the plot
plt.show()

## Comparing the winning on 1st serve percentages per surface

Calculating the average and median winners' and losers' wins on first serve percentage, per surface type:

In [None]:
winning_first_serve_by_surface_aggregated_df = round(wins_on_first_serve_melted_df.groupby(["surface", "winner_or_loser"], as_index=False).agg({"first_serve_wins_percentage": ["mean", "median"]}), 2)
winning_first_serve_by_surface_aggregated_df.sort_values(by=["surface", ("first_serve_wins_percentage", "median")], ascending=[True, False], inplace=True, ignore_index=True)
winning_first_serve_by_surface_aggregated_df

Plotting the winners' and losers' wins on first serve percentage per surface type distribution:

In [None]:
# Creating the plot
plt.figure(figsize=(16,9))
first_serve_percentage_by_surface_plot = sns.boxplot(data=wins_on_first_serve_melted_df, x="surface", y="first_serve_wins_percentage", hue="winner_or_loser")

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
first_serve_percentage_by_surface_plot.set_xlabel("Surface Type")
first_serve_percentage_by_surface_plot.set_ylabel("1st Serve Percentage")
first_serve_percentage_by_surface_plot.set_title("First Serve Percentage Difference Between Winners and Losers, By Surface")

# Showing the plot
plt.show()

## Comparing the winning on 1st serve percentages per event type

Calculating the average and median winners' and losers' winning on first serve percentage, per tourney level:

In [None]:
winning_first_serve_by_tourney_level_aggregated_df = round(wins_on_first_serve_melted_df.groupby(["tourney_level", "winner_or_loser"], as_index=False).agg({"first_serve_wins_percentage": ["mean", "median"]}), 2)
winning_first_serve_by_tourney_level_aggregated_df.sort_values(by=["tourney_level", ("first_serve_wins_percentage", "median")], ascending=[True, False], inplace=True, ignore_index=True)
winning_first_serve_by_tourney_level_aggregated_df

Plotting the winners' and losers' wins on first serve percentage per tourney type distribution:

In [None]:
# Creating the plot
plt.figure(figsize=(16,9))
first_serve_percentage_by_event_type_plot = sns.boxplot(data=wins_on_first_serve_melted_df, x="tourney_level", y="first_serve_wins_percentage", hue="winner_or_loser")

# Removing the frame
sns.despine(left=True, bottom=True)

# Titles
first_serve_percentage_by_event_type_plot.set_xlabel("Tourney Type")
first_serve_percentage_by_event_type_plot.set_ylabel("1st Serve Percentage")
first_serve_percentage_by_event_type_plot.set_title("First Serve Percentage Difference Between Winners and Losers, By Tourney Type")

# Showing the plot
plt.show()