# What am I Investigating?
Tennis is a racquet sport that can be played individually against a single opponent (singles) or between two teams of two players each (doubles). Players use a strung racquet to strike a hollow rubber ball, covered with felt, over a net and into the opponent's court. The objective of the game is to play the ball in such a way that the opponent cannot play a valid return. The player who is unable to return the ball will not gain a point, while the opposite player will.  
Tennis is played on different surfaces, including grass, clay, hard courts, and even indoor carpet. Each surface affects the ball's speed and bounce differently, leading to varied styles of play.  
*I'm intrested to find out if there's a relationship between the surface the match was played on and its duration*.

# Importing Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

pd.set_option('display.max_columns', 50)

# Loading Data

In [2]:
folder_path = '/content/drive/MyDrive/Tennis_Analysis/full_matches_data'

# List all files in the directory with a .csv extension
all_files = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f)) and f.endswith('.csv')]

# Use a list comprehension to read each file into a dataframe and then concatenate them all
combined_df = pd.concat([pd.read_csv(os.path.join(folder_path, f)) for f in all_files], ignore_index=True)

In [3]:
combined_df.sample(10)

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
65977,2010-580,Australian Open,Hard,128,G,20100118,27,104386,,,Janko Tipsarevic,R,180.0,SRB,25.5,105992,,WC,Ryan Harrison,R,183.0,USA,17.7,6-2 6-4 7-6(3),5,R128,166.0,10.0,5.0,122.0,75.0,49.0,28.0,15.0,9.0,11.0,5.0,9.0,121.0,65.0,41.0,25.0,15.0,14.0,18.0,36.0,1060.0,341.0,119.0
42187,2002-506,Buenos Aires,Clay,32,A,20020218,13,103900,,,David Nalbandian,R,180.0,ARG,20.1,103656,,,Albert Montanes,R,175.0,ESP,21.2,6-2 6-4,3,R32,85.0,3.0,0.0,59.0,37.0,26.0,14.0,9.0,5.0,6.0,5.0,1.0,69.0,36.0,22.0,14.0,9.0,3.0,7.0,48.0,780.0,58.0,624.0
6866,1992-326,Brisbane,Hard,32,A,19920928,3,101398,,Q,Neil Borwick,L,193.0,AUS,25.0,102304,,WC,Grant Doyle,R,180.0,AUS,18.7,7-6(5) 2-6 6-4,3,R32,122.0,2.0,2.0,89.0,56.0,39.0,17.0,15.0,1.0,5.0,2.0,3.0,93.0,53.0,42.0,18.0,15.0,5.0,8.0,230.0,123.0,312.0,70.0
16593,1995-336,Hong Kong,Hard,32,A,19950417,2,102491,,Q,Alex Radulescu,R,185.0,GER,20.3,102088,,,Joern Renzenbrink,R,196.0,GER,22.7,6-1 6-3,3,R32,57.0,7.0,2.0,51.0,31.0,24.0,11.0,8.0,2.0,2.0,2.0,3.0,42.0,25.0,17.0,5.0,8.0,1.0,5.0,243.0,155.0,92.0,507.0
37842,2001-404,Indian Wells Masters,Hard,64,M,20010312,60,103720,6.0,,Lleyton Hewitt,R,180.0,AUS,20.0,102765,,,Nicolas Escude,R,185.0,FRA,24.9,6-1 6-3,3,QF,75.0,4.0,4.0,45.0,24.0,18.0,9.0,8.0,0.0,2.0,3.0,2.0,55.0,29.0,15.0,8.0,8.0,3.0,9.0,8.0,2385.0,36.0,945.0
58432,2007-520,Roland Garros,Clay,128,G,20070528,24,103900,15.0,,David Nalbandian,R,180.0,ARG,25.4,102703,,,Hyung Taik Lee,R,180.0,KOR,31.3,6-2 6-1 3-6 6-3,5,R128,138.0,5.0,4.0,111.0,71.0,50.0,23.0,16.0,9.0,11.0,3.0,3.0,110.0,70.0,43.0,14.0,17.0,7.0,14.0,18.0,1415.0,42.0,766.0
5227,1992-414,Hamburg Masters,Clay,56,M,19920504,7,101772,,,Andrei Cherkasov,R,180.0,RUS,21.8,101685,,WC,Markus Naewie,R,190.0,GER,22.3,6-4 6-3,3,R64,105.0,0.0,0.0,55.0,40.0,29.0,11.0,9.0,5.0,5.0,3.0,1.0,70.0,44.0,32.0,10.0,10.0,5.0,8.0,30.0,1048.0,77.0,480.0
39112,2001-533,Costa Do Sauipe,Hard,32,A,20010910,9,101897,5.0,,Fernando Meligeni,L,180.0,BRA,30.4,102239,,Q,Francisco Costa,R,180.0,BRA,28.2,6-1 6-2,3,R32,60.0,6.0,1.0,38.0,25.0,22.0,10.0,8.0,0.0,0.0,0.0,1.0,46.0,27.0,14.0,6.0,7.0,3.0,7.0,80.0,522.0,246.0,143.0
27622,1998-425,Barcelona,Clay,56,A,19980413,43,103103,,,Dominik Hrbaty,R,183.0,SVK,20.2,101320,14.0,,Magnus Gustafsson,R,185.0,SWE,31.2,5-7 6-4 6-3,3,R16,123.0,8.0,4.0,100.0,55.0,35.0,24.0,16.0,5.0,10.0,2.0,3.0,103.0,67.0,40.0,14.0,15.0,4.0,10.0,58.0,804.0,32.0,1187.0
40858,2002-360,Casablanca,Clay,32,A,20020408,20,102369,5.0,,Julien Boutter,R,190.0,FRA,28.0,101885,,,Wayne Arthurs,L,190.0,AUS,31.0,6-4 3-6 6-3,3,R16,124.0,3.0,11.0,99.0,58.0,39.0,20.0,14.0,8.0,11.0,9.0,7.0,89.0,55.0,38.0,15.0,14.0,3.0,7.0,61.0,645.0,97.0,408.0


# Understanding the Big Picture

Getting the number of rows and columns in the dataset (pre-cleaning):

In [4]:
number_of_rows = combined_df.shape[0]
number_of_columns = combined_df.shape[1]

print(f"The dataset has {number_of_rows} rows and {number_of_columns} columns.")

The dataset has 104682 rows and 49 columns.


Counting the number of NA's in each column:

In [7]:
combined_df.isna().sum()

tourney_id                0
tourney_name              0
surface                   0
draw_size                 0
tourney_level             0
tourney_date              0
match_num                 0
winner_id                 0
winner_seed           52779
winner_entry          81836
winner_name               0
winner_hand               5
winner_ht               335
winner_ioc                0
winner_age                0
loser_id                  0
loser_seed            71636
loser_entry           73656
loser_name                0
loser_hand               24
loser_ht               1217
loser_ioc                 0
loser_age                15
score                     0
best_of                   0
round                     0
minutes                2894
w_ace                     0
w_df                      0
w_svpt                    0
w_1stIn                   0
w_1stWon                  0
w_2ndWon                  0
w_SvGms                   0
w_bpSaved                 0
w_bpFaced           

Checking if there are any duplicate rows:

In [None]:
number_of_duplicated_rows = combined_df.duplicated().sum()
print(f"There are {number_of_duplicated_rows} duplicated rows in the dataset.")

There are 0 duplicated rows in the dataset.


# Data Cleaning

Deleting matches that we don't know their statistics:

In [8]:
# Creating a list of columns names that has the statistics
match_data_columns_names = ["w_ace", "w_df", "w_svpt", "w_1stIn", "w_1stWon", "w_2ndWon", "w_SvGms", "w_bpSaved", "w_bpFaced", "l_ace", "l_df", "l_svpt", "l_1stIn", "l_1stWon", "l_2ndWon", "l_SvGms", "l_bpSaved", "l_bpFaced"]

# Dropping all the rows based on the subset of columns in the list above
combined_df.dropna(subset=match_data_columns_names, inplace=True)

# Checking if it worked
combined_df[match_data_columns_names].isna().sum()

w_ace        0
w_df         0
w_svpt       0
w_1stIn      0
w_1stWon     0
w_2ndWon     0
w_SvGms      0
w_bpSaved    0
w_bpFaced    0
l_ace        0
l_df         0
l_svpt       0
l_1stIn      0
l_1stWon     0
l_2ndWon     0
l_SvGms      0
l_bpSaved    0
l_bpFaced    0
dtype: int64

Deleting unnecessary columns:

In [None]:
names_of_columns_to_drop = ["winner_seed", "winner_entry", "loser_seed", "loser_entry", "match_num"]

combined_df.drop(columns=["winner_seed", "winner_entry", "loser_seed", "loser_entry", "match_num"], inplace=True)

combined_df.columns

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'winner_id', 'winner_name', 'winner_hand', 'winner_ht',
       'winner_ioc', 'winner_age', 'loser_id', 'loser_name', 'loser_hand',
       'loser_ht', 'loser_ioc', 'loser_age', 'score', 'best_of', 'round',
       'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon',
       'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt',
       'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced',
       'winner_rank', 'winner_rank_points', 'loser_rank', 'loser_rank_points'],
      dtype='object')

Changing the format of the dates in the tourney_date column, from `YYYYMMDD` format to `Y M D` format.

In [None]:
combined_df['tourney_date'] = pd.to_datetime(combined_df['tourney_date'], format='%Y%m%d')
combined_df.head()

Creating new columns:

In [None]:
# Creating a variable called "total_points" - the total points played in the match.
combined_df["total_points"] = combined_df["w_svpt"]+combined_df["l_svpt"]

combined_df.sample(7)

# Univariate Analysis

## Categorical Variables

Counting the number of games played on each surface type (and plotting it):

In [None]:
combined_df.surface.value_counts()

Hard      47729
Clay      31085
Grass      9783
Carpet     5878
Name: surface, dtype: int64

In [None]:
plt.figure(figsize=(12,6))
surfaces_countplot = sns.countplot(data=combined_df, x="surface", order=combined_df.surface.value_counts().index)
sns.despine(left=True, bottom=True)
surfaces_countplot.bar_label(surfaces_countplot.containers[0], label_type='edge')
surfaces_countplot.set_xlabel("Type of Surface")
surfaces_countplot.set_ylabel("Number of Matches Played")
surfaces_countplot.set_title("Number of Matches Played Between 1991 and 2023* on each surface")
plt.show()

Counting the number of times a best of 3 sets and a best of 5 sets were played (and plotting it):

In [None]:
combined_df.best_of.value_counts()

In [None]:
plt.figure(figsize=(12,6))
best_of_countplot = sns.countplot(data=combined_df, x="best_of", order=combined_df.best_of.value_counts().index)
sns.despine(left=True, bottom=True)
best_of_countplot.bar_label(surfaces_countplot.containers[0], label_type='edge')
best_of_countplot.set_xlabel("Best of How Many Sets")
best_of_countplot.set_ylabel("Number of Matches")
best_of_countplot.set_title("Number of Matches Played Between 1991 and 2023")
plt.show()

In [None]:
combined_df.winner_hand.value_counts()

In [None]:
combined_df.loser_hand.value_counts()

## Numerical Variables

Analyzing the total_points variable:

In [None]:
combined_df.total_points.describe()

In [None]:
plt.figure(figsize=(12, 6))
total_point_histogram = sns.histplot(data=combined_df, x="total_points")
sns.despine(left=True, bottom=True)
total_point_histogram.set_xlabel("Total Number of Points Played in a Match")
total_point_histogram.set_ylabel("Number of Matches")
plt.show()