In [116]:
# Vizuelna podešavanja okruženja (samo razvuče notebook na širinu ekrana)
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [117]:
import pandas as pd

In [118]:
df_matches_2018 = pd.read_csv("ASM_PZ2_podaci_2021/atp_matches_2018.csv")
df_matches_2019 = pd.read_csv("ASM_PZ2_podaci_2021/atp_matches_2019.csv")
df_matches_2020 = pd.read_csv("ASM_PZ2_podaci_2021/atp_matches_2020.csv")
df_players = pd.read_csv("ASM_PZ2_podaci_2021/atp_players.csv", header=None, names=['player_id', 'first_name', 'last_name', 'hand', 'birth_date', 'country_code'])
df_rankings_10s = pd.read_csv("ASM_PZ2_podaci_2021/atp_rankings_10s.csv")
df_rankings_current = pd.read_csv("ASM_PZ2_podaci_2021/atp_rankings_current.csv", header=None, names=['ranking_date', 'rank', 'player', 'points'])

df_matches_2018.name = "matches 2018"
df_matches_2019.name = "matches 2019"
df_matches_2020.name = "matches 2020"
df_players.name = "players"
df_rankings_10s.name = "rankings 10s"
df_rankings_current.name = "rankings current"

## ATP Tennis Rankings, Results, and Stats

This contains my master ATP player file, historical rankings, results, and match stats.

The player file columns are player_id, first_name, last_name, hand, birth_date, country_code.

The columns for the ranking files are ranking_date, ranking, player_id, ranking_points (where available).

ATP rankings are mostly complete from 1985 to the present. 1982 is missing, and rankings from 1973-1984 are only intermittent.

Results and stats: There are up to three files per season: One for tour-level main draw matches (e.g. 'atp_matches_2014.csv'), one for tour-level qualifying and challenger main-draw matches, and one for futures matches.

Most of the columns in the results files are self-explanatory. I've also included a matches_data_dictionary.txt file to spell things out a bit more.

To make the results files easier for more people to use, I've included a fair bit of redundancy with the biographical and ranking files: each row contains several columns of biographical information, along with ranking and ranking points, for both players. Ranking data, as well as age, are as of tourney_date, which is almost always the Monday at or near the beginning of the event.

MatchStats are included where I have them. In general, that means 1991-present for tour-level matches, 2008-present for challengers, and 2011-present for tour-level qualifying. The MatchStats columns should be self-explanatory, but they might not be what you're used to seeing; it's all integer totals (e.g. 1st serves in, not 1st serve percentage), from which traditional percentages can be calculated.

There are some tour-level matches with missing stats. Some are missing because ATP doesn't have them. Others I've deleted because they didn't pass some sanity check (loser won 60% of points, or match time was under 20 minutes, etc). Also, Davis Cup matches are included in the tour-level files, but there are no stats for Davis Cup matches until the last few seasons.

# Doubles

I've added tour-level doubles back to 2000. Filenames follow the convention atp_matches_doubles_yyyy.csv. I may eventually be able to add tour-level doubles from before 2000, as well as lower-level doubles for some years. Most of the columns are the same, though in a different order.

# Attention

Please read, understand, and abide by the license below. It seems like a reasonable thing to ask, given the hundreds of hours I've put into amassing and maintaining this dataset. Unfortunately, a few bad apples have violated the license, and when people do that, it makes me considerably less motivated to continue updating.

Also, if you're using this for academic/research purposes (great!), take a minute and cite it properly. It's not that hard, it helps others find a useful resource, and let's face it, you should be doing it anyway.

# License

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Dataset" property="dct:title" rel="dct:type">Tennis databases, files, and algorithms</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://www.tennisabstract.com/" property="cc:attributionName" rel="cc:attributionURL">Jeff Sackmann / Tennis Abstract</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.<br />Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/JeffSackmann" rel="dct:source">https://github.com/JeffSackmann</a>.

In other words: Attribution is required. Non-commercial use only. 


* Many of the columns in the 'matches' files are self-explanatory, or are very similar to previous columns.

tourney_id
- a unique identifier for each tournament, such as 2020-888. The exact formats are borrowed from several different sources, so while the first four characters are always the year, the rest of the ID doesn't follow a predictable structure.

tourney_name
surface
draw_size
- number of players in the draw, often rounded up to the nearest power of 2. (For instance, a tournament with 28 players may be shown as 32.)

tourney_level
- For men: 'G' = Grand Slams, 'M' = Masters 1000s, 'A' = other tour-level events, 'C' = Challengers, 'S' = Satellites/ITFs, 'F' = Tour finals and other season-ending events, and 'D' = Davis Cup 
- For women, there are several additional tourney_level codes, including 'P' = Premier, 'PM' = Premier Mandatory, and 'I' = International. The various levels of ITFs are given by the prize money (in thousands), such as '15' = ITF $15,000. Other codes, such as 'T1' for Tier I (and so on) are used for older WTA tournament designations. 'D' is used for Federation/Fed/Billie Jean King Cup, and also for Wightman Cup and Bonne Bell Cup.

- Others, eventually for both genders: 'E' = exhibition (events not sanctioned by the tour, though the definitions can be ambiguous), 'J' = juniors, and 'T' = team tennis, which does yet appear anywhere in the dataset but will at some point.

tourney_date
- eight digits, YYYYMMDD, usually the Monday of the tournament week.

match_num
- a match-specific identifier. Often starting from 1, sometimes counting down from 300, and sometimes arbitrary. 

winner_id
- the player_id used in this repo for the winner of the match

winner_seed
winner_entry
- 'WC' = wild card, 'Q' = qualifier, 'LL' = lucky loser, 'PR' = protected ranking, 'ITF' = ITF entry, and there are a few others that are occasionally used.

winner_name
winner_hand
winner_ht
- height in centimeters, where available

winner_ioc
- three-character country code

winner_age
- age, in years, as of the tourney_date

loser_id
loser_seed
loser_entry
loser_name
loser_hand
loser_ht
loser_ioc
loser_age
score
best_of
- '3' or '5', indicating the the number of sets for this match

round
minutes
- match length, where available

w_ace
- winner's number of aces
w_df
- winner's number of doubles faults
w_svpt
- winner's number of serve points
w_1stIn
- winner's number of first serves made
w_1stWon
- winner's number of first-serve points won
w_2ndWon
- winner's number of second-serve points won
w_SvGms
- winner's number of serve games
w_bpSaved
- winner's number of break points saved
w_bpFaced
- winner's number of break points faced

l_ace
l_df
l_svpt
l_1stIn
l_1stWon
l_2ndWon
l_SvGms
l_bpSaved
l_bpFaced

winner_rank
- winner's ATP or WTA rank, as of the tourney_date, or the most recent ranking date before the tourney_date
winner_rank_points
- number of ranking points, where available
loser_rank
loser_rank_points

* _doubles_ files notes

The matches_doubles files have similar columns, though not all in the same order.

The identifying information for each player refers to 'winner1', 'winner2', 'loser1', and 'loser2'. The labels 1 and 2 are not assigned for any particular reason.

In general, the tournament IDs for doubles results are the same as for singles results (so, for instance, you can see which players entered both draws at the same event), though this is not guaranteed for every single tournament, since some of the data came from different sources.

The stats columns ('w_ace' etc) are per *team*, not per player. That's a function of how tennis stats are typically recorded, not a decision on my part.

In [119]:
df_matches_2018.name = "matches_2018"
df_matches_2018

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,2018-M020,Brisbane,Hard,32,A,20180101,271,105992,,,...,47.0,33.0,19.0,14.0,1.0,4.0,47.0,1010.0,52.0,909.0
1,2018-M020,Brisbane,Hard,32,A,20180101,272,111577,,,...,41.0,25.0,7.0,9.0,7.0,11.0,54.0,890.0,94.0,593.0
2,2018-M020,Brisbane,Hard,32,A,20180101,273,104797,,,...,53.0,37.0,29.0,15.0,10.0,16.0,63.0,809.0,30.0,1391.0
3,2018-M020,Brisbane,Hard,32,A,20180101,275,200282,,WC,...,43.0,33.0,17.0,11.0,4.0,6.0,208.0,245.0,44.0,1055.0
4,2018-M020,Brisbane,Hard,32,A,20180101,276,111581,,Q,...,35.0,28.0,5.0,9.0,0.0,2.0,175.0,299.0,68.0,755.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2884,2018-0605,Tour Finals,Hard,8,F,20181112,299,104925,1.0,,...,33.0,23.0,7.0,8.0,5.0,9.0,1.0,8045.0,6.0,4310.0
2885,2018-0605,Tour Finals,Hard,8,F,20181112,300,100644,3.0,,...,34.0,25.0,8.0,10.0,2.0,6.0,5.0,5085.0,1.0,8045.0
2886,2018-M-DC-2018-WG-M-FRA-CRO-01,Davis Cup WG F: FRA vs CRO,Clay,4,D,20181123,1,106432,,,...,64.0,43.0,18.0,15.0,9.0,13.0,12.0,2480.0,40.0,1050.0
2887,2018-M-DC-2018-WG-M-FRA-CRO-01,Davis Cup WG F: FRA vs CRO,Clay,4,D,20181123,2,105227,,,...,57.0,39.0,18.0,15.0,1.0,4.0,7.0,4250.0,259.0,200.0


In [120]:
df_matches_2019.name = "matches_2019"
df_matches_2019

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2.0,,...,54.0,34.0,20.0,14.0,10.0,15.0,9.0,3590.0,16.0,1977.0
1,2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4.0,,...,52.0,36.0,7.0,10.0,10.0,13.0,16.0,1977.0,239.0,200.0
2,2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2.0,,...,27.0,15.0,6.0,8.0,1.0,5.0,9.0,3590.0,40.0,1050.0
3,2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,PR,...,60.0,38.0,9.0,11.0,4.0,6.0,239.0,200.0,31.0,1298.0
4,2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4.0,,...,56.0,46.0,19.0,15.0,2.0,4.0,16.0,1977.0,18.0,1855.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2776,2019-M-DC-2019-QLS-M-SUI-RUS-01,Davis Cup QLS R1: SUI vs RUS,Hard,4,D,20190201,2,111575,,,...,47.0,32.0,14.0,10.0,5.0,8.0,11.0,2880.0,362.0,56.0
2777,2019-M-DC-2019-QLS-M-SUI-RUS-01,Davis Cup QLS R1: SUI vs RUS,Hard,4,D,20190201,4,111575,,,...,94.0,65.0,27.0,17.0,15.0,17.0,11.0,2880.0,142.0,389.0
2778,2019-M-DC-2019-QLS-M-SWE-COL-01,Davis Cup QLS R1: SWE vs COL,Clay,4,D,20190201,1,105053,,,...,33.0,20.0,7.0,9.0,1.0,5.0,251.0,190.0,116.0,485.0
2779,2019-M-DC-2019-QLS-M-SWE-COL-01,Davis Cup QLS R1: SWE vs COL,Clay,4,D,20190201,2,123755,,,...,31.0,14.0,9.0,7.0,2.0,6.0,228.0,224.0,194.0,267.0


In [121]:
df_matches_2020.name = "matches_2020"
df_matches_2020

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,2020-8888,Atp Cup,Hard,24,A,20200106,300,104925,,,...,51.0,39.0,6.0,10.0,6.0,8.0,2.0,9055.0,1.0,9985.0
1,2020-8888,Atp Cup,Hard,24,A,20200106,299,105138,,,...,35.0,21.0,6.0,9.0,5.0,10.0,10.0,2335.0,34.0,1251.0
2,2020-8888,Atp Cup,Hard,24,A,20200106,298,104925,,,...,57.0,35.0,25.0,14.0,6.0,11.0,2.0,9055.0,5.0,5705.0
3,2020-8888,Atp Cup,Hard,24,A,20200106,297,105583,,,...,54.0,39.0,14.0,12.0,0.0,1.0,34.0,1251.0,17.0,1840.0
4,2020-8888,Atp Cup,Hard,24,A,20200106,296,104745,,,...,55.0,37.0,10.0,14.0,1.0,5.0,1.0,9985.0,18.0,1775.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1442,2020-M-DC-2020-WG2-PO-POL-HKG-01,Davis Cup WG2 PO: POL vs HKG,Hard,4,D,20200306,2,105668,,,...,,,,,,,461.0,68.0,960.0,11.0
1443,2020-M-DC-2020-WG2-PO-POL-HKG-01,Davis Cup WG2 PO: POL vs HKG,Hard,4,D,20200306,4,209874,,,...,,,,,,,,,,
1444,2020-M-DC-2020-WG2-PO-SYR-ZIM-01,Davis Cup WG2 PO: SYR vs ZIM,Hard,4,D,20200306,1,208518,,,...,,,,,,,,,813.0,18.0
1445,2020-M-DC-2020-WG2-PO-SYR-ZIM-01,Davis Cup WG2 PO: SYR vs ZIM,Hard,4,D,20200306,2,111761,,,...,,,,,,,430.0,79.0,,


In [122]:
df_players.name = "players"
df_players

Unnamed: 0,player_id,first_name,last_name,hand,birth_date,country_code
0,100001,Gardnar,Mulloy,R,19131122.0,USA
1,100002,Pancho,Segura,R,19210620.0,ECU
2,100003,Frank,Sedgman,R,19271002.0,AUS
3,100004,Giuseppe,Merlo,R,19271011.0,ITA
4,100005,Richard Pancho,Gonzales,R,19280509.0,USA
...,...,...,...,...,...,...
54970,209936,Lorenzo,Claverie,U,,VEN
54971,209937,Giorgio,Tabacco,U,,ITA
54972,209938,Constantinos,Koshis,U,,CYP
54973,209939,Lluc,Mir Anglada,U,,ESP


In [123]:
df_rankings_10s.name = "rankings_10s"
df_rankings_10s

Unnamed: 0,ranking_date,rank,player,points
0,20100104,1,103819,10550.0
1,20100104,2,104745,9205.0
2,20100104,3,104925,8310.0
3,20100104,4,104918,7030.0
4,20100104,5,105223,6785.0
...,...,...,...,...
916291,20191230,1922,134833,1.0
916292,20191230,1922,144856,1.0
916293,20191230,1922,202326,1.0
916294,20191230,1926,207307,1.0


In [124]:
df_rankings_current.name = "rankings_current"
df_rankings_current

Unnamed: 0,ranking_date,rank,player,points
0,20200106,1,104745,9985
1,20200106,2,104925,9055
2,20200106,3,103819,6590
3,20200106,4,106233,5825
4,20200106,5,106421,5705
...,...,...,...,...
51152,20201221,896,209857,15
51153,20201221,897,103565,15
51154,20201221,898,125800,15
51155,20201221,899,132575,15


### Print info.

In [125]:
df_list = [df_matches_2018, df_matches_2019, df_matches_2020, df_players, df_rankings_10s, df_rankings_current]

for df in df_list:
    print(df.name)
    print("\nall fields")
    print(df.columns)
    print("\nfields with NA")
    print(df.columns[df.isna().any()].tolist())
    print("\n")

matches_2018

all fields
Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'loser_id', 'loser_seed', 'loser_entry', 'loser_name', 'loser_hand',
       'loser_ht', 'loser_ioc', 'loser_age', 'score', 'best_of', 'round',
       'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon',
       'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt',
       'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced',
       'winner_rank', 'winner_rank_points', 'loser_rank', 'loser_rank_points'],
      dtype='object')

fields with NA
['winner_seed', 'winner_entry', 'winner_hand', 'winner_ht', 'loser_seed', 'loser_entry', 'loser_hand', 'loser_ht', 'loser_age', 'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_

### Clear matches dataframes.

In [126]:
columns = ['draw_size', 'tourney_level', 'score', 'best_of', 'round', 'minutes',
           'winner_seed', 'winner_entry', 'winner_hand', 'winner_ht', 'winner_age', 'winner_rank_points',
           'loser_seed', 'loser_entry', 'loser_hand', 'loser_ht', 'loser_age', 'loser_rank_points',
           'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced', 
           'l_ace', 'l_df', 'l_svpt', 'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced']
# errors='ignore' so that drop may be repeated
df_matches_2018.drop(columns, axis=1, errors='ignore', inplace=True)
df_matches_2019.drop(columns, axis=1, errors='ignore', inplace=True)
df_matches_2020.drop(columns, axis=1, errors='ignore', inplace=True)
df_matches_2018.fillna(value='Unknown', axis=0, inplace=True)
df_matches_2019.fillna(value='Unknown', axis=0, inplace=True)
df_matches_2020.fillna(value='Unknown', axis=0, inplace=True)
df_matches_2018

Unnamed: 0,tourney_id,tourney_name,surface,tourney_date,match_num,winner_id,winner_name,winner_ioc,loser_id,loser_name,loser_ioc,winner_rank,loser_rank
0,2018-M020,Brisbane,Hard,20180101,271,105992,Ryan Harrison,USA,104919,Leonardo Mayer,ARG,47,52
1,2018-M020,Brisbane,Hard,20180101,272,111577,Jared Donaldson,USA,111442,Jordan Thompson,AUS,54,94
2,2018-M020,Brisbane,Hard,20180101,273,104797,Denis Istomin,UZB,106000,Damir Dzumhur,BIH,63,30
3,2018-M020,Brisbane,Hard,20180101,275,200282,Alex De Minaur,AUS,105449,Steve Johnson,USA,208,44
4,2018-M020,Brisbane,Hard,20180101,276,111581,Michael Mmoh,USA,105643,Federico Delbonis,ARG,175,68
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2884,2018-0605,Tour Finals,Hard,20181112,299,104925,Novak Djokovic,SRB,104731,Kevin Anderson,RSA,1,6
2885,2018-0605,Tour Finals,Hard,20181112,300,100644,Alexander Zverev,GER,104925,Novak Djokovic,SRB,5,1
2886,2018-M-DC-2018-WG-M-FRA-CRO-01,Davis Cup WG F: FRA vs CRO,Clay,20181123,1,106432,Borna Coric,CRO,104871,Jeremy Chardy,FRA,12,40
2887,2018-M-DC-2018-WG-M-FRA-CRO-01,Davis Cup WG F: FRA vs CRO,Clay,20181123,2,105227,Marin Cilic,CRO,104542,Jo-Wilfried Tsonga,FRA,7,259


### Concatenate matches dataframes into df_matches; add id column.

In [127]:
df_matches = pd.concat([df_matches_2018, df_matches_2019, df_matches_2020], ignore_index=True)
df_matches.name = "matches_all"
df_matches_2018['id'] = df_matches_2018.index
df_matches_2019['id'] = df_matches_2019.index
df_matches_2020['id'] = df_matches_2020.index
df_matches['id'] = df_matches.index
df_matches

Unnamed: 0,tourney_id,tourney_name,surface,tourney_date,match_num,winner_id,winner_name,winner_ioc,loser_id,loser_name,loser_ioc,winner_rank,loser_rank,id
0,2018-M020,Brisbane,Hard,20180101,271,105992,Ryan Harrison,USA,104919,Leonardo Mayer,ARG,47,52,0
1,2018-M020,Brisbane,Hard,20180101,272,111577,Jared Donaldson,USA,111442,Jordan Thompson,AUS,54,94,1
2,2018-M020,Brisbane,Hard,20180101,273,104797,Denis Istomin,UZB,106000,Damir Dzumhur,BIH,63,30,2
3,2018-M020,Brisbane,Hard,20180101,275,200282,Alex De Minaur,AUS,105449,Steve Johnson,USA,208,44,3
4,2018-M020,Brisbane,Hard,20180101,276,111581,Michael Mmoh,USA,105643,Federico Delbonis,ARG,175,68,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7112,2020-M-DC-2020-WG2-PO-POL-HKG-01,Davis Cup WG2 PO: POL vs HKG,Hard,20200306,2,105668,Jerzy Janowicz,POL,106388,Pak Long Yeung,HKG,461,960,7112
7113,2020-M-DC-2020-WG2-PO-POL-HKG-01,Davis Cup WG2 PO: POL vs HKG,Hard,20200306,4,209874,Maks Kasnikowski,POL,207852,Wai Yu Kai,HKG,Unknown,Unknown,7113
7114,2020-M-DC-2020-WG2-PO-SYR-ZIM-01,Davis Cup WG2 PO: SYR vs ZIM,Hard,20200306,1,208518,Hazem Naw,UNK,200250,Mehluli Don Ayanda Sibanda,ZIM,Unknown,813,7114
7115,2020-M-DC-2020-WG2-PO-SYR-ZIM-01,Davis Cup WG2 PO: SYR vs ZIM,Hard,20200306,2,111761,Benjamin Lock,ZIM,200181,Amer Naow,SYR,430,Unknown,7115


### Clear NaN in df_rankings_10s (whereas df_rankings_current has no NaN).

In [128]:
df_rankings_10s.drop(df_rankings_10s[df_rankings_10s['ranking_date'] < 20180101].index, inplace=True)
print(df_rankings_10s.isnull().sum(axis=0))
df_rankings_10s.dropna(inplace=True)
df_rankings_10s

ranking_date      0
rank              0
player            0
points          170
dtype: int64


Unnamed: 0,ranking_date,rank,player,points
764191,20180101,1,104745,10645.0
764192,20180101,2,103819,9605.0
764193,20180101,3,105777,5150.0
764194,20180101,4,100644,4610.0
764195,20180101,5,106233,4015.0
...,...,...,...,...
916291,20191230,1922,134833,1.0
916292,20191230,1922,144856,1.0
916293,20191230,1922,202326,1.0
916294,20191230,1926,207307,1.0


### Concatenate df_rankings_10s and df_rankings_current into df_rankings.

In [129]:
df_rankings = pd.concat([df_rankings_10s, df_rankings_current], ignore_index=True)
df_rankings.name = "rankings"
df_rankings['id'] = df_rankings.index
df_rankings

Unnamed: 0,ranking_date,rank,player,points,id
0,20180101,1,104745,10645.0,0
1,20180101,2,103819,9605.0,1
2,20180101,3,105777,5150.0,2
3,20180101,4,100644,4610.0,3
4,20180101,5,106233,4015.0,4
...,...,...,...,...,...
203087,20201221,896,209857,15.0,203087
203088,20201221,897,103565,15.0,203088
203089,20201221,898,125800,15.0,203089
203090,20201221,899,132575,15.0,203090


### Drop players not appearing in df_matches and unnecessary columns from df_players.

In [130]:
df_players.rename(columns={"player_id": "id"}, inplace=True)

# def unique_players_in_matches(df):
#     winners_unique = set(df['winner_id'].unique())
#     losers_unique = set(df['loser_id'].unique())
#     total_unique = winners_unique.union(losers_unique)
#     return total_unique
    
winners_unique = set(df_matches['winner_id'].unique())
losers_unique = set(df_matches['loser_id'].unique())
unqiue_players_in_matches = winners_unique.union(losers_unique)
df_players.drop(df_players[~df_players['id'].isin(unqiue_players_in_matches)].index, inplace=True)

df_players.drop(['hand', 'birth_date'], axis=1, inplace=True, errors='ignore')
df_players

Unnamed: 0,id,first_name,last_name,country_code
643,100644,Alexander,Zverev,GER
3332,103333,Ivo,Karlovic,CRO
3498,103499,Aqeel,Khan,PAK
3528,103529,Aisam Ul Haq,Qureshi,PAK
3564,103565,Stephane,Robert,FRA
...,...,...,...,...
54904,209870,Gunawan,Trismuwantara,INA
54905,209871,Amr Elsayed Abdou Ahmed,Mohamed,EGY
54906,209872,Thehan Sanjaya,Wijemanne,SRI
54907,209873,Eric Jr.,Olivarez,PHI


### Print info about dataframes.

In [131]:
df_list = [df_matches_2018, df_matches_2019, df_matches_2020, df_matches, df_players, df_rankings]

for df in df_list:
    print("DATAFRAME: " + df.name + " " + str(df.shape) + "\n")
    print("NA FIELDS")
    print(df.isna().sum())
    print("\nUNIQUE FIELDS")
    for column in df:
        print(column + "\t" + str(df[column].is_unique))
    print("\n\n")

DATAFRAME: matches_2018 (2889, 14)

NA FIELDS
tourney_id      0
tourney_name    0
surface         0
tourney_date    0
match_num       0
winner_id       0
winner_name     0
winner_ioc      0
loser_id        0
loser_name      0
loser_ioc       0
winner_rank     0
loser_rank      0
id              0
dtype: int64

UNIQUE FIELDS
tourney_id	False
tourney_name	False
surface	False
tourney_date	False
match_num	False
winner_id	False
winner_name	False
winner_ioc	False
loser_id	False
loser_name	False
loser_ioc	False
winner_rank	False
loser_rank	False
id	True



DATAFRAME: matches_2019 (2781, 14)

NA FIELDS
tourney_id      0
tourney_name    0
surface         0
tourney_date    0
match_num       0
winner_id       0
winner_name     0
winner_ioc      0
loser_id        0
loser_name      0
loser_ioc       0
winner_rank     0
loser_rank      0
id              0
dtype: int64

UNIQUE FIELDS
tourney_id	False
tourney_name	False
surface	False
tourney_date	False
match_num	False
winner_id	False
winner_name	False

### pickle.dump() cleared dataframes into files.

In [132]:
import pickle

with open("ASM_PZ2_podaci_2021/matches_2018_cleaned", 'wb') as file:
    pickle.dump(df_matches_2018, file)
with open("ASM_PZ2_podaci_2021/matches_2019_cleaned", 'wb') as file:
    pickle.dump(df_matches_2019, file)
with open("ASM_PZ2_podaci_2021/matches_2020_cleaned", 'wb') as file:
    pickle.dump(df_matches_2020, file)
with open("ASM_PZ2_podaci_2021/matches_cleaned", 'wb') as file:
    pickle.dump(df_matches, file)
with open("ASM_PZ2_podaci_2021/players_cleaned", 'wb') as file:
    pickle.dump(df_players, file)
with open("ASM_PZ2_podaci_2021/rankings_cleaned", 'wb') as file:
    pickle.dump(df_rankings, file)

# Modelling the network

### pickle.load() cleared dataframes from files.

In [133]:
import networkx as nx

import pickle

with open("ASM_PZ2_podaci_2021/matches_2018_cleaned", 'rb') as file:
    df_matches_2018 = pickle.load(file)
    df_matches_2018.name = "matches_2018"
with open("ASM_PZ2_podaci_2021/matches_2019_cleaned", 'rb') as file:
    df_matches_2019 = pickle.load(file)
    df_matches_2019.name = "matches_2019"
with open("ASM_PZ2_podaci_2021/matches_2020_cleaned", 'rb') as file:
    df_matches_2020 = pickle.load(file)
    df_matches_2020.name = "matches_2020"
with open("ASM_PZ2_podaci_2021/matches_cleaned", 'rb') as file:
    df_matches = pickle.load(file)
    df_matches.name = "matches_all"
with open("ASM_PZ2_podaci_2021/players_cleaned", 'rb') as file:
    df_players = pickle.load(file)
    df_players.name = "players"
with open("ASM_PZ2_podaci_2021/rankings_cleaned", 'rb') as file:
    df_rankings = pickle.load(file)
    df_rankings.name = "rankings"

### Create graphs for players and matches.

In [134]:
G_2018 = nx.Graph(name="2018")
G_2019 = nx.Graph(name="2019")
G_2020 = nx.Graph(name="2020")
G_all = nx.Graph(name="all years")

matches_graphs = [G_2018, G_2019, G_2020, G_all]
matches_dataframes = [df_matches_2018, df_matches_2019, df_matches_2020, df_matches]
for i in range(4):
    G = matches_graphs[i]
    df = matches_dataframes[i]
    
    # add nodes
    player_nodes = df_players[(df_players['id'].isin(df['winner_id'])) | (df_players['id'].isin(df['loser_id']))]
    for _, id, first_name, last_name, _ in player_nodes.itertuples():
        G.add_node(id, label=(first_name + " " + last_name))
        
    # add edges
    df_reduced = df[['winner_id', 'loser_id']]
    for _, winner_id, loser_id in df_reduced.itertuples():
        if (winner_id, loser_id) in G.edges:
            G.edges[winner_id, loser_id]['weight'] += 1
        else:
            G.add_edge(winner_id, loser_id, weight=1)
        
# Pajek has support for diacritis, whereas gml does not!
nx.write_gexf(G_2018, "models/matches_2018.gexf")
nx.write_gexf(G_2019, "models/matches_2019.gexf")
nx.write_gexf(G_2020, "models/matches_2020.gexf")
nx.write_gexf(G_all, "models/matches_all.gexf")

### Basic graph info. [Q1 - average opponents per player]

In [135]:
print(nx.info(G_2018))
print(nx.info(G_2019))
print(nx.info(G_2020))
print(nx.info(G_all))

Name: 2018
Type: Graph
Number of nodes: 419
Number of edges: 2489
Average degree:  11.8807
Name: 2019
Type: Graph
Number of nodes: 364
Number of edges: 2378
Average degree:  13.0659
Name: 2020
Type: Graph
Number of nodes: 345
Number of edges: 1325
Average degree:   7.6812
Name: all years
Type: Graph
Number of nodes: 581
Number of edges: 5330
Average degree:  18.3477


### Centralities. [Q2 - players with highest number of opponents played]

In [145]:
def calculate_centralities(G):
    DC_dict = nx.degree_centrality(G)
    CC_dict = nx.closeness_centrality(G)
    BC_dict = nx.betweenness_centrality(G)
    EVC_dict = nx.eigenvector_centrality(G)
    
    df_dc = pd.DataFrame.from_dict(DC_dict, orient='index', columns=['DC'])
    df_cc = pd.DataFrame.from_dict(CC_dict, orient='index', columns=['CC'])
    df_bc = pd.DataFrame.from_dict(BC_dict, orient='index', columns=['BC'])
    df_evc = pd.DataFrame.from_dict(EVC_dict, orient='index', columns=['EVC'])
    df = pd.concat([df_dc, df_cc, df_bc, df_evc], axis=1)
    return df

for G in matches_graphs:
    print(G.name)
    name_dict = nx.get_node_attributes(G, 'label')
    df_names = pd.DataFrame.from_dict(name_dict, orient='index', columns=['name'])
    df_centralities = calculate_centralities(G)
    df_centralities = pd.concat([df_names, df_centralities], axis=1)
    df_centralities.sort_values('DC', ascending=False, inplace=True)
    print(df_centralities)
    print()

2018
                             name        DC        CC        BC           EVC
104926              Fabio Fognini  0.145933  0.388683  0.033726  1.438065e-01
106233              Dominic Thiem  0.131579  0.387737  0.014091  1.469095e-01
126774         Stefanos Tsitsipas  0.126794  0.377184  0.018005  1.359105e-01
100644           Alexander Zverev  0.124402  0.378078  0.010155  1.463716e-01
104755            Richard Gasquet  0.122010  0.376293  0.012518  1.376286e-01
...                           ...       ...       ...       ...           ...
111515       Thai Son Kwiatkowski  0.002392  0.225563  0.000000  5.068646e-04
111574                 Jc Aragone  0.002392  0.232303  0.000000  6.272305e-04
111578              Stefan Kozlov  0.002392  0.257656  0.000000  2.747894e-03
111790             Brayden Schnur  0.002392  0.246687  0.000000  1.633772e-03
208029  Holger Vitus Nodskov Rune  0.002392  0.004306  0.000000  8.413057e-12

[419 rows x 5 columns]

2019
                             

### [Q3 - players by number of unique and total number of tournaments they participated in]

In [144]:
print("Total number of unique tourney IDs: ")
print(df_matches['tourney_id'].nunique())
print()

for df_m in matches_dataframes:
    print(df_m.name)
    df_wm = df_m.rename(columns={'winner_id': 'player_id'})[['tourney_id', 'player_id']]
    df_lm = df_m.rename(columns={'loser_id': 'player_id'})[['tourney_id', 'player_id']]
    df_pm = pd.concat([df_wm, df_lm], axis=0)
    df_pm = df_pm.join(df_players.set_index('id'), on='player_id')[['tourney_id', 'player_id', 'first_name', 'last_name']]
    df_player_tourneys = df_pm.groupby(['player_id', 'first_name', 'last_name'], as_index=True)['tourney_id'].agg(['count', 'nunique']).reset_index()
    print(df_player_tourneys[['player_id', 'first_name', 'last_name', 'nunique']].sort_values('nunique', ascending=False))
    print(df_player_tourneys[['player_id', 'first_name', 'last_name', 'count']].sort_values('count', ascending=False))
    print()

Number of unique tourney IDs: 
328

matches_2018
     player_id            first_name  last_name  nunique
188     106000                 Damir    Dzumhur       32
83      105077                Albert      Ramos       30
61      104898                 Robin      Haase       30
71      104999                Mischa     Zverev       30
93      105173                Adrian  Mannarino       30
..         ...                   ...        ...      ...
262     111153           Christopher    Eubanks        1
263     111167                   Ugo    Nastasi        1
264     111190               Zhizhen      Zhang        1
265     111192                 Lucas    Miedler        1
418     208029  Holger Vitus Nodskov       Rune        1

[419 rows x 4 columns]
     player_id            first_name  last_name  count
0       100644             Alexander     Zverev     77
224     106233               Dominic      Thiem     75
330     126774              Stefanos  Tsitsipas     73
66      104926         

### [Q4 - "representatives", i.e. players with most contacts - biggest degree]

In [138]:
edges = sorted(G_all.edges(data=True), key=lambda t: t[2].get('weight', 1))
print(edges)

[(100644, 105341, {'weight': 1}), (100644, 104312, {'weight': 1}), (100644, 111456, {'weight': 1}), (100644, 105992, {'weight': 1}), (100644, 105223, {'weight': 1}), (100644, 105311, {'weight': 1}), (100644, 104180, {'weight': 1}), (100644, 104755, {'weight': 1}), (100644, 105870, {'weight': 1}), (100644, 105539, {'weight': 1}), (100644, 104919, {'weight': 1}), (100644, 105575, {'weight': 1}), (100644, 106000, {'weight': 1}), (100644, 105902, {'weight': 1}), (100644, 104999, {'weight': 1}), (100644, 105614, {'weight': 1}), (100644, 105166, {'weight': 1}), (100644, 103917, {'weight': 1}), (100644, 105657, {'weight': 1}), (100644, 105683, {'weight': 1}), (100644, 106109, {'weight': 1}), (100644, 105373, {'weight': 1}), (100644, 144719, {'weight': 1}), (100644, 104797, {'weight': 1}), (100644, 106426, {'weight': 1}), (100644, 106228, {'weight': 1}), (100644, 128034, {'weight': 1}), (100644, 106198, {'weight': 1}), (100644, 144707, {'weight': 1}), (100644, 104460, {'weight': 1}), (100644, 