<img src='john_isner.png' width='464'/>

# John Isner: Pride of America or One-Trick Pony?

There's nothing that we American tennis fans love more than cheering on USA's finest during the US Open tournament. But while Serena Williams always seems to find her way to the top of the women's bracket, the talent on the men's side of the game rarely makes the headlines.

The highest ranked men's singles player on the ATP tour right now is a fella named John Isner, currently ranked 10 in the world. John, or "Big John" as he is affectionately known as in the tennis community, is a monster on the tennis court. Standing at 6'10'', John is most well-known for having an incredibly powerful serve. He has recorded serve speeds of up to 149.9 miles per hour, which visibly intimidates opposing players, even at the professional level.

https://youtu.be/cgdTzXL86XM <--- Have a look at John smacking a few at this character

Even with John's success cracking into the elite top 10, many consider him to be a one-dimensional player, with his massive serve being his only true weapon. His most obvious weaknesses on the court are his movement and his backhand. John's matches tend to be boring and predictable - he wins his service games and loses his opponent's services games, without many long exchanges mixed in. Certain friends of mine claim that he would be nothing without his serve, and I myself have to admit to changing the channel during one (or a few) of Big John's matches, saying, "This goofy bastard will never win a major." But how is it that this alleged one-dimensional player remains at the top-end of PROFESSIONAL tennis and hasn't been "figured out" by these other top players?

This question has prompted me to dig a little deeper into John Isner's game, and more specifically his serve, to try and figure out what he's really made of.

The dataset used in this analysis was put together by Jeff Sackman / Tennis Abstract and is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The data includes ATP match-level data dating back to 1968. The full dataset can be found on GitHub at https://github.com/JeffSackmann/tennis_atp.

## Question #3: Cutting Samson's Hair

This cell just shows the importing of several different libraries used during the remainder of this investigation.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor

%matplotlib inline

The datasets were separated by year, so we had to import each file separately and then merge them together into one big dataframe.

In [2]:
directory = "C://Users//Cheney//Desktop//tennis_atp-master//atp_matches_"
isner_id = 104545

years = list(map(str,(range(2007,2018))))
sets = []

for year in years:
    path = directory + year + ".csv"
    temp_df = pd.read_csv(path)
    sets.append(temp_df)

master_df = pd.concat(sets)
master_df.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
0,2007-500,Halle,Grass,32,A,20070611,1,102318,,LL,...,1.0,3.0,2.0,61.0,38.0,28.0,13.0,11.0,1.0,3.0
1,2007-500,Halle,Grass,32,A,20070611,2,103813,,,...,0.0,2.0,2.0,44.0,30.0,17.0,5.0,8.0,1.0,5.0
2,2007-500,Halle,Grass,32,A,20070611,3,103794,,,...,1.0,5.0,6.0,95.0,44.0,35.0,19.0,11.0,12.0,14.0
3,2007-500,Halle,Grass,32,A,20070611,4,102967,,,...,4.0,8.0,7.0,86.0,47.0,38.0,17.0,15.0,2.0,6.0
4,2007-500,Halle,Grass,32,A,20070611,5,104607,4.0,,...,3.0,9.0,4.0,80.0,43.0,33.0,15.0,11.0,6.0,8.0


We wanted to start by getting a feel for John Isner's career win percentage as it relates to that of his competition. We loop through a list of players and create a dictionary showing their individual career wins, losses, and win percentages. We then compiled this data into a new dataframe, only incuding players who had logged more than 200 career matches.

In [3]:
all_player_win_percentages = []
for player in master_df['winner_id'].unique():
    player_win_percentages = {}
    
    temp_view = master_df[master_df['winner_id'] == player]
    temp_view2 = master_df[master_df['loser_id'] == player]
    
    if (len(temp_view) + len(temp_view2)) > 200:
        player_win_percentages['player_id'] = player
        player_win_percentages['name'] = temp_view.iloc[0,10]
        player_win_percentages['wins'] = len(temp_view)
        player_win_percentages['losses'] = len(temp_view2)
        player_win_percentages['total_matches'] = len(temp_view) + len(temp_view2)
        player_win_percentages['win_pct'] = len(temp_view)/(len(temp_view) + len(temp_view2))
        
        if temp_view.iloc[0,13] == 'USA':
            player_win_percentages['usa?'] = 1
        else:
            player_win_percentages['usa?'] = 0
            
        all_player_win_percentages.append(player_win_percentages)
        
    else:
        continue
        
z = pd.DataFrame(all_player_win_percentages).sort_values(by=['win_pct'],ascending=False).reset_index(drop=True)

As you can see below, John's career win percentage falls around a respectable 61.9%.

In [4]:
z

Unnamed: 0,losses,name,player_id,total_matches,usa?,win_pct,wins
0,132,Novak Djokovic,104925,873,0,0.848797,741
1,128,Roger Federer,103819,785,0,0.836943,657
2,137,Rafael Nadal,104745,834,0,0.835731,697
3,153,Andy Murray,104918,763,0,0.799476,610
4,99,Andy Roddick,104053,357,1,0.722689,258
5,147,Juan Martin Del Potro,105223,523,0,0.718929,376
6,236,David Ferrer,103970,816,0,0.710784,580
7,91,Robin Soderling,104417,302,0,0.698675,211
8,195,Jo Wilfried Tsonga,104542,625,0,0.688000,430
9,138,Milos Raonic,105683,429,0,0.678322,291


#### Now, here's where we start the fun stuff. We wanted to find out how the rest of Isner's game compares to his competition, or stated another way, we wanted to find out how much of his on-court success can be attributed to his serve alone. We attack this question by focusing on his career win percentage. Can we predict the change in win percentage if we neutralized his serve?

We create one dataframe that shows all matchup data that is unrelated to either player. 

In [5]:
wins_losses = list(('w'*16547)+('l'*16547))
player_list = list(master_df['winner_id'][:16547]) + list(master_df['loser_id'][16547:])
opponent_list = list(master_df['loser_id'][:16547]) + list(master_df['winner_id'][16547:])

new_cols = ['tourney_level','surface','tourney_date','best_of']
master_df_results = master_df[['tourney_level','surface','tourney_date','best_of']]
master_df_results.columns = new_cols
master_df_results['outcome'] = wins_losses
master_df_results['player'] = player_list
master_df_results['opponent'] = opponent_list


master_df_results.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,tourney_level,surface,tourney_date,best_of,outcome,player,opponent
0,A,Grass,20070611,3,w,102318,103694
1,A,Grass,20070611,3,w,103813,104019
2,A,Grass,20070611,3,w,103794,104559
3,A,Grass,20070611,3,w,102967,103900
4,A,Grass,20070611,3,w,104607,103017


We then created a second dataframe with attributes of each individual player that has played in one of these logged matches. Most of the attributes are averages of the player's match statistics, while some are categorical attributes such as handedness and nationality.

In [6]:
winners = master_df[['winner_id', 'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'w_ace', 'w_df', 
                     'w_svpt', 'w_1stIn', 'w_1stWon',
    'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced','l_bpFaced']].copy()

losers = master_df[['loser_id', 'loser_name', 'loser_hand', 'loser_ht', 'loser_ioc', 'l_ace', 'l_df', 'l_svpt', 
                    'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved','l_bpFaced','w_bpFaced']].copy()

winners['breaks'] = winners['l_bpFaced'] - losers['l_bpSaved']
 
losers['breaks'] = losers['w_bpFaced'] - winners['w_bpSaved']

columns = ['player_id', 'player_name', 'player_hand', 'player_ht', 'player_ioc', 'aces', 'double_faults', 
           'service_pts', '1st_serve_in', '1st_won', '2nd_won', 'service_gms', 'bp_saved', 'bp_faced', 'break_chances', 'breaks']

winners.columns = columns
losers.columns = columns

df = pd.concat([winners,losers]).reset_index(drop=True)

player_ids = list(df.player_id)
players = pd.unique(player_ids)

player_stats=[]

for player in players:
    df_view = df[df['player_id']==player].reset_index(drop=True)
    individual_player_stats = {}
    
    individual_player_stats['player_id'] = player
    individual_player_stats['player_name'] = df_view.player_name[0]
    individual_player_stats['player_hand'] = df_view.player_hand[0]
    
    
    if df_view.player_ioc[0] == 'USA':
        individual_player_stats['usa?'] = 1
    else:
        individual_player_stats['usa?'] = 0
    
    df_view.player_ht.fillna(df_view.player_ht.mean(), inplace=True)
    individual_player_stats['player_ht'] = np.average(df_view.player_ht)
    
    df_view.aces.fillna(df_view.aces.mean(), inplace=True)
    individual_player_stats['aces'] = np.average(df_view.aces)
    
    df_view.double_faults.fillna(df_view.double_faults.mean(), inplace=True)
    individual_player_stats['double_faults'] = np.average(df_view.double_faults)
    
    df_view.service_pts.fillna(df_view.service_pts.mean(), inplace=True)
    individual_player_stats['service_pts'] = np.average(df_view.service_pts)
    
    df_view['1st_serve_in'].fillna(df_view['1st_serve_in'].mean(), inplace=True)
    individual_player_stats['1st_serve_in'] = np.average(df_view['1st_serve_in'])
    
    df_view['1st_won'].fillna(df_view['1st_won'].mean(), inplace=True)
    individual_player_stats['1st_won'] = np.average(df_view['1st_won'])
    
    df_view['2nd_won'].fillna(df_view['2nd_won'].mean(), inplace=True)
    individual_player_stats['2nd_won'] = np.average(df_view['2nd_won'])
    
    df_view.service_gms.fillna(df_view.service_gms.mean(), inplace=True)
    individual_player_stats['service_gms'] = np.average(df_view.service_gms)
    
    df_view.bp_saved.fillna(df_view.bp_saved.mean(), inplace=True)
    individual_player_stats['bp_saved'] = np.average(df_view.bp_saved)
    
    df_view.bp_faced.fillna(df_view.bp_faced.mean(), inplace=True)
    individual_player_stats['bp_faced'] = np.average(df_view.bp_faced)
    
    df_view.break_chances.fillna(df_view.break_chances.mean(), inplace=True)
    individual_player_stats['break_chances'] = np.average(df_view.break_chances)
    
    df_view.breaks.fillna(df_view.breaks.mean(), inplace=True)
    individual_player_stats['breaks'] = np.average(df_view.breaks)
    
    individual_player_stats['recorded_matches'] = len(df_view)
    

    player_stats.append(individual_player_stats)
    
all_player_statistics = pd.DataFrame(player_stats)
all_player_statistics['player_hand'] = all_player_statistics['player_hand'].replace('U','R')
all_player_statistics['player_ht'].fillna(all_player_statistics['player_ht'].mean(), inplace=True)
all_player_statistics['player_ht'].fillna(all_player_statistics['player_ht'].mean(), inplace=True)


player_statistics = all_player_statistics[all_player_statistics.recorded_matches > 17].reset_index(drop=True)
player_statistics.head()

Unnamed: 0,1st_serve_in,1st_won,2nd_won,aces,bp_faced,bp_saved,break_chances,breaks,double_faults,player_hand,player_ht,player_id,player_name,recorded_matches,service_gms,service_pts,usa?
0,49.794118,34.117647,12.5,4.470588,6.911765,3.735294,5.441176,2.411765,2.176471,R,183.0,102318,Andrei Pavel,40,12.058824,77.0,0
1,52.442308,35.641827,13.365385,3.21875,7.122596,4.213942,7.009615,2.848558,2.120192,L,185.0,103813,Jarkko Nieminen,447,12.300481,79.286058,0
2,42.669516,31.675214,16.111111,7.467236,6.105413,3.592593,5.464387,2.05698,2.894587,R,178.0,103794,Benjamin Becker,358,11.863248,75.239316,0
3,44.156627,31.885542,16.433735,5.993976,6.879518,4.210843,5.921687,2.343373,2.409639,R,188.0,102967,Marc Gicquel,166,11.933735,77.26506,0
4,43.777626,33.959072,16.875853,7.757162,5.039563,3.24693,7.0,2.92633,2.364256,R,196.0,104607,Tomas Berdych,771,12.020464,75.504775,0


Using a couple left joins, we combine our matchup dataframe and our player stats dataframe.

In [7]:
x = master_df_results.join(player_statistics.set_index('player_id'), on='player')
full_dataframe = x.join(player_statistics.set_index('player_id'), on='opponent',rsuffix='_opponent')
full_dataframe.head()

Unnamed: 0,tourney_level,surface,tourney_date,best_of,outcome,player,opponent,1st_serve_in,1st_won,2nd_won,...,break_chances_opponent,breaks_opponent,double_faults_opponent,player_hand_opponent,player_ht_opponent,player_name_opponent,recorded_matches_opponent,service_gms_opponent,service_pts_opponent,usa?_opponent
0,A,Grass,20070611,3,w,102318,103694,49.794118,34.117647,12.5,...,6.078818,2.477833,2.359606,R,168.0,Olivier Rochus,215.0,11.650246,76.433498,0.0
1,A,Grass,20070611,3,w,103813,104019,52.442308,35.641827,13.365385,...,5.98913,2.228261,2.065217,R,193.0,Kristof Vliegen,100.0,12.021739,77.73913,0.0
2,A,Grass,20070611,3,w,103794,104559,42.669516,31.675214,16.111111,...,6.446429,2.478571,4.042857,R,188.0,Teymuraz Gabashvili,293.0,12.071429,80.110714,0.0
3,A,Grass,20070611,3,w,102967,103900,44.156627,31.885542,16.433735,...,7.84322,3.377119,3.389831,R,180.0,David Nalbandian,254.0,11.991525,76.529661,0.0
4,A,Grass,20070611,3,w,104607,103017,43.777626,33.959072,16.875853,...,5.990099,2.366337,3.019802,R,183.0,Nicolas Kiefer,107.0,11.633663,73.346535,0.0


Next, we prep our data for further analysis. Here we convert our categorical variables into quantitative variables through a process called label encoding.

In [8]:
full_dataframe['tourney_date'] = pd.to_datetime(full_dataframe['tourney_date'])
full_dataframe['tourney_date'] = full_dataframe['tourney_date'].apply(lambda x: x.toordinal())

full_dataframe['tourney_level'] = full_dataframe['tourney_level'].astype('category')
full_dataframe['tourney_level'] = full_dataframe['tourney_level'].cat.codes

full_dataframe['surface'] = full_dataframe['surface'].astype('category')
full_dataframe['surface'] = full_dataframe['surface'].cat.codes

full_dataframe['outcome'] = full_dataframe['outcome'].replace('w',1)
full_dataframe['outcome'] = full_dataframe['outcome'].replace('l',0)

full_dataframe['player_hand'] = full_dataframe['player_hand'].replace('R',1)
full_dataframe['player_hand'] = full_dataframe['player_hand'].replace('L',0)

full_dataframe['player_hand_opponent'] = full_dataframe['player_hand_opponent'].replace('R',1)
full_dataframe['player_hand_opponent'] = full_dataframe['player_hand_opponent'].replace('L',0)

full_dataframe = full_dataframe.dropna()
full_dataframe.head()

Unnamed: 0,tourney_level,surface,tourney_date,best_of,outcome,player,opponent,1st_serve_in,1st_won,2nd_won,...,break_chances_opponent,breaks_opponent,double_faults_opponent,player_hand_opponent,player_ht_opponent,player_name_opponent,recorded_matches_opponent,service_gms_opponent,service_pts_opponent,usa?_opponent
0,0,2,719163,3,1,102318,103694,49.794118,34.117647,12.5,...,6.078818,2.477833,2.359606,1.0,168.0,Olivier Rochus,215.0,11.650246,76.433498,0.0
1,0,2,719163,3,1,103813,104019,52.442308,35.641827,13.365385,...,5.98913,2.228261,2.065217,1.0,193.0,Kristof Vliegen,100.0,12.021739,77.73913,0.0
2,0,2,719163,3,1,103794,104559,42.669516,31.675214,16.111111,...,6.446429,2.478571,4.042857,1.0,188.0,Teymuraz Gabashvili,293.0,12.071429,80.110714,0.0
3,0,2,719163,3,1,102967,103900,44.156627,31.885542,16.433735,...,7.84322,3.377119,3.389831,1.0,180.0,David Nalbandian,254.0,11.991525,76.529661,0.0
4,0,2,719163,3,1,104607,103017,43.777626,33.959072,16.875853,...,5.990099,2.366337,3.019802,1.0,183.0,Nicolas Kiefer,107.0,11.633663,73.346535,0.0


After reformatting our data, we start setting up our predictive model. We begin by importing a few modules from Scikit-learn, which has a lot of great tools for machine learning. The algorithm we'll be using in this case is called Random Forest Classifier.

In [9]:
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

We split our dataframe into our features and target, with the target being the outcome of the match. Then we split both into training and test datasets. We then fit a decision tree based on the training set.

In [10]:
X = full_dataframe.drop(['player','opponent','outcome','player_name','player_name_opponent','tourney_date','player_hand'], axis=1)
y = full_dataframe['outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

my_tree = tree.DecisionTreeClassifier()
my_tree = my_tree.fit(X_train,y_train)

The decision tree allows us to take a look at the feature importances. In other words, how influential is each feature in trying to predict the outcome of a match. Features that don't have much of an impact on the preditive power of the model can cause noise. We use these feature importances to select the cocktail of features that results in the most accurate predictive model.

In [11]:
importances = list(my_tree.feature_importances_)
features = list(X.columns)

sample = pd.DataFrame()
sample['features'] = features
sample['importances'] = importances
sample = sample.sort_values(by=['importances'],ascending=False).reset_index(drop=True)
sample

Unnamed: 0,features,importances
0,recorded_matches_opponent,0.08296
1,bp_faced,0.076912
2,recorded_matches,0.053207
3,break_chances,0.04964
4,2nd_won_opponent,0.046373
5,player_ht_opponent,0.045617
6,surface,0.043846
7,tourney_level,0.039403
8,break_chances_opponent,0.038534
9,bp_faced_opponent,0.035804


Using our selected group of features, we deploy the Random Forest Classifier model by fitting it to our training set and then using the model to generate a set of predicted outcomes. Our model correctly predicted the match outcome approximately 80% of the time.

In [12]:
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.804154708214


#### Now that we have our model set up to predict match outcomes, we turn our attention back to Big John. We wanted to find out how effective of a player he would be without his dominant serve. In order to find out, we decide to replace all of John's serving-related stats with the competition averages and then simulate all of his matches using our new predictive model. 

Below we calculate the average for each serving-related feature in our dataset. 

In [13]:
serve_stats = ('1st_serve_in','1st_won','2nd_won','aces','bp_faced','bp_saved','double_faults','service_gms','service_pts')

avg_serve_stats = {}
for stat_line in serve_stats:
    avg_serve_stats[stat_line] = player_statistics[stat_line].mean()

In [14]:
avg_serve_stats = pd.Series(avg_serve_stats)
avg_serve_stats

1st_serve_in     48.988325
1st_won          34.100649
2nd_won          15.648051
aces              5.536924
bp_faced          7.417186
bp_saved          4.395114
double_faults     3.058153
service_gms      12.235931
service_pts      80.470136
dtype: float64

Here we see John's actual average serving-related stats, which are noticeably better than the competition averages shown above.

In [15]:
player_statistics[player_statistics.player_id == isner_id]

Unnamed: 0,1st_serve_in,1st_won,2nd_won,aces,bp_faced,bp_saved,break_chances,breaks,double_faults,player_hand,player_ht,player_id,player_name,recorded_matches,service_gms,service_pts,usa?
40,60.527629,47.324421,15.365419,17.616756,4.180036,2.966132,4.812834,1.493761,2.27451,R,206.0,104545,John Isner,580,13.914439,87.946524,1


We create a new dataframe for the augumented data and replace the serving-related stats below.

In [16]:
augmented_player_statistics = player_statistics.copy()

augmented_player_statistics.loc[40,'1st_serve_in'] = 48.988325
augmented_player_statistics.loc[40,'1st_won'] = 34.100649
augmented_player_statistics.loc[40,'2nd_won'] = 15.648051
augmented_player_statistics.loc[40,'aces'] = 5.536924
augmented_player_statistics.loc[40,'bp_faced'] = 7.417186
augmented_player_statistics.loc[40,'bp_saved'] = 4.395114
augmented_player_statistics.loc[40,'double_faults'] = 3.058153
augmented_player_statistics.loc[40,'service_gms'] = 12.235931
augmented_player_statistics.loc[40,'service_pts'] = 80.470136

augmented_player_statistics[augmented_player_statistics.player_id == isner_id]

Unnamed: 0,1st_serve_in,1st_won,2nd_won,aces,bp_faced,bp_saved,break_chances,breaks,double_faults,player_hand,player_ht,player_id,player_name,recorded_matches,service_gms,service_pts,usa?
40,48.988325,34.100649,15.648051,5.536924,7.417186,4.395114,4.812834,1.493761,3.058153,R,206.0,104545,John Isner,580,12.235931,80.470136,1


Using the same techniques as earlier, we transform the categorical features into quantitative features so that they can be run through our model.

In [17]:
temp_variable = master_df_results.join(augmented_player_statistics.set_index('player_id'), on='player')
augmented_full_dataframe = temp_variable.join(augmented_player_statistics.set_index('player_id'), on='opponent',rsuffix='_opponent')

augmented_full_dataframe['tourney_date'] = pd.to_datetime(augmented_full_dataframe['tourney_date'])
augmented_full_dataframe['tourney_date'] = augmented_full_dataframe['tourney_date'].apply(lambda x: x.toordinal())

augmented_full_dataframe['tourney_level'] = augmented_full_dataframe['tourney_level'].astype('category')
augmented_full_dataframe['tourney_level'] = augmented_full_dataframe['tourney_level'].cat.codes

augmented_full_dataframe['surface'] = augmented_full_dataframe['surface'].astype('category')
augmented_full_dataframe['surface'] = augmented_full_dataframe['surface'].cat.codes

augmented_full_dataframe['outcome'] = augmented_full_dataframe['outcome'].replace('w',1)
augmented_full_dataframe['outcome'] = augmented_full_dataframe['outcome'].replace('l',0)

augmented_full_dataframe['player_hand'] = augmented_full_dataframe['player_hand'].replace('R',1)
augmented_full_dataframe['player_hand'] = augmented_full_dataframe['player_hand'].replace('L',0)

augmented_full_dataframe['player_hand_opponent'] = augmented_full_dataframe['player_hand_opponent'].replace('R',1)
augmented_full_dataframe['player_hand_opponent'] = augmented_full_dataframe['player_hand_opponent'].replace('L',0)

augmented_full_dataframe = augmented_full_dataframe.dropna()

Then, we pull out all of Isner's matches from the augmented dataset.

In [18]:
isner_matches1 = augmented_full_dataframe[augmented_full_dataframe.player == isner_id]
isner_matches2 = augmented_full_dataframe[augmented_full_dataframe.opponent == isner_id]
isner_matches = pd.concat([isner_matches1,isner_matches2])
isner_matches

Unnamed: 0,tourney_level,surface,tourney_date,best_of,outcome,player,opponent,1st_serve_in,1st_won,2nd_won,...,break_chances_opponent,breaks_opponent,double_faults_opponent,player_hand_opponent,player_ht_opponent,player_name_opponent,recorded_matches_opponent,service_gms_opponent,service_pts_opponent,usa?_opponent
96,4,3,719163,5,1,104545,103813,48.988325,34.100649,15.648051,...,7.009615,2.848558,2.120192,0.0,185.0,Jarkko Nieminen,447.0,12.300481,79.286058,0.0
158,4,3,719163,5,1,104545,103573,48.988325,34.100649,15.648051,...,5.692308,2.128205,1.717949,1.0,180.0,Rik De Voest,61.0,12.307692,82.435897,0.0
756,0,3,719163,3,1,104545,102450,48.988325,34.100649,15.648051,...,7.312500,2.937500,4.250000,1.0,185.0,Tim Henman,19.0,15.437500,100.312500,0.0
772,0,3,719163,3,1,104545,103794,48.988325,34.100649,15.648051,...,5.464387,2.056980,2.894587,1.0,178.0,Benjamin Becker,358.0,11.863248,75.239316,0.0
782,0,3,719163,3,1,104545,104639,48.988325,34.100649,15.648051,...,7.376344,2.881720,2.795699,0.0,180.0,Wayne Odesnik,95.0,12.946237,84.311828,1.0
787,0,3,719163,3,1,104545,103163,48.988325,34.100649,15.648051,...,7.075988,2.762918,3.623100,1.0,188.0,Tommy Haas,340.0,12.735562,80.194529,0.0
789,0,3,719163,3,1,104545,104792,48.988325,34.100649,15.648051,...,7.764378,3.148423,3.393321,1.0,193.0,Gael Monfils,555.0,12.300557,79.543599,0.0
1029,0,3,719163,3,1,104545,103794,48.988325,34.100649,15.648051,...,5.464387,2.056980,2.894587,1.0,178.0,Benjamin Becker,358.0,11.863248,75.239316,0.0
184,0,3,719163,3,1,104545,104268,48.988325,34.100649,15.648051,...,6.643478,2.691304,2.739130,0.0,185.0,Alejandro Falla,249.0,12.004348,78.582609,0.0
1019,0,3,719163,3,1,104545,103722,48.988325,34.100649,15.648051,...,6.222222,2.361111,1.953704,1.0,180.0,Florent Serra,217.0,12.328704,81.754630,0.0


#### Just like before, we use the same features and ask our model to predict the outcomes of Isner's matches using the augmented stats. As you can see below, after the model simulated the matches, Isner's actual win percentage of 61.9% dropped to a predicted 47.2%. While this is a fairly sizeable drop, I was honestly surprised that the impact was not greater.

In [19]:
X_set = isner_matches.drop(['player','opponent','outcome','player_name','player_name_opponent','tourney_date','player_hand'], 
                           axis=1)

predicted_outcomes = clf.predict(X_set)

new_df =isner_matches[['player','opponent']].reset_index(drop=True)
new_df['outcome'] = predicted_outcomes

total_matches = len(new_df)

a = new_df[new_df.player == isner_id]
count_1 = len(a[a.outcome == 1])

b = new_df[new_df.opponent == isner_id]
count_2 = len(b[b.outcome == 0])

total_wins = count_1 + count_2

total_win_pct = total_wins/total_matches

print('Total Wins: %s' %total_wins)
print('Total Win Percentage: %s' %total_win_pct)

Total Wins: 267
Total Win Percentage: 0.4717314487632509


#### To put things in perspective, John Isner's predicted win percentage (with a neutralized serve) still lands him among the top five active American players in men's tennis. 

In [20]:
usa_win_pcts = z[z['usa?']==1].reset_index(drop=True)

adjusted_usa_win_pcts = usa_win_pcts.copy()
adjusted_usa_win_pcts.loc[1,'win_pct'] = total_win_pct
adjusted_usa_win_pcts = adjusted_usa_win_pcts.drop([0,2,4]).sort_values(by='win_pct',ascending=False).reset_index(drop=True)
adjusted_usa_win_pcts

Unnamed: 0,losses,name,player_id,total_matches,usa?,win_pct,wins
0,103,Jack Sock,106058,257,1,0.599222,154
1,246,Sam Querrey,105023,560,1,0.560714,314
2,117,Steve Johnson,105449,241,1,0.514523,124
3,221,John Isner,104545,580,1,0.471731,359
4,165,Donald Young,105385,288,1,0.427083,123
5,129,Ryan Harrison,105992,223,1,0.421525,94


### So, to all the haters: 

### Big John's game is certainly not without flaw, but he's an incredible player and one that we American tennis fans should be proud to call "ours". He's currently #10 in the world, #1 in our hearts, and he's definitely better than you.