Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# importing all the csv files

year_2002 = pd.read_csv('atp_matches_2002.csv')
year_2003 = pd.read_csv('atp_matches_2003.csv')
year_2004 = pd.read_csv('atp_matches_2004.csv')
year_2005 = pd.read_csv('atp_matches_2005.csv')
year_2006 = pd.read_csv('atp_matches_2006.csv')
year_2007 = pd.read_csv('atp_matches_2007.csv')
year_2008 = pd.read_csv('atp_matches_2008.csv')
year_2009 = pd.read_csv('atp_matches_2009.csv')
year_2010 = pd.read_csv('atp_matches_2010.csv')
year_2011 = pd.read_csv('atp_matches_2011.csv')
year_2012 = pd.read_csv('atp_matches_2012.csv')
year_2013 = pd.read_csv('atp_matches_2013.csv')
year_2014 = pd.read_csv('atp_matches_2014.csv')
year_2015 = pd.read_csv('atp_matches_2015.csv')
year_2016 = pd.read_csv('atp_matches_2016.csv')
year_2017 = pd.read_csv('atp_matches_2017.csv')
# year_2018 = pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2018.csv')
# year_2019 = pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2019.csv')

In [3]:
year_2002.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
0,2002-499,Delray Beach,Hard,32,A,20020304,1,104053,1.0,,...,5.0,0.0,2.0,55.0,33.0,15.0,11.0,9.0,1.0,6.0
1,2002-499,Delray Beach,Hard,32,A,20020304,2,102703,,,...,3.0,1.0,0.0,58.0,36.0,19.0,9.0,9.0,5.0,10.0
2,2002-499,Delray Beach,Hard,32,A,20020304,3,103566,,,...,0.0,0.0,0.0,14.0,7.0,3.0,2.0,3.0,1.0,3.0
3,2002-499,Delray Beach,Hard,32,A,20020304,4,103182,8.0,,...,11.0,7.0,3.0,68.0,35.0,28.0,11.0,12.0,3.0,7.0
4,2002-499,Delray Beach,Hard,32,A,20020304,5,103188,,,...,14.0,3.0,0.0,97.0,53.0,28.0,22.0,13.0,8.0,14.0


In [4]:
year_2017.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
0,2017-M020,Brisbane,Hard,32,A,20170102,300,105777,7.0,,...,7.0,4.0,0.0,69.0,49.0,36.0,9.0,12.0,2.0,5.0
1,2017-M020,Brisbane,Hard,32,A,20170102,299,105777,7.0,,...,0.0,4.0,3.0,61.0,28.0,24.0,16.0,10.0,2.0,4.0
2,2017-M020,Brisbane,Hard,32,A,20170102,298,105453,3.0,,...,5.0,9.0,2.0,61.0,37.0,27.0,10.0,10.0,0.0,2.0
3,2017-M020,Brisbane,Hard,32,A,20170102,297,105683,1.0,,...,7.0,4.0,0.0,84.0,61.0,39.0,14.0,14.0,2.0,4.0
4,2017-M020,Brisbane,Hard,32,A,20170102,296,105777,7.0,,...,14.0,6.0,5.0,82.0,37.0,29.0,24.0,14.0,4.0,7.0


In [5]:
pd.set_option('display.max_columns', 55)

In [6]:
# creating function to choose Rafael Nadal for both winner and loser name

def nadal(data):
    '''filters out Rafael Nadal from all the datasets'''
    
    # prevents warning
    data = data.copy()
    
    # filters datasets
    data = data[(data['winner_name'] == 'Rafael Nadal') | (data['loser_name'] == 'Rafael Nadal')]
    
    return data

# subset of all the data to just Rafael Nadal
y_2002 = nadal(year_2002)
y_2003 = nadal(year_2003)
y_2004 = nadal(year_2004)
y_2005 = nadal(year_2005)
y_2006 = nadal(year_2006)
y_2007 = nadal(year_2007)
y_2008 = nadal(year_2008)
y_2009 = nadal(year_2009)
y_2010 = nadal(year_2010)
y_2011 = nadal(year_2011)
y_2012 = nadal(year_2012)
y_2013 = nadal(year_2013)
y_2014 = nadal(year_2014)
y_2015 = nadal(year_2015)
y_2016 = nadal(year_2016)
y_2017 = nadal(year_2017)
# y_2018 = nadal(year_2018)
# y_2019 = nadal(year_2019)

# concatenate all the datasets into one dataframe

r_nadal = pd.concat([y_2002, y_2003, y_2004, y_2005, y_2006, y_2007, y_2008, 
                    y_2009, y_2010, y_2011, y_2012, y_2013, y_2014, y_2015, 
                    y_2016, y_2017])
print(r_nadal.shape)
r_nadal.head()

(1001, 49)


Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,winner_rank,winner_rank_points,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,loser_rank,loser_rank_points,score,best_of,round,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
606,2002-573,Mallorca,Clay,32,A,20020429,6,104745,,WC,Rafael Nadal,L,185.0,ESP,15.904175,762.0,14.0,102887,,,Ramon Delgado,R,185.0,PAR,25.453799,81.0,490.0,6-4 6-4,3,R32,83.0,1.0,1.0,66.0,59.0,36.0,3.0,10.0,6.0,9.0,2.0,1.0,59.0,41.0,23.0,5.0,10.0,2.0,7.0
619,2002-573,Mallorca,Clay,32,A,20020429,19,103694,,,Olivier Rochus,R,168.0,BEL,21.275838,70.0,571.0,104745,,WC,Rafael Nadal,L,185.0,ESP,15.904175,762.0,14.0,6-2 6-2,3,R16,62.0,1.0,0.0,48.0,34.0,23.0,12.0,8.0,0.0,0.0,0.0,0.0,59.0,49.0,28.0,2.0,8.0,5.0,9.0
188,2003-414,Hamburg Masters,Clay,64,M,20030512,31,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,103908,,,Paul Henri Mathieu,R,185.0,FRA,21.327858,44.0,855.0,7-5 6-4,3,R64,102.0,2.0,1.0,81.0,66.0,44.0,6.0,11.0,8.0,10.0,2.0,1.0,57.0,38.0,26.0,9.0,11.0,3.0,7.0
205,2003-414,Hamburg Masters,Clay,64,M,20030512,48,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,102845,2.0,,Carlos Moya,R,190.0,ESP,26.704997,4.0,2985.0,7-5 6-4,3,R32,89.0,1.0,0.0,63.0,53.0,30.0,7.0,11.0,2.0,5.0,3.0,0.0,64.0,36.0,25.0,11.0,11.0,4.0,9.0
213,2003-414,Hamburg Masters,Clay,64,M,20030512,56,103292,,,Gaston Gaudio,R,175.0,ARG,24.421629,29.0,1080.0,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,6-2 6-2,3,R16,59.0,3.0,1.0,40.0,25.0,22.0,10.0,8.0,0.0,0.0,1.0,2.0,48.0,35.0,18.0,7.0,8.0,2.0,6.0


In [7]:
r_nadal.tail()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,winner_rank,winner_rank_points,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,loser_rank,loser_rank_points,score,best_of,round,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
244,2017-580,Australian Open,Hard,128,G,20170116,205,104745,9.0,,Rafael Nadal,L,185.0,ESP,30.622861,9.0,3195.0,100644,24.0,,Alexander Zverev,R,,GER,19.742642,24.0,1655.0,4-6 6-3 6-7(5) 6-3 6-2,5,R32,245.0,11.0,4.0,150.0,113.0,82.0,23.0,24.0,5.0,7.0,19.0,11.0,172.0,108.0,77.0,27.0,24.0,11.0,16.0
255,2017-580,Australian Open,Hard,128,G,20170116,216,104745,9.0,,Rafael Nadal,L,185.0,ESP,30.622861,9.0,3195.0,104792,6.0,,Gael Monfils,R,193.0,FRA,30.376454,6.0,3625.0,6-3 6-3 4-6 6-4,5,R16,175.0,2.0,1.0,101.0,74.0,53.0,17.0,19.0,3.0,6.0,15.0,10.0,130.0,76.0,50.0,24.0,19.0,11.0,17.0
261,2017-580,Australian Open,Hard,128,G,20170116,222,104745,9.0,,Rafael Nadal,L,185.0,ESP,30.622861,9.0,3195.0,105683,3.0,,Milos Raonic,R,196.0,CAN,26.056126,3.0,5290.0,6-4 7-6(7) 6-4,5,QF,164.0,4.0,4.0,105.0,76.0,63.0,15.0,16.0,4.0,4.0,14.0,2.0,107.0,71.0,48.0,20.0,16.0,1.0,3.0
264,2017-580,Australian Open,Hard,128,G,20170116,225,104745,9.0,,Rafael Nadal,L,185.0,ESP,30.622861,9.0,3195.0,105777,15.0,,Grigor Dimitrov,R,188.0,BUL,25.672827,15.0,2135.0,6-3 5-7 7-6(5) 6-7(4) 6-4,5,SF,296.0,8.0,3.0,184.0,135.0,93.0,27.0,28.0,12.0,16.0,20.0,5.0,181.0,124.0,87.0,28.0,27.0,8.0,13.0
265,2017-580,Australian Open,Hard,128,G,20170116,226,103819,17.0,,Roger Federer,R,185.0,SUI,35.441478,17.0,1980.0,104745,9.0,,Rafael Nadal,L,185.0,ESP,30.622861,9.0,3195.0,6-4 3-6 6-1 3-6 6-3,5,F,217.0,20.0,3.0,138.0,85.0,65.0,26.0,22.0,13.0,17.0,4.0,3.0,151.0,110.0,69.0,23.0,22.0,14.0,20.0


In [8]:
r_nadal.columns

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'winner_rank', 'winner_rank_points', 'loser_id', 'loser_seed',
       'loser_entry', 'loser_name', 'loser_hand', 'loser_ht', 'loser_ioc',
       'loser_age', 'loser_rank', 'loser_rank_points', 'score', 'best_of',
       'round', 'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon',
       'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df',
       'l_svpt', 'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved',
       'l_bpFaced'],
      dtype='object')

In [9]:
year_2003 = nadal(year_2003)

In [10]:
year_2003

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,winner_rank,winner_rank_points,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,loser_rank,loser_rank_points,score,best_of,round,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
188,2003-414,Hamburg Masters,Clay,64,M,20030512,31,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,103908,,,Paul Henri Mathieu,R,185.0,FRA,21.327858,44.0,855.0,7-5 6-4,3,R64,102.0,2.0,1.0,81.0,66.0,44.0,6.0,11.0,8.0,10.0,2.0,1.0,57.0,38.0,26.0,9.0,11.0,3.0,7.0
205,2003-414,Hamburg Masters,Clay,64,M,20030512,48,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,102845,2.0,,Carlos Moya,R,190.0,ESP,26.704997,4.0,2985.0,7-5 6-4,3,R32,89.0,1.0,0.0,63.0,53.0,30.0,7.0,11.0,2.0,5.0,3.0,0.0,64.0,36.0,25.0,11.0,11.0,4.0,9.0
213,2003-414,Hamburg Masters,Clay,64,M,20030512,56,103292,,,Gaston Gaudio,R,175.0,ARG,24.421629,29.0,1080.0,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,6-2 6-2,3,R16,59.0,3.0,1.0,40.0,25.0,22.0,10.0,8.0,0.0,0.0,1.0,2.0,48.0,35.0,18.0,7.0,8.0,2.0,6.0
306,2003-410,Monte Carlo Masters,Clay,64,M,20030414,23,104745,,Q,Rafael Nadal,L,185.0,ESP,16.862423,109.0,337.0,102344,,,Karol Kucera,R,188.0,SVK,29.111567,49.0,768.0,6-1 6-2,3,R64,63.0,0.0,2.0,47.0,31.0,25.0,10.0,8.0,3.0,3.0,3.0,8.0,43.0,23.0,13.0,6.0,7.0,2.0,6.0
327,2003-410,Monte Carlo Masters,Clay,64,M,20030414,44,104745,,Q,Rafael Nadal,L,185.0,ESP,16.862423,109.0,337.0,102610,4.0,,Albert Costa,R,180.0,ESP,27.802875,7.0,2235.0,7-5 6-3,3,R32,120.0,1.0,1.0,96.0,70.0,40.0,14.0,11.0,14.0,17.0,1.0,3.0,67.0,34.0,19.0,14.0,10.0,7.0,12.0
337,2003-410,Monte Carlo Masters,Clay,64,M,20030414,54,103909,,,Guillermo Coria,R,175.0,ARG,21.24846,26.0,1165.0,104745,,Q,Rafael Nadal,L,185.0,ESP,16.862423,109.0,337.0,7-6(3) 6-2,3,R16,94.0,0.0,0.0,66.0,37.0,23.0,15.0,10.0,5.0,8.0,0.0,2.0,56.0,34.0,17.0,10.0,10.0,3.0,8.0
431,2003-560,US Open,Hard,128,G,20030825,54,104745,,,Rafael Nadal,L,185.0,ESP,17.226557,45.0,786.0,102950,,,Fernando Vicente,R,180.0,ESP,26.464066,61.0,616.0,6-4 6-3 6-3,5,R128,133.0,2.0,4.0,82.0,46.0,37.0,23.0,14.0,3.0,3.0,9.0,8.0,111.0,65.0,44.0,19.0,14.0,14.0,18.0
468,2003-560,US Open,Hard,128,G,20030825,91,101962,22.0,,Younes El Aynaoui,R,193.0,MAR,31.950719,21.0,1260.0,104745,,,Rafael Nadal,L,185.0,ESP,17.226557,45.0,786.0,7-6(6) 6-3 7-6(6),5,R64,163.0,15.0,2.0,112.0,76.0,58.0,21.0,17.0,3.0,5.0,3.0,6.0,109.0,83.0,59.0,13.0,16.0,9.0,12.0
579,2003-425,Barcelona,Clay,56,A,20030421,13,104745,,WC,Rafael Nadal,L,185.0,ESP,16.881588,96.0,426.0,102548,,WC,Juan Antonio Marin,R,175.0,CRC,28.136893,137.0,275.0,6-0 RET,3,R64,42.0,0.0,0.0,17.0,10.0,7.0,5.0,3.0,0.0,0.0,0.0,0.0,19.0,12.0,2.0,3.0,3.0,0.0,3.0
599,2003-425,Barcelona,Clay,56,A,20030421,33,102374,8.0,,Alex Corretja,R,180.0,ESP,29.026694,17.0,1415.0,104745,,WC,Rafael Nadal,L,185.0,ESP,16.881588,96.0,426.0,3-6 6-2 6-1,3,R32,127.0,3.0,2.0,83.0,51.0,30.0,17.0,12.0,9.0,13.0,3.0,1.0,78.0,57.0,31.0,6.0,12.0,7.0,14.0


In [11]:
r_nadal['won'] = r_nadal['winner_name'] == 'Rafael Nadal'

In [12]:
r_nadal.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,winner_rank,winner_rank_points,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,loser_rank,loser_rank_points,score,best_of,round,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,won
606,2002-573,Mallorca,Clay,32,A,20020429,6,104745,,WC,Rafael Nadal,L,185.0,ESP,15.904175,762.0,14.0,102887,,,Ramon Delgado,R,185.0,PAR,25.453799,81.0,490.0,6-4 6-4,3,R32,83.0,1.0,1.0,66.0,59.0,36.0,3.0,10.0,6.0,9.0,2.0,1.0,59.0,41.0,23.0,5.0,10.0,2.0,7.0,True
619,2002-573,Mallorca,Clay,32,A,20020429,19,103694,,,Olivier Rochus,R,168.0,BEL,21.275838,70.0,571.0,104745,,WC,Rafael Nadal,L,185.0,ESP,15.904175,762.0,14.0,6-2 6-2,3,R16,62.0,1.0,0.0,48.0,34.0,23.0,12.0,8.0,0.0,0.0,0.0,0.0,59.0,49.0,28.0,2.0,8.0,5.0,9.0,False
188,2003-414,Hamburg Masters,Clay,64,M,20030512,31,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,103908,,,Paul Henri Mathieu,R,185.0,FRA,21.327858,44.0,855.0,7-5 6-4,3,R64,102.0,2.0,1.0,81.0,66.0,44.0,6.0,11.0,8.0,10.0,2.0,1.0,57.0,38.0,26.0,9.0,11.0,3.0,7.0,True
205,2003-414,Hamburg Masters,Clay,64,M,20030512,48,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,102845,2.0,,Carlos Moya,R,190.0,ESP,26.704997,4.0,2985.0,7-5 6-4,3,R32,89.0,1.0,0.0,63.0,53.0,30.0,7.0,11.0,2.0,5.0,3.0,0.0,64.0,36.0,25.0,11.0,11.0,4.0,9.0,True
213,2003-414,Hamburg Masters,Clay,64,M,20030512,56,103292,,,Gaston Gaudio,R,175.0,ARG,24.421629,29.0,1080.0,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,6-2 6-2,3,R16,59.0,3.0,1.0,40.0,25.0,22.0,10.0,8.0,0.0,0.0,1.0,2.0,48.0,35.0,18.0,7.0,8.0,2.0,6.0,False


In [13]:
r_nadal = r_nadal.reset_index().drop(columns='index')

In [14]:
r_nadal.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,winner_rank,winner_rank_points,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,loser_rank,loser_rank_points,score,best_of,round,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,won
0,2002-573,Mallorca,Clay,32,A,20020429,6,104745,,WC,Rafael Nadal,L,185.0,ESP,15.904175,762.0,14.0,102887,,,Ramon Delgado,R,185.0,PAR,25.453799,81.0,490.0,6-4 6-4,3,R32,83.0,1.0,1.0,66.0,59.0,36.0,3.0,10.0,6.0,9.0,2.0,1.0,59.0,41.0,23.0,5.0,10.0,2.0,7.0,True
1,2002-573,Mallorca,Clay,32,A,20020429,19,103694,,,Olivier Rochus,R,168.0,BEL,21.275838,70.0,571.0,104745,,WC,Rafael Nadal,L,185.0,ESP,15.904175,762.0,14.0,6-2 6-2,3,R16,62.0,1.0,0.0,48.0,34.0,23.0,12.0,8.0,0.0,0.0,0.0,0.0,59.0,49.0,28.0,2.0,8.0,5.0,9.0,False
2,2003-414,Hamburg Masters,Clay,64,M,20030512,31,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,103908,,,Paul Henri Mathieu,R,185.0,FRA,21.327858,44.0,855.0,7-5 6-4,3,R64,102.0,2.0,1.0,81.0,66.0,44.0,6.0,11.0,8.0,10.0,2.0,1.0,57.0,38.0,26.0,9.0,11.0,3.0,7.0,True
3,2003-414,Hamburg Masters,Clay,64,M,20030512,48,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,102845,2.0,,Carlos Moya,R,190.0,ESP,26.704997,4.0,2985.0,7-5 6-4,3,R32,89.0,1.0,0.0,63.0,53.0,30.0,7.0,11.0,2.0,5.0,3.0,0.0,64.0,36.0,25.0,11.0,11.0,4.0,9.0,True
4,2003-414,Hamburg Masters,Clay,64,M,20030512,56,103292,,,Gaston Gaudio,R,175.0,ARG,24.421629,29.0,1080.0,104745,,Q,Rafael Nadal,L,185.0,ESP,16.939083,87.0,486.0,6-2 6-2,3,R16,59.0,3.0,1.0,40.0,25.0,22.0,10.0,8.0,0.0,0.0,1.0,2.0,48.0,35.0,18.0,7.0,8.0,2.0,6.0,False


In [15]:
r_nadal.tail()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,winner_rank,winner_rank_points,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,loser_rank,loser_rank_points,score,best_of,round,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,won
996,2017-580,Australian Open,Hard,128,G,20170116,205,104745,9.0,,Rafael Nadal,L,185.0,ESP,30.622861,9.0,3195.0,100644,24.0,,Alexander Zverev,R,,GER,19.742642,24.0,1655.0,4-6 6-3 6-7(5) 6-3 6-2,5,R32,245.0,11.0,4.0,150.0,113.0,82.0,23.0,24.0,5.0,7.0,19.0,11.0,172.0,108.0,77.0,27.0,24.0,11.0,16.0,True
997,2017-580,Australian Open,Hard,128,G,20170116,216,104745,9.0,,Rafael Nadal,L,185.0,ESP,30.622861,9.0,3195.0,104792,6.0,,Gael Monfils,R,193.0,FRA,30.376454,6.0,3625.0,6-3 6-3 4-6 6-4,5,R16,175.0,2.0,1.0,101.0,74.0,53.0,17.0,19.0,3.0,6.0,15.0,10.0,130.0,76.0,50.0,24.0,19.0,11.0,17.0,True
998,2017-580,Australian Open,Hard,128,G,20170116,222,104745,9.0,,Rafael Nadal,L,185.0,ESP,30.622861,9.0,3195.0,105683,3.0,,Milos Raonic,R,196.0,CAN,26.056126,3.0,5290.0,6-4 7-6(7) 6-4,5,QF,164.0,4.0,4.0,105.0,76.0,63.0,15.0,16.0,4.0,4.0,14.0,2.0,107.0,71.0,48.0,20.0,16.0,1.0,3.0,True
999,2017-580,Australian Open,Hard,128,G,20170116,225,104745,9.0,,Rafael Nadal,L,185.0,ESP,30.622861,9.0,3195.0,105777,15.0,,Grigor Dimitrov,R,188.0,BUL,25.672827,15.0,2135.0,6-3 5-7 7-6(5) 6-7(4) 6-4,5,SF,296.0,8.0,3.0,184.0,135.0,93.0,27.0,28.0,12.0,16.0,20.0,5.0,181.0,124.0,87.0,28.0,27.0,8.0,13.0,True
1000,2017-580,Australian Open,Hard,128,G,20170116,226,103819,17.0,,Roger Federer,R,185.0,SUI,35.441478,17.0,1980.0,104745,9.0,,Rafael Nadal,L,185.0,ESP,30.622861,9.0,3195.0,6-4 3-6 6-1 3-6 6-3,5,F,217.0,20.0,3.0,138.0,85.0,65.0,26.0,22.0,13.0,17.0,4.0,3.0,151.0,110.0,69.0,23.0,22.0,14.0,20.0,False


In [16]:
def into_int(value):
    if value:
        return str(1)
    else:
        return str(0)

In [17]:
r_nadal['won'] = r_nadal['won'].apply(into_int)

In [18]:
r_nadal.dtypes

tourney_id             object
tourney_name           object
surface                object
draw_size               int64
tourney_level          object
tourney_date            int64
match_num               int64
winner_id               int64
winner_seed           float64
winner_entry           object
winner_name            object
winner_hand            object
winner_ht             float64
winner_ioc             object
winner_age            float64
winner_rank           float64
winner_rank_points    float64
loser_id                int64
loser_seed            float64
loser_entry            object
loser_name             object
loser_hand             object
loser_ht              float64
loser_ioc              object
loser_age             float64
loser_rank            float64
loser_rank_points     float64
score                  object
best_of                 int64
round                  object
minutes               float64
w_ace                 float64
w_df                  float64
w_svpt    

In [19]:
r_nadal['tourney_date'] = r_nadal['tourney_date'].astype(str)

In [20]:
r_nadal['tourney_date'] = pd.to_datetime(r_nadal['tourney_date'], infer_datetime_format=True)

In [21]:
# split the data into train and test

train = r_nadal[r_nadal['tourney_date'] < '20150101']
test = r_nadal[r_nadal['tourney_date'] > '20141231']

In [22]:
train.shape, test.shape

((855, 50), (146, 50))

In [23]:
!pip install pandas-profiling



In [24]:
from pandas_profiling import ProfileReport

profile = ProfileReport(r_nadal, minimal=True).to_notebook_iframe()
profile

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=59.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




In [26]:
target = 'won'

# Get a dataframe with all train columns except the target
train_features = train.drop(columns=[target])

# Get a list of the numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
categorical_features = cardinality[cardinality <= 50].index.tolist()

# Combine the lists 
features = numeric_features + categorical_features
print(features)

['draw_size', 'match_num', 'winner_id', 'winner_seed', 'winner_ht', 'winner_age', 'winner_rank', 'winner_rank_points', 'loser_id', 'loser_seed', 'loser_ht', 'loser_age', 'loser_rank', 'loser_rank_points', 'best_of', 'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt', 'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced', 'surface', 'tourney_level', 'winner_entry', 'winner_hand', 'winner_ioc', 'loser_entry', 'loser_hand', 'loser_ioc', 'round']


In [27]:
# Arrange data into X features matrix and y target vector 
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

# shapes of the data sets
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((855, 43), (855,), (146, 43), (146,))

In [28]:
X_train.isnull().sum()

draw_size               0
match_num               0
winner_id               0
winner_seed           117
winner_ht               2
winner_age              0
winner_rank             4
winner_rank_points      4
loser_id                0
loser_seed            477
loser_ht               11
loser_age               0
loser_rank              4
loser_rank_points       4
best_of                 0
minutes                31
w_ace                  32
w_df                   32
w_svpt                 32
w_1stIn                32
w_1stWon               32
w_2ndWon               32
w_SvGms                32
w_bpSaved              32
w_bpFaced              32
l_ace                  32
l_df                   32
l_svpt                 32
l_1stIn                32
l_1stWon               32
l_2ndWon               32
l_SvGms                32
l_bpSaved              32
l_bpFaced              32
surface                 0
tourney_level           0
winner_entry          838
winner_hand             0
winner_ioc  

In [29]:
X_train.corr()

Unnamed: 0,draw_size,match_num,winner_id,winner_seed,winner_ht,winner_age,winner_rank,winner_rank_points,loser_id,loser_seed,loser_ht,loser_age,loser_rank,loser_rank_points,best_of,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
draw_size,1.0,0.849695,0.100742,0.036605,0.00902,0.018109,-0.110286,0.128959,0.042211,0.388713,0.021662,0.025483,0.040776,0.000453,0.718,0.403681,0.173964,0.092582,0.392,0.392659,0.446502,0.347014,0.479144,0.092175,0.103782,0.218069,0.193427,0.461596,0.409837,0.372241,0.331276,0.48843,0.238306,0.353248
match_num,0.849695,1.0,0.092878,0.050208,0.021984,0.051237,-0.14518,0.098816,0.082639,0.325879,0.028351,0.006261,-0.14397,0.178377,0.598974,0.443566,0.169349,0.050143,0.415639,0.424689,0.450781,0.319987,0.466981,0.155449,0.17431,0.203458,0.131607,0.446082,0.404637,0.374018,0.328521,0.476882,0.219318,0.317677
winner_id,0.100742,0.092878,1.0,-0.325569,0.238778,-0.35001,-0.153605,0.31646,-0.178464,0.15222,0.028194,0.438916,0.049843,-0.063637,0.068333,-0.006793,-0.202705,-0.206391,-0.064732,0.011966,0.011379,-0.154791,-0.040658,-0.099515,-0.115819,0.162208,0.051844,-0.018799,-0.09881,-0.075471,0.040869,-0.029765,-0.02662,0.007651
winner_seed,0.036605,0.050208,-0.325569,1.0,0.059604,-0.098536,0.57322,-0.499074,-0.063057,-0.106218,-0.045413,-0.167809,-0.007286,0.150933,-0.074791,0.002063,0.127334,0.080868,0.04154,0.000751,-0.009763,0.067365,0.016457,-0.007748,0.024223,-0.109015,0.016953,0.005217,0.04399,0.038813,-0.018874,0.007434,0.009576,-0.020675
winner_ht,0.00902,0.021984,0.238778,0.059604,1.0,-0.079271,0.118402,0.002946,0.035836,-0.13323,-0.005656,0.019186,-0.138188,0.203271,0.002008,0.037443,0.234529,0.05239,0.035544,0.036639,0.072749,0.006297,0.03441,0.001788,-0.025821,0.033685,0.011096,0.006424,0.01165,0.039625,0.038705,0.032545,-0.041839,-0.070731
winner_age,0.018109,0.051237,-0.35001,-0.098536,-0.079271,1.0,-0.153452,0.463402,0.565816,-0.196215,0.078104,0.068845,-0.163533,0.321183,-0.016027,0.064696,0.185516,0.069898,0.041052,-0.002069,0.024669,0.091886,0.035981,0.037213,0.034051,-0.03888,-0.095749,0.024517,0.066538,0.061057,-0.008734,0.038637,-0.038733,-0.046019
winner_rank,-0.110286,-0.14518,-0.153605,0.57322,0.118402,-0.153452,1.0,-0.356185,-0.090593,-0.16131,-0.019492,-0.120991,0.02891,0.028855,-0.067641,-0.044067,0.135337,0.090356,0.023824,0.015526,0.007143,0.009086,-0.006261,0.07789,0.074705,-0.057734,-0.020952,-0.040815,-0.022396,-0.007765,-0.024791,-0.0183,-0.056803,-0.077395
winner_rank_points,0.128959,0.098816,0.31646,-0.499074,0.002946,0.463402,-0.356185,1.0,0.258343,0.140978,0.092062,0.299616,-0.020538,-0.007697,0.085672,0.062747,-0.127819,-0.106821,-0.032884,-0.009352,0.003377,-0.032426,-0.005728,-0.060605,-0.066417,0.126444,-0.005633,0.025164,-0.003421,-0.000345,0.006511,0.016273,-0.01257,0.02905
loser_id,0.042211,0.082639,-0.178464,-0.063057,0.035836,0.565816,-0.090593,0.258343,1.0,-0.163439,0.215289,-0.664484,-0.074954,0.247802,0.04434,0.126624,0.174637,0.079515,0.128824,0.089299,0.090515,0.136779,0.108531,0.096329,0.107723,0.009403,-0.017566,0.08023,0.109235,0.106282,0.050274,0.104888,-0.013383,-0.016556
loser_seed,0.388713,0.325879,0.15222,-0.106218,-0.13323,-0.196215,-0.16131,0.140978,-0.163439,1.0,-0.030472,0.081089,0.714488,-0.632741,0.272127,0.020894,-0.164093,-0.030947,0.004567,0.04239,0.07523,-0.038356,0.064348,-0.093895,-0.102062,0.161412,0.155909,0.106439,0.01768,0.018757,0.139452,0.071641,0.082425,0.128366


In [30]:
%%time
# WARNING: the %%time command sometimes has quirks/bugs

from scipy.stats import randint, uniform
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier()
)

param_distributions = {   
    'simpleimputer__strategy': ['mean', 'median'], 
    'randomforestclassifier__n_estimators': randint(100, 500),
    'randomforestclassifier__min_samples_leaf': randint(1, 20),
    'randomforestclassifier__max_depth': [5, 10, 15, 20, None], 
    'randomforestclassifier__max_features': uniform(0, 1),
}

search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=100, 
    cv=3, 
    scoring='accuracy', 
    verbose=10, 
    random_state=42, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(X_train, y_train);

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    2.9s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    6.7s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    8.1s
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:    9.7s
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:   10.6s
[Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:   12.5s
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:   15.0s
[Parallel(n_jobs=-1)]: Done  97 tasks      | elapsed:   17.5s
[Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed:   20.1s
[Parallel(n_jobs=-1)]: Done 129 tasks      | elapsed:   24.9s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   27.9s
[Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed:   31.4s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   

Wall time: 59 s


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('ordinalencoder',
                                              OrdinalEncoder(cols=None,
                                                             drop_invariant=False,
                                                             handle_missing='value',
                                                             handle_unknown='value',
                                                             mapping=None,
                                                             return_df=True,
                                                             verbose=0)),
                                             ('simpleimputer',
                                              SimpleImputer(add_indicator=False,
                                                            copy=True,
                                                            fill_value=None,


In [31]:
search.score(X_train, y_train)

1.0

In [32]:
search.score(X_test, y_test)

1.0

In [33]:
search.best_params_

{'randomforestclassifier__max_depth': 10,
 'randomforestclassifier__max_features': 0.18182496720710062,
 'randomforestclassifier__min_samples_leaf': 1,
 'randomforestclassifier__n_estimators': 413,
 'simpleimputer__strategy': 'median'}

In [35]:
y_2004.shape

(48, 49)

In [36]:
y_2005.shape

(89, 49)

In [37]:
X_train.head()

Unnamed: 0,draw_size,match_num,winner_id,winner_seed,winner_ht,winner_age,winner_rank,winner_rank_points,loser_id,loser_seed,loser_ht,loser_age,loser_rank,loser_rank_points,best_of,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,surface,tourney_level,winner_entry,winner_hand,winner_ioc,loser_entry,loser_hand,loser_ioc,round
0,32,6,104745,,185.0,15.904175,762.0,14.0,102887,,185.0,25.453799,81.0,490.0,3,83.0,1.0,1.0,66.0,59.0,36.0,3.0,10.0,6.0,9.0,2.0,1.0,59.0,41.0,23.0,5.0,10.0,2.0,7.0,Clay,A,WC,L,ESP,,R,PAR,R32
1,32,19,103694,,168.0,21.275838,70.0,571.0,104745,,185.0,15.904175,762.0,14.0,3,62.0,1.0,0.0,48.0,34.0,23.0,12.0,8.0,0.0,0.0,0.0,0.0,59.0,49.0,28.0,2.0,8.0,5.0,9.0,Clay,A,,R,BEL,WC,L,ESP,R16
2,64,31,104745,,185.0,16.939083,87.0,486.0,103908,,185.0,21.327858,44.0,855.0,3,102.0,2.0,1.0,81.0,66.0,44.0,6.0,11.0,8.0,10.0,2.0,1.0,57.0,38.0,26.0,9.0,11.0,3.0,7.0,Clay,M,Q,L,ESP,,R,FRA,R64
3,64,48,104745,,185.0,16.939083,87.0,486.0,102845,2.0,190.0,26.704997,4.0,2985.0,3,89.0,1.0,0.0,63.0,53.0,30.0,7.0,11.0,2.0,5.0,3.0,0.0,64.0,36.0,25.0,11.0,11.0,4.0,9.0,Clay,M,Q,L,ESP,,R,ESP,R32
4,64,56,103292,,175.0,24.421629,29.0,1080.0,104745,,185.0,16.939083,87.0,486.0,3,59.0,3.0,1.0,40.0,25.0,22.0,10.0,8.0,0.0,0.0,1.0,2.0,48.0,35.0,18.0,7.0,8.0,2.0,6.0,Clay,M,,R,ARG,Q,L,ESP,R16
