# 2021: Week 6 - Comparing Prize Money for Professional Golfers

February 10, 2021

Challenge By: Jenny Martin

__What's one of the benefits of preparing your own data?
Being able to start your analysis sooner!__

Sometimes I can find opening Tableau Desktop to explore my data gets a little distracting by trying to visualise it before I've decided on the story. Starting my analysis of the dataset in Tableau Prep helps me, personally, to stay more focused! It's clear where the outliers are, what the distribution of the dataset is and therefore what the story should be.

For this week's challenge we're looking at a dataset that was used in December 2020 for Sports Viz Sunday (thanks to Kate Brown for sharing!) This dataset comes from the PGA and LPGA 2019 Golf tours and lists the total prize money for the top 100 players. For those of us who aren't too familiar with golf, the PGA is the men's tour, whilst the LPGA is the women's tour.

## Input
We have one input this week:

<img src='https://1.bp.blogspot.com/-n1nJAhjFwFE/YB1DtLB2OrI/AAAAAAAAAtc/okyuUbZ672006nCq_cenaAu_9SWa1HlBgCLcBGAsYHQ/w400-h223/2021W06%2BInput.png'>

Official Money

## Requirements
- [Input the data](https://drive.google.com/file/d/13Vx4CVwEwoRDXQdjctqef9ooa1ojtkds/view?usp=sharing)
- Answer these questions:
    - What's the Total Prize Money earned by players for each tour? (help)
    - How many players are in this dataset for each tour?
    - How many events in total did players participate in for each tour?
    - How much do players win per event? What's the average of this for each tour? (help)
    - How do players rank by prize money for each tour? What about overall? What is the average difference between where they are ranked within their tour compared to the overall rankings where both tours are combined? (help)
        - Here we would like the difference to be positive as you would presume combining the tours would cause a player's ranking to increase
- Combine the answers to these questions into one dataset (help)
- Pivot the data so that we have a column for each tour, with each row representing an answer to the above questions (help)
- Clean up the Measure field and create a new column showing the difference between the tours for each measure
    - We're looking at the difference between the LPGA from the PGA, so in most instances this number will be negative
- [Output the data](https://drive.google.com/file/d/1EXw1fweOb47LiQt4RtUq6ERRMO68D6FD/view?usp=sharing)

## Output

<img src='https://1.bp.blogspot.com/-iKTMfxcBhx8/YCPot6fySAI/AAAAAAAAAvE/KqpS4RH8QQo_0HJMnXXwFfjLycQn_CQPwCLcBGAsYHQ/w400-h131/2021W06%2BOutput.png'>

- 4 fields
    - Measure
    - PGA
    - LPGA
    - Difference between tours
- 5 rows (6 including headers)

The full output can be downloaded [here](https://drive.google.com/file/d/1EXw1fweOb47LiQt4RtUq6ERRMO68D6FD/view?usp=sharing).

In [78]:
import pandas as pd

In [79]:
input = 'PGALPGAMoney2019.xlsx'
excel_sheet = pd.ExcelFile(input).sheet_names
print(excel_sheet)

['OfficialMoney', 'Sources']


In [80]:
df1 = pd.read_excel(input, sheet_name='OfficialMoney')
print(df1.head(5))
print(df1.info())

       PLAYER NAME    MONEY  EVENTS TOUR
0    Brooks Koepka  9684006      21  PGA
1     Rory McIlroy  7785286      19  PGA
2      Matt Kuchar  6294690      22  PGA
3  Patrick Cantlay  6121488      21  PGA
4    Gary Woodland  5690965      24  PGA
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   PLAYER NAME  200 non-null    object
 1   MONEY        200 non-null    int64 
 2   EVENTS       200 non-null    int64 
 3   TOUR         200 non-null    object
dtypes: int64(2), object(2)
memory usage: 6.4+ KB
None


In [81]:
df2 = pd.read_excel(input, sheet_name='Sources', header=None)
df2.columns = ['TOUR', 'SOURCE']
print(df2.head(5))
print(df2.info())

   TOUR                                             SOURCE
0  LPGA  https://www.lpga.com/statistics/money/official...
1   PGA  https://www.pgatour.com/content/pgatour/stats/...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   TOUR    2 non-null      object
 1   SOURCE  2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes
None


In [82]:
# What's the Total Prize Money earned by players for each tour?
total_prize_money = df1.groupby(['TOUR']).agg(total_prize_money=('MONEY','sum')).reset_index()
print(total_prize_money)

# How many players are in this dataset for each tour?
players_each_tour = df1.groupby('TOUR').agg(number_of_players=('PLAYER NAME','count')).reset_index()
print("=====================================")
print(players_each_tour)

# How many events in total did players participate in for each tour?
events_played = df1.groupby(['TOUR']).agg(number_of_events=('EVENTS','sum')).reset_index()
print("=====================================")
print(events_played)

# How much do players win per event? What's the average of this for each tour?
df1['MONEY_EARN_EACH_EVENT'] = df1['MONEY'] / df1['EVENTS']
df1 = df1.round(0)
avg_money_per_event = df1.groupby(['TOUR']).agg(avg_money_per_event=('MONEY_EARN_EACH_EVENT','mean')).reset_index().round(0)
print("=====================================")
print(avg_money_per_event)

# How do players rank by prize money for each tour? What about overall? 
# What is the average difference between where they are ranked within their tour 
# compared to the overall rankings where both tours are combined?

df1['EARNINGS RANK BY TOUR'] = df1.groupby('TOUR')['MONEY'].rank(ascending=False)
df1['EARNINGS RANK OVERALL'] = df1['MONEY'].rank(ascending=False)
df1['DIFFERENT IN EARNINGS RANK'] = df1['EARNINGS RANK OVERALL'] - df1['EARNINGS RANK BY TOUR']

diff_earnings_rank = df1.groupby('TOUR').agg(avg_different_in_ranking=('DIFFERENT IN EARNINGS RANK', 'mean')).reset_index()
print("=====================================")
print(diff_earnings_rank)

   TOUR  total_prize_money
0  LPGA           58410411
1   PGA          256726356
   TOUR  number_of_players
0  LPGA                100
1   PGA                100
   TOUR  number_of_events
0  LPGA              2266
1   PGA              2282
   TOUR  avg_money_per_event
0  LPGA              25525.0
1   PGA             120282.0
   TOUR  avg_different_in_ranking
0  LPGA                     96.13
1   PGA                      3.87


In [83]:
# Combine the answers to these questions into one dataset
combine_df = pd.merge(left=total_prize_money, right=players_each_tour, on='TOUR')
combine_df = pd.merge(left=combine_df, right=events_played, on='TOUR')
combine_df = pd.merge(left=combine_df, right=avg_money_per_event, on='TOUR')
combine_df = pd.merge(left=combine_df, right=diff_earnings_rank, on='TOUR')

print(combine_df.head(5))
print(combine_df.info())

   TOUR  total_prize_money  number_of_players  number_of_events  \
0  LPGA           58410411                100              2266   
1   PGA          256726356                100              2282   

   avg_money_per_event  avg_different_in_ranking  
0              25525.0                     96.13  
1             120282.0                      3.87  
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   TOUR                      2 non-null      object 
 1   total_prize_money         2 non-null      int64  
 2   number_of_players         2 non-null      int64  
 3   number_of_events          2 non-null      int64  
 4   avg_money_per_event       2 non-null      float64
 5   avg_different_in_ranking  2 non-null      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 112.0+ bytes
None


In [84]:
# Pivot the data so that we have a column for each tour, with each row representing an answer to the above questions
pivoted = combine_df.melt(id_vars='TOUR', value_vars=['total_prize_money', 'number_of_players', 'number_of_events', 'avg_money_per_event', 'avg_different_in_ranking'], var_name='Measure' , value_name='pivot_value')
print(pivoted)

pivoted = pivoted.pivot(index='Measure', columns='TOUR', values='pivot_value')
print("==========================================================================")
print(pivoted)
print(pivoted.info())


   TOUR                   Measure   pivot_value
0  LPGA         total_prize_money  5.841041e+07
1   PGA         total_prize_money  2.567264e+08
2  LPGA         number_of_players  1.000000e+02
3   PGA         number_of_players  1.000000e+02
4  LPGA          number_of_events  2.266000e+03
5   PGA          number_of_events  2.282000e+03
6  LPGA       avg_money_per_event  2.552500e+04
7   PGA       avg_money_per_event  1.202820e+05
8  LPGA  avg_different_in_ranking  9.613000e+01
9   PGA  avg_different_in_ranking  3.870000e+00
TOUR                             LPGA           PGA
Measure                                            
avg_different_in_ranking        96.13  3.870000e+00
avg_money_per_event          25525.00  1.202820e+05
number_of_events              2266.00  2.282000e+03
number_of_players              100.00  1.000000e+02
total_prize_money         58410411.00  2.567264e+08
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, avg_different_in_ranking to total_prize_money
Data c

In [85]:
# Clean up the Measure field and create a new column showing the difference between the tours for each measure
output = pivoted
output['Difference between tours'] = output['LPGA'] - output['PGA']
print(output.head(5))
print(output.info())

TOUR                             LPGA           PGA  Difference between tours
Measure                                                                      
avg_different_in_ranking        96.13  3.870000e+00              9.226000e+01
avg_money_per_event          25525.00  1.202820e+05             -9.475700e+04
number_of_events              2266.00  2.282000e+03             -1.600000e+01
number_of_players              100.00  1.000000e+02              0.000000e+00
total_prize_money         58410411.00  2.567264e+08             -1.983159e+08
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, avg_different_in_ranking to total_prize_money
Data columns (total 3 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   LPGA                      5 non-null      float64
 1   PGA                       5 non-null      float64
 2   Difference between tours  5 non-null      float64
dtypes: float64(3)
memory usage: 160.0+ bytes