# PGA Tour Web Scraping
### *Data is obtained using PGATour.com's GraphQL API*


Initially, when I attempted to scrape data from pgatour.com using traditional web scraping techniques with BeautifulSoup, I encountered a roadblock. It became evident that pgatour.com utilizes GraphQL, a query language for APIs, to handle its data. This meant that directly scraping the website's HTML structure wouldn't provide the desired results. Realizing the need to adapt, I embarked on learning about GraphQL and how to effectively query data using this technology. With newfound knowledge and understanding, I implemented GraphQL queries to retrieve the specific data I required from pgatour.com. This transition allowed me to efficiently extract the desired information from the website, enabling me to proceed with my data analysis and exploration.


### Tasks
This Jupyter Notebook contains code for retrieving and analyzing PGA Tour data using the GraphQL API. The code performs the following tasks:

1. Imports the necessary packages and sets up the API endpoint.
2. Sends a GraphQL introspection query to retrieve the schema.
3. Saves the GraphQL schema as a JSON file.
4. Defines functions to retrieve player statistics, player information, and tournament winners.
5. Retrieves player statistics and information for multiple years.
6. Merges the data into a single DataFrame.
7. Performs data cleaning and type conversion.
8. Saves the final DataFrame as a CSV file.

### Code Explanation

The code is divided into several sections:

1. Importing Packages: Importing the required packages, including `requests`, `json`, and `pandas`.

2. API Configuration: Defining the API endpoint URL and the `X_API_KEY` token.

3. GraphQL Introspection Query: Defining the GraphQL introspection query to retrieve the schema.

4. Retrieving the Schema: Sending the introspection query to the server and saving the schema as a JSON file.

5. Function Definitions: Defining helper functions to retrieve player statistics, player information, and tournament winners.

6. Data Retrieval and Processing: Looping through the desired years and retrieving player statistics, player information, and tournament winners. Merging the data into a single DataFrame.

7. Data Cleaning and Type Conversion: Performing data cleaning and converting columns to their appropriate data types.

8. Saving the Data: Saving the final DataFrame as a CSV file.




In [10]:
##### Import all required packages #####
import requests
#import graphql
import json
from numpy import NaN
import pandas as pd


In [2]:
# in the requests header seems to be a constant token ('x-api-key') that is needed
# found in devtools
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"

url = "https://orchestrator.pgatour.com/graphql"

# GraphQL introspection query
introspection_query = """
    query IntrospectionQuery {
        __schema {
            queryType {
                name
            }
            mutationType {
                name
            }
            subscriptionType {
                name
            }
            types {
                ...FullType
            }
            directives {
                name
                description
                locations
                args {
                    ...InputValue
                }
            }
        }
    }

    fragment FullType on __Type {
        kind
        name
        description
        fields(includeDeprecated: true) {
            name
            description
            args {
                ...InputValue
            }
            type {
                ...TypeRef
            }
            isDeprecated
            deprecationReason
        }
        inputFields {
            ...InputValue
        }
        interfaces {
            ...TypeRef
        }
        enumValues(includeDeprecated: true) {
            name
            description
            isDeprecated
            deprecationReason
        }
        possibleTypes {
            ...TypeRef
        }
    }

    fragment InputValue on __InputValue {
        name
        description
        type {
            ...TypeRef
        }
        defaultValue
    }

    fragment TypeRef on __Type {
        kind
        name
        ofType {
            kind
            name
            ofType {
                kind
                name
                ofType {
                    kind
                    name
                    ofType {
                        kind
                        name
                        ofType {
                            kind
                            name
                            ofType {
                                kind
                                name
                            }
                        }
                    }
                }
            }
        }
    }
"""

# Send the introspection query to the server
response = requests.post(
    url, json={"query": introspection_query}, headers={"x-api-key": X_API_KEY}
)

# Print the response (GraphQL schema)
json_schema = response.json()

with open("graphql_schema.json", "w") as f:
    # Use the `json.dump()` method to write the data to the file
    json.dump(json_schema, f, indent=4)

In [3]:
stats = {
    "120": "ScoringAvg",
    "156": "BirdieAvg",
    "101": "DrivingDistance",
    "129": "TotalDriving",
    "103": "GIR%",
    "130": "Scrambling",
    "02675": "SG_Total",
    "02568": "SG_ApproachGreen",
    "02564": "SG_Putting",
    "02567": "SG_OffTee",
    "02569": "SG_AroundGreen",
    "02674": "SG_TeeToGreen",
    "02394": "FedExCupPoints",
    "138": "Top10Finishes",
    "186": "WorldRank",
    "109": "Money"
}

In [4]:
def get_df(YEAR, STAT_ID, DESCR):
    X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"
    req = {
        "operationName": "StatDetails",
        "variables": {
            "tourCode": "R",
            "statId": STAT_ID,
            "year": YEAR,
            "eventQuery": None,
        },
        "query": """query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {
  statDetails(
    tourCode: $tourCode
    statId: $statId
    year: $year
    eventQuery: $eventQuery
  ) {
    tourCode
    year
    displaySeason
    statId
    statType
    tournamentPills {
      tournamentId
      displayName
    }
    yearPills {
      year
      displaySeason
    }
    statTitle
    statDescription
    tourAvg
    lastProcessed
    statHeaders
    statCategories {
      category
      displayName
      subCategories {
        displayName
        stats {
          statId
          statTitle
        }
      }
    }
    rows {
      ... on StatDetailsPlayer {
        __typename
        playerId
        playerName
        country
        countryFlag
        rank
        rankDiff
        rankChangeTendency
        stats {
          statName
          statValue
          color
        }
      }
      ... on StatDetailTourAvg {
        __typename
        displayName
        value
      }
    }
    sponsorLogo
    }
    } 
    """,
    }

    # post the request
    page = requests.post(
        url,
        json=req,
        headers={"x-api-key": X_API_KEY},
    )
    # get the data
    data = page.json()["data"]["statDetails"]["rows"]

    # filter out items, thats __typename is not "StatDetailsPlayer" like
    # data contains -> "__typename": "StatDetailTourAvg", remove the tour average value
    data = filter(lambda item: item.get("__typename", NaN) == "StatDetailsPlayer", data)

    # format to a table that is in the webpage
    
    table = map(
        lambda item: {
            # "RANK": item["rank"],
            "PLAYER": item["playerName"],
            DESCR: item["stats"][0]["statValue"],
        },
        data,
    )

    # convert the dataframe
    s = pd.DataFrame(table)

    return s

def get_players():
    url = "https://orchestrator.pgatour.com/graphql"

    X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"
    req = {
    "operationName": "PlayerDirectory",
    "variables": {
        "tourCode": "R"
    },
    "query": """query PlayerDirectory($tourCode: TourCode!, $active: Boolean) {
      playerDirectory(tourCode: $tourCode, active: $active) {
        tourCode
        players {
          id
          isActive
          firstName
          lastName
          shortName
          displayName
          alphaSort
          country
          countryFlag
          headshot
          playerBio {
            id
            age
            education
            turnedPro
          }
        }
      }
    }
    """,
    }

    # post the request
    page = requests.post(
    url,
    json=req,
    headers={"x-api-key": X_API_KEY},
    )
    # get the data
    # print(page.json())
    data = page.json()["data"]["playerDirectory"]["players"]

    # filter out items, thats __typename is not "StatDetailsPlayer" like
    # data contains -> "__typename": "StatDetailTourAvg", remove the tour average value
    # data = filter(lambda item: item.get("__typename", NaN) == "StatDetailsPlayer", data)

    # format to a table that is in the webpage

    table = map(
    lambda item: {
        # "RANK": item["rank"],
        "PLAYER": item["displayName"],
        "Country": item["countryFlag"],
    },
    data,
    )

    # convert the dataframe
    s = pd.DataFrame(table)

    return s

def get_wins(YEAR):
    X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"
    req = {
    "operationName": "TourCupSplit",
    "variables": {
        "id": "fedex",
        "tourCode": "R",
        "year": YEAR
    },
    "query": """query TourCupSplit($tourCode: TourCode!, $id: String, $year: Int, $eventQuery: StatDetailEventQuery) {
      tourCupSplit(tourCode: $tourCode, id: $id, year: $year, eventQuery: $eventQuery) {
        id
        title
        projectedTitle
        projectedLive
        season
        description
        detailCopy
        logo
        options
        fixedHeaders
        columnHeaders
        rankingsHeader
        message
        projectedPlayers {
          ...Player
          ...InfoRow
        }
        officialPlayers {
          ...Player
          ...InfoRow
        }
        tournamentPills {
          tournamentId
          displayName
        }
        yearPills {
          year
          displaySeason
        }
        winner {
          id
          rank
          firstName
          lastName
          displayName
          shortName
          countryFlag
          country
          earnings
          totals {
            label
            value
          }
        }
      }
    }

    fragment Player on TourCupCombinedPlayer {
      __typename
      id
      firstName
      lastName
      displayName
      shortName
      countryFlag
      country
      rankingData {
        projected
        official
        event
        movement
        movementAmount
        logo
        logoDark
      }
      pointData {
        projected
        official
        event
        movement
        movementAmount
        logo
        logoDark
      }
      projectedSort
      officialSort
      thisWeekRank
      previousWeekRank
      columnData
    }

    fragment InfoRow on TourCupCombinedInfo {
      logo
      logoDark
      text
      sortValue
    }
    """,
    }

    # post the request
    page = requests.post(
    url,
    json=req,
    headers={"x-api-key": X_API_KEY},
    )
    # get the data
    #print(page.json())
    data = page.json()["data"]["tourCupSplit"]["officialPlayers"]

    # filter out items, thats __typename is not "StatDetailsPlayer" like
    # data contains -> "__typename": "StatDetailTourAvg", remove the tour average value
    # data = filter(lambda item: item.get("__typename", NaN) == "StatDetailsPlayer", data)

    # format to a table that is in the webpage

    table = map(
    lambda item: {
        # "RANK": item["rank"],
        "PLAYER": item["displayName"],
        "#Wins": item["columnData"][2]
    },
    data,
    )

    # convert the dataframe
    s = pd.DataFrame(table)

    return s

In [5]:
years = [2017, 2018, 2019, 2020, 2021, 2022]
dfs = []

players = get_players()
for year in years:
    df = pd.DataFrame()
    for key in stats:
        if len(df) == 0:
            df = get_df(year, key, stats[key])
        else:
            curr = get_df(year, key, stats[key])
            df = pd.merge(df, curr, on="PLAYER")
    df['Year'] = year  # Add a 'year' column with the current year
    wins = get_wins(year)

    df = pd.merge(df, players, on="PLAYER")
    df = pd.merge(df,wins,on="PLAYER")

    dfs.append(df)

combined_df = pd.concat(dfs, ignore_index=True)

combined_df.head()

Unnamed: 0,PLAYER,ScoringAvg,BirdieAvg,DrivingDistance,TotalDriving,GIR%,Scrambling,SG_Total,SG_ApproachGreen,SG_Putting,SG_OffTee,SG_AroundGreen,SG_TeeToGreen,FedExCupPoints,Top10Finishes,WorldRank,Money,Year,Country,#Wins
0,Jordan Spieth,68.846,4.49,295.6,176,70.01,61.76,1.924,0.896,0.278,0.321,0.429,1.646,2671,12,10.54,"$9,433,033",2017,USA,3
1,Rickie Fowler,69.083,4.28,300.3,93,67.03,63.86,1.987,0.408,0.852,0.378,0.349,1.134,1832,10,6.59,"$6,083,197",2017,USA,1
2,Justin Thomas,69.359,4.48,309.7,170,67.16,60.39,1.811,0.738,0.332,0.452,0.289,1.479,2689,12,8.46,"$9,921,560",2017,USA,5
3,Marc Leishman,69.468,3.81,298.6,179,67.04,61.42,1.422,0.41,0.356,0.319,0.337,1.066,1324,7,5.14,"$5,866,390",2017,AUS,2
4,Paul Casey,69.469,3.87,297.5,107,70.06,64.12,1.543,0.939,0.133,0.349,0.122,1.41,1135,9,5.42,"$3,906,974",2017,ENG,-


In [6]:
combined_df["#Wins"] = combined_df["#Wins"].replace("-",0)
combined_df.head()

Unnamed: 0,PLAYER,ScoringAvg,BirdieAvg,DrivingDistance,TotalDriving,GIR%,Scrambling,SG_Total,SG_ApproachGreen,SG_Putting,SG_OffTee,SG_AroundGreen,SG_TeeToGreen,FedExCupPoints,Top10Finishes,WorldRank,Money,Year,Country,#Wins
0,Jordan Spieth,68.846,4.49,295.6,176,70.01,61.76,1.924,0.896,0.278,0.321,0.429,1.646,2671,12,10.54,"$9,433,033",2017,USA,3
1,Rickie Fowler,69.083,4.28,300.3,93,67.03,63.86,1.987,0.408,0.852,0.378,0.349,1.134,1832,10,6.59,"$6,083,197",2017,USA,1
2,Justin Thomas,69.359,4.48,309.7,170,67.16,60.39,1.811,0.738,0.332,0.452,0.289,1.479,2689,12,8.46,"$9,921,560",2017,USA,5
3,Marc Leishman,69.468,3.81,298.6,179,67.04,61.42,1.422,0.41,0.356,0.319,0.337,1.066,1324,7,5.14,"$5,866,390",2017,AUS,2
4,Paul Casey,69.469,3.87,297.5,107,70.06,64.12,1.543,0.939,0.133,0.349,0.122,1.41,1135,9,5.42,"$3,906,974",2017,ENG,0


In [7]:
combined_df['PLAYER'] = combined_df['PLAYER'].astype(object)
combined_df['ScoringAvg'] = combined_df['ScoringAvg'].astype(float)
combined_df['BirdieAvg'] = combined_df['BirdieAvg'].astype(float)
combined_df['DrivingDistance'] = combined_df['DrivingDistance'].astype(float)
combined_df['TotalDriving'] = combined_df['TotalDriving'].astype(float)
combined_df['GIR%'] = combined_df['GIR%'].astype(float)
combined_df['Scrambling'] = combined_df['Scrambling'].astype(float)
combined_df['SG_Total'] = combined_df['SG_Total'].astype(float)
combined_df['SG_ApproachGreen'] = combined_df['SG_ApproachGreen'].astype(float)
combined_df['SG_Putting'] = combined_df['SG_Putting'].astype(float)
combined_df['SG_OffTee'] = combined_df['SG_OffTee'].astype(float)
combined_df['SG_AroundGreen'] = combined_df['SG_AroundGreen'].astype(float)
combined_df['SG_TeeToGreen'] = combined_df['SG_TeeToGreen'].astype(float)
combined_df['FedExCupPoints'] = combined_df['FedExCupPoints'].str.replace(',', '',regex=True).astype(int)
combined_df['Top10Finishes'] = combined_df['Top10Finishes'].astype(int)
combined_df['WorldRank'] = combined_df['WorldRank'].astype(float)
combined_df['Money'] = combined_df['Money'].str.replace('[$,]', '',regex=True).astype(int)
combined_df['#Wins'] = combined_df['#Wins'].astype(int)

combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 845 entries, 0 to 844
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   PLAYER            845 non-null    object 
 1   ScoringAvg        845 non-null    float64
 2   BirdieAvg         845 non-null    float64
 3   DrivingDistance   845 non-null    float64
 4   TotalDriving      845 non-null    float64
 5   GIR%              845 non-null    float64
 6   Scrambling        845 non-null    float64
 7   SG_Total          845 non-null    float64
 8   SG_ApproachGreen  845 non-null    float64
 9   SG_Putting        845 non-null    float64
 10  SG_OffTee         845 non-null    float64
 11  SG_AroundGreen    845 non-null    float64
 12  SG_TeeToGreen     845 non-null    float64
 13  FedExCupPoints    845 non-null    int64  
 14  Top10Finishes     845 non-null    int64  
 15  WorldRank         845 non-null    float64
 16  Money             845 non-null    int64  
 1

In [8]:
order = ['PLAYER','Country','Year','ScoringAvg','BirdieAvg','DrivingDistance','TotalDriving',
         'GIR%','Scrambling','SG_Total','SG_ApproachGreen','SG_Putting','SG_OffTee','SG_AroundGreen',
         'SG_TeeToGreen','Top10Finishes','#Wins','FedExCupPoints','WorldRank','Money']
combined_df = combined_df[order]
combined_df.head()

Unnamed: 0,PLAYER,Country,Year,ScoringAvg,BirdieAvg,DrivingDistance,TotalDriving,GIR%,Scrambling,SG_Total,SG_ApproachGreen,SG_Putting,SG_OffTee,SG_AroundGreen,SG_TeeToGreen,Top10Finishes,#Wins,FedExCupPoints,WorldRank,Money
0,Jordan Spieth,USA,2017,68.846,4.49,295.6,176.0,70.01,61.76,1.924,0.896,0.278,0.321,0.429,1.646,12,3,2671,10.54,9433033
1,Rickie Fowler,USA,2017,69.083,4.28,300.3,93.0,67.03,63.86,1.987,0.408,0.852,0.378,0.349,1.134,10,1,1832,6.59,6083197
2,Justin Thomas,USA,2017,69.359,4.48,309.7,170.0,67.16,60.39,1.811,0.738,0.332,0.452,0.289,1.479,12,5,2689,8.46,9921560
3,Marc Leishman,AUS,2017,69.468,3.81,298.6,179.0,67.04,61.42,1.422,0.41,0.356,0.319,0.337,1.066,7,2,1324,5.14,5866390
4,Paul Casey,ENG,2017,69.469,3.87,297.5,107.0,70.06,64.12,1.543,0.939,0.133,0.349,0.122,1.41,9,0,1135,5.42,3906974


In [9]:
path = 'pga_full.csv'

combined_df.to_csv(path,index=False)