# PGAtour.com Web Scraper

The purpose of this notebook is to display the web scraper I created to scrape the PGAtour.com website for pga tour player statistics from 2010-2017.

In [40]:
#Imports
import requests # Request module
import pandas as pd # Data Wrangling
import numpy as np # Data Wrangling
from bs4 import BeautifulSoup #Web sraping module

PGA tour statistical data is contained on separate pages on pgatour.com/stats webste.

My approach to scraping this data from these separate web pages was to 
<ol>
<li>Create a dataframe for each statistic page. Each dataframe includes the players and their stats.</li>
<li>Keep only the columns that I need from that page.</li>
<li>Repeat steps 1-3 for years 2010-2017.</li>
</ol>

The implementation of my strategy is described below
<ol>
<li>Pulls column headers</li>
<li>Pulls players from particular stats page</li>
<li>Pulls statistics from page</li>
<li>Create a dictionary to store player data in for particular stats page.</li>
<li>Uses functions 1-4 to create a pandas dataframe to store data for that particular statistic.</li>
<li>Loop through years 2010-2017 to create a dataframe from years 2010-2017</li>
<li>Store results in a sqlite3 database for future use.</li>
</ol>

### 1. Pull column headers from page

In [10]:
def get_headers(soup):
    '''This function get's the column names to use for the data frame.'''
    headers = []
    
    #Get rounds header
    rounds = soup.find_all(class_="rounds hidden-small hidden-medium")[0].get_text()
    headers.append(rounds)
    
    #Get other headers
    stat_headers = soup.find_all(class_="col-stat hidden-small hidden-medium")
    for header in stat_headers:
        headers.append(header.get_text())
    
    return headers

### 2. Pull players from page

In [11]:
#Get Players
def get_players(soup):
    '''This function takes the beautiful soup created and uses it to gather player names from the specified stats page.'''
    
    player_list = []
    
    #Get player as html tags
    players = soup.select('td a')[1:] #Use 1 beacuse first line of all tables is not useful.
    #Loop through list
    for player in players:
        player_list.append(player.get_text())
    
    return player_list

### 3. Pull statistics from page

In [12]:
##Get Stats
def get_stats(soup, categories):
    '''This function takes the soup created before and the number of categories needed to generate this'''
    
    #Finds all tags with class specified and puts into a list
    stats = soup.find_all(class_="hidden-small hidden-medium")
    
    #Initialize stats list
    stat_list = []
    
    #Loop through 
    for i in range(0, len(stats)-categories+1, categories):
        temp_list = []
        for j in range(categories):
            temp_list.append(stats[i + j].get_text())
        stat_list.append(temp_list)
            
    return stat_list

### 4. Create data dictionary for page

In [13]:
def stats_dict(players, stats):
        '''This function takes two lists, players and stats, 
        and creates a dictionary with the player being the key 
        and the stats as the vales (as a list)'''
    
        #initialize player dictionary
        player_dict = {}
    
        #Loop through player list
        for i, player in enumerate(players):
            player_dict[player] = stats[i]
    
        return player_dict

### 5. Use functions 1-4 to create dataframe for statistic. "make_dataframe"

In [14]:
##Mega function
def make_dataframe(url, categories):
        
    ##Create soup object from url.
    response = requests.get(url)
    text = response.text
    soup = BeautifulSoup(text, 'lxml')
    
    #1. Get Headers
    headers = get_headers(soup)
    
    #2. Get Players
    players = get_players(soup)
    
    #3. Get Stats
    stats = get_stats(soup, categories)
    
    #4. Make stats dictionary.
    stats_dictionary = stats_dict(players, stats)
    
    #Make dataframe
    frame = pd.DataFrame(stats_dictionary, index = headers).T
    
    #Reset index
    frame = frame.reset_index()
    
    #For each Dataframe, change index column to 'NAME'
    frame = frame.rename(index = str, columns = {'index': 'NAME'})
    return frame

### 6. Loop through years 2010-2017 to create a dataframe from years 2010-2017
All of the data cleaning and preprocessing happens in the next couple of code blocks.

In [15]:
years = [str(i) for i in range(2010, 2018)]

In [34]:
for year in years:
    print(year)
    #Fedex cup points
    fcp = make_dataframe("https://www.pgatour.com/stats/stat.02671.{}.html".format(year), 6)[['NAME', 'POINTS']]
    #Top 10's and wins
    top10 = make_dataframe("https://www.pgatour.com/stats/stat.138.{}.html".format(year), 5)[['NAME', 'TOP 10', '1ST']]

    #Scoring statistics, keep rounds from this page as it most accurately reflects total rounds player completed in season.
    scoring = make_dataframe("https://www.pgatour.com/stats/stat.120.{}.html".format(year), 5)[['NAME', 'ROUNDS', 'AVG']]
    scoring = scoring.rename(columns={'AVG':'SCORING'})

    #Driving Distance
    drivedistance = make_dataframe("https://www.pgatour.com/stats/stat.101.{}.html".format(year), 4)[['NAME', 'AVG.']]
    #Rename Columns
    drivedistance = drivedistance.rename(columns = {'AVG.':'DRIVE_DISTANCE'})

    #Driving Accuracy
    driveacc = make_dataframe("https://www.pgatour.com/stats/stat.102.{}.html".format(year), 4)[['NAME', '%']]
    #Change column name from % to FWY %
    driveacc = driveacc.rename(columns = {'%': "FWY_%"})

    #Greens in Regulation.
    gir = make_dataframe("https://www.pgatour.com/stats/stat.103.{}.html".format(year), 5)[['NAME', '%']]
    #Change column name from % to GIR %
    gir = gir.rename(columns = {'%': "GIR_%"})

    #Strokes gained putting
    sg_putting = make_dataframe("https://www.pgatour.com/stats/stat.02564.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    #Change name of average column
    sg_putting = sg_putting.rename(columns = {'AVERAGE': 'SG_P'})

    #Strokes gained tee to green
    sg_teetogreen = make_dataframe("https://www.pgatour.com/stats/stat.02674.{}.html".format(year), 6)[['NAME', 'AVERAGE']]
    #Change name of average column
    sg_teetogreen = sg_teetogreen.rename(columns = {'AVERAGE' : 'SG_TTG'})

    #sg total
    sg_total = make_dataframe("https://www.pgatour.com/stats/stat.02675.{}.html".format(year), 6)[['NAME', 'AVERAGE']]
    sg_total = sg_total.rename(columns = {'AVERAGE':'SG_T'})
    
    #Get Dataframes into list.
    data_frames = [drivedistance, driveacc, gir, sg_putting, sg_teetogreen, sg_total]
    
    #Merge all Dataframes together
    df_one = pd.DataFrame()
    df_one = scoring
    for df in data_frames:
        df_one = pd.merge(df_one, df, on='NAME')
        
    

    #merge fex ex cup points
    df_one = pd.merge(df_one, fcp, how='outer', on='NAME')
    #Merge top 10's
    df_one = pd.merge(df_one, top10, how='outer', on='NAME')
    
    #Only get people who's scoring average isn't null.
    df_one = df_one.loc[df_one['SCORING'].isnull() == False]  
    
    #Add year column
    df_one['Year'] = year
    
    #Concat dataframe to overall dataframe
    
    if year == '2010':
        df_total = pd.DataFrame()
        df_total = pd.concat([df_total, df_one], axis=0)
    else:
        df_total = pd.concat([df_total, df_one], axis=0)
    

2010
2011
2012
2013
2014
2015
2016
2017


In [37]:
df_total.head()

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,GIR_%,SG_P,SG_TTG,SG_T,POINTS,TOP 10,1ST,Year
0,Aaron Baddeley,94,70.995,298.9,56.65,64.6,0.509,-0.294,0.208,559,2,,2010
1,Adam Scott,70,70.468,294.4,62.93,69.61,-0.746,1.609,0.862,640,4,1.0,2010
2,Alex Cejka,81,71.219,277.4,70.31,66.6,-0.466,0.396,-0.073,489,4,,2010
3,Alex Prugh,88,70.878,295.7,58.4,68.6,0.202,-0.112,0.092,526,4,,2010
4,Andres Romero,73,70.986,296.0,55.05,65.07,0.254,-0.118,0.136,853,2,,2010


Now save the file in a sqlite3 database

In [38]:
#Load sqlite package
import sqlite3 as db
#Create connect object with example db. A new file will be created.
conn = db.connect('pgatour_raw.db')

#Create cursor to perform actions on db.
c = conn.cursor()

In [39]:
df_total.to_sql("pgatour_stats_raw", conn, if_exists='replace')

  chunksize=chunksize, dtype=dtype)


In [41]:
conn.close()

# Conclusion
This notebook walked you through how I implemented a web scraper to scrape the pgatour.com website for player statistics from 2010-2017. Additional data cleaning steps will be needed to prepare this data for analysis but that is out of scope for this notebook. Applications of data cleaning in python can be seen in follow up notebooks in this repository.