# Scraping Cricket Statistics

I had gone through several tutorials on using Beautiful Soup to scrape data off of websites and I was on the lookout for a webpage that I could practice my skills on. Now, back home in India, cricket has HUGE fan-following and that a website containing cricket statistics seemed like a good choice for scraping. In this post, I'm going to scrape data from the website HOWSTAT.

Let's import some packages to help us scrape the data. In brief, 'requests' will help us to download the webpage as HTML, 're' is the regular expressions package for Python, 'BeautifulSoup' to help us extract data from HTML tags and 'pandas' to help us with storing the scraped data. 

In [1]:
import requests, re
import pandas as pd
from bs4 import BeautifulSoup

To prevent overburdening the website with HTTP requests for dowloading the webpage, let's write a function which we will call just once to download the html as a text file. 

In [2]:
def download_page():
    ## Download the page using the URL and save as a text file
    page_html = requests.get('http://www.howstat.com.au/cricket/Statistics/Players/PlayerCountryList.asp?Country=IND')
    f_html = open('f_html.txt', 'w').write(page_html.text)

In [4]:
# Call the function to download the page here
download_page()

In [5]:
## Read in the text file, soupify and close file
f_open = open('f_html.txt', 'r')
soup = BeautifulSoup(f_open.read(), 'lxml')
f_open.close()

In [7]:
## Selects the parent tag 'tr' containing children 'td' with col names (found by trial and error)
col_tags = soup.find_all('tr')[21:22][0].contents    
col_list = [] # List to store the raw names  
for each in col_tags:
    if 'td' in str(each): # looks for td tag among other unnecessary stuff
        col_list.append(BeautifulSoup(str(each), 'lxml').getText())

In [8]:
# re.search('(\w\s*)+',col_list[i]).group().strip() looks for only words and strips them of unecessary stuff
col_names =[re.search('(\w\s*)+',col_list[i]).group().strip() for i in xrange(len(col_list))]
col_names = map(str, col_names) # Converting unicode names to string

In [9]:
## Now we have to rename columns according to Test, ODI and T20 in that order
## First join the words with an '_' and then change names
col_names = [re.sub(' ', '_', col_names[i]) for i in xrange(len(col_names))]
col_names[1:5] = ['Test_'+col_names[i] for i in range(1,6)]
col_names.pop(6)
col_names[6:10] = ['ODI_'+col_names[i] for i in range(6,11)]
col_names.pop(11)
col_names[11:15] = ['T20_'+col_names[i] for i in range(11,16)]
col_names.pop(16)

'Bowl_Avg'

In [10]:
## Let's search for player stats and store them as individual lists within a list
## By looking at the table on the website & by trial and error, I see that player indexes
## start at 22 and end at 364
players = [] # Master list with all players
for i in range(22, 365): # Looping through each players tag collection
    temp_list = [] # list to build each players info
    player_tags = soup.find_all('tr')[i].find_all('td') # Each player is contained within 'tr' and stats within 'td'
    temp_list.append(str(player_tags[0].getText().strip())) # Storing players name
    for i in range( 1, len(player_tags)): # Filling temp_list with the players stats
        temp_list.append(str(player_tags[i].getText().strip()))
    # Replace missing values with NaN    
    for i in xrange(len(temp_list)):
        if temp_list[i] == '': temp_list[i] = float('NaN')
    temp_list[1: ] = map(float, temp_list[1: ]) # Converting all numbers to float
    players.append(temp_list) # Appending each player to the master list

In [11]:
stats = pd.DataFrame(players, columns = col_names)

In [14]:
stats.head()

Unnamed: 0,Name,Test_M,Test_Runs,Test_Bat_Avg,Test_Wkts,Test_Bowl_Avg,ODI_M,ODI_Runs,ODI_Bat_Avg,ODI_Wkts,ODI_Bowl_Avg,T20_M,T20_Runs,T20_Bat_Avg,T20_Wkts,T20_Bowl_Avg
0,"Aaron, Varun R*",9,35,3.89,18,52.61,9.0,8.0,8.0,11.0,38.09,,,,,
1,"Abid Ali, Syed",29,1018,20.36,47,42.13,5.0,93.0,31.0,7.0,26.71,,,,,
2,"Adhikari, Hemu R",21,872,31.14,3,27.33,,,,,,,,,,
3,"Agarkar, Ajit B",26,571,16.79,58,47.33,191.0,1269.0,14.59,288.0,27.85,4.0,15.0,7.5,3.0,28.33
4,"Amar Singh, Ladha",7,292,22.46,28,30.64,,,,,,,,,,
