This notebook illustrates how to scrape data from the IPL statistics webpage about the highest scores in the 2023 season. This notebook is set up for educational and learning purpose only.

In [1]:
# !pip3 install requests
# !pip3 install pandas

In [14]:
import requests
import os
import pandas as pd
import json

#### URL to retrieve the data from.
To scrape data from a web page, the url corresponding to that webpage needs to located. However, just the web url of the webpage would not return the data since an api call needs to be made to that url. The url for the api call can be retrieved from the webpage inspection on the web browser. Entering that url on the browser returns the desired data. Going through the raw data on the browser should give enough understanding on how to retrieve that data as desired. 

In [3]:
api_url = "https://ipl-stats-sports-mechanic.s3.ap-south-1.amazonaws.com/ipl/feeds/stats/107-toprunsscorers.js?callback=ontoprunsscorers&_=1688441636396"

In [4]:
raw_scraped_text = requests.get(url=api_url).text

In [5]:
"API response starts with: " + raw_scraped_text[0:20]

'API response starts with: ontoprunsscorers({"t'

In [6]:
"API response ends with: " + raw_scraped_text[-20:]

'API response ends with: layerID":"5432"}]});'

In [7]:
raw_scraped_text.__len__()

138925

In most of the cases, the API response is a well-formatted json or xml document data. After inspection, in this case, directly converting the response data into a json object is not possible. However, for each player entry in the leaderboard data, the data starts with a json-formatted dictionary. We exploit this info here.

#### Extract data.

In [8]:
# retrieve the positions of the first starting and ending braces
# skip the undesired first opening curly brace.
start_pos = raw_scraped_text.find("{", 18)
end_pos = raw_scraped_text.find("}")
start_pos, end_pos

(36, 877)

In [9]:
records = []
rank = 1
while True:
    # store the current record
    records.append(
        json.loads(raw_scraped_text[start_pos : end_pos + 1].strip(",").strip("]"))
    )
    # find the positions of the starting and the closing braces for the next record
    new_start_offset = raw_scraped_text[end_pos:].find("{")
    if new_start_offset == -1:
        break
    start_pos = new_start_offset + end_pos
    end_pos = start_pos + raw_scraped_text[start_pos:].find("}") + 1
    rank += 1

print(len(records), " records found.")

166  records found.


In [10]:
records[0]

{'StrikerName': 'Shubman Gill',
 'PlayerId': '62',
 'Matches': '17',
 'PlayerDOB': '0000-00-00',
 'RightHandedBat': 'true',
 'Nationality': 'Indian',
 'TCompetitionID': '107',
 'TStrikerID': '2019-100mb00000000062-8fa4884a17f311',
 'TTeamID': '35',
 'TeamCode': 'GT',
 'TeamName': 'Gujarat Titans',
 'CompetitionID': '107',
 'TeamID': '35',
 'StrikerID': '2019-100mb00000000062-8fa4884a17f311',
 'Innings': '17',
 'Extras': '33',
 'TotalRuns': '890',
 'Balls': '564',
 'Dotballs': '150',
 'StrikeRate': '157.80',
 'DBPercent': '26.59',
 'DBFreq': '3.76',
 'BdryFreq': '4.77',
 'BdryPercent': '60.44',
 'RPSS': '2.14',
 'ScoringBalls': '414',
 'Ones': '244',
 'Twos': '48',
 'Threes': '4',
 'Fours': '85',
 'Sixes': '33',
 'Outs': '15',
 'NotOuts': '2',
 'BattingAveragesss': '445.00',
 'FiftyPlusRuns': '4',
 'Centuries': '3',
 'DoubleCenturies': '0',
 'HighestScore': '129',
 'BattingAverage': '59.33',
 'Catches': '6',
 'Stumpings': '0',
 'ClientPlayerID': '3761'}

#### Create a dataframe.

In [11]:
highest_scorers_df = pd.DataFrame(records)
highest_scorers_df.head(2)

Unnamed: 0,StrikerName,PlayerId,Matches,PlayerDOB,RightHandedBat,Nationality,TCompetitionID,TStrikerID,TTeamID,TeamCode,...,NotOuts,BattingAveragesss,FiftyPlusRuns,Centuries,DoubleCenturies,HighestScore,BattingAverage,Catches,Stumpings,ClientPlayerID
0,Shubman Gill,62,17,0000-00-00,True,Indian,107,2019-100mb00000000062-8fa4884a17f311,35,GT,...,2,445.0,4,3,0,129,59.33,6,0,3761
1,Faf Du Plessis,94,14,0000-00-00,True,Overseas,107,2019-100mb00000000094-1e8fa14543ee11,19,RCB,...,1,730.0,8,0,0,84,56.15,3,0,24


In [12]:
# retain only the desired fields, here, related to batting
cols_to_retain = [
    "StrikerName",
    "PlayerId",
    "Matches",
    "Nationality",
    "TeamName",
    "Innings",
    "TotalRuns",
    "Balls",
    "Dotballs",
    "StrikeRate",
    "DBPercent",
    "BdryFreq",
    "BdryPercent",
    "Ones",
    "Twos",
    "Threes",
    "Fours",
    "Sixes",
    "Outs",
    "NotOuts",
    "BattingAveragesss",
    "FiftyPlusRuns",
    "Centuries",
    "DoubleCenturies",
    "HighestScore",
    "BattingAverage",
    "Catches",
    "Stumpings",
]

highest_scorers_df = highest_scorers_df[cols_to_retain]
highest_scorers_df.head(2)

Unnamed: 0,StrikerName,PlayerId,Matches,Nationality,TeamName,Innings,TotalRuns,Balls,Dotballs,StrikeRate,...,Outs,NotOuts,BattingAveragesss,FiftyPlusRuns,Centuries,DoubleCenturies,HighestScore,BattingAverage,Catches,Stumpings
0,Shubman Gill,62,17,Indian,Gujarat Titans,17,890,564,150,157.8,...,15,2,445.0,4,3,0,129,59.33,6,0
1,Faf Du Plessis,94,14,Overseas,Royal Challengers Bangalore,14,730,475,133,153.68,...,13,1,730.0,8,0,0,84,56.15,3,0


In [13]:
len(highest_scorers_df)

166

#### Store the data in the desired dir.

In [15]:
filename = "ipl_highest_scorers_2023.xlsx"
# create dir
relative_dir_path = "../data"
if not os.path.exists(relative_dir_path):
    os.mkdir(relative_dir_path)
# store the data
highest_scorers_df.to_csv(relative_dir_path + "/" + filename)
print("Data successfully saved.")

Data successfully saved.
