# I. Web Scrapping

I have scraped IPL 2021 data from https://www.indiatvnews.com using Python.

The Indian Premier League (IPL) is a professional Twenty20 cricket league in India hosted every year by eight teams representing eight different cities in India.The league was founded by the Board of Control for Cricket in India (BCCI) in 2008.

Dataset is available at https://www.kaggle.com/malaydhami/ipl-2021-dataset

In [1]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup 

In [2]:
url = "https://www.indiatvnews.com/iframe/ipl-2021/schedule"
request = requests.get(url)
soup = BeautifulSoup(request.content, "html.parser")

In [3]:
urls = set()

for url in soup.find_all("div",class_="match-box"):
    urls.add(url.a["href"])

print(len(urls))

50


## File1: Matches.csv

   
This dataset has details of all the matches played in IPL 2021.

Details of dataset is as below:

1) match_no: Match Number

2) date: Date of match

3) team1: Name of the first team

4) team3: Name of the second team

5) venue: Venue of match

6) toss: Team who won the toss

7) decision: Team decision after winning the toss

8) winner: Team who won the match

9) target: Target has been chased or defended

10) mom: Man of the match

In [4]:
matches = pd.DataFrame(columns=["date","team1","team2","venue","toss","decision","winner","target","mom","url"])

for idx,url in enumerate(urls):
    
    sub_src = requests.get(url)
    sub_soup = BeautifulSoup(sub_src.content, "html.parser")
    
    s = requests.Session()
    match = BeautifulSoup(s.get(sub_soup.find("div", class_="madbodynew").iframe["src"]).text,"html.parser")
    
    date=match.find_all("div",class_="mch_hdr-lft")[1].find_all("span")[1].text.strip()

    teams=[]
    for t in match.find_all("span",class_="mch_tm-nme"):
        teams.append(t.text)
    team1, team2 = teams

    details = []
    for index, detail in enumerate(match.find_all("span",class_="lst_txt-b")):
        if index==1 or index==2 or index==3 or index==4:
            details.append(detail.text)

    data = pd.DataFrame(columns=[""])
    venue, toss, result, mom = details
    venue = venue.split(",")[0]
    toss_winner = toss.split(" won ")[0]
    toss_decision = toss.split(" to ")[-1]
    winner = result.split(" won ")[0]
    target = result.split(" ")[-1].strip(".")
    
    if target=="wickets":
        target="chased"
    else:
        target="defended"
    matches = matches.append({"date":date,"team1":team1,"team2":team2,"venue":venue,"toss":toss_winner,"decision":toss_decision,"winner":winner,"target":target,"mom":mom,"url":url},ignore_index=True)

print("Completed!")

Completed!


In [5]:
matches["date"] = pd.to_datetime(matches["date"], infer_datetime_format=True)
matches = matches.sort_values(by="date").reset_index(drop=True)
matches.insert(0, "match_no", matches.index+1)
match_urls = matches["url"]
matches.drop("url",axis=1,inplace=True)
matches.head(10)

Unnamed: 0,match_no,date,team1,team2,venue,toss,decision,winner,target,mom
0,1,2021-04-09,Mumbai,Bangalore,MA Chidambaram Stadium,Bangalore,field,Bangalore,chased,Harshal Patel
1,2,2021-04-10,Chennai,Delhi,Wankhede Stadium,Delhi,field,Delhi,chased,Shikhar Dhawan
2,3,2021-04-11,Hyderabad,Kolkata,MA Chidambaram Stadium,Hyderabad,field,Kolkata,defended,Nitish Rana
3,4,2021-04-12,Rajasthan,Punjab,Wankhede Stadium,Rajasthan,field,Punjab,defended,Sanju Samson
4,5,2021-04-13,Kolkata,Mumbai,MA Chidambaram Stadium,Kolkata,field,Mumbai,defended,Rahul Chahar
5,6,2021-04-14,Hyderabad,Bangalore,MA Chidambaram Stadium,Hyderabad,field,Bangalore,defended,Glenn Maxwell
6,7,2021-04-15,Rajasthan,Delhi,Wankhede Stadium,Rajasthan,field,Rajasthan,chased,Jaydev Unadkat
7,8,2021-04-16,Punjab,Chennai,Wankhede Stadium,Chennai,field,Chennai,chased,Deepak Chahar
8,9,2021-04-17,Mumbai,Hyderabad,MA Chidambaram Stadium,Mumbai,bat,Mumbai,defended,Kieron Pollard
9,10,2021-04-18,Bangalore,Kolkata,MA Chidambaram Stadium,Bangalore,bat,Bangalore,defended,AB de Villiers


## File2: Ball by Ball.csv

This dataset contains details of each ball bowled during IPL  2021

Details of dataset is as below:

1) match_no: Match number

2) inning: Inning number

3) over: Current over number

4) ball: Current ball of the over

5) batsman: Batsman on the strike

6) bowler: Bowler name

7) is_wicket: Wicket or not

8) dismissal_type: If wicket gone then dismissal type

9) fielder: If dismissal type is caught then fielder name

10) batsman_run: How many runs batsman scored in this delivery

11) extra_run: How many extra runs bowler gave in this delivery

12) total_run: total runs of the ball

13) extra_type: If extra runs are there then it's type

In [6]:
ball_by_ball = pd.DataFrame(columns=["match_no","inning","over","ball","batsman","bowler","is_wicket","dismissal_type","fielder","batsman_run","extra_run","total_run","extra_type"])

for match_no, link in enumerate(match_urls):
    
    url = requests.get(link)
    match=BeautifulSoup(url.content,"html.parser")
    
    match_details=BeautifulSoup(requests.get(match.find("iframe")["src"].replace("overview","commentary")).text,"html.parser")
    
    for inning in range(1,3):
        for over_details in match_details.find_all("div",class_="mch_cmt-wrp")[inning-1].find_all("div",class_="lst_cir-scl"):
            
            over = over_details.find("div",class_="ful_scr-bwl")
            delivery = over_details.find("div",class_="lst_cir-itm")
            detail = over_details.find("div",class_="ful_scr-ttl")
            wicket = over_details.find("div",class_="ful_scr-txt")
            player = np.nan
            kind = np.nan
            extra_type= np.nan
            batsman_run=0
            extra_run=0
            is_wicket=0
            
            if over:
                over = over.text
                delivery = delivery.text
                detail = detail.text
                wicket = wicket.text

                if wicket:
                    wicket_type = wicket.split(" ")[2]
                    s=wicket
                    if wicket_type=="runout":
                        kind="runout"
                        player=s[s.find("(")+1:s.find(")")]
                    elif wicket_type=="c":
                        kind="caught"
                        if wicket.split(" ")[4]=="b":
                            player=s[15:s.find(" batsmen")]
                        else:
                            player=s[11:s.find(" b")]
                    elif wicket_type=="lbw":
                        kind="lbw"
                    elif wicket_type=="b":
                        kind="bowled"
                    elif wicket_type=="st":
                        kind="stumped"
                        player=s[12:s.find(" b")]

                bowler, batsman = detail.split(",")[0].split(" to ")
                bat_run=detail.split(", ")[1].split(" ",maxsplit=1)
                
                if bat_run[0]=="Four":
                    batsman_run=4
                elif bat_run[0]=="Six":
                    batsman_run=6
                elif bat_run[0]=="no":
                    batsman_run=0
                elif bat_run[1]=="no ball":
                    extra_run = 1
                    extra_type = "NB"
                    batsman_run = int(bat_run[0])-1
                elif bat_run[1]=="wide":
                    extra_run = bat_run[0]
                    extra_type = "wide"
                elif bat_run[1]=="leg bye":
                    extra_run = bat_run[0]
                    extra_type = "lb"
                else:
                    batsman_run = bat_run[0]

                if delivery=="w":
                    is_wicket=1
                over_no,ball = over.split(".")
                ball_by_ball = ball_by_ball.append({"match_no":match_no+1,"inning":inning,"over":over_no,"ball":ball,"batsman":batsman,"bowler":bowler,"is_wicket":is_wicket,"dismissal_type":kind,"fielder":player,"batsman_run":batsman_run,"extra_run":extra_run,"total_run":int(batsman_run)+int(extra_run),"extra_type":extra_type}, ignore_index=True)

print("Completed!")

Completed!


In [7]:
data_type = {"match_no":int,"inning":int,"over":int,"ball":int,"batsman":object,"bowler":object,"is_wicket":int,"dismissal_type":object,"fielder":object,"batsman_run":int,"extra_run":int,"total_run":int,"extra_type":object}
ball_by_ball = ball_by_ball.astype(data_type)
ball_by_ball = ball_by_ball.sort_values(by=["match_no","inning","over","ball"]).reset_index(drop=True)
ball_by_ball.head(10)

Unnamed: 0,match_no,inning,over,ball,batsman,bowler,is_wicket,dismissal_type,fielder,batsman_run,extra_run,total_run,extra_type
0,1,1,0,1,Rohit Sharma,Mohammed Siraj,0,,,2,0,2,
1,1,1,0,2,Rohit Sharma,Mohammed Siraj,0,,,0,0,0,
2,1,1,0,3,Rohit Sharma,Mohammed Siraj,0,,,0,0,0,
3,1,1,0,4,Rohit Sharma,Mohammed Siraj,0,,,2,0,2,
4,1,1,0,5,Rohit Sharma,Mohammed Siraj,0,,,0,0,0,
5,1,1,0,6,Rohit Sharma,Mohammed Siraj,0,,,1,0,1,
6,1,1,1,1,Rohit Sharma,KA Jamieson,0,,,1,0,1,
7,1,1,1,2,CA Lynn,KA Jamieson,0,,,0,0,0,
8,1,1,1,3,CA Lynn,KA Jamieson,0,,,0,0,0,
9,1,1,1,4,CA Lynn,KA Jamieson,0,,,0,0,0,


In [10]:
# save this both dataframes in csv file.

matches.to_csv("E:\Project\Webscrapping\Dataset\Matches.csv", index=False)
ball_by_ball.to_csv("E:\Project\Webscrapping\Dataset\Ball by Ball.csv", index=False)