# Web Scrapping BWF Data

Inspired by Loh Kean Yew's historic achievement of being the first Singaporean to win a BWF World Championships title at the BWF World Championships 2021 held in Huelva, Spain, I wanted to do some simple analysis on his meteoric rise to the top of the badminton world.

To start off, I felt that the BWF official website offers some match data that could be useful. Since I have just learnt about web scrapping, I decided to code and web scrape Loh Kean Yew's playing data from the BWF website.

## Using Selenium

At the start, I was looking for tools to perform the web scrapping as the BWF website seemed to be quite complex. Unlike the tutorials where I practised on HTML sites, it seemed like the BWF site is also written in Javascript? While I have limited knowledge in web development, HTML and Javascript, a few google searches got me to Selenium.

Selenium as a powerful tool for controlling web browsers through programs and performing browser automation. It is able to perform various operations on multiple elements on a webpage. This is required as the BWF website is coded with buttons/tabs that changes the info dynamically on the page without changing the URL. Again, not the expert on web development so more work needs to be done to better understand.

In [1]:
from selenium import webdriver

from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from requests import get

import time

## Navigating the Webpage Using Selenium

I did a mapping through the workflow of navigating the webpage on the clicks requred to get to the data that I want to display. Again due to how the data is displayed on the website dynamically. I have used loops to iterate the process as we are required to select and display the data by year.

In [2]:
browser = webdriver.Chrome()
url = 'https://bwfbadminton.com/player/76115/loh-kean-yew'
browser.get(url)

time.sleep(3)

#path to click on 'TOURNAMENTS' tab
path = '//*[@id="app"]/div/div/div[2]/div/ul/li[2]/a'
tournament_tab = browser.find_element_by_xpath(path)

#command to click
tournament_tab.click()

#wait for it to load after click
time.sleep(3)

#path to click on drop down arrow bottom to select 'Year'
path = '//*[@id="app"]/div[1]/div/section/div/div/div/div[2]/div/div/div/div[1]/div[1]/div[2]/div/i'
year_tab = browser.find_element_by_xpath(path)

#command to click
year_tab.click()

#wait for it to load after click
time.sleep(3)

#scrape the main tournament page in order to get the number of tournament years available
main_source = browser.page_source
main_soup = BeautifulSoup(main_source, 'html.parser')

#wait for it to scrape
time.sleep(3)

#path to click on Year 2022 to close the dropdown bar
path = '//*[@id="list-item-22-0"]/div/div'
year_tab = browser.find_element_by_xpath(path)

#command to click
year_tab.click()

#wait for it to load after click
time.sleep(3)

#get the number of tournament years available
year_containers = main_soup.find_all('div', class_='v-list-item__title')


html_source = []
no_of_tourney = []

#for loop to repeat the scrapping for each year available
for i in range(len(year_containers)-1):
    
    #scroll back to top of website after scrapping each year. if not, there is accessing element issue
    browser.execute_script("window.scrollTo(0, 0)")
    
    #path to click on drop down arrow bottom to select 'Year'
    path = '//*[@id="app"]/div[1]/div/section/div/div/div/div[2]/div/div/div/div[1]/div[1]/div[2]/div/i'
    year_tab = browser.find_element_by_xpath(path)

    #command to click
    year_tab.click()

    #wait for it to load after click
    time.sleep(3)

    #path to click to choose the year
    x = i+1
    
    path = '//*[@id="list-item-22-' + str(x) +'"]/div/div'
    twoone_tab = browser.find_element_by_xpath(path)

    #command to click
    twoone_tab.click()

    #wait for it to load after click
    time.sleep(3)

    #scrape the tournaments on the page after selection of year
    overall_source = browser.page_source
    overall_soup = BeautifulSoup(overall_source, 'html.parser')
    
    #identify the drop down button for 'View Match Breakdown'
    #If there are 10 drop down buttons, this means there are 10 tournaments in that year
    page_containers = overall_soup.find_all('i', class_='fas fa-fw fa-chevron-down')
    print(len(page_containers))
    
    #collate a list of number of tournaments in each year
    no_of_tourney.append(len(page_containers))

    #for loop to scrape the matches data tournament by tournament in that year
    for i in range(len(page_containers)):
        #path to click on arrow button for 'VIEW MATCH BREAKDOWN'
        x = 4+i
        path = '//*[@id="app"]/div/div/div[' + str(x) + ']/div[1]/div/div/div/div[4]/a/h2/i'
        view_match1 = browser.find_element_by_xpath(path)

        time.sleep(3)

        #command to click
        view_match1.click()

        #wait for it to load after click
        time.sleep(3)

        #scrape
        source = browser.page_source
        html_source.append(source)

        time.sleep(3)

browser.quit()

10
4
17
10
4
8
16
9
8
1


In [3]:
html_soup = [BeautifulSoup(html_source[i], 'html.parser') for i in range(len(html_source))]

## Extracting the Data

By running through the various tags, I was able to extract the data that I think will be useful for analysis later.

In [4]:
player1 = []
player1_set1 = []
player1_set2 = []
player1_set3 = []
player2 = []
player2_set1 = []
player2_set2 = []
player2_set3 = []
rounds = []
match_time = []
tournament_name = []
year = []

count = 0 
index = 0   

for soup in html_soup:
    
    #each soup contains all the tournament profile in that year. if there are 10 tournaments in year 2021, 
    #html_soup[0] to html_soup[9] will contain the same 10 tournaments info
    
    tourney_containers = soup.find_all('div', class_='profile-tmt-detail')
    year_containers = soup.find_all('div',class_='v-select__selection v-select__selection--comma')
    
    #each soup represents one tournament
    #find all the matches in that tournament (soup)
    match_containers = soup.find_all('li', class_='result-match-single-card')
    
    if count == no_of_tourney[index]:    #number of tournament per year
        count = 0 
        index += 1
    
    #each container represent one match in that particular tournament (soup)
    for container in match_containers:

        tournament_name.append((tourney_containers[count].h2.span.a.text))   #count will only increase after inner loops done
        year.append(year_containers[0].text)
        
        #find first player names across all the rounds
        player1_info = container.find('div', class_='team-details-wrap-card')
        player1_name = player1_info.a.text.strip()
        player1.append(player1_name)
        
        #find first player score
        player1_score = player1_info.find_all('div', class_='score')
        scores1 = player1_score[0].find_all('span')
        
        #check if the scores length, as there could be different number of sets played
        #exmaple walkover/bye with no set played, only two sets, or up to three sets
        if len(scores1) == 3:
            
            scores1_1 = scores1[0].text
            player1_set1.append(scores1_1)

            scores1_2 = scores1[1].text
            player1_set2.append(scores1_2)
            
            scores1_3 = scores1[2].text
            player1_set3.append(scores1_3)
            
        elif len(scores1) == 2:
            
            scores1_1 = scores1[0].text
            player1_set1.append(scores1_1)

            scores1_2 = scores1[1].text
            player1_set2.append(scores1_2)
            
            scores1_3 = None
            player1_set3.append(scores1_3)
            
        elif len(scores1) == 1:

            scores1_1 = scores1[0].text
            player1_set1.append(scores1_1)
            
            scores1_2 = None
            player1_set2.append(scores1_2)
            
            scores1_3 = None
            player1_set3.append(scores1_3)
            
        else:
            scores1_1 = None
            player1_set1.append(scores1_1)
            scores1_2 = None
            player1_set2.append(scores1_2)
            scores1_3 = None
            player1_set3.append(scores1_3)

        #find second player names across all the rounds
        player2_info = container.find('div', class_='team-details-wrap-card margin-top')
        #there are matches where it is walkover/bye, and the opponent name is not listed
        if player2_info != None:
            player2_name = player2_info.a.text.strip()
            player2.append(player2_name)
        else:
            player2_name = 'Walkover/Bye'
            player2.append(player2_name)

        #find second player score
        #to also account for matches where it is walkover/bye, and opponent name and scores are not listed
        if player2_info != None:
            player2_score = player2_info.find_all('div', class_='score')
            scores2 = player2_score[0].find_all('span')
        else:
            scores2 = []
        if len(scores2) == 3:
            
            scores2_1 = scores2[0].text
            player2_set1.append(scores2_1)

            scores2_2 = scores2[1].text
            player2_set2.append(scores2_2)
            
            scores2_3 = scores2[2].text
            player2_set3.append(scores2_3)
            
        elif len(scores2) == 2:
            
            scores2_1 = scores2[0].text
            player2_set1.append(scores2_1)

            scores2_2 = scores2[1].text
            player2_set2.append(scores2_2)
            
            scores2_3 = None
            player2_set3.append(scores2_3)
            
        elif len(scores1) == 1:

            scores2_1 = scores1[0].text
            player2_set1.append(scores2_1)
            
            scores2_2 = None
            player2_set2.append(scores2_2)
            
            scores2_3 = None
            player2_set3.append(scores2_3)
            
        else:
            scores2_1 = None
            player2_set1.append(scores2_1)
            scores2_2 = None
            player2_set2.append(scores2_2)
            scores2_3 = None
            player2_set3.append(scores2_3)
            
        #find all round number
        round_number = container.find('span', class_='round-oop').text.strip()
        rounds.append(round_number)

        #find all match time
        m_time = container.find('div', class_='time').text.strip()
        match_time.append(m_time)
        
    count += 1   #only increase count after all the matches in the tournament

In [5]:
bwf_lky = pd.DataFrame({'year':year,
                        'tournament':tournament_name,
                        'round':rounds,
                        'match_time': match_time,
                        'player1': player1,
                        'player2': player2,
                        'player1_set1': player1_set1,
                        'player2_set1': player2_set1,
                        'player1_set2': player1_set2,
                        'player2_set2': player2_set2,
                        'player1_set3': player1_set3,
                        'player2_set3': player2_set3,})

In [6]:
bwf_lky.sample(10)

Unnamed: 0,year,tournament,round,match_time,player1,player2,player1_set1,player2_set1,player1_set2,player2_set2,player1_set3,player2_set3
48,2019,IDBI Federal Life Insurance Hyderabad Open 2019,R16,0:47,PARUPALLI Kashyap,LOH Kean Yew,21,17,15,21,19.0,21.0
133,2017,CELCOM AXIATA Malaysia International Series 2017,Final,0:33,CHEAM June Wei,LOH Kean Yew,19,21,14,21,,
8,2021,SimInvest Indonesia Open 2021(New dates),QF,0:25,LOH Kean Yew,Hans-Kristian Solberg VITTINGHUS,21,9,21,4,,
177,2015,SCG Badminton Asia Junior Championships 2015,,0:35,LOH Kean Yew,Satheishtharan RAMACHANDRAN,19,21,12,21,,
2,2021,TotalEnergies BWF World Championships 2021,R16,0:30,Kantaphon WANGCHAROEN,LOH Kean Yew,4,21,7,21,,
135,2017,ROBOT Badminton Asia Mixed Team Championships ...,,0:34,Sameer VERMA,LOH Kean Yew,21,9,21,16,,
173,2015,Victor Indonesia International Challenge 2015,R16,0:33,Sony Dwi KUNCORO,LOH Kean Yew,21,11,21,14,,
25,2021,YONEX Dutch Open 2021,Final,0:36,Lakshya SEN,LOH Kean Yew,12,21,16,21,,
209,2014,Ciputra Hanoi - Yonex Sunrise Vietnam Internat...,R64,1:03,Febriyan IRVANNALDY,LOH Kean Yew,17,21,21,17,21.0,13.0
131,2017,CELCOM AXIATA Malaysia International Series 2017,QF,0:28,LOH Kean Yew,Henrikho KHO WIBOWO,21,18,21,14,,


In [7]:
bwf_lky.to_csv(r'C:\Users\Andy\Desktop\Learning\Dataquest\Project_WebScrape\bwf_lky.csv')