### Dependencies:
Uncomment the below code and run cell to install all dependencies. It is recommended to create a new virtual environment. This is done using conda and python 3.11.


In [2]:
# !pip install pypyodbc
# !pip install polars
# !pip install bs4
# !pip install selenium

Collecting pypyodbc
  Downloading pypyodbc-1.3.6.tar.gz (24 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: pypyodbc
  Building wheel for pypyodbc (setup.py): started
  Building wheel for pypyodbc (setup.py): finished with status 'done'
  Created wheel for pypyodbc: filename=pypyodbc-1.3.6-py3-none-any.whl size=22871 sha256=230a512de1136ba601bdbcabc866f833c996c13ed79b97d9a50dbaf5eb5158ff
  Stored in directory: c:\users\user\appdata\local\pip\cache\wheels\5d\fa\2f\becd808255e1b61b755eb3a3aa256229a7d7e1fc89e2aef2bc
Successfully built pypyodbc
Installing collected packages: pypyodbc
Successfully installed pypyodbc-1.3.6
Collecting polars
  Downloading polars-1.5.0-cp38-abi3-win_amd64.whl.metadata (14 kB)
Downloading polars-1.5.0-cp38-abi3-win_amd64.whl (31.5 MB)
   ---------------------------------------- 0.0/31.5 MB ? eta -:--:--
   ---------------------------------------- 0.3/31.5 MB ? eta 

In [3]:
import pypyodbc as odbc

## Selenium
Selenium is a popular open-source framework used for automating web browsers. It provides a set of tools and libraries that allow developers to interact with web elements, simulate user actions, and perform automated testing of web applications.
One of the key features of Selenium is its ability to locate and interact with web elements on a page using various methods such as XPath, CSS selectors, and element attributes. This enables you to perform actions like clicking buttons, entering text, selecting dropdown options, and verifying the presence of specific elements.

In addition to automated testing, Selenium can be used for web scraping, data extraction, and web application monitoring. It allows you to retrieve data from websites, scrape information, and perform tasks at regular intervals to monitor the behavior and performance of web applications.
Overall, Selenium is a powerful tool for automating web browsers and performing various tasks related to web testing, web scraping, and web application monitoring. Its flexibility, cross-browser compatibility, and extensive community support make it a popular choice among developers and testers.

### Part 1: Get all Events 
The first page of the ufc stats page includes event names for all ufc events, including ufc fight nights.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import bs4

# PATH = 'C:\Program Files (x86)\chromedriver.exe'

driver = webdriver.Chrome()
driver.get('http://ufcstats.com/statistics/events/completed?page=all')

delay = 5

try:
    myElem = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'b-head')))
    print("Page is ready!")
    soup = bs4.BeautifulSoup(driver.page_source, 'html.parser')
    page_links = soup.find_all('a', class_='b-link b-link_style_black') 
    driver.quit()
    print("Continue to next cells")
    
except TimeoutException:
    print("Loading took too much time... Rerun the script.")
    driver.quit()



Page is ready!
Continue to next cells


In [4]:
# print(page_links)

In [5]:
# print(soup.prettify())

In [6]:
dates_source = soup.find_all('span', class_='b-statistics__date')
event_dates = [date.get_text(strip=True) for date in dates_source]
event_dates = event_dates[1:]
event_names = [link.get_text(strip=True) for link in page_links]

In [8]:
# event_names

In [9]:
import polars as pl
event_names_to_links = {name: link['href'] for name, link in zip(event_names, page_links)}
# event_names_to_links

In [10]:
event_date_link = {'name:' : event_names, 'date:' : event_dates}

In [11]:
df = pl.DataFrame(event_date_link)


## Part 2: Clicking the events link and obtaining fights within the shows

In [25]:
# df.write_csv('event_dates.csv')

Firstly, I navigate through links using Selenium's .click() method.

Clicking on a given link has allowed for me to access all 'fights' within the event.

In [16]:
# The following code will wait for the page to load before continuing
# If the page takes too long to load, the script will be stopped (it will be highlighted in text)
# This is to prevent the script from running into errors if the page is not loaded properly
driver = webdriver.Chrome()
driver.get('http://ufcstats.com/statistics/events/completed?page=all')
try:
    myElem = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'b-head')))
    print("Page is ready!")
except TimeoutException:
    print("Page could not load... Rerun the cell.")
    driver.quit()
else:
    driver.find_elements(by=By.LINK_TEXT, value=page_links[0].get_text(strip=True))[0].click()
    # Get page source of new page
    soup = bs4.BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()

Page is ready!


#### Below is the the source code of the UFC stats page:

In [18]:
# print(soup.prettify())

To find the number of fights per event, I count the number of 'win' text elements there are within the tables. There is only one per fight.
In rare cases, there are also 'draws' - there will 2 instances of draw text elements per fight.

The number of fights is important to carry out calculations for accurate slicing of the long list of elements gathered from the website.

In [96]:
def find_event_name(soup):
    """Find the name of the event.
    Args:
        soup (_type_): The soup object of the page.
    """
    event_name = soup.find('span', class_='b-content__title-highlight').get_text(strip=True)

    return event_name



def find_number_of_fights(soup):
    """Find the number of fights on a page.
    This is done by finding the number of wins on the page.
    Args:
        soup (_type_): The soup object of the page.
    """
    num_wins = 0 
    all_elements = soup.findAll('p', class_='b-fight-details__table-text')
    
    for item in all_elements:
        if item.get_text(strip=True) == 'win':
            num_wins += 1
        if item.get_text(strip=True) == 'draw':
            num_wins += 0.5
        if item.get_text(strip=True) == 'nc':
            num_wins += 0.5
            
    return num_wins

def total_number_of_elements(soup):
    """Find the total number of elements on a page.
    Args:
        soup (_type_): The soup object of the page.
    """
    all_elements = soup.findAll('p', class_='b-fight-details__table-text')
    
    return len(all_elements)

def find_fight_details(soup):
    """Find the details of each fight on a page.
    Args:
        soup (_type_): The soup object of the page.
    """
    all_elements = soup.findAll('p', class_='b-fight-details__table-text')
    
    # in all elements, check if there is a draw or nc. These come in pairs, so remove one of them
    for i in range(len(all_elements)):
        if all_elements[i].get_text(strip=True) == 'draw' or all_elements[i].get_text(strip=True) == 'nc':
            all_elements.pop(i)
            break
            
    
    
    num_fights = find_number_of_fights(soup)
    num_elements = total_number_of_elements(soup)
    num_fight_details = num_elements / num_fights
    fight_details = []
    start = 0
    end = int(num_fight_details)
 
       
    for i in range(int(num_fights)):
        fight_details.append(all_elements[start:end])
        start = end
        end += int(num_fight_details)
        
    return fight_details  

def get_fight_details(fight_details):
    """Get the details of each fight.
    Args:
        fight_details (_type_): The fight details of the page.
    """
    fight_details_list = []
    
    for fight in fight_details:
        fight_details_list.append([item.get_text(strip=True) for item in fight])
        
        
    return fight_details_list

## Example using the functions
Below is a step-by-step usage of the functions above... 
The extraction method can be seen.

In [97]:
find_event_name(soup)

'UFC 286: Edwards vs. Usman 3'

In [98]:
find_number_of_fights(soup)

15.0

In [99]:
total_number_of_elements(soup)

241

In [105]:
fight_details = find_fight_details(soup)
extracted_fight_details = get_fight_details(fight_details)

In [106]:
extracted_fight_details

[['win',
  'Leon Edwards',
  'Kamaru Usman',
  '0',
  '0',
  '120',
  '87',
  '0',
  '4',
  '0',
  '0',
  'Welterweight',
  'M-DEC',
  '',
  '5',
  '5:00'],
 ['win',
  'Justin Gaethje',
  'Rafael Fiziev',
  '0',
  '0',
  '103',
  '97',
  '1',
  '0',
  '0',
  '0',
  'Lightweight',
  'M-DEC',
  '',
  '3',
  '5:00'],
 ['win',
  'Gunnar Nelson',
  'Bryan Barberena',
  '0',
  '0',
  '10',
  '7',
  '1',
  '0',
  '1',
  '0',
  'Welterweight',
  'SUB',
  'Armbar',
  '1',
  '4:51'],
 ['win',
  'Jennifer Maia',
  "Casey O'Neill",
  '0',
  '0',
  '145',
  '137',
  '0',
  '0',
  '0',
  '0',
  "Women's Flyweight",
  'U-DEC',
  '',
  '3',
  '5:00'],
 ['win',
  'Marvin Vettori',
  'Roman Dolidze',
  '0',
  '0',
  '106',
  '71',
  '0',
  '0',
  '0',
  '0',
  'Middleweight',
  'U-DEC',
  '',
  '3',
  '5:00'],
 ['win',
  'Jack Shore',
  'Makwan Amirkhani',
  '0',
  '0',
  '28',
  '10',
  '1',
  '1',
  '1',
  '0',
  'Featherweight',
  'SUB',
  'Rear Naked Choke',
  '2',
  '4:27'],
 ['win',
  'Chris Dunca

In [32]:
def get_data(details):
    """Format the details of the fight.
    Args:
        details (_type_): Input the output of get_fight_details().
    """
    fighter_1 = []
    fighter_2 = []
    
    for detail in details:
        # slice using negative indexing since some fights will have two draw or nc outcomes at the start of the list instead of one win and one loss
        # fighter_1.append({
        #     'matchup': f"{detail[1]} vs {detail[2]}",
        #     'fighter': detail[1],
        #     'knockdowns': detail[3],
        #     'successful_sig_strikes': detail[5],
        #     'successful_takedowns': detail[7],
        #     'submission_attempts': detail[9],
        #     'weightclass': detail[11],
        #     'method': detail[12],
        #     'method_notes': detail[13],
        #     'round': detail[14],
        #     'time': detail[15],
        #     'w/l/d': 'w'
        # })
        
        fighter_1.append({
            'matchup' : f"{detail[-15]} vs {detail[-14]}",
            'fighter' : detail[-15],
            'knockdowns' : detail[-13],
            'successful_sig_strikes' : detail[-11],
            'successful_takedowns' : detail[-9],
            'submission_attempts' : detail[-7],
            'weightclass' : detail[-5],
            'method' : detail[-4],
            'method_notes' : detail[-3],
            'round' : detail[-2],
            'time' : detail[-1],
            'w/l/d' : 'win'
        })
            
        # fighter_2.append({
        #     'matchup': f"{detail[1]} vs {detail[2]}",
        #     'fighter': detail[2],
        #     'knockdowns': detail[4],
        #     'successful_sig_strikes': detail[6],
        #     'successful_takedowns': detail[8],
        #     'submission_attempts': detail[10],
        #     'weightclass': detail[11],
        #     'method': detail[12],
        #     'method_notes': detail[13],
        #     'round': detail[14],
        #     'time': detail[15],
        #     'w/l/d': 'l'
        # })
        fighter_2.append({
            'matchup' : f"{detail[-15]} vs {detail[-14]}",
            'fighter' : detail[-14],
            'knockdowns' : detail[-12],
            'successful_sig_strikes' : detail[-10],
            'successful_takedowns' : detail[-8],
            'submission_attempts' : detail[-6],
            'weightclass' : detail[-5],
            'method' : detail[-4],
            'method_notes' : detail[-3],
            'round' : detail[-2],
            'time' : detail[-1],
            'w/l/d' : 'loss'
        })
    
    return fighter_1, fighter_2

# Formats the data into a dictionary
data = get_data(extracted_fight_details)

In [33]:
data[0] # Fighter 1
data[1] # Fighter 2
data[0][0] # First fight of Fighter 1..etc
data[1][0] # First fight of Fighter 2..etc

{'matchup': 'Sean Brady vs Gilbert Burns',
 'fighter': 'Gilbert Burns',
 'knockdowns': '0',
 'successful_sig_strikes': '47',
 'successful_takedowns': '1',
 'submission_attempts': '0',
 'weightclass': 'Welterweight',
 'method': 'U-DEC',
 'method_notes': '',
 'round': '5',
 'time': '5:00',
 'w/l/d': 'win'}

## Testing on Draws/NCs

In [132]:
draw_url = 'http://ufcstats.com/event-details/e4bb7e483c4ad318'

driver = webdriver.Chrome()
driver.get(draw_url)

try:
    myElem = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'b-head')))
    print("Page is ready!")
    soup = bs4.BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()
except TimeoutException:
    print("Page could not load... Rerun the cell.")
    driver.quit()
    
else:
    draw_fight_details = find_fight_details(soup)
    # draw_extracted_fight_details = get_fight_details(draw_fight_details)

Page is ready!


In [133]:
total_number_of_elements(soup)

241

In [134]:
draw_fight_details = find_fight_details(soup)
test_details = get_fight_details(draw_fight_details)
test_data = get_data(test_details)

In [141]:
test_data[0][-4] # Fighter 1

{'matchup': 'Jake Hadley vs Malcolm Gordon',
 'fighter': 'Jake Hadley',
 'knockdowns': '1',
 'successful_sig_strikes': '10',
 'successful_takedowns': '0',
 'submission_attempts': '0',
 'weightclass': 'Flyweight',
 'method': 'KO/TKO',
 'method_notes': 'Punch',
 'round': '1',
 'time': '1:01',
 'w/l/d': 'win'}

#### All required functions have been made for extracting the data.

Now all that is left for this part is to iterate the process to scrape all data by clicking through mutliple links (events).

In [107]:
def get_data_from_event(number_of_events, delay=5):
    """Gets the data from an event.

    Args:
        event_name (str): name of event
        soup (_type_): soup of all events page
    """
    archive = []
    driver = webdriver.Chrome()
    driver.get('http://ufcstats.com/statistics/events/completed?page=all')
    try:
        WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'b-head')))
        print("Page is ready!")
        
    except TimeoutException:
        print("Page could not load... Rerun the cell.")
        driver.quit()
    
    else:
        print("Executing scraping of data...")
        for number in range(number_of_events):
            
            driver.find_elements(by=By.LINK_TEXT, value=page_links[number].get_text(strip=True))[0].click()
            soup = bs4.BeautifulSoup(driver.page_source, 'html.parser')
            
            fight_details = find_fight_details(soup)
            extracted_fight_details = get_fight_details(fight_details)
            
            data = get_data(extracted_fight_details)
            archive.append([event_names[number], data])
            driver.execute_script("window.history.go(-1)")
        
        driver.quit()
        print("Data successfully scraped.")
    return archive

# Get data from the most recent 5 events
five_recent_events = get_data_from_event(5)
    
    
    

NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=128.0.6613.121)
Stacktrace:
	GetHandleVerifier [0x00007FF7C2109412+29090]
	(No symbol) [0x00007FF7C207E239]
	(No symbol) [0x00007FF7C1F3B1DA]
	(No symbol) [0x00007FF7C1F0FAF5]
	(No symbol) [0x00007FF7C1FBE2C7]
	(No symbol) [0x00007FF7C1FD5EB1]
	(No symbol) [0x00007FF7C1FB6493]
	(No symbol) [0x00007FF7C1F809D1]
	(No symbol) [0x00007FF7C1F81B31]
	GetHandleVerifier [0x00007FF7C242871D+3302573]
	GetHandleVerifier [0x00007FF7C2474243+3612627]
	GetHandleVerifier [0x00007FF7C246A417+3572135]
	GetHandleVerifier [0x00007FF7C21C5EB6+801862]
	(No symbol) [0x00007FF7C208945F]
	(No symbol) [0x00007FF7C2084FB4]
	(No symbol) [0x00007FF7C2085140]
	(No symbol) [0x00007FF7C207461F]
	BaseThreadInitThunk [0x00007FFF50C77374+20]
	RtlUserThreadStart [0x00007FFF525FCC91+33]


In [37]:
def final_format(data):
    """Final format of the data.
    Args:
        data (_type_): The data from get_data_from_event().
    """    
    for event in data:
        for fights in event[1]:
            for fight in fights:
                fight.update({'event': event[0]})
            
    return data

test = final_format(five_recent_events)

def overview_df(data):
    """Overview of the data.
    Args:
        data (_type_): The data from final_format().
    """
    num_events = len(data)
    dta = final_format(data)
    winners = []
    losers = []

    for i in range(num_events):
        for j in range(len(dta[i][1][0])):
            winners.append(dta[i][1][0][j])
            losers.append(dta[i][1][1][j])

    W_df = pl.DataFrame(winners)
    L_df = pl.DataFrame(losers)

    return W_df, L_df

winners, losers = overview_df(test)

final = pl.concat([winners, losers])


In [38]:
final

matchup,fighter,knockdowns,successful_sig_strikes,successful_takedowns,submission_attempts,weightclass,method,method_notes,round,time,w/l/d,event
str,str,str,str,str,str,str,str,str,str,str,str,str
"""Sean Brady vs Gilbert Burns""","""Sean Brady""","""0""","""130""","""7""","""0""","""Welterweight""","""U-DEC""","""""","""5""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Natalia Silva vs Jessica Andra…","""Natalia Silva""","""0""","""117""","""0""","""0""","""Women's Flyweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Steve Garcia vs Kyle Nelson""","""Steve Garcia""","""0""","""22""","""0""","""0""","""Featherweight""","""KO/TKO""","""Elbows""","""1""","""3:59""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Cody Durden vs Matt Schnell""","""Cody Durden""","""0""","""34""","""0""","""1""","""Bantamweight""","""SUB""","""Guillotine Choke""","""2""","""0:29""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Yanal Ashmouz vs Trevor Peek""","""Yanal Ashmouz""","""0""","""42""","""9""","""0""","""Lightweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…"
…,…,…,…,…,…,…,…,…,…,…,…,…
"""Shamil Gaziev vs Don'Tale Maye…","""Don'Tale Mayes""","""0""","""27""","""0""","""0""","""Heavyweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…"
"""Guram Kutateladze vs Jordan Vu…","""Jordan Vucenic""","""1""","""34""","""0""","""0""","""Lightweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…"
"""Sam Hughes vs Viktoriia Dudako…","""Viktoriia Dudakova""","""0""","""61""","""3""","""0""","""Women's Strawweight""","""S-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…"
"""Jai Herbert vs Rolando Bedoya""","""Rolando Bedoya""","""0""","""67""","""0""","""0""","""Lightweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…"


In [39]:
alt = winners.join(losers, on='matchup', how='inner', suffix='_2')
alt


matchup,fighter,knockdowns,successful_sig_strikes,successful_takedowns,submission_attempts,weightclass,method,method_notes,round,time,w/l/d,event,fighter_2,knockdowns_2,successful_sig_strikes_2,successful_takedowns_2,submission_attempts_2,weightclass_2,method_2,method_notes_2,round_2,time_2,w/l/d_2,event_2
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""Sean Brady vs Gilbert Burns""","""Sean Brady""","""0""","""130""","""7""","""0""","""Welterweight""","""U-DEC""","""""","""5""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…","""Gilbert Burns""","""0""","""47""","""1""","""0""","""Welterweight""","""U-DEC""","""""","""5""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Natalia Silva vs Jessica Andra…","""Natalia Silva""","""0""","""117""","""0""","""0""","""Women's Flyweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…","""Jessica Andrade""","""0""","""50""","""0""","""0""","""Women's Flyweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Steve Garcia vs Kyle Nelson""","""Steve Garcia""","""0""","""22""","""0""","""0""","""Featherweight""","""KO/TKO""","""Elbows""","""1""","""3:59""","""win""","""UFC Fight Night: Burns vs. Bra…","""Kyle Nelson""","""0""","""1""","""0""","""0""","""Featherweight""","""KO/TKO""","""Elbows""","""1""","""3:59""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Cody Durden vs Matt Schnell""","""Cody Durden""","""0""","""34""","""0""","""1""","""Bantamweight""","""SUB""","""Guillotine Choke""","""2""","""0:29""","""win""","""UFC Fight Night: Burns vs. Bra…","""Matt Schnell""","""0""","""40""","""0""","""0""","""Bantamweight""","""SUB""","""Guillotine Choke""","""2""","""0:29""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Yanal Ashmouz vs Trevor Peek""","""Yanal Ashmouz""","""0""","""42""","""9""","""0""","""Lightweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…","""Trevor Peek""","""0""","""55""","""0""","""0""","""Lightweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…"
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Shamil Gaziev vs Don'Tale Maye…","""Shamil Gaziev""","""0""","""31""","""2""","""0""","""Heavyweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…","""Don'Tale Mayes""","""0""","""27""","""0""","""0""","""Heavyweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…"
"""Guram Kutateladze vs Jordan Vu…","""Guram Kutateladze""","""0""","""30""","""1""","""0""","""Lightweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…","""Jordan Vucenic""","""1""","""34""","""0""","""0""","""Lightweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…"
"""Sam Hughes vs Viktoriia Dudako…","""Sam Hughes""","""0""","""97""","""0""","""0""","""Women's Strawweight""","""S-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…","""Viktoriia Dudakova""","""0""","""61""","""3""","""0""","""Women's Strawweight""","""S-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…"
"""Jai Herbert vs Rolando Bedoya""","""Jai Herbert""","""1""","""82""","""2""","""0""","""Lightweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…","""Rolando Bedoya""","""0""","""67""","""0""","""0""","""Lightweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Sandhagen vs.…"


In [40]:
alt_selected = alt.select(['event', 'weightclass','matchup', 'fighter'
                           ,'knockdowns', 'successful_sig_strikes', 
                           'successful_takedowns', 'submission_attempts', 
                           'w/l/d', 'fighter_2', 'knockdowns_2',
                           'successful_sig_strikes_2','successful_takedowns_2',
                           'submission_attempts_2', 'w/l/d_2', 'method', 'method_notes', 
                           'round', 'time'])

alt_selected_casted = alt.select(pl.col('event').cast(str), pl.col('weightclass').cast(str),
                                 pl.col('matchup').cast(str), pl.col('fighter').cast(str), 
                                 pl.col('knockdowns').cast(int), pl.col('successful_sig_strikes').cast(int),
                                 pl.col('successful_takedowns').cast(int), pl.col('submission_attempts').cast(int),
                                 pl.col('w/l/d').cast(str), pl.col('fighter_2').cast(str), pl.col('knockdowns_2').cast(int),
                                 pl.col('successful_sig_strikes_2').cast(int), pl.col('successful_takedowns_2').cast(int),
                                 pl.col('submission_attempts_2').cast(int), pl.col('w/l/d_2').cast(str),
                                 pl.col('method').cast(str), pl.col('method_notes').cast(str), pl.col('round').cast(int),
                                 pl.col('time').cast(str))
alt_selected_casted

event,weightclass,matchup,fighter,knockdowns,successful_sig_strikes,successful_takedowns,submission_attempts,w/l/d,fighter_2,knockdowns_2,successful_sig_strikes_2,successful_takedowns_2,submission_attempts_2,w/l/d_2,method,method_notes,round,time
str,str,str,str,i64,i64,i64,i64,str,str,i64,i64,i64,i64,str,str,str,i64,str
"""UFC Fight Night: Burns vs. Bra…","""Welterweight""","""Sean Brady vs Gilbert Burns""","""Sean Brady""",0,130,7,0,"""win""","""Gilbert Burns""",0,47,1,0,"""win""","""U-DEC""","""""",5,"""5:00"""
"""UFC Fight Night: Burns vs. Bra…","""Women's Flyweight""","""Natalia Silva vs Jessica Andra…","""Natalia Silva""",0,117,0,0,"""win""","""Jessica Andrade""",0,50,0,0,"""win""","""U-DEC""","""""",3,"""5:00"""
"""UFC Fight Night: Burns vs. Bra…","""Featherweight""","""Steve Garcia vs Kyle Nelson""","""Steve Garcia""",0,22,0,0,"""win""","""Kyle Nelson""",0,1,0,0,"""win""","""KO/TKO""","""Elbows""",1,"""3:59"""
"""UFC Fight Night: Burns vs. Bra…","""Bantamweight""","""Cody Durden vs Matt Schnell""","""Cody Durden""",0,34,0,1,"""win""","""Matt Schnell""",0,40,0,0,"""win""","""SUB""","""Guillotine Choke""",2,"""0:29"""
"""UFC Fight Night: Burns vs. Bra…","""Lightweight""","""Yanal Ashmouz vs Trevor Peek""","""Yanal Ashmouz""",0,42,9,0,"""win""","""Trevor Peek""",0,55,0,0,"""win""","""U-DEC""","""""",3,"""5:00"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""UFC Fight Night: Sandhagen vs.…","""Heavyweight""","""Shamil Gaziev vs Don'Tale Maye…","""Shamil Gaziev""",0,31,2,0,"""win""","""Don'Tale Mayes""",0,27,0,0,"""win""","""U-DEC""","""""",3,"""5:00"""
"""UFC Fight Night: Sandhagen vs.…","""Lightweight""","""Guram Kutateladze vs Jordan Vu…","""Guram Kutateladze""",0,30,1,0,"""win""","""Jordan Vucenic""",1,34,0,0,"""win""","""U-DEC""","""""",3,"""5:00"""
"""UFC Fight Night: Sandhagen vs.…","""Women's Strawweight""","""Sam Hughes vs Viktoriia Dudako…","""Sam Hughes""",0,97,0,0,"""win""","""Viktoriia Dudakova""",0,61,3,0,"""win""","""S-DEC""","""""",3,"""5:00"""
"""UFC Fight Night: Sandhagen vs.…","""Lightweight""","""Jai Herbert vs Rolando Bedoya""","""Jai Herbert""",1,82,2,0,"""win""","""Rolando Bedoya""",0,67,0,0,"""win""","""U-DEC""","""""",3,"""5:00"""


## This is the section where all data is being scraped
Currently, this will output a csv at the end. The future idea is to compare to sql database and update efficiently.

Keep in mind this is only the fight overviews and not the per round statistics.

In [108]:
num_of_events = len(page_links)

In [116]:
all_events_overview = get_data_from_event(num_of_events)

Page is ready!
Executing scraping of data...
Data successfully scraped.


In [None]:
# yaml files?

In [117]:
winners, losers = overview_df(all_events_overview)

In [118]:
final = pl.concat([winners, losers])

In [119]:
final.write_csv('overview_data_concat.csv')

In [120]:
final

matchup,fighter,knockdowns,successful_sig_strikes,successful_takedowns,submission_attempts,weightclass,method,method_notes,round,time,w/l/d,event
str,str,str,str,str,str,str,str,str,str,str,str,str
"""Sean Brady vs Gilbert Burns""","""Sean Brady""","""0""","""130""","""7""","""0""","""Welterweight""","""U-DEC""","""""","""5""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Natalia Silva vs Jessica Andra…","""Natalia Silva""","""0""","""117""","""0""","""0""","""Women's Flyweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Steve Garcia vs Kyle Nelson""","""Steve Garcia""","""0""","""22""","""0""","""0""","""Featherweight""","""KO/TKO""","""Elbows""","""1""","""3:59""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Cody Durden vs Matt Schnell""","""Cody Durden""","""0""","""34""","""0""","""1""","""Bantamweight""","""SUB""","""Guillotine Choke""","""2""","""0:29""","""win""","""UFC Fight Night: Burns vs. Bra…"
"""Yanal Ashmouz vs Trevor Peek""","""Yanal Ashmouz""","""0""","""42""","""9""","""0""","""Lightweight""","""U-DEC""","""""","""3""","""5:00""","""win""","""UFC Fight Night: Burns vs. Bra…"
…,…,…,…,…,…,…,…,…,…,…,…,…
"""Orlando Wiet vs Robert Lucarel…","""Robert Lucarelli""","""0""","""2""","""1""","""1""","""Open Weight""","""KO/TKO""","""""","""1""","""2:50""","""win""","""UFC 2: No Way Out"""
"""Frank Hamaker vs Thaddeus Lust…","""Thaddeus Luster""","""0""","""0""","""0""","""0""","""Open Weight""","""SUB""","""Keylock""","""1""","""4:52""","""win""","""UFC 2: No Way Out"""
"""Johnny Rhodes vs David Levicki""","""David Levicki""","""0""","""4""","""0""","""0""","""Open Weight""","""KO/TKO""","""Punches""","""1""","""12:13""","""win""","""UFC 2: No Way Out"""
"""Patrick Smith vs Ray Wizard""","""Ray Wizard""","""0""","""1""","""0""","""0""","""Open Weight""","""SUB""","""Guillotine Choke""","""1""","""0:58""","""win""","""UFC 2: No Way Out"""


In [124]:
alt = winners.join(losers, on='matchup', how='inner', suffix='_2')


AttributeError: 'DataFrame' object has no attribute 'fighter'

In [131]:
[alt['fighter'] == 'win'].value_counts()

AttributeError: 'list' object has no attribute 'value_counts'

In [122]:

alt_selected = alt.select(['event', 'weightclass','matchup', 'fighter'
                           ,'knockdowns', 'successful_sig_strikes', 
                           'successful_takedowns', 'submission_attempts', 
                           'w/l/d', 'fighter_2', 'knockdowns_2',
                           'successful_sig_strikes_2','successful_takedowns_2',
                           'submission_attempts_2', 'w/l/d_2', 'method', 'method_notes', 
                           'round', 'time'])

alt_selected

# alt_selected_casted = alt.select(pl.col('event').cast(str), pl.col('weightclass').cast(str),
#                                  pl.col('matchup').cast(str), pl.col('fighter').cast(str), 
#                                  pl.col('knockdowns').cast(int), pl.col('successful_sig_strikes').cast(int),
#                                  pl.col('successful_takedowns').cast(int), pl.col('submission_attempts').cast(int),
#                                  pl.col('w/l/d').cast(str), pl.col('fighter_2').cast(str), pl.col('knockdowns_2').cast(int),
#                                  pl.col('successful_sig_strikes_2').cast(int), pl.col('successful_takedowns_2').cast(int),
#                                  pl.col('submission_attempts_2').cast(int), pl.col('w/l/d_2').cast(str),
#                                  pl.col('method').cast(str), pl.col('method_notes').cast(str), pl.col('round').cast(int),
#                                  pl.col('time').cast(str))
# alt_selected_casted

event,weightclass,matchup,fighter,knockdowns,successful_sig_strikes,successful_takedowns,submission_attempts,w/l/d,fighter_2,knockdowns_2,successful_sig_strikes_2,successful_takedowns_2,submission_attempts_2,w/l/d_2,method,method_notes,round,time
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""UFC Fight Night: Burns vs. Bra…","""Welterweight""","""Sean Brady vs Gilbert Burns""","""Sean Brady""","""0""","""130""","""7""","""0""","""win""","""Gilbert Burns""","""0""","""47""","""1""","""0""","""win""","""U-DEC""","""""","""5""","""5:00"""
"""UFC Fight Night: Burns vs. Bra…","""Women's Flyweight""","""Natalia Silva vs Jessica Andra…","""Natalia Silva""","""0""","""117""","""0""","""0""","""win""","""Jessica Andrade""","""0""","""50""","""0""","""0""","""win""","""U-DEC""","""""","""3""","""5:00"""
"""UFC Fight Night: Burns vs. Bra…","""Featherweight""","""Steve Garcia vs Kyle Nelson""","""Steve Garcia""","""0""","""22""","""0""","""0""","""win""","""Kyle Nelson""","""0""","""1""","""0""","""0""","""win""","""KO/TKO""","""Elbows""","""1""","""3:59"""
"""UFC Fight Night: Burns vs. Bra…","""Bantamweight""","""Cody Durden vs Matt Schnell""","""Cody Durden""","""0""","""34""","""0""","""1""","""win""","""Matt Schnell""","""0""","""40""","""0""","""0""","""win""","""SUB""","""Guillotine Choke""","""2""","""0:29"""
"""UFC Fight Night: Burns vs. Bra…","""Lightweight""","""Yanal Ashmouz vs Trevor Peek""","""Yanal Ashmouz""","""0""","""42""","""9""","""0""","""win""","""Trevor Peek""","""0""","""55""","""0""","""0""","""win""","""U-DEC""","""""","""3""","""5:00"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""UFC 2: No Way Out""","""Open Weight""","""Orlando Wiet vs Robert Lucarel…","""Orlando Wiet""","""0""","""8""","""0""","""0""","""win""","""Robert Lucarelli""","""0""","""2""","""1""","""1""","""win""","""KO/TKO""","""""","""1""","""2:50"""
"""UFC 2: No Way Out""","""Open Weight""","""Frank Hamaker vs Thaddeus Lust…","""Frank Hamaker""","""0""","""2""","""1""","""3""","""win""","""Thaddeus Luster""","""0""","""0""","""0""","""0""","""win""","""SUB""","""Keylock""","""1""","""4:52"""
"""UFC 2: No Way Out""","""Open Weight""","""Johnny Rhodes vs David Levicki""","""Johnny Rhodes""","""0""","""11""","""1""","""0""","""win""","""David Levicki""","""0""","""4""","""0""","""0""","""win""","""KO/TKO""","""Punches""","""1""","""12:13"""
"""UFC 2: No Way Out""","""Open Weight""","""Patrick Smith vs Ray Wizard""","""Patrick Smith""","""0""","""1""","""0""","""1""","""win""","""Ray Wizard""","""0""","""1""","""0""","""0""","""win""","""SUB""","""Guillotine Choke""","""1""","""0:58"""


In [123]:
alt_selected.write_csv('overview_data_joined.csv')

## Part 3: Individual Fighter Profiles


## Polars - a more efficent alternative to the widely used Pandas
This works very similar to pandas, so the general procedures will carry over, making it easy to use.

This will be used throughout to create dataframes for fast processing and accessibilty of newly-obtained data.

In [102]:
import polars as pl

events_dates = pl.DataFrame({'Date': event_dates , 'Event': event_names})

print(events_dates)

shape: (703, 2)
┌────────────────────┬─────────────────────────────────┐
│ Date               ┆ Event                           │
│ ---                ┆ ---                             │
│ str                ┆ str                             │
╞════════════════════╪═════════════════════════════════╡
│ September 07, 2024 ┆ UFC Fight Night: Burns vs. Bra… │
│ August 24, 2024    ┆ UFC Fight Night: Cannonier vs.… │
│ August 17, 2024    ┆ UFC 305: Du Plessis vs. Adesan… │
│ August 10, 2024    ┆ UFC Fight Night: Tybura vs. Sp… │
│ August 03, 2024    ┆ UFC Fight Night: Sandhagen vs.… │
│ …                  ┆ …                               │
│ July 14, 1995      ┆ UFC 6: Clash of the Titans      │
│ April 07, 1995     ┆ UFC 5: The Return of the Beast  │
│ December 16, 1994  ┆ UFC 4: Revenge of the Warriors  │
│ September 09, 1994 ┆ UFC 3: The American Dream       │
│ March 11, 1994     ┆ UFC 2: No Way Out               │
└────────────────────┴─────────────────────────────────┘


In [103]:
events_dates.write_csv('events_dates.csv')

Above completes the first table: events data.
The next step is to click on the link of these events and take data regarding all fights. This will include:
- two fighters
- winner
- method of win
- various other fight stats

After this, fight profile should be gathered of all fighters. It may be more efficient to store data dating back to 2013 or more recent.
Having UFC fight statistics from UFC 01 would be impractical when trying to predict up-and-coming fights.

However for the sake of practicing SQL and learning to structure a database, it may be worth collecting all data

## Connecting to SQL:
Below, I am connecting to the database, this is done on my desktop PC.

Feel free to do this yourself, I am specifically using 'microsoft sql server management studio'.

In [53]:
import pypyodbc as odbc
DRIVER_NAME = 'SQL Server'

# Laptop Server:
# SERVER_NAME = 'LAPTOP-79UCG6D3\SQLEXPRESS'

# Desktop Server:
SERVER_NAME = 'DESKTOP-Q3IFECL\SQLEXPRESS01'

DATABASE_NAME = 'UFC-STATS'

connection_string = f"""
    DRIVER={{{DRIVER_NAME}}};
    SERVER={SERVER_NAME};
    DATABASE={DATABASE_NAME};
    Trusted_Connection=yes;

"""

con = odbc.connect(connection_string)

print(con)

<pypyodbc.Connection object at 0x0000021A9C960610>


Below is the SQL code written in python to migrate the data to the SQL server. As mentioned before, this is using SQL Server Management Studio (SSMS).
The code begins by creating a table called UFC_Events - this will be a list of all UFC events detailed on the ufc stats page online

In [122]:
# transfer data to SQL Server
# Convert the DataFrame to a CSV file
# df.write_csv('events.csv')
cursor = con.cursor()
cursor.execute('''
    CREATE TABLE UFC_Events (
        Date VARCHAR(255),
        Event VARCHAR(255)
    )
''')
cursor.commit()

cursor.execute('''
    BULK INSERT UFC_Events
    FROM 'C:\\Users\\User\\PycharmProjects\\UFCDatabase\\DataCollection\\data\\events_dates.csv'
    WITH (
        FIELDTERMINATOR = '",',
        ROWTERMINATOR = '\n',
        FIRSTROW = 2
    )
''')
cursor.commit()

In [57]:
# Now to BULK INSERT the overview_data_concat.csv file (final DataFrame)
cursor = con.cursor()
# Creating the table in SQL Server

# matchup,fighter,knockdowns,successful_sig_strikes,successful_takedowns,submission_attempts,weightclass,method,method_notes,round,time,w/l/d,event
cursor.execute('''
               CREATE TABLE UFC_Fights_Overview (
                     matchup VARCHAR(255),
                     fighter VARCHAR(255),
                     knockdowns VARCHAR(255),
                     successful_sig_strikes VARCHAR(255),
                     successful_takedowns VARCHAR(255),
                     submission_attempts VARCHAR(255),
                     weightclass VARCHAR(255),
                     method VARCHAR(255),
                     method_notes VARCHAR(255),
                     round VARCHAR(255),
                     time VARCHAR(255),
                     w_l_d VARCHAR(255),
                     event VARCHAR(255)
    )
''')

# cursor.execute('''
#     CREATE TABLE UFC_Fights_Overview (
#         matchup VARCHAR(255),
#         fighter VARCHAR(255),
#         knockdowns INT,
#         successful_sig_strikes INT,
#         successful_takedowns INT,
#         submission_attempts INT,
#         weightclass VARCHAR(255),
#         method VARCHAR(255),
#         method_notes VARCHAR(255),
#         round INT,
#         time VARCHAR(255),
#         w_l_d VARCHAR(255),
#         event VARCHAR(255)
#     )
# ''')
# cursor.commit()

# BULK INSERT the overview_data_concat.csv file
cursor.execute('''
    BULK INSERT UFC_Fights_Overview
    FROM 'C:\\Users\\User\\PycharmProjects\\UFCDatabase\\DataCollection\\data\\overview_data_concat.csv'
    WITH(
        FIELDTERMINATOR = ',',
        ROWTERMINATOR = '\n',
        FIRSTROW = 2
    )
    '''
)

cursor.commit()

con.close()


In [135]:
con.close()

In [58]:
# reopens the connection
con = odbc.connect(connection_string)
cursor = con.cursor()
cursor.execute('''
    SELECT *
    FROM UFC_Fights_Overview
''')

for row in cursor.fetchall():
    print(row)

con.close()



('Sean Brady vs Gilbert Burns', 'Sean Brady', '0', '130', '7', '0', 'Welterweight', 'U-DEC', '""', '5', '5:00', 'win', 'UFC Fight Night: Burns vs. Brady')
('Natalia Silva vs Jessica Andrade', 'Natalia Silva', '0', '117', '0', '0', "Women's Flyweight", 'U-DEC', '""', '3', '5:00', 'win', 'UFC Fight Night: Burns vs. Brady')
('Steve Garcia vs Kyle Nelson', 'Steve Garcia', '0', '22', '0', '0', 'Featherweight', 'KO/TKO', 'Elbows', '1', '3:59', 'win', 'UFC Fight Night: Burns vs. Brady')
('Cody Durden vs Matt Schnell', 'Cody Durden', '0', '34', '0', '1', 'Bantamweight', 'SUB', 'Guillotine Choke', '2', '0:29', 'win', 'UFC Fight Night: Burns vs. Brady')
('Yanal Ashmouz vs Trevor Peek', 'Yanal Ashmouz', '0', '42', '9', '0', 'Lightweight', 'U-DEC', '""', '3', '5:00', 'win', 'UFC Fight Night: Burns vs. Brady')
('Chris Padilla vs Rongzhu', 'Chris Padilla', '0', '68', '0', '0', 'Lightweight', 'KO/TKO', '""', '2', '4:14', 'win', 'UFC Fight Night: Burns vs. Brady')
('Isaac Dulgarian vs Brendon Marotte'

## Data Cleaning
Despite all efforts to extract the data with no errors, there is still cases within the ufc source page that causes data to be incorrectly extracted.

Where there is a draw or nc, there will be an extra element in the elements list which causes results in the extraction functions to use the wrong indexes.

draws and ncs will cause the match-up column to appear as 'draw vs ...' or 'nc vs ...'. Also, any subsequent fights will also suffer from 'win vs'.

The knockdown column will also either have win, draw or nc and every statistic is shifted one place to the right (increased index).

Two functions will be required. Firstly to fix ncs or draws, then secondly to fix the subsequent fights

In [4]:
import polars as pl

dta = pl.read_csv('data\\overview_data_concat.csv', ignore_errors=True)

In [8]:
## fixing draws and ncs
## Firstly we need to find where there is 'draw vs' or 'nc vs' in the matchup column
## Then we need to find the corresponding fighter in the fighter column

def find_draws_ncs(data):
    """Find the draws and no contests in
    the matchup column.
    Args:
        data (_type_): The DataFrame.
    """
    draws = data.filter(pl.col('matchup').str.contains('draw vs'))
    ncs = data.filter(pl.col('matchup').str.contains('nc vs'))
    return draws, ncs

find_draws_ncs(dta)

# def fix_draws_ncs(data):
#     """Fix the draws and no contests in
#     the matchup column.
#     Args:
#         data (_type_): The DataFrame.
#     """
#     draws, ncs = find_draws_ncs(data)
#     draws = draws.with_column(pl.col('fighter').str_contains('draw'), 'draw')
#     ncs = ncs.with_column(pl.col('fighter').str_contains('nc'), 'nc')
#     return draws, ncs
   

AttributeError: 'Expr' object has no attribute 'str_contains'