<a href="https://colab.research.google.com/github/ayanga1998/UFC_Dashboard/blob/main/Data_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# UFC Dataaset Collection
---

For this project, we will be developing an interactive dashboard where users can analyze their favorite fighters throughout the history of the Ultimate Fighting Championship. 

This notebook represents the first step in what will be an exciting journey. We begin this project by first scraping data from the UFC Stats website. This website contains data regarding nearly every fight since the inception of the UFC in the mid 1990's. 

In [1]:
import requests
from bs4 import BeautifulSoup  
import pandas as pd

Using beautiful soup (BS4) as our html parser, we will first extract the link to each recorded UFC event in the database

In [None]:
page_url = "http://ufcstats.com/statistics/events/completed?page=all"
rows = []
event_links = []
reviews= None
page = requests.get(page_url)

# Get links for each fight week
if page.status_code == 200:
    soup = BeautifulSoup(page.content, 'html.parser')
    divs = soup.select("i a")

    for row in range(len(divs)):
        event_links.append(divs[row].attrs['href'])

# Store only events that have already ocurred
event_links = event_links[1:]

As of today, this corresponds to over 590 total events!

In [None]:
print(f'Total number of events available: {len(event_links)}')
print(len(set(event_links))==len(event_links))

Total number of events available: 591
True


Next, we sort through each event, and collect the links for each specific fight (Main Card, Prelims, and Early Prelims). 

In [None]:
# Store the link for each fight for every event

fight_stats_links = []
text = 'fight-details'

for link in event_links:
    event_page = requests.get(link)

    if event_page.status_code == 200:
        event_soup = BeautifulSoup(event_page.content, 'html.parser')
        fight_links = event_soup.select("p a")

        for row in range(0,len(fight_links)):
            if 'fight-details' in fight_links[row].attrs['href']:
                fight_stats_links.append(fight_links[row].attrs['href'])

Based on the structure of the html code, there are instances where duplicate fight links were collected (i.e. the event in which a fight ends in a draw for example). 

In [None]:
print(f'Number of fight links stored: {len(fight_stats_links)}')
print(f'Number of unique fight links stored: {len(set(fight_stats_links))}')
print(f'Number of duplicate links: {len(fight_stats_links)-len(set(fight_stats_links))}')

Number of fight links stored: 6578
Number of unique fight links stored: 6460
Number of duplicate links: 118


The code below is used to filter out duplicate links before we begin scraping fight data

In [None]:
# Remove duplicate links from the list

fight_stats_links_unique = []
seen = set()

for x in fight_stats_links:
    if x not in seen:
        fight_stats_links_unique.append(x)
        seen.add(x)

In [None]:
len(fight_stats_links_unique)

6460

**Key**

kd: knockdowns \
ss: significant strikes \
td: takedowns \
rev: ? \


In [None]:
columns = ['red_fighter', 'blue_fighter', 'red_kd', 'blue_kd', 'red_ss', 'blue_ss', 'red_ss_pct', 'blue_ss_pct',
           'red_ts', 'blue_ts', 'red_td', 'blue_td', 'red_td_pct', 'blue_td_pct', 'red_sub_att', 'blue_sub_att', 'red_rev',
           'blue_rev', 'red_ctrl_time', 'blue_ctrl_time', 'red_head', 'blue_head', 'red_body', 'blue_body', 'red_leg', 'blue_leg', 
           'red_dist', 'blue_dist', 'red_clinch', 'blue_clinch', 'red_grnd', 'blue_grnd']

data = {col:[] for col in columns}
data['result'] = []
data['method'] = []
data['round'] = []
data['time'] = []
data['event_title'] = []
data['weight_class'] = []

for fight_link in fight_stats_links_unique:

    fight_page = requests.get(fight_link)

    if fight_page.status_code == 200:
        fight_soup = BeautifulSoup(fight_page.content, 'html.parser')
        
        upper_table = fight_soup.select("p i")

        # Collect fight statistics
        fight_stats = fight_soup.select("tr")

        if len(fight_stats) > 0:
            first_box = fight_stats[1].get_text().split(sep='\n\n')
        else:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: Fight data unavailable')
            continue

        event_name = fight_soup.select('div h2')[0]
        data['event_title'].append(event_name)

        # Store the weight class
        weight = fight_soup.find_all('tr',{'class':'b-fight-details__fight-title'})[0].text
        data['weight_class'].append(weight)

        # Store number of rounds in fight
        num_rounds = upper_table[3].get_text()
        data['round'].append(num_rounds)


        # Store the time expended in the last round of the fight
        time = upper_table[5].get_text()
        data['time'].append(time)

        # Store the method of victory
        method = upper_table[2].get_text()
        data['method'].append(method)
        
        # Store the winner of each fight
        divs = fight_soup.find_all('div',{'class':'b-fight-details__person'})

        if 'W' in divs[0].i.get_text():
            data['result'].append(divs[0].a.get_text())
        elif 'W' in divs[1].i.get_text():
            data['result'].append(divs[1].a.get_text())
        elif 'NC' in divs[0].i.get_text():
            data['result'].append('NC')
        elif 'D' in divs[0].i.get_text():
            data['result'].append('D')
        else:
            data['result'].append('UNK')


        if '5' in num_rounds:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: 5 round fight')
            second_box = fight_stats[9].get_text().split(sep='\n\n')
        elif '4' in num_rounds:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: 4 round fight (early finish)')
            second_box = fight_stats[8].get_text().split(sep='\n\n')
        elif '3' in num_rounds:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: 3 round fight')
            second_box = fight_stats[7].get_text().split(sep='\n\n')
        elif '2' in num_rounds:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: 2 round fight (early finish)')
            second_box = fight_stats[6].get_text().split(sep='\n\n')
        elif '1' in num_rounds:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: 1 round fight (early finish)')
            second_box = fight_stats[5].get_text().split(sep='\n\n')

        data_row = [element.replace('\n', '').strip() for element in first_box if element != '']

        data_row2 = [item.replace('\n','').strip() for item in second_box if item != '']
        data_row2 = data_row2[6:]

        data_rows = data_row + data_row2

        for idx, item in enumerate(data_rows):
            data[columns[idx]].append(item)
        

In [None]:
fight_link = fight_stats_links_unique[1]
fight_page = requests.get(fight_link)

if fight_page.status_code == 200:
    fight_soup = BeautifulSoup(fight_page.content, 'html.parser')
    method = fight_soup.select('div h2')[0]
    print(method.a.text)



  UFC Fight Night: Hermansson vs. Strickland

   


In [None]:
# -------- Test code ----------
fight_link = fight_stats_links_unique[100]
fight_page = requests.get(fight_link)

if fight_page.status_code == 200:
    fight_soup = BeautifulSoup(fight_page.content, 'html.parser')
    method = fight_soup.find_all('i',{'class':'b-fight-details__fight-title'})[0].text
    #method = fight_soup.select('div div div div')[0]
    print(method)



      Lightweight Bout
    


In [None]:
df = pd.DataFrame(data)

df.head()

Unnamed: 0,red_fighter,blue_fighter,red_kd,blue_kd,red_ss,blue_ss,red_ss_pct,blue_ss_pct,red_ts,blue_ts,red_td,blue_td,red_td_pct,blue_td_pct,red_sub_att,blue_sub_att,red_rev,blue_rev,red_ctrl_time,blue_ctrl_time,red_head,blue_head,red_body,blue_body,red_leg,blue_leg,red_dist,blue_dist,red_clinch,blue_clinch,red_grnd,blue_grnd,result,method,round,time,event_title,weight_class
0,Jack Hermansson,Sean Strickland,0,0,137 of 353,153 of 330,38%,46%,137 of 353,161 of 338,0 of 8,0 of 0,0%,---,0,0,0,0,0:31,0:00,22 of 194,125 of 286,64 of 105,24 of 40,51 of 54,4 of 4,134 of 350,151 of 328,3 of 3,2 of 2,0 of 0,0 of 0,Sean Strickland,Decision - Split,\n\n Round:\n \n 5\n,\n\n Time:\n \n 5:00\n\...,"[\n, [\n\n UFC Fight Night: Hermansson vs. St...",\n \n Middleweight Bout\n
1,Punahele Soriano,Nick Maximov,0,0,45 of 63,29 of 45,71%,64%,74 of 93,60 of 82,0 of 0,11 of 16,---,68%,0,1,0,1,1:37,8:45,28 of 46,19 of 33,17 of 17,6 of 8,0 of 0,4 of 4,20 of 38,15 of 31,2 of 2,10 of 10,23 of 23,4 of 4,Nick Maximov,Decision - Split,\n\n Round:\n \n 3\n,\n\n Time:\n \n 5:00\n\...,"[\n, [\n\n UFC Fight Night: Hermansson vs. St...",\n \n Middleweight Bout\n
2,Shavkat Rakhmonov,Carlston Harris,1,0,13 of 28,10 of 27,46%,37%,16 of 31,15 of 35,1 of 3,0 of 0,33%,---,0,0,0,0,0:52,0:44,10 of 25,3 of 13,3 of 3,6 of 10,0 of 0,1 of 4,9 of 20,8 of 25,0 of 0,2 of 2,4 of 8,0 of 0,Shavkat Rakhmonov,KO/TKO,\n\n Round:\n \n 1\n,\n\n Time:\n \n 4:10\n\...,"[\n, [\n\n UFC Fight Night: Hermansson vs. St...",\n\n Welterweight Bout\n
3,Sam Alvey,Brendan Allen,0,1,24 of 57,36 of 54,42%,66%,24 of 57,36 of 54,0 of 0,0 of 1,---,0%,0,1,0,0,0:07,0:44,20 of 52,15 of 32,2 of 3,14 of 15,2 of 2,7 of 7,24 of 56,32 of 47,0 of 1,2 of 3,0 of 0,2 of 4,Brendan Allen,Submission,\n\n Round:\n \n 2\n,\n\n Time:\n \n 2:10\n\...,"[\n, [\n\n UFC Fight Night: Hermansson vs. St...",\n \n Light Heavyweight Bout\n
4,Tresean Gore,Bryan Battle,0,0,57 of 95,112 of 193,60%,58%,86 of 126,119 of 203,2 of 3,1 of 8,66%,12%,1,0,0,0,1:16,3:20,31 of 67,49 of 117,15 of 17,46 of 59,11 of 11,17 of 17,42 of 79,105 of 185,11 of 12,7 of 8,4 of 4,0 of 0,Bryan Battle,Decision - Unanimous,\n\n Round:\n \n 3\n,\n\n Time:\n \n 5:00\n\...,"[\n, [\n\n UFC Fight Night: Hermansson vs. St...",\n \n Middleweight Bout\n


In [None]:
df.to_csv('full_ufc_dataset.csv')

# Collecting Event Data
---
We would like to collect the date for each event in the UFC as it may be beneficial for analytics purposes

In [7]:
page_url = "https://en.wikipedia.org/wiki/List_of_UFC_events"
page = requests.get(page_url)

# Get links for each fight week
if page.status_code == 200:
    soup = BeautifulSoup(page.content, 'html.parser')

    event_table = soup.find('table',{'id':'Past_events'})
    body = event_table.find('tbody')
    rows = body.find_all('tr')

data = {'event_title':[], 'date':[]}

for row in rows[1:]:
    row_data = row.find_all('td')

    data['event_title'].append(row_data[1].text.replace('\n',''))
    data['date'].append(row_data[2].text.replace('\n',''))

In [8]:
event_df = pd.DataFrame(data)

In [9]:
time = {'Jan':'1', 'Feb':'2', 'Mar':'3', 'Apr':'4', 'May':'5', 'Jun':'6', 'Jul':'7',
        'Aug':'8', 'Sep':'9', 'Oct':'10', 'Nov':'11', 'Dec':'12'}

def replace_date(text):
    for x, y in time.items():
        text = text.replace(x,y)
    
    text = text.replace(',','')
    text = text.replace(' ', '-')
    return text

In [10]:
event_df['date'] = event_df.date.apply(lambda x: replace_date(x))

In [12]:
event_df.to_csv('event_data.csv')

# Sustaining the database
---
It takes several hours to scrape the data from the UFC Stats website. It becomes unnecessary and painful having to run the entire code each time I want to update the dataset. To reduce suffering, let's add some code to scrape only the latest event. We can do this every week in order to maintain the dataset quality