<a href="https://colab.research.google.com/github/ayanga1998/UFC_Dashboard/blob/main/Data_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# UFC Dataaset Collection

For this project, we will be developing an interactive dashboard where users can analyze their favorite fighters throughout the history of the Ultimate Fighting Championship. 

This notebook represents the first step in what will be an exciting journey. We begin this project by first scraping data from the UFC Stats website. This website contains data regarding nearly every fight since the inception of the UFC in the mid 1990's. 

In [1]:
import requests
from bs4 import BeautifulSoup  
import pandas as pd

Using beautiful soup (BS4) as our html parser, we will first extract the link to each recorded UFC event in the database

In [2]:
page_url = "http://ufcstats.com/statistics/events/completed?page=all"
rows = []
event_links = []
reviews= None
page = requests.get(page_url)

# Get links for each fight week
if page.status_code == 200:
    soup = BeautifulSoup(page.content, 'html.parser')
    divs = soup.select("i a")

    for row in range(len(divs)):
        event_links.append(divs[row].attrs['href'])

# Store only events that have already ocurred
event_links = event_links[1:]

In [3]:
event_links

['http://ufcstats.com/event-details/335ad945324c3a2e',
 'http://ufcstats.com/event-details/f5585e675af7afd4',
 'http://ufcstats.com/event-details/2a470ad41c22c25a',
 'http://ufcstats.com/event-details/ef927e4fe2117ab8',
 'http://ufcstats.com/event-details/509697e08673d2e5',
 'http://ufcstats.com/event-details/c95fbc085e17a532',
 'http://ufcstats.com/event-details/b5abaa65f87938eb',
 'http://ufcstats.com/event-details/48e093ea1f43a053',
 'http://ufcstats.com/event-details/3974fa35c917af1d',
 'http://ufcstats.com/event-details/8a9c6c4301f6d088',
 'http://ufcstats.com/event-details/8f4616698508f24d',
 'http://ufcstats.com/event-details/d247691a6c0e9034',
 'http://ufcstats.com/event-details/e15d0a2519d6a0b5',
 'http://ufcstats.com/event-details/0eec866a077889f0',
 'http://ufcstats.com/event-details/4c27ca8c8481f79a',
 'http://ufcstats.com/event-details/b9532d815060de7f',
 'http://ufcstats.com/event-details/0db9d2486d564a3c',
 'http://ufcstats.com/event-details/2f13e4020cea5b38',
 'http://u

As of today, this corresponds to 590 total events!

In [4]:
print(f'Total number of events available: {len(event_links)}')
print(len(set(event_links))==len(event_links))

Total number of events available: 590
True


Next, we sort through each event, and collect the links for each specific fight (Main Card, Prelims, and Early Prelims). 

In [5]:
# Store the link for each fight for every event

fight_stats_links = []
text = 'fight-details'

for link in event_links:
    event_page = requests.get(link)

    if event_page.status_code == 200:
        event_soup = BeautifulSoup(event_page.content, 'html.parser')
        fight_links = event_soup.select("p a")

        for row in range(0,len(fight_links)):
            if 'fight-details' in fight_links[row].attrs['href']:
                fight_stats_links.append(fight_links[row].attrs['href'])

Based on the structure of the html code, there are instances where duplicate fight links were collected (i.e. the event in which a fight ends in a draw for example). 

In [6]:
print(f'Number of fight links stored: {len(fight_stats_links)}')
print(f'Number of unique fight links stored: {len(set(fight_stats_links))}')
print(f'Number of duplicate links: {len(fight_stats_links)-len(set(fight_stats_links))}')

Number of fight links stored: 6565
Number of unique fight links stored: 6447
Number of duplicate links: 118


The code below is used to filter out duplicate links before we begin scraping fight data

In [7]:
# Remove duplicate links from the list

fight_stats_links_unique = []
seen = set()

for x in fight_stats_links:
    if x not in seen:
        fight_stats_links_unique.append(x)
        seen.add(x)

In [8]:
len(fight_stats_links_unique)

6447

In [28]:
columns = ['red_fighter', 'blue_fighter', 'red_KD', 'blue_KD', 'red_Sig_Strikes', 'blue_Sig_Strikes', 'red_Sig_Strikes_Pct', 'blue_Sig_Strikes_Pct',
           'red_Total_Strikes', 'blue_Total_Strikes', 'red_TD', 'blue_TD', 'red_TD_pct', 'blue_TD_pct', 'red_Sub_Att', 'blue_Sub_Att', 'red_Rev',
           'blue_Rev', 'red_Ctrl_Time', 'blue_Ctrl_Time', 'red_head', 'blue_head', 'red_body', 'blue_body', 'red_leg', 'blue_leg', 
           'red_distance', 'blue_distance', 'red_clinch', 'blue_clinch', 'red_ground', 'blue_ground']

data = {col:[] for col in columns}
data['result'] = []
data['method'] = []
data['round'] = []
data['time'] = []

for fight_link in fight_stats_links_unique[-20:]:

    fight_page = requests.get(fight_link)

    if fight_page.status_code == 200:
        fight_soup = BeautifulSoup(fight_page.content, 'html.parser')
        
        upper_table = fight_soup.select("p i")
        
        # Store number of rounds in fight
        num_rounds = upper_table[3].get_text()
        data['round'].append(num_rounds)

        # Store the time expended in the last round of the fight
        time = upper_table[5].get_text()
        data['time'].append(time)

        # Store the method of victory
        method = upper_table[2].get_text()
        data['method'].append(method)
        
        # Store the winner of each fight
        divs = fight_soup.find_all('div',{'class':'b-fight-details__person'})

        # Collect fight statistics
        fight_stats = fight_soup.select("tr")

        if len(fight_stats) > 0:
            first_box = fight_stats[1].get_text().split(sep='\n\n')
        else:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: Fight data unavailable')
            continue

        if 'W' in divs[0].i.get_text():
            data['result'].append(divs[0].a.get_text())
        elif 'W' in divs[1].i.get_text():
            data['result'].append(divs[1].a.get_text())
        elif 'NC' in divs[0].i.get_text():
            data['result'].append('NC')
        elif 'D' in divs[0].i.get_text():
            data['result'].append('D')
        else:
            data['result'].append('UNK')


        if '5' in num_rounds:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: 5 round fight')
            second_box = fight_stats[9].get_text().split(sep='\n\n')
        elif '4' in num_rounds:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: 4 round fight (early finish)')
            second_box = fight_stats[8].get_text().split(sep='\n\n')
        elif '3' in num_rounds:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: 3 round fight')
            second_box = fight_stats[7].get_text().split(sep='\n\n')
        elif '2' in num_rounds:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: 2 round fight (early finish)')
            second_box = fight_stats[6].get_text().split(sep='\n\n')
        elif '1' in num_rounds:
            print(f'{divs[1].a.get_text()} v. {divs[0].a.get_text()}: 1 round fight (early finish)')
            second_box = fight_stats[5].get_text().split(sep='\n\n')

        data_row = [element.replace('\n', '').strip() for element in first_box if element != '']

        data_row2 = [item.replace('\n','').strip() for item in second_box if item != '']
        data_row2 = data_row2[6:]

        data_rows = data_row + data_row2

        for idx, item in enumerate(data_rows):
            data[columns[idx]].append(item)
        

Felix Lee Mitchell  v. Ken Shamrock : 1 round fight (early finish)
Kimo Leopoldo  v. Royce Gracie : 1 round fight (early finish)
Roland Payne  v. Harold Howard : 1 round fight (early finish)
Christophe Leninger  v. Ken Shamrock : 1 round fight (early finish)
Emmanuel Yarborough  v. Keith Hackney : 1 round fight (early finish)
Patrick Smith  v. Royce Gracie : 1 round fight (early finish)
Remco Pardoel  v. Royce Gracie : 1 round fight (early finish)
Johnny Rhodes  v. Patrick Smith : 1 round fight (early finish)
Jason DeLucia  v. Royce Gracie : 1 round fight (early finish)
Orlando Wiet  v. Remco Pardoel : 1 round fight (early finish)
Fred Ettish  v. Johnny Rhodes : 1 round fight (early finish)
Scott Morris  v. Patrick Smith : 1 round fight (early finish)
Minoki Ichihara  v. Royce Gracie : 1 round fight (early finish)
Scott Baker  v. Jason DeLucia : 1 round fight (early finish)
Alberta Cerra Leon  v. Remco Pardoel : 1 round fight (early finish)
Robert Lucarelli  v. Orlando Wiet : 1 round f

In [27]:
fight_link = fight_stats_links_unique[1]
fight_page = requests.get(fight_link)

if fight_page.status_code == 200:
    fight_soup = BeautifulSoup(fight_page.content, 'html.parser')
    method = fight_soup.select('p i')[5].get_text()
    print(method)




          Time:
        
        5:00

      


In [29]:
df = pd.DataFrame(data)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 36 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   red_fighter           20 non-null     object
 1   blue_fighter          20 non-null     object
 2   red_KD                20 non-null     object
 3   blue_KD               20 non-null     object
 4   red_Sig_Strikes       20 non-null     object
 5   blue_Sig_Strikes      20 non-null     object
 6   red_Sig_Strikes_Pct   20 non-null     object
 7   blue_Sig_Strikes_Pct  20 non-null     object
 8   red_Total_Strikes     20 non-null     object
 9   blue_Total_Strikes    20 non-null     object
 10  red_TD                20 non-null     object
 11  blue_TD               20 non-null     object
 12  red_TD_pct            20 non-null     object
 13  blue_TD_pct           20 non-null     object
 14  red_Sub_Att           20 non-null     object
 15  blue_Sub_Att          20 non-null     obje

In [30]:
df

Unnamed: 0,red_fighter,blue_fighter,red_KD,blue_KD,red_Sig_Strikes,blue_Sig_Strikes,red_Sig_Strikes_Pct,blue_Sig_Strikes_Pct,red_Total_Strikes,blue_Total_Strikes,red_TD,blue_TD,red_TD_pct,blue_TD_pct,red_Sub_Att,blue_Sub_Att,red_Rev,blue_Rev,red_Ctrl_Time,blue_Ctrl_Time,red_head,blue_head,red_body,blue_body,red_leg,blue_leg,red_distance,blue_distance,red_clinch,blue_clinch,red_ground,blue_ground,result,method,round,time
0,Ken Shamrock,Felix Lee Mitchell,0,0,4 of 4,3 of 3,100%,100%,7 of 7,21 of 22,1 of 2,0 of 0,50%,---,1,0,0,0,--,--,1 of 1,0 of 0,2 of 2,1 of 1,1 of 1,2 of 2,0 of 0,0 of 0,3 of 3,3 of 3,1 of 1,0 of 0,Ken Shamrock,Submission,\n\n Round:\n \n 1\n,\n\n Time:\n \n 4:34\n\...
1,Royce Gracie,Kimo Leopoldo,0,0,2 of 6,6 of 9,33%,66%,17 of 21,6 of 10,0 of 2,1 of 1,0%,100%,1,0,1,1,--,--,1 of 1,3 of 6,0 of 2,2 of 2,1 of 3,1 of 1,0 of 0,0 of 1,2 of 6,3 of 3,0 of 0,3 of 5,Royce Gracie,Submission,\n\n Round:\n \n 1\n,\n\n Time:\n \n 4:40\n\...
2,Harold Howard,Roland Payne,1,0,9 of 12,3 of 4,75%,75%,12 of 15,3 of 4,0 of 1,1 of 2,0%,50%,0,0,0,0,--,--,9 of 12,0 of 0,0 of 0,2 of 2,0 of 0,1 of 2,1 of 4,2 of 2,4 of 4,1 of 2,4 of 4,0 of 0,Harold Howard,KO/TKO,\n\n Round:\n \n 1\n,\n\n Time:\n \n 0:46\n\...
3,Ken Shamrock,Christophe Leninger,0,0,8 of 9,1 of 2,88%,50%,16 of 20,21 of 23,1 of 1,0 of 0,100%,---,0,0,0,0,--,--,5 of 6,1 of 2,3 of 3,0 of 0,0 of 0,0 of 0,0 of 1,1 of 2,0 of 0,0 of 0,8 of 8,0 of 0,Ken Shamrock,KO/TKO,\n\n Round:\n \n 1\n,\n\n Time:\n \n 4:49\n\...
4,Keith Hackney,Emmanuel Yarborough,1,0,34 of 50,1 of 5,68%,20%,36 of 52,4 of 9,0 of 0,0 of 1,---,0%,0,0,0,0,--,--,29 of 45,1 of 5,0 of 0,0 of 0,5 of 5,0 of 0,6 of 8,0 of 2,2 of 3,0 of 0,26 of 39,1 of 3,Keith Hackney,KO/TKO,\n\n Round:\n \n 1\n,\n\n Time:\n \n 1:59\n\...
5,Royce Gracie,Patrick Smith,0,0,4 of 4,1 of 2,100%,50%,11 of 11,2 of 3,1 of 2,0 of 0,50%,---,0,0,0,0,--,--,3 of 3,0 of 0,0 of 0,1 of 2,1 of 1,0 of 0,0 of 0,0 of 1,1 of 1,1 of 1,3 of 3,0 of 0,Royce Gracie,KO/TKO,\n\n Round:\n \n 1\n,\n\n Time:\n \n 1:17\n\...
6,Royce Gracie,Remco Pardoel,0,0,0 of 0,0 of 0,---,---,0 of 0,0 of 0,1 of 2,0 of 0,50%,---,1,0,0,0,--,--,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,Royce Gracie,Submission,\n\n Round:\n \n 1\n,\n\n Time:\n \n 1:31\n\...
7,Patrick Smith,Johnny Rhodes,0,0,5 of 12,4 of 9,41%,44%,5 of 12,4 of 9,0 of 0,0 of 0,---,---,1,0,0,0,--,--,1 of 4,2 of 5,2 of 2,0 of 0,2 of 6,2 of 4,3 of 10,4 of 9,2 of 2,0 of 0,0 of 0,0 of 0,Patrick Smith,Submission,\n\n Round:\n \n 1\n,\n\n Time:\n \n 1:07\n\...
8,Royce Gracie,Jason DeLucia,0,0,0 of 0,0 of 0,---,---,0 of 0,0 of 0,0 of 0,1 of 1,---,100%,1,0,1,0,--,--,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,0 of 0,Royce Gracie,Submission,\n\n Round:\n \n 1\n,\n\n Time:\n \n 1:07\n\...
9,Remco Pardoel,Orlando Wiet,0,0,7 of 7,1 of 2,100%,50%,7 of 7,5 of 7,1 of 1,0 of 0,100%,---,0,0,0,0,--,--,7 of 7,0 of 1,0 of 0,0 of 0,0 of 0,1 of 1,0 of 0,1 of 2,0 of 0,0 of 0,7 of 7,0 of 0,Remco Pardoel,KO/TKO,\n\n Round:\n \n 1\n,\n\n Time:\n \n 1:29\n\...


In [31]:
df.to_csv('sample_ufc_dataset.csv')