## Webscraping
Okie dokie so by using some regex patterns in Python I was able to find both tables, they are stored inside of a parent div with class name "table__inner". From there we can extract information from both tables.

Table headers are neatly stored inside of the \<th> tag.

In [136]:
import re
import pandas as pd
import numpy as np

webpage = open("./Denver Nuggets vs. Los Angeles Lakers - Oct 25, 2023 - Game recap Proballers.htm")
tables =  re.findall(r'(<div class=\"table__inner\">[\s\S]*?<\/table>)', webpage.read())

team_1 = tables[0]
team_2 = tables[1]

team_headers = re.findall(r'(<th class=\".*\">.*<\/th>)', team_1)
team_headers = [re.sub(r'(<th class=\".*\">|<\/th>)', '', header) for header in team_headers]


Since both tables have the exact same column names, the headers only need to be extracted once.

Now we can get to extracting all the data. Which is stored inside of the \<td> tags. Some data such as player names are nested in an anchor tag which makes them slightly more challenging to extract.

However with another regex pattern we can remove that anchor tag, but it's left with a lot of whitespaces. So the best I can do is make all the names camel case. Which is thankfully still readable.

In [137]:
# GET DATA FROM TEAM 1 TABLE
team_1_data_raw = re.findall(r'(<td class=\".*\">[\s\S]*?<\/td>)', team_1)
team_1_data_raw = [re.sub(r'(<td class=\".*\">|<\/td>|<a [\s\S]*?>\n|\n|<\/a>|<div [\s\S]*?>[\s\S]*?>[\s\S]*?<\/div>)','', data) for data in team_1_data_raw]

# GET DATA FROM TEAM 2 TABLE
team_2_data_raw = re.findall(r'(<td class=\".*\">[\s\S]*?<\/td>)', team_2)
team_2_data_raw = [re.sub(r'(<td class=\".*\">|<\/td>|<a [\s\S]*?>\n|\n|<\/a>|<div [\s\S]*?>[\s\S]*?>[\s\S]*?<\/div>)','', data) for data in team_2_data_raw]

In [138]:
# CREATE OBJECT TO INSERT INTO DATAFRAME
team_1_data = {team_headers[i]: [] for i in range(len(team_headers))}
team_2_data = {team_headers[i]: [] for i in range(len(team_headers))}


In [139]:
# ITERATE THROUGH TEAM 1 AND TEAM 2 DATA AND INSERT INTO OBJECT
row_iterator = 0
for i in range(len(team_1_data_raw)):
    if row_iterator == len(team_headers):
        row_iterator = 0
        team_1_data[team_headers[row_iterator]].append(team_1_data_raw[i])
        team_2_data[team_headers[row_iterator]].append(team_2_data_raw[i])
        row_iterator += 1
    else:
        team_1_data[team_headers[row_iterator]].append(team_1_data_raw[i])
        team_2_data[team_headers[row_iterator]].append(team_2_data_raw[i])
        row_iterator += 1


In [140]:
# Remove whitespace from player names
team_1_data['PLAYER'] = [re.sub(r'(\s)', '', player) for player in team_1_data['PLAYER']]
team_2_data['PLAYER'] = [re.sub(r'(\s)', '', player) for player in team_2_data['PLAYER']]        

## Duplicate columns

The website I am currently scraping displays three columns twice for readability.

- Pts
- Ast
- Reb

This is a problem since now these three lists are twice the size as every other list.

We also know that all the other columns are of length 13, and these three are of length 26. Therefore we should be able to split these three columns in half very quickly and go along our merry way to put them into a pandas dataframe.

It will be noteably easier to cut these lists in half since our raw data is taken in order of from left to right. So messing with my iteration doesn't seem wise. Especially when Python has these features to quickly remove elements like this.

In [141]:
# Cut the length of Pts, Reb and Ast lists in half
# Pts
team_1_data['Pts'] = team_1_data['Pts'][0:len(team_1_data['Pts'])//2]
team_2_data['Pts'] = team_2_data['Pts'][0:len(team_2_data['Pts'])//2]
# Reb
team_1_data['Reb'] = team_1_data['Reb'][0:len(team_1_data['Reb'])//2]
team_2_data['Reb'] = team_2_data['Reb'][0:len(team_2_data['Reb'])//2]
# Ast
team_1_data['Ast'] = team_1_data['Ast'][0:len(team_1_data['Ast'])//2]
team_2_data['Ast'] = team_2_data['Ast'][0:len(team_2_data['Ast'])//2]

# Remove duplicate keys from team_1_data and team_2_data




## Finally...

We should be all set to insert our values into a pandas dataframe

However, with duplicate values, comes duplicate columns. As you can see from the following print statements.

In [148]:
# team_1_panda = pd.DataFrame(team_1_data)
# team_2_panda = pd.DataFrame(team_2_data)

for i in range(len(team_headers)):
    print("HEADER: ",team_headers[i]," ",len(team_1_data[team_headers[i]]))
    print("\n")
    # print("HEADER: ",team_headers[i]," ",len(team_2_data[team_headers[i]]))
    # print("\n")


HEADER:  PLAYER   13


HEADER:  Pts   13


HEADER:  Reb   13


HEADER:  Ast   13


HEADER:  MIN   13


HEADER:  2M-2A   13


HEADER:  3M-3A   13


HEADER:  FG%   13


HEADER:  1M-1A   13


HEADER:  1%   13


HEADER:  Or   13


HEADER:  Dr   13


HEADER:  Reb   13


HEADER:  Ast   13


HEADER:  To   13


HEADER:  Stl   13


HEADER:  Blk   13


HEADER:  Fo   13


HEADER:  Pts   13


HEADER:  +/-   13


HEADER:  Eff   13




## Using pandas

Duplicate columns aren't scary though. Since there are easy ways built into to pandas to determine if a column is duplicate and to remove them.

Thank you pandas!

In [152]:
# Insert into dataframe
team_1_panda = pd.DataFrame(team_1_data)
team_2_panda = pd.DataFrame(team_2_data)

# Remove duplicate keys from team_1_panda and team_2_panda
team_1_panda = team_1_panda.loc[:,~team_1_panda.columns.duplicated()]
team_2_panda = team_2_panda.loc[:,~team_2_panda.columns.duplicated()]
print(team_1_panda)
print(team_2_panda)

object


In [None]:
team_1_panda['Pts']

## Exporting

Now with our cleaned up

In [None]:
# Export to csv
team_1_panda.to_csv('team_1.csv', index=False)
team_2_panda.to_csv('team_2.csv' ,index=False)