## Webscraping
Okie dokie so by using some regex patterns in Python I was able to find both tables, they are stored inside of a parent div with class name "table__inner". From there we can extract information from both tables.

Table headers are neatly stored inside of the \<th> tag.

In [59]:
import re
import pandas as pd
import numpy as np
import urllib.request

url = "https://www.proballers.com/basketball/game/757056/denver-nuggets-los-angeles-lakers-2023-10-25"

fp = urllib.request.urlopen(url)
mybytes = fp.read()

webpage = mybytes.decode("utf8")
fp.close()

tables =  re.findall(r'(<div class=\"table__inner\">[\s\S]*?<\/table>)', webpage)

team_1 = tables[0]
team_2 = tables[1]

team_headers = re.findall(r'(<th class=\".*\">.*<\/th>)', team_1)
team_headers = [re.sub(r'(<th class=\".*\">|<\/th>)', '', header) for header in team_headers]


Since both tables have the exact same column names, the headers only need to be extracted once.

Now we can get to extracting all the data. Which is stored inside of the \<td> tags. Some data such as player names are nested in an anchor tag which makes them slightly more challenging to extract.

However with another regex pattern we can remove that anchor tag, but it's left with a lot of whitespaces. So the best I can do is make all the names camel case. Which is thankfully still readable.

In [43]:
# GET DATA FROM TEAM 1 TABLE
team_1_data_raw = re.findall(r'(<td class=\".*\">[\s\S]*?<\/td>)', team_1)
team_1_data_raw = [re.sub(r'(<td class=\".*\">|<\/td>|<a [\s\S]*?>\n|\n|<\/a>|<div [\s\S]*?>[\s\S]*?>[\s\S]*?<\/div>)','', data) for data in team_1_data_raw]
print(team_1_data_raw)
# GET DATA FROM TEAM 2 TABLE
team_2_data_raw = re.findall(r'(<td class=\".*\">[\s\S]*?<\/td>)', team_2)
team_2_data_raw = [re.sub(r'(<td class=\".*\">|<\/td>|<a [\s\S]*?>\n|\n|<\/a>|<div [\s\S]*?>[\s\S]*?>[\s\S]*?<\/div>)','', data) for data in team_2_data_raw]

['                                                    Nikola Jokic                                            ', '29', '13', '11', '36', '9-17', '3-5', '54.5%', '2-4', '50.0%', '3', '10', '13', '11', '2', '1', '1', '2', '29', '15', '41', '                                                    Jamal Murray                                            ', '21', '2', '6', '34', '5-8', '3-5', '61.5%', '2-2', '100.0%', '0', '2', '2', '6', '1', '0', '1', '3', '21', '3', '24', '                                                    Kentavious Caldwell-Pope                                            ', '20', '2', '1', '36', '6-9', '2-3', '66.7%', '2-2', '100.0%', '1', '1', '2', '1', '3', '3', '1', '5', '20', '10', '20', '                                                    Aaron Gordon                                            ', '15', '7', '5', '35', '6-9', '1-2', '63.6%', '0-0', '-', '2', '5', '7', '5', '0', '2', '1', '0', '15', '6', '26', '                                                    Michael 

In [44]:
# CREATE OBJECT TO INSERT INTO DATAFRAME
team_1_data = {team_headers[i]: [] for i in range(len(team_headers))}
team_2_data = {team_headers[i]: [] for i in range(len(team_headers))}


In [45]:
# ITERATE THROUGH TEAM 1 AND TEAM 2 DATA AND INSERT INTO OBJECT
row_iterator = 0
for i in range(len(team_1_data_raw)):
    if row_iterator == len(team_headers):
        row_iterator = 0
        team_1_data[team_headers[row_iterator]].append(team_1_data_raw[i])
        team_2_data[team_headers[row_iterator]].append(team_2_data_raw[i])
        row_iterator += 1
    else:
        team_1_data[team_headers[row_iterator]].append(team_1_data_raw[i])
        team_2_data[team_headers[row_iterator]].append(team_2_data_raw[i])
        row_iterator += 1


{'PLAYER': ['                                                    Nikola Jokic                                            ', '                                                    Jamal Murray                                            ', '                                                    Kentavious Caldwell-Pope                                            ', '                                                    Aaron Gordon                                            ', '                                                    Michael Porter                                            ', '                                                    Reggie Jackson                                            ', '                                                    Christian Braun                                            ', '                                                    Zeke Nnaji                                            ', '                                                    Peyton Watson         

In [46]:
# Remove whitespace from player names
team_1_data['PLAYER'] = [re.sub(r'(\s)', '', player) for player in team_1_data['PLAYER']]
team_2_data['PLAYER'] = [re.sub(r'(\s)', '', player) for player in team_2_data['PLAYER']]        

## Duplicate columns

The website I am currently scraping displays three columns twice for readability.

- Pts
- Ast
- Reb

This is a problem since now these three lists are twice the size as every other list.

We also know that all the other columns are of length 13, and these three are of length 26. Initially I thought this meant that I should remove half of the values in the columns right down the middle, however after looking at the object I was making the duplicate values were right next to each other.

So I then realized we simply needed to take every other value. This is also a very simple task to do in Python or R. We can do all of by indexing our list.

In [47]:
# Take every other element of the three duplicate columns
# Pts
team_1_data['Pts'] = team_1_data['Pts'][0:len(team_1_data['Pts']):2]
team_2_data['Pts'] = team_2_data['Pts'][0:len(team_2_data['Pts']):2]
# Reb
team_1_data['Reb'] = team_1_data['Reb'][0:len(team_1_data['Reb']):2]
team_2_data['Reb'] = team_2_data['Reb'][0:len(team_2_data['Reb']):2]
# Ast
team_1_data['Ast'] = team_1_data['Ast'][0:len(team_1_data['Ast']):2]
team_2_data['Ast'] = team_2_data['Ast'][0:len(team_2_data['Ast']):2]



## Finally...

We should be all set to insert our values into a pandas dataframe

However, with duplicate values, comes duplicate columns. So let's tidy this up and remove those.

### Using pandas

Duplicate columns aren't scary though. Since there are easy ways built into to pandas to determine if a column is duplicate and to remove them.

Thank you pandas!

In [49]:
# Insert into dataframe
team_1_panda = pd.DataFrame(team_1_data)
team_2_panda = pd.DataFrame(team_2_data)

# Remove duplicate keys from team_1_panda and team_2_panda
team_1_panda = team_1_panda.loc[:,~team_1_panda.columns.duplicated()]
team_2_panda = team_2_panda.loc[:,~team_2_panda.columns.duplicated()]

# Add team name to each player
team_1_panda['Team'] = team_1_data['PLAYER'][12]
team_2_panda['Team'] = team_2_data['PLAYER'][12]
print(team_1_panda.head(10))


                    PLAYER Pts Reb Ast MIN 2M-2A 3M-3A     FG% 1M-1A      1%  \
0              NikolaJokic  29  13  11  36  9-17   3-5   54.5%   2-4   50.0%   
1              JamalMurray  21   2   6  34   5-8   3-5   61.5%   2-2  100.0%   
2  KentaviousCaldwell-Pope  20   2   1  36   6-9   2-3   66.7%   2-2  100.0%   
3              AaronGordon  15   7   5  35   6-9   1-2   63.6%   0-0       -   
4            MichaelPorter  12  12   2  30   3-4   2-9   38.5%   0-0       -   
5            ReggieJackson   8   3   1  24   1-3   2-5   37.5%   0-0       -   
6           ChristianBraun   5   3   2  19   2-4   0-1   40.0%   1-2   50.0%   
7                ZekeNnaji   4   0   1  12   1-2   0-1   33.3%   2-2  100.0%   
8             PeytonWatson   3   0   0  11   0-0   1-3   33.3%   0-0       -   
9             JalenPickett   2   0   0   1   1-1   0-0  100.0%   0-0       -   

  Or  Dr To Stl Blk Fo +/- Eff           Team  
0  3  10  2   1   1  2  15  41  DenverNuggets  
1  0   2  1   0   1  3 

In [50]:
print(team_2_panda.head(10))

                 PLAYER Pts Reb Ast MIN 2M-2A 3M-3A    FG% 1M-1A      1% Or  \
0           LeBronJames  21   8   5  29  9-12   1-4  62.5%   0-1    0.0%  1   
1         TaureanPrince  18   3   1  30   2-2   4-6  75.0%   2-2  100.0%  1   
2          AnthonyDavis  17   8   4  34  5-15   1-2  35.3%   4-4  100.0%  1   
3          AustinReaves  14   8   4  31   3-9   1-2  36.4%   5-7   71.4%  4   
4  D&#039;AngeloRussell  11   4   7  36   2-7   2-5  33.3%   1-2   50.0%  0   
5        CameronReddish   7   4   0  18   1-2   1-2  50.0%   2-2  100.0%  2   
6         ChristianWood   7   4   0  15   3-3   0-1  75.0%   1-2   50.0%  1   
7           GabeVincent   6   1   2  22   3-4   0-4  37.5%   0-0       -  1   
8          RuiHachimura   6   3   0  15   3-7   0-3  30.0%   0-0       -  2   
9           JaxsonHayes   0   1   0   7   0-0   0-0      -   0-0       -  0   

  Dr To Stl Blk Fo  +/- Eff              Team  
0  7  0   1   0  1    7  28  LosAngelesLakers  
1  2  1   0   1  0  -14  20  LosAn

Above I not only removed duplicate values, but I decided to add a team column to every player. This may seem redundant since each data frame is about one team.

That being said the data extracted for both teams is the recording of **one** game. So I figured it would be simpler to indicate the team of each player in a seperate column, merge the tables together and then export the data frame as one csv that shows the recording of one game.

Now if someone want's see the performance of a entire game from the scope of both teams they may do so with the CSV. If not and they would like to compare them seperately, they may split the table imported from CSV into two using the team column.

In [51]:
# Merge the two dataframes
game_data = pd.concat([team_1_panda, team_2_panda], axis=0)
game_data

Unnamed: 0,PLAYER,Pts,Reb,Ast,MIN,2M-2A,3M-3A,FG%,1M-1A,1%,Or,Dr,To,Stl,Blk,Fo,+/-,Eff,Team
0,NikolaJokic,29,13,11,36,9-17,3-5,54.5%,2-4,50.0%,3,10,2,1,1,2,15,41,DenverNuggets
1,JamalMurray,21,2,6,34,5-8,3-5,61.5%,2-2,100.0%,0,2,1,0,1,3,3,24,DenverNuggets
2,KentaviousCaldwell-Pope,20,2,1,36,6-9,2-3,66.7%,2-2,100.0%,1,1,3,3,1,5,10,20,DenverNuggets
3,AaronGordon,15,7,5,35,6-9,1-2,63.6%,0-0,-,2,5,0,2,1,0,6,26,DenverNuggets
4,MichaelPorter,12,12,2,30,3-4,2-9,38.5%,0-0,-,2,10,0,2,0,1,12,20,DenverNuggets
5,ReggieJackson,8,3,1,24,1-3,2-5,37.5%,0-0,-,0,3,2,1,0,0,11,6,DenverNuggets
6,ChristianBraun,5,3,2,19,2-4,0-1,40.0%,1-2,50.0%,1,2,1,0,1,1,5,6,DenverNuggets
7,ZekeNnaji,4,0,1,12,1-2,0-1,33.3%,2-2,100.0%,0,0,1,0,0,2,-3,2,DenverNuggets
8,PeytonWatson,3,0,0,11,0-0,1-3,33.3%,0-0,-,0,0,1,0,1,1,1,1,DenverNuggets
9,JalenPickett,2,0,0,1,1-1,0-0,100.0%,0-0,-,0,0,0,0,0,0,0,2,DenverNuggets


## Verifying

How can we check that the dataframe we've made here is correct?

Well I would say "with another regex" but the player names I extracted have been slightly modified to remove all whitespaces. So in the meantime I'll just be looking at the webpage and compare it to my dataframe. 

## Exporting

Now with our cleaned up code, we can export our dataframe to csv.

In [61]:
# Export to csv
file_name = url.split('/')[-1] + '.csv'
game_data.to_csv("./datasets/"+file_name, index=False)