I wanted to write my own custom web scrape from a site that contains data, but doesn't have the data in a nice data table format. That means I would have parse through the site's HTML code and identify the data I wanted, and be able to write the python code to capture it. 

I chose a site that had NFL power rankings for all 32 teams, from each week of the 2023 season (18 weeks). The goal was to finish with a 32x19 data table where the first column has all 32 team names, and the other 18 columns had the teams' power ranking from each of the 18 weeks in the season. This notebook will walk you through how I did it.

Here is te URL for the website I used: https://www.nfl.com/news/nfl-power-rankings-week-1-eagles-lions-rising-heading-into-2022-season-cowboys-s

You will notice that while the data is organized on this webiste, it is not in format where I can just extract a data table and be done.

In [1]:
#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import lxml.html

### get team names

In [2]:
#Lets extract the data for a week 1
url = "https://www.nfl.com/news/nfl-power-rankings-week-1-eagles-lions-rising-heading-into-2022-season-cowboys-s"
response = requests.get(url)
content = response.content

In [3]:
#look at response code (200 is good)
print(response.status_code)

200


In [4]:
#parse html using lxml so we can start processing it
html = lxml.html.fromstring(content)

In [5]:
#now lets select every parent div element that we know stores team name data
team_names = html.xpath("//div[contains(@class, 'nfl-o-ranked-item__title')]")

In [6]:
#print how many items were on the page that we just extracted
print(len(team_names))

32


In [7]:
#Now, select a specific team and return its data
#I will select the Bills, the 1st team that appears on the list
team_name = team_names[0]

In [8]:
#Now lets extract the data from the HTML by using the Xpath
team = team_name.xpath('.//a/text()')[0]

In [9]:
#verify that the Xpath got what we want
print("Team =", team)

Team = Buffalo Bills


In [10]:
#now lets create an empty dataframe with column names
data = pd.DataFrame(columns=['team'])

In [11]:
#now we will loop through the extracted data, find the team names in the HTML, and populate dataframe
for team_name in team_names:
    team = team_name.xpath('.//a/text()')[0]
    data = data.append({'team': team}, ignore_index=True)

In [12]:
#lets take a look at what we got
data

Unnamed: 0,team
0,Buffalo Bills
1,Los Angeles Rams
2,San Francisco 49ers
3,Tampa Bay Buccaneers
4,Cincinnati Bengals
5,Green Bay Packers
6,Kansas City Chiefs
7,Denver Broncos
8,Los Angeles Chargers
9,Baltimore Ravens


Looks good, now lets grab the corresponding power rankings.

### get power ranking

In [13]:
#now lets select every parent div element that we know stores team rank data
team_ranks = html.xpath("//div[contains(@class, 'nfl-o-ranked-item__label--second')]")

In [14]:
#print how many items were on the page that we just extracted
print(len(team_ranks))

32


In [15]:
#Now, select a specific team and return its data
#I will select the Bills, the 1st team that appears on the list
team_rank = team_ranks[0]

In [16]:
#Now lets extract the data from the HTML by using the xpath
rank = team_rank.xpath('text()')[0]

In [17]:
print("Rank =", rank)

Rank = 1


In [18]:
#now lets create an empty dataframe wit column names, then we loop through items while appending the dataframe
data1 = pd.DataFrame(columns=['w1_pr'])

In [19]:
#loop through items and populate dataframe
for team_rank in team_ranks:
    rank = team_rank.xpath('text()')[0]
    data1 = data1.append({'w1_pr': rank}, ignore_index=True)

In [20]:
#lets take a look at what we got
data1

Unnamed: 0,w1_pr
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10


In [21]:
#if we put the two columns together, we have our teams, and their week 1 power ranking
w1_df = pd.concat([data,data1], axis=1)
w1_df

Unnamed: 0,team,w1_pr
0,Buffalo Bills,1
1,Los Angeles Rams,2
2,San Francisco 49ers,3
3,Tampa Bay Buccaneers,4
4,Cincinnati Bengals,5
5,Green Bay Packers,6
6,Kansas City Chiefs,7
7,Denver Broncos,8
8,Los Angeles Chargers,9
9,Baltimore Ravens,10


Looks good, normally I would now automate this process for the remaining weeks 2-18, but the URL for week 1 is not in the same format as weeks 2-18, making it difficult to loop through all the URLs. But since weeks 2-18 have a similar format, I will extract week 2 manually, then automate for weeks 3-18.

### get week 2 data, then automate for remaining weeks

In [22]:
#extract the team names data for week 2
url = "https://www.nfl.com/news/nfl-power-rankings-week-2-2022-nfl-season"
response = requests.get(url)
content = response.content

In [23]:
#look at response code (200 is good)
print(response.status_code)

200


In [24]:
#parse html using lxml so we can start processing it
html = lxml.html.fromstring(content)

In [25]:
#now lets select every parent div element that we know stores team name data
team_names = html.xpath("//div[contains(@class, 'nfl-o-ranked-item__title')]")

In [26]:
#print how many items were on the page that we just extracted
print(len(team_names))

32


In [27]:
#Now, select a specific movie and return its data
#I will select the Bills, the 1st team that appears on the list
team_name = team_names[0]

In [28]:
#Now lets extract the data from this team by using the Xpaths that we set up earlier
team = team_name.xpath('.//a/text()')[0]

In [29]:
print("Team =", team)

Team = Buffalo Bills


In [30]:
#now lets create an empty dataframe with column names, then we loop through while appending the dataframe
w2_df = pd.DataFrame(columns=['team'])

In [31]:
#loop through items and populate dataframe
for team_name in team_names:
    team = team_name.xpath('.//a/text()')[0]
    w2_df = w2_df.append({'team': team}, ignore_index=True)

In [32]:
#week 2 team names
w2_df

Unnamed: 0,team
0,Buffalo Bills
1,Kansas City Chiefs
2,Tampa Bay Buccaneers
3,Los Angeles Rams
4,Los Angeles Chargers
5,Baltimore Ravens
6,Minnesota Vikings
7,Green Bay Packers
8,Cincinnati Bengals
9,Philadelphia Eagles


In [33]:
#now we would normally extract the rank data, but for the purposes of this project we can just add in numbers 1-32
#because the team names are extracted in order. Then we can merge dfs based on the team names column

In [34]:
#populate w2_pr column with numbers 1-32
w2_df['w2_pr'] = np.arange(1,33)
w2_df

Unnamed: 0,team,w2_pr
0,Buffalo Bills,1
1,Kansas City Chiefs,2
2,Tampa Bay Buccaneers,3
3,Los Angeles Rams,4
4,Los Angeles Chargers,5
5,Baltimore Ravens,6
6,Minnesota Vikings,7
7,Green Bay Packers,8
8,Cincinnati Bengals,9
9,Philadelphia Eagles,10


In [35]:
#now we want to merge week 1 power rankings with week 2 power rankings
df = pd.merge(w1_df, w2_df, on='team')
df

Unnamed: 0,team,w1_pr,w2_pr
0,Buffalo Bills,1,1
1,Los Angeles Rams,2,4
2,San Francisco 49ers,3,11
3,Tampa Bay Buccaneers,4,3
4,Cincinnati Bengals,5,9
5,Green Bay Packers,6,8
6,Kansas City Chiefs,7,2
7,Denver Broncos,8,16
8,Los Angeles Chargers,9,5
9,Baltimore Ravens,10,6


### automate extraction for weeks 3-18

In [36]:
#automate to get weeks 3-18
#now we want to automate the process and put the extracted data into a dataframe
#this cell loops through weeks 3-18 (1 at a time), then selects every parent div that we know stores team name data
team_names = []
for start in range(3,19,1):
    url = f"https://www.nfl.com/news/nfl-power-rankings-week-{start}-2022-nfl-season"
    response = requests.get(url)
    content = response.content
    html = lxml.html.fromstring(content)
    team_names += html.xpath("//div[contains(@class, 'nfl-o-ranked-item__title')]")

In [37]:
#check how many team names we got(32*16=512)
print(len(team_names))

512


In [38]:
#now lets create an empty dataframe with column names, auto_df stands for automated dataframe
auto_df = pd.DataFrame(columns=['team'])

In [39]:
#loop through items and populate dataframe
for team_name in team_names:
    team = team_name.xpath('.//a/text()')[0]
    auto_df = auto_df.append({'team': team}, ignore_index=True)

In [40]:
#heres our team names in order from weeks 3-18
auto_df

Unnamed: 0,team
0,Buffalo Bills
1,Kansas City Chiefs
2,Philadelphia Eagles
3,Tampa Bay Buccaneers
4,Green Bay Packers
...,...
507,Denver Broncos
508,Chicago Bears
509,Arizona Cardinals
510,Houston Texans


In [41]:
#since the team names are in order (based on power ranking), we can just populate the rankings column with a numpy array rather
#than extract rank data from weeks 3-18
#populate rank column with repeated sequence (1-32)
x = np.arange(1,33)
auto_df['rank'] = np.tile(x,16)

In [42]:
#now here's our df
auto_df

Unnamed: 0,team,rank
0,Buffalo Bills,1
1,Kansas City Chiefs,2
2,Philadelphia Eagles,3
3,Tampa Bay Buccaneers,4
4,Green Bay Packers,5
...,...,...
507,Denver Broncos,28
508,Chicago Bears,29
509,Arizona Cardinals,30
510,Houston Texans,31


In [43]:
#loop through auto_df and slice it into weekly data
#merge each weekly df with df that contains first two weeks
#np.arange is weird because I'm naming each column based on weeks 3-18, and I wanted slicing to match up with 3-18, rather than
#run a nested loop from 1-16 to do the slicing
x = np.arange(-96,544,32)
for i in range(3,19,1):
    weekly_df = auto_df[x[i]:x[i+1]]
    df = pd.merge(df, weekly_df, on='team')
    df = df.rename(columns={"rank": f"w{i}_pr"})

In [44]:
#here is our dataframe
df

Unnamed: 0,team,w1_pr,w2_pr,w3_pr,w4_pr,w5_pr,w6_pr,w7_pr,w8_pr,w9_pr,w10_pr,w11_pr,w12_pr,w13_pr,w14_pr,w15_pr,w16_pr,w17_pr,w18_pr
0,Buffalo Bills,1,1,1,2,2,1,1,1,1,2,5,4,5,4,4,3,3,2
1,Los Angeles Rams,2,4,6,6,11,15,12,11,15,18,26,28,30,29,26,27,25,26
2,San Francisco 49ers,3,11,9,11,5,4,11,12,5,6,7,5,3,7,2,2,1,1
3,Tampa Bay Buccaneers,4,3,4,8,8,6,9,18,20,16,12,11,18,16,21,23,20,15
4,Cincinnati Bengals,5,9,12,10,6,9,5,5,10,11,11,10,8,3,3,4,4,3
5,Green Bay Packers,6,8,5,4,4,8,15,19,19,22,16,19,21,21,22,17,12,9
6,Kansas City Chiefs,7,2,2,5,3,3,3,3,3,3,2,1,2,5,5,5,5,4
7,Denver Broncos,8,16,17,14,16,23,24,27,26,26,30,30,31,30,30,28,32,28
8,Los Angeles Chargers,9,5,7,15,14,12,10,16,16,15,18,17,12,18,9,9,8,7
9,Baltimore Ravens,10,6,10,7,7,5,8,8,6,5,6,8,11,11,8,12,10,12


In [45]:
#sort in alphebetical order by 'team', and we have our finished dataframe with the teams in alphabetical order
df = df.sort_values('team')
df

Unnamed: 0,team,w1_pr,w2_pr,w3_pr,w4_pr,w5_pr,w6_pr,w7_pr,w8_pr,w9_pr,w10_pr,w11_pr,w12_pr,w13_pr,w14_pr,w15_pr,w16_pr,w17_pr,w18_pr
15,Arizona Cardinals,16,22,14,20,17,19,26,20,22,23,19,22,22,23,29,30,30,30
30,Atlanta Falcons,31,30,29,25,22,22,18,22,18,17,22,20,25,25,25,26,28,27
9,Baltimore Ravens,10,6,10,7,7,5,8,8,6,5,6,8,11,11,8,12,10,12
0,Buffalo Bills,1,1,1,2,2,1,1,1,1,2,5,4,5,4,4,3,3,2
22,Carolina Panthers,23,29,31,29,31,32,32,30,29,32,31,31,28,27,20,24,19,18
31,Chicago Bears,32,25,30,27,30,29,29,23,24,21,20,23,29,28,28,29,29,29
4,Cincinnati Bengals,5,9,12,10,6,9,5,5,10,11,11,10,8,3,3,4,4,3
21,Cleveland Browns,22,20,22,17,21,20,25,25,21,19,24,24,19,20,23,20,26,24
14,Dallas Cowboys,15,27,21,13,13,10,6,4,4,4,8,3,4,2,6,6,6,6
7,Denver Broncos,8,16,17,14,16,23,24,27,26,26,30,30,31,30,30,28,32,28


In [46]:
#save data table to Excel csv
df.to_csv('2023_nfl_weekly_power_rankings.csv', index=False)

Now we have our weekly NFL power rankings from 2023 in a nice data table which we can analyze.