# Exploring all the PlayOffs series in NBA history

In this series of notebooks I am using a table that you can find in <a href="https://www.basketball-reference.com/playoffs/series.html">BasketballReference</a> to analyze all PlayOffs series in the history of NBA. In this first notebook, I am preparing the data (in the form of three CSV tables: one for NBA PlayOffs series, another one for ABA and the last one for the BAA).

I am going to perform this first task using **requests**, **scrapy** and **pandas**.

In [1]:
import requests
from scrapy import Selector

In [2]:
page = requests.get("https://www.basketball-reference.com/playoffs/series.html")
sel = Selector(page)

The table is between
tags with id="playoffs_series". Then, the XPath for this table is '//table@id="playoffs_series"'

In [3]:
import pandas as pd

xpath = '//table[@id="playoffs_series"]'
df = pd.read_html(sel.xpath(xpath).extract()[0])[0]

df.head()

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Winner,Winner,Unnamed: 7_level_0,Loser,Loser,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0
Unnamed: 0_level_1,Yr,Lg,Series,Unnamed: 3_level_1,Unnamed: 4_level_1,Team,W,Unnamed: 7_level_1,Team,W,Unnamed: 10_level_1,Favorite,Underdog
0,2020,NBA,Eastern Conf First Round,"Aug 18 - Aug 29, 2020",,Milwaukee Bucks (1),4,,Orlando Magic (8),1,,MIL (-7500),ORL (+3250)
1,2020,NBA,Eastern Conf First Round,"Aug 17 - Aug 23, 2020",,Toronto Raptors (2),4,,Brooklyn Nets (7),0,,TOR (-2200),BRK (+1315)
2,2020,NBA,Eastern Conf First Round,"Aug 17 - Aug 23, 2020",,Boston Celtics (3),4,,Philadelphia 76ers (6),0,,BOS (-450),PHI (+360)
3,2020,NBA,Eastern Conf First Round,"Aug 18 - Aug 24, 2020",,Miami Heat (5),4,,Indiana Pacers (4),0,,MIA (-320),IND (+260)
4,2020,NBA,Western Conf First Round,"Aug 18 - Aug 29, 2020",,Los Angeles Lakers (1),4,,Portland Trail Blazers (8),1,,LAL (-550),POR (+425)


We see the set of column names is a total mess. Next we are giving each column a meaningful name. Also, note that some columns are empty (specifically columns 4, 7 and 10); the names for these columns will start with "rm" to remove them before doing any other wrangling with the data.

In [4]:
df.columns = ["Year",      # Year of the PlayOffs series, integer
              "League",    # League of the PlayOffs, string
              "Series",    # Series name, string
              "Date",      # Date of start and end of the series
              "rm1",       # to be removed
              "Team_win",  # Team who won the series
              "Wins_win",  # Games won by the team who won the series
              "rm2",       # to be removed
              "Team_lose", # Team who lost the series
              "Wins_lose", # Games won by the team who won the series
              "rm3",       # to be removed
              "Fav",       # Favourite team for winnning the series
              "Underdog"]  # Non-favourite team for winning the series

# Removal of the empty columns to be removed
for col in [col for col in df.columns if col.startswith("rm")]:
    del df[col]

Also, there are some rows that are not part of the data, and only exist to give some nice format to the table in the BasketballReference web. These rows are easily-recognizable: the value of their "Year" column is not an integer; you guessed right: all the PlayOffs series must contain a year where they took place.

In this block of code I am also changing the datatype of columns **Year**, **Wins_win** and **Wins_lose** to ```integer```.

In [5]:
df = df[df.Year.apply(lambda x: str(x).isdigit())].astype(
    {
        "Year":int,
        "Wins_win": int, 
        "Wins_lose": int
        }
         )

The records in the table we have obtained from BasketballReference contains information about NBA, ABA and BAA PlayOffs series. Next, I am dividing this initial table in three separated dataframes, one for each league.

In [6]:
df.League.value_counts()

NBA    807
ABA     63
BAA     19
Name: League, dtype: int64

In [7]:
# ABA PlayOffs series
aba = df[df["League"] == "ABA"]
print("Years of ABA series:")
print(aba.Year.value_counts().sort_index(), end = "\n\n")

# BAA PlayOffs series
baa = df[df["League"] == "BAA"]
print("Years of BAA series:")
print(baa.Year.value_counts().sort_index())

# NBA PlayOffs series
nba = df[df["League"] == "NBA"]
del nba["League"]

Years of ABA series:
1968    7
1969    7
1970    7
1971    8
1972    7
1973    7
1974    8
1975    8
1976    4
Name: Year, dtype: int64

Years of BAA series:
1947    5
1948    7
1949    7
Name: Year, dtype: int64


Finally, I am saving these three dataframes in CSV files for their use in future notebooks.

In [8]:
aba.to_csv("aba.csv", index = False)
baa.to_csv("baa.csv", index = False)
nba.to_csv("nba.csv", index = False)