# Intro

! NOTE BEFORE USING THIS NOTEBOOK. If you know anything about CapFriendly API, please DM me, I will delete this notebook as it is much more polite to retrieve data using API. Using scrapper is the last case scenario.

We will get the players' stats and salaries by going through this notebook.

At first, we need to set all needed filters and conditions on players. Then we will use the resulting URL to get data.

For example, I will scrap all the players through last 5 seasons. 

As I need data season-by-season I am going to start from the last season to the earliest I need.

This link will allow me to download data for the 2020-2021 season with all parameters and columns specified.

Let's do some coding now.

## Scrapping the main table

In [25]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import logging

In [26]:
# parse a needed url
url_v2 = "https://www.capfriendly.com/browse/active/2021?stats-season=2021&display=birthday,country,weightkg,heightcm,draft,slide-candidate,signing-status,expiry-year,performance-bonus,caphit-percent,aav,length,minors-salary,base-salary,skater-individual-advanced-stats,skater-on-ice-advanced-stats,goalie-advanced-stats,type,signing-age,signing-date,arbitration,extension&limits=gp-5-90"

req = requests.get(url_v2)
soup = BeautifulSoup(req.content)  # make a soup of html & css from the web page

In [27]:
df = pd.read_html(url_v2, header=0, index_col = 0, na_values=["-"])[0]

In [28]:
df.shape

(50, 65)

As we see above, only data on 50 players has been retrieved. 

How do we parse other pages, if there are more than 50 players in our url?

### Scrapping multiple pages of main table

In [32]:
info_about_lists = soup.find_all("a", {"class": "whi pagin_r"})  # via devtools we find the element that allows to switch between pages of data

In [34]:
print(info_about_lists)  # all links to other pages of data

[<a class="whi pagin_r" data-val="2" href="/browse/?p=2">2</a>, <a class="whi pagin_r" data-val="3" href="/browse/?p=3">3</a>, <a class="whi pagin_r" data-val="18" href="/browse/?p=18">Last</a>]


In [35]:
last_list_num = int(info_about_lists[-1]["data-val"])  # take the last number of page from date-val so we now how many values were selected for us

In [36]:
print(last_list_num)  # check that 18th is last page number we got

18


Now we can use for loop to parse all the data we have on multiple pages. 

### Final Block of Code

In [39]:
req = requests.get(url_v2)
soup = BeautifulSoup(req.content)  # make a soup of html & css from the web page

info_about_lists = soup.find_all("a", {"class": "whi pagin_r"})  # via devtools we find the element that allows to switch between pages of data
last_list_num = int(info_about_lists[-1]["data-val"])  # take the last number of page from date-val so we now how many values were selected for us

pages_dfs = []

for page_num in range(1, last_list_num + 1):

        print(f"Start scapring page {page_num}")

        time.sleep(1)  # let the page download the results

        url = url_start + f"&pg={page_num}"  # we parse the needed page by adding a parameter for url
        df = pd.read_html(url, header=0, index_col = 0, na_values=["-"])[0]

        df = df.reset_index()  # to have player name as a separate column

        print(df.shape[0], f"rows were retrieved from page number {page_num}")

        pages_dfs.append(df)


result_df = pd.concat(pages_dfs)

Start scapring page 1
50 rows were retrieved from page number 1
Start scapring page 2
50 rows were retrieved from page number 2
Start scapring page 3
50 rows were retrieved from page number 3
Start scapring page 4
50 rows were retrieved from page number 4
Start scapring page 5
50 rows were retrieved from page number 5
Start scapring page 6
50 rows were retrieved from page number 6
Start scapring page 7
50 rows were retrieved from page number 7
Start scapring page 8
50 rows were retrieved from page number 8
Start scapring page 9
50 rows were retrieved from page number 9
Start scapring page 10
50 rows were retrieved from page number 10
Start scapring page 11
50 rows were retrieved from page number 11
Start scapring page 12
50 rows were retrieved from page number 12
Start scapring page 13
50 rows were retrieved from page number 13
Start scapring page 14
50 rows were retrieved from page number 14
Start scapring page 15
50 rows were retrieved from page number 15
Start scapring page 16
50 ro

Now let's add an extension to code to download caphits data of multiple seasons into one dataframe.

## Scrapping caphits data season by season

In [41]:
seasons_dfs = []  # multiple seasons will be stored as different dataframes first

for year in range (2016, 2022):

    url_start = f"https://www.capfriendly.com/browse/active/{year}?stats-season={year}&display=birthday,country,weightkg,heightcm,draft,slide-candidate,signing-status,expiry-year,performance-bonus,caphit-percent,aav,length,minors-salary,base-salary,skater-individual-advanced-stats,skater-on-ice-advanced-stats,goalie-advanced-stats,type,signing-age,signing-date,arbitration,extension&limits=gp-5-90"

    req = requests.get(url_start)

    soup = BeautifulSoup(req.content)

    info_about_lists = soup.find_all("a", {"class": "whi pagin_r"})
    last_list_num = int(info_about_lists[-1]["data-val"])  # take the last number of page from date-val

    time.sleep(3)  # let's scrapp politely, 

    pages_dfs = []

    for page_num in range(1, last_list_num + 1):

        logging.info(f"Start season {year} scapring page {page_num}")

        time.sleep(2)  # let the page download the results

        url = url_start + f"&pg={page_num}"
        df = pd.read_html(url, header=0, index_col = 0, na_values=["-"])[0]

        pages_dfs.append(df)

    one_season_df = pd.concat(pages_dfs)

    one_season_df = one_season_df.reset_index()

    one_season_df['season_start_year'] = year

    seasons_dfs.append(one_season_df)

total_df = pd.concat(seasons_dfs)


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=fb97efa2-d417-46e7-813c-0372ce0dd7f6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>