# Notebook 1 - Data Collection

##### Alessandro DeChellis


## Introduction

The goal of this report is to anaylze how NHL team payrolls affect regular season and playoff success. In order to to this, we need to first get 2 tables, one for player salaries and one for team rosters by year. 

This will be done through webscraping from www.capfriendly.com and www.hockey-reference.com. 

*** <b> DISCLAIMER: DO NOT RUN THIS NOTEBOOK, IT IS FOR READING AND CLARIFICATION PURPOSES ONLY </b> ***

We first import our libraries

In [2]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For scraping
import requests
from bs4 import BeautifulSoup

# For adding delays so that we don't spam requests
import time

We started by setting up 4 lists to fill with our data

### Salaries

In [51]:
# Set up blank lists
year = []
name = []
position = []
cap_hit = []

#### Looping Through Seasons

We then run a loop for each season. We did this manually for each season, as we did not want to overwhelm the website. 

In [48]:
# Loop through the pages
for i in range(1,33):
    page = i
    # Get the webpage
    response = requests.get(f'https://www.capfriendly.com/browse/active/2021?age-calculation-date=october1&hide=team,clauses,age,handed,expiry-status,salary,skater-stats,goalie-stats&pg={page}')
    # Transform the response
    soup = BeautifulSoup(response.content)
    #Find the table we are looking for
    table = soup.find(id='brwt')
    # Find all entries in the table (100 entries)
    team_position_salary = table.find_all('td', class_='center')
    # Divide this into sets of 2 (position and cap hit) and make it a list
    players = [team_position_salary[x:x+2] for x in range(0, len(team_position_salary), 2)]
    # Find the player names in the row and make it a list
    player_name =  table.find_all('td', class_='left')

    
    #Loop through the list
    for n in range(0,50):
        # separate the first element of the list and make it the variable position_
        position_ = players[n][0].text.strip()
        # Separate the second element and make it cap_hit_
        cap_hit_ = players[n][1].text.strip()
        # Find the player and make it a variable
        name_ = player_name[n].text.strip()
    
        # Append these variables to the lists with the year
        name.append(name_)
        position.append(position_)
        cap_hit.append(cap_hit)
        year.append(2021)
    


IndexError: list index out of range

The error above is because there is not 50 entries on the last page. This was done for every year from 2007-2021. 

#### Creating the dataframe

We want to now convert these lists into a dataframe and convert this file to a csv

In [49]:
# We pass in a dictionary; the keys are the column names we want, and the values are the variables we've created
salaries = pd.DataFrame({'year': year, 'name': name, 'position': position, 'cap_hit': cap_hit})


In [None]:
# Check that this worked
salaries.head()

In [273]:
## Convert this to a csv
salaries.to_csv('data/salaries.csv', index=False)

In [116]:
## Double check that this worked
salaries.head()

Unnamed: 0,year,name,position,cap_hit
0,2021,1. Connor McDavid,C,$12500000
1,2021,2. Artemi Panarin,LW,$11642857
2,2021,3. Auston Matthews,C,$11640250
3,2021,4. Erik Karlsson,RD,$11500000
4,2021,5. John Tavares,C,$11000000


In [None]:
# Remove the $ 

salaries['cap_hit] = salaries['cap_hit].str.replace("$", '')

In [285]:
# Change the cap_hit type to int
salaries['cap_hit'] = salaries['cap_hit'].astype(int)

In [288]:
salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18042 entries, 0 to 18041
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   year      18042 non-null  int64 
 1   name      18042 non-null  object
 2   position  18042 non-null  object
 3   cap_hit   18042 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 563.9+ KB


### Rosters

We first set up our team list

In [46]:
# Team List
teams = ['CAR', 'NYR', "NYI", 'NJD','CBJ', 'PHI', 'PIT', 'FLA', 'TOR', ' TBL', 'DET', 'BOS', 'BUF', 'MTL', 'OTT',
        'MIN', 'NSH', 'STL', 'COL', 'DAL', 'CHI','EDM', 'CGY', 'LAK', 'SJS', 'VAN', 'ANA', 'WSH']

In [32]:
# Empty Lists
name_1 = []
team_1 = []
season_1 = []

##### Looping Through Teams and Seasons

We set up a nested loop to run through teams and seasons to get rosters from www.hockey-reference.com

In [36]:
# Set up the range of years
years = [2007,2008,2009,2010, 2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021]

#Set up the outter loop
for t in teams:
    for n in years:
        # Get the response
        response = requests.get(f'https://www.hockey-reference.com/teams/{t}/{n}.html')
        # Convert with BeautifulSoup
        soup = BeautifulSoup(response.content)
        # Find the table with the rosters
        rost = soup.find('div', class_='table_wrapper', id='all_roster')
        # Find the smaller table
        roster = rost.find('table', id='roster')
        # find the entries in the table
        clean = roster.find('tbody')
        # find all entries into a list
        clean_roster = clean.find_all('td', class_='left')
        # separate the list into groups of 2
        players = [clean_roster[x:x+2] for x in range(0, len(clean_roster), 3)]
        
        # Loop through the list
        for i in range(0, len(players)):
            # find the player name
            player = players[i][0].text.strip()
            #Append the name
            name_1.append(player)
            # append the team name (t)
            team_1.append(t)
            # Append the season (n)
            season_1.append(n)

   

#### Create the dataframe

In [107]:
# Create a dataframe with the lists. 
players = pd.DataFrame({'year': season, 'name': name, 'team': team})

In [108]:
# Check that this worked
players.head()

Unnamed: 0,year,name,team
0,2007,Craig Adams,CAR
1,2007,Kevyn Adams,CAR
2,2007,Keith Aucoin,CAR
3,2007,Anton Babchuk,CAR
4,2007,Ryan Bayda,CAR


Save the Dataframe to a csv

In [109]:
# Save to csv
players.to_csv('data/players.csv', index=False)

## Conclusion

In this notebook, we can see that we can easily obtain player salaries and team rosters through webscraping. This was a very rough notebook in terms of actual coding, as there was alot of manual work to do with this scraping. The data we need is was in very rough form and was not easily obtained in terms of one solid code. 

At the end of the day, we have what we need and will begin to clean the data in the next notebook: <b> Book2 - Data Cleaning </b>