# Data Extraction

### By: Calvin Chen and Matt Hashimoto

Hey everyone, and welcome to Week 2 of the `Balling with Data` project! We're excited to get started with the project, so let's get underway! First, a table of contents about what we'll be covering in this notebook today.

In [2]:
# Standard imports
# If any of these don't work, try doing `pip install _____`, or try looking up the error message.
import numpy as np
import pandas as pd
import json
import time
import os.path
from os import path
import math
import datetime
# import unidecode
import requests
from bs4 import BeautifulSoup

# Table of Contents
* [Introduction to web-scraping](#section1)
* [What is `sportsreference`?](#section2)
* [Let's get our data!](#section3)
    * [Potentially Useful Classes](#section3a)
    * [Important Things to Know](#section3b)
    * [Sandbox Area](#section3c)

<a id='section1'></a>
# Introduction to Web-Scraping!

Now that we've discussed the different project objectives and what kind of data we plan on getting,  we can now look into different methods of extracting this data from the internet. There are a couple of different ways we could go about doing this:

1. Web-scraping
2. API endpoint/Package

Between these two methods, the main difference is just how much someone has prepared the data for us beforehand. In many cases with starter data science projects, it'll be possible to find the data you need from differenrt free, online sources/APIs, making it easier for you to get started. However, what may happen on different occassions is that you won't be able to find any reliable database/data source that has all the different components of the data you're looking for. When this happens, you need to be able to find the data yourself. **How would we go about doing that? Let's try webscraping for [Stephen Curry college stats](https://www.sports-reference.com/cbb/players/stephen-curry-1.html).**

In [5]:
steph_url = 'https://www.sports-reference.com/cbb/players/stephen-curry-1.html'
req = requests.get(steph_url) # This will make a request to steph_url for us!

In [6]:
# Now, we sift through the request's content with a html parser.
soup = BeautifulSoup(req.content, 'html.parser')
soup.prettify()



In [67]:
# Now, we can use the .find method for BeautifulSoup objects to find the data we need from Steph Curry's stats.
# We've done the following below for you because we won't be going too in-depth into this for the project, but
# it's nice/important to know how to do.
table = soup.find('table', {'id': 'players_per_game'})
stats = table.findAll('td')
row_stats = [stats[i:i+28] for i in range(0, len(stats), 28)]
last_year = ['Steph Curry'] + [stat.get_text() for stat in row_stats[-2]] # Second-to-last element in row_stats should be the latest yearly averages for the player (right before career stats)
last_year = np.reshape(np.array(last_year), (-1, 29))
last_year


array([['Steph Curry', 'Davidson', 'Southern', '34', '34', '33.7', '9.2',
        '20.2', '.454', '5.4', '10.3', '.519', '3.8', '9.9', '.387',
        '6.5', '7.4', '.876', '0.6', '3.8', '4.4', '5.6', '2.5', '0.2',
        '3.7', '2.4', '28.6', '', '-3.33']], dtype='<U11')

Now, we've gotten Steph Curry's final year stats at Davidson, but what do the different stats mean? Let's find their headers so we put some sense to these numbers.

In [68]:
# Find column headers
cols = table.findAll('th')[1:29] # Column headers
col_headers = ['Name'] + [col.get_text() for col in cols]
col_headers

['Name',
 'School',
 'Conf',
 'G',
 'GS',
 'MP',
 'FG',
 'FGA',
 'FG%',
 '2P',
 '2PA',
 '2P%',
 '3P',
 '3PA',
 '3P%',
 'FT',
 'FTA',
 'FT%',
 'ORB',
 'DRB',
 'TRB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS',
 '\xa0',
 'SOS']

In [69]:
# Now, let's make a pandas dataframe from this data.
curry = pd.DataFrame(data=last_year, columns=np.array(col_headers))
curry

Unnamed: 0,Name,School,Conf,G,GS,MP,FG,FGA,FG%,2P,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Unnamed: 20,SOS
0,Steph Curry,Davidson,Southern,34,34,33.7,9.2,20.2,0.454,5.4,...,3.8,4.4,5.6,2.5,0.2,3.7,2.4,28.6,,-3.33


Congrats! You've successfully scraped together a dataframe for us to analyze about Steph Curry's basketball stats in his final year at Davidson college. Now, we can see that we'd easily be able to apply the same logic above to a variety of different NCAA players, and may still be quite useful when we come across **international players**. Unfortunately, for the scope of this project, we won't get into analyzing international player's stats, but you can imagine it'd be a similar process to how we analyzed Steph Curry above.

Now, let's get into a free sports API that'll abstract all this scraping away for all the different types of websites we might encounter, and allow us to access all the different player data in a friendly format. **Let's get into what `sportsreference` can do for us!**

<a id='section2'></a>
# What is `sportsreference`?

Now that we've seen how web-scraping works fundamentally, let's work with an API that will abstract that all away for us and give us the ability to easily query for different players' stats we're interested in!

**Let's visit the [sportsreference documentation](https://sportsreference.readthedocs.io/en/stable/).**

Read through the documentation and get a handle for how the API is strcutured. Afterwards, we'll get into a couple of different exercises, and then leave the rest for you guys to handle!

In [72]:
# Modules from sportsrefernece.ncaab for college basketball
from sportsreference.ncaab.boxscore import Boxscore as NCAAB_Boxscore
from sportsreference.ncaab.conferences import Conferences as NCAAB_Conferences
from sportsreference.ncaab.rankings import Rankings as NCAAB_Rankings
from sportsreference.ncaab.roster import Player as NCAAB_Player
from sportsreference.ncaab.roster import Roster as NCAAB_Roster
from sportsreference.ncaab.schedule import Schedule as NCAAB_Schedule
from sportsreference.ncaab.teams import Teams as NCAAB_Teams

# Modules from sportsrefernece.nba for NBA basketball
from sportsreference.nba.boxscore import Boxscore as NBA_Boxscore
from sportsreference.nba.roster import Player as NBA_Player
from sportsreference.nba.roster import Roster as NBA_Roster
from sportsreference.nba.schedule import Schedule as NBA_Schedule
from sportsreference.nba.teams import Teams as NBA_Teams

**Exercise 1:** Find all the different teams' abbreviations in the NBA in 2011.

In [150]:
# TODO

**Exercise 2:** Find all the players that played for the Golden State Warriors in the past 3 years.

In [151]:
# TODO

**Exercise 3:** Find all the players that played for Cal Basketball and UCLA from 2015 to 2018.

In [152]:
# TODO

<a id='section3'></a>
# Let's Get Our Data!

Now that you've been able to tinker around with a little bit of the package, try and figure out how you might able to get the data we need for the project! We've provided the following classes below to try and help out what we're trying to find, but tinker around and see what kind of things you come across!

To reiterate our project objective, and in turn, what we need from our data, we want to:

**Predict the 2019-2020 NBA Rookie statlines and compare those to their current statlines, given the past 10 years worth of NBA rookie + NCAA basketball data.**

<a id='section3a'></a>
## Potentially Useful Classes

In [155]:
# This class wraps around the NBA_Player class, and adds a 'first_year' property that returns when the NBA
# player given first played in the NBA, in a datetime object.
class New_NBA_Player(NBA_Player):
    def __init__(self, player_id):
        NBA_Player.__init__(self, player_id)
        self._first_year = datetime.datetime.strptime(self._season[0][:4], '%Y').date()
    
    @property
    def first_year(self):
        return self._first_year
    

In [156]:
# This class wraps around the NCAAB_Player class, and adds a 'last_year' property that returns when the respective
# NCAAB Player last played in the NCAA, in a datetime object.
class New_NCAAB_Player(NCAAB_Player):
    def __init__(self, player_id):
        NCAAB_Player.__init__(self, player_id)
        self._last_year = datetime.datetime.strptime(self._most_recent_season[:4], '%Y').date()
    
    @property
    def last_year(self):
        return self._last_year

<a id='section3b'></a>
## Important Things To Know

1. The last digit on the `player_id` tag relates to which number instance they are of that name. For example, stephen-curry-2 would be the second player with the same name 'Stephen Curry'. This can get incredibly annoying when trying to translate player data from the NBA to the NCAA, as there's a lot more players (and more possible name collisions) in the NCAA than in the NBA.

2. Datetime objects are comparable. Let's see how this implicates with what we know above with the new classes. (Hint: Can a player play in NCAA basketball after playing in the NBA?)

In [166]:
year_2009 = datetime.datetime.strptime('2009', '%Y').date()
year_2009

datetime.date(2009, 1, 1)

In [167]:
year_2008 = datetime.datetime.strptime('2008', '%Y').date()
year_2008

datetime.date(2008, 1, 1)

In [168]:
year_2009 > year_2008

True

3. It may be easier for you to go from all the different NBA players and trying to find their respective NCAA stats than the other way around (there are less NBA players than NCAA players, so potentially less queries to be made to find all the data.)

4. Take a look at what happens when you try to query into the NCAAB_Player class with an invalid `player_id` and see how you can use this to your advantage!

In [173]:
New_NCAAB_Player('lebron-james-1') # LeBron never went to college.

TypeError: 'NoneType' object is not callable

5. There will be many different cases where this function won't work, and it's up to you for what you want to do about them (i.e. international players didn't play in the NCAA, players aren't necessarily guarenteed to be the first instance player with their name). Feel free to ask us about what you should do in order to deal with these cases, but we mention this to highlight how what you choose to do here can alter how your project fundamentally behaves later on. This doesn't mean any way is necessarily right (we haven't gone through all the different combinations), but this gives you more free reign to take this project into your own hands and determine **what you want your data to be like, and where to get the data from.**

## Sandbox Area

Here's where you'll be extracting all the data you might need for the project. Feel free to tinker around however you please and ask us any questions you might have about anything-- we're more than happy to help you out!

In [174]:
# We provided this helper to try and reduce the number of issued cases you might across, and for convenience's sake.
def convert_nba_ncaa_name(name):
    """
    Converts the format of the NBA player_id to the NCAA player_id.
    """
    return unidecode.unidecode(name.lower().replace(" ", "-") + "-1")

In [175]:
convert_nba_ncaa_name("Stephen Curry")

'stephen-curry-1'

In [187]:
def get_nba_ncaa_10_years():
    """
    Getting the college basketball data for all NBA Players in the past 10 years. Seperate all the different 
    failed cases for this function into a different output, so we can analyze them afterwards and see what 
    kinds of data we're missing out on.
    
    Expected output type: Pandas dataframe
    """
    # dododododododo
    return data, failed

In [185]:
data, failed = get_nba_ncaa_10_years()
csv_data = data.to_csv('all_player_data.csv')

AttributeError: 'ellipsis' object has no attribute 'to_csv'

**Congrats! You've gotten all your data!** This is definitely not an easy task to do, so congratulate yourself with figuring out how `sportsreference` works and getting all the data we need for the project! Next week, we'll get into analyzing the different features of the data, and doing some [data analysis](https://en.wikipedia.org/wiki/Data_analysis) and [feature engineering](https://en.wikipedia.org/wiki/Feature_engineering) to determine which features will be best to use for our project. Stay tuned for more :D