## Team: The Untouchables
#### Members: Gerdin Ventura Croussett, Frank Choukouali Noumbissie, Armando Taveras
#### Introduction / Motivation: Predict whether college player is going to the NBA or not based on their stats. (incomplete intro)

### Step 1 - Importing Dependencies
#### The first step is to import important libraries that will allow us to obtain, modify, and visualize the data in an efficient yet elegant way. We will be using pandas and numpy to clean and modify the data, seaborn to create data visualizations, and requests to send GET requests to the server the data is stored on. We are also using json, re (regular expressions), and beautiful soup to scrape and parse the data into a neat and readable table format.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import requests
import json
import re
from bs4 import BeautifulSoup

### Step 2 - Generating URL Scrape Info
#### Now that we have all our dependencies, the next step is to figure out the endpoint (URL) to which we will be requesting the data from. The website <a href="https://www.basketball-reference.com" target="_blank">basketball-reference</a> has an abundance of data about basketball ranging anywhere from NBA Drafts, Season Leaders, Team Stats, Individual Player Stats, and more! We will be using this website as our data source since they are constantly updating the data and there's a wide variety of information which we can pull many insights from. 

#### Since the objective of this project is to be able to predict whether a college basketball player will be drafted into the NBA or not, we will be looking at the draft data which can be found <a href="https://www.basketball-reference.com/draft/" target="_blank">here</a>. Note: the draft data is split by the year and ranges back to 1947. Since college basketball has been changing rapidly over the years with players improving at an impressive rate, we will only focus on the last 20 years which will eliminate some noise in the data set since players from the older times did not average high numbers. 

#### Now that we know what our data is and why we are selecting it, let's find the URL of the site we will scrape. Start off by selecting different draft years to see how the URL changes accordingly. What do you notice? Well, there is a base URL of 'https://www.basketball-reference.com/draft/' which is then followed by `NBA_{year}.html` where {year} is the current year we selected. For example, if I selected the 2021 draft, the URL would be `https://www.basketball-reference.com/draft/NBA_2021.html`. Finally, let's create a list of URL's, one for each year's draft that we want to scrape. 

In [2]:
# Years we'll be scraping
years_list = []

# Populate the list with the years
for year in range(2000, 2022):
    years_list.append("NBA_" + str(year))
    
# Create a list of all the url's we will scrape
draft_url_list = []
for year in years_list:
    URL = "https://www.basketball-reference.com/draft/" + year + ".html"
    draft_url_list.append(URL)

# Print the results
draft_url_list

['https://www.basketball-reference.com/draft/NBA_2000.html',
 'https://www.basketball-reference.com/draft/NBA_2001.html',
 'https://www.basketball-reference.com/draft/NBA_2002.html',
 'https://www.basketball-reference.com/draft/NBA_2003.html',
 'https://www.basketball-reference.com/draft/NBA_2004.html',
 'https://www.basketball-reference.com/draft/NBA_2005.html',
 'https://www.basketball-reference.com/draft/NBA_2006.html',
 'https://www.basketball-reference.com/draft/NBA_2007.html',
 'https://www.basketball-reference.com/draft/NBA_2008.html',
 'https://www.basketball-reference.com/draft/NBA_2009.html',
 'https://www.basketball-reference.com/draft/NBA_2010.html',
 'https://www.basketball-reference.com/draft/NBA_2011.html',
 'https://www.basketball-reference.com/draft/NBA_2012.html',
 'https://www.basketball-reference.com/draft/NBA_2013.html',
 'https://www.basketball-reference.com/draft/NBA_2014.html',
 'https://www.basketball-reference.com/draft/NBA_2015.html',
 'https://www.basketball

#### Great! We now have a list of URL's that we'll scrape our data from. Click on one of the links and look at the information that is provided to us. Some relevant information that the table shown in each URL provides is: the player's name, the college they went to (if they attended college), and their stats in the NBA. Now, let's create a helper function that uses requests, beautiful soup, and pandas to request the data and neatly display it in a table, given some URL. Also, not every player drafted attended college. Some players played overseas. As a result, we must make sure we remove those entries who did not play college basketball and we must update the columns in our table that contain special characters since overseas players sometimes have special characters in their name.

In [3]:
# Helper function that scrapes draft data
def scrape_draft_data(url_to_req):
    # sending get request and saving the response as response object
    r = requests.get(url = url_to_req)

    # extracting data in json format
    data = r.content

    # Getting the html we need from the site
    soup = BeautifulSoup(data, "html.parser")
    tb = soup.findAll("table") #td

    # read_html returns a list of tables so we must index @ 0 to get the table
    table = pd.read_html(str(tb[0]), header=1)[0]

    # Filter out all the players that aren't in college. Also filter the extra rows
    table = table[table['College'].notna()]
    table = table[table['Rk'].notna()]
    
    # Replace special characters that mess up the url (overseas players)
    table = table.replace({'Player': {'č': 'c', 'Í': 'I', 'š' : 's'}}, regex=True)
    
    return table

#### Now that are helper function is created, let's loop through our list of URL's for the draft data and call our scrape_draft_data function which will return a table to us. We can store that table in a list, then concatenate all the tables in our list together to have one united table with all the data.

In [4]:
tables_list = []
for url in draft_url_list:
    table = scrape_draft_data(url)        # Scrape the url and get the table
    tables_list.append(table)             # Add the table to list of tables so we can merge later
joined_tables = pd.concat(tables_list)    # Merge all our tables together
joined_tables                             # Display the table

Unnamed: 0,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP.1,PTS.1,TRB.1,AST.1,WS,WS/48,BPM,VORP
0,1,1,NJN,Kenyon Martin,Cincinnati,15,757,23134,9325,5159,...,.234,.629,30.6,12.3,6.8,1.9,48.0,.100,0.1,12.1
1,2,2,VAN,Stromile Swift,LSU,9,547,10804,4582,2535,...,.074,.699,19.8,8.4,4.6,0.5,21.3,.095,-1.6,1.1
3,4,4,CHI,Marcus Fizer,Iowa State,6,289,6032,2782,1340,...,.191,.691,20.9,9.6,4.6,1.2,2.7,.022,-3.7,-2.6
4,5,5,ORL,Mike Miller,Florida,17,1032,27812,10973,4376,...,.407,.769,26.9,10.6,4.2,2.6,60.7,.105,0.8,19.8
5,6,6,ATL,DerMarr Johnson,Cincinnati,7,344,5930,2121,769,...,.336,.789,17.2,6.2,2.2,0.9,6.4,.052,-1.6,0.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,55,55,OKC,Aaron Wiggins,Maryland,1,50,1209,416,178,...,.304,.729,24.2,8.3,3.6,1.4,1.2,.048,-4.3,-0.7
57,56,56,CHO,Scottie Lewis,Florida,1,2,7,1,0,...,,.500,3.5,0.5,0.0,0.5,0.0,.164,6.0,0.0
58,57,57,CHO,Balsa Koprivica,Florida State,,,,,,...,,,,,,,,,,
59,58,58,NYK,Jericho Sims,Texas,1,41,555,90,169,...,,.414,13.5,2.2,4.1,0.5,1.5,.128,-1.7,0.0


#### Awesome! We have successfully scraped all the data from the NBA Drafts 2000-2021. There are a total of 1,037 college basketball players drafted in the last 21 years which means that on average, there are about 49 college basketball players drafted into the NBA each year. Earlier we mentioned how the stats displayed in this table are the player's NBA stats. Since the college basketball and the NBA are different leagues, it would not make sense to compare the stats since the NBA is significantly harder to perform well more consistently. Instead, we must find the college stats of the players drafted and use that to create and train our model. 

#### To get the college stats for a drafted player, we must scrape that data from another source. The source we will use is <a href="https://www.sports-reference.com/" target="_blank">sports-reference</a>, which is another source from the same company as our original source. The college basketball player stats endpoint is https://www.sports-reference.com/cbb/players/ which is then followed by a player's name separated by hypens (-), the number 1 (after player's name), and the `.html` extension. For example, the URL for the college basketball player (that was drafted to the NBA) Scottie Lewis is https://www.sports-reference.com/cbb/players/scottie-lewis-1.html

**Note: The player's name in the URL must be typed in all lowercase letters or else the link will not work!**

In [15]:
# Create list of URL's for college stat scraper
stats_url_list = []
for (index, row) in joined_tables.iterrows():
    player_name = row["Player"].replace(' ', '-').lower()                                # Replace the space with - for the link we need to scrape
    overall_pick = row["Pk"]                                                             # Get what place they were drafted
    url = "https://www.sports-reference.com/cbb/players/" + player_name + "-1.html"      # new url with college stats
    stats_url_list.append((url, overall_pick))                                           # Add a tuple (url, pick) to our list

#### Similiarly, we'll create a helper function to scrape the player's college stats. Since there are some differences between the current and last data source, we can't reuse our previous function, however, our code will be more readable and organized!

In [16]:
# stats_url_list
def scrape_player_stats(url_to_req):
    # sending get request and saving the response as response object
    r = requests.get(url = url_to_req)

    # extracting data in json format
    data = r.content

    # Getting the html we need from the site
    soup = BeautifulSoup(data, "html.parser")
    tb = soup.findAll("table") 
    
    # Skip link when they don't have college stats
    if tb == []:
        return pd.DataFrame()
    
    # read_html returns a list of tables so we must index @ 0 to get the table    
    table = pd.read_html(str(tb), header=0)
    
    if len(table) == 0:
        return pd.DataFrame()

    table = table[0]
    
    # Only get the career averages 
    table = table[table['Season'] == "Career"]

    return table
    
# Test function on one player
t = scrape_player_stats("https://www.sports-reference.com/cbb/players/jalen-suggs-1.html")
t

Unnamed: 0,Season,School,Conf,G,GS,MP,FG,FGA,FG%,2P,2PA,2P%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Unnamed: 27,SOS
1,Career,Gonzaga,,30,30,28.9,5.2,10.3,0.503,4.0,6.8,0.588,1.2,3.5,0.337,2.9,3.8,0.754,0.6,4.7,5.3,4.5,1.9,0.3,2.9,2.6,14.4,,5.92


#### Now that are helper function is created, let's loop through our list of URL's for the draft data and call our scrape_draft_data function which will return a table to us. We can store that table in a list, then concatenate all the tables in our list together to have one united table with all the data.

In [17]:
college_stats_list = []

for (url, overall_pick) in stats_url_list:
    res = scrape_player_stats(url)
    res["Overall_Pick"] = overall_pick
    college_stats_list.append(res)

pd.set_option('display.max_columns', None)        # shows all cols 
college_stats_table = pd.concat(college_stats_list)  
college_stats_table

Unnamed: 0,Season,School,Conf,G,GS,MP,FG,FGA,FG%,2P,2PA,2P%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Unnamed: 27,SOS,Overall_Pick,Unnamed: 25,Unnamed: 19
4,Career,Cincinnati,,116,97,25.0,4.4,7.5,0.586,4.4,7.5,0.59,0.0,0.1,0.222,2.2,3.7,0.581,,,7.5,1.2,1.1,2.5,1.6,2.9,11.0,,6.71,1,,
2,Career,LSU,,50,40,26.6,5.1,9.0,0.56,4.9,8.4,0.585,0.2,0.7,0.242,3.1,5.1,0.613,,,7.0,0.7,1.3,2.6,2.5,2.6,13.4,,5.92,2,,
3,Career,Iowa State,,97,91,31.1,7.1,13.9,0.511,6.9,13.2,0.524,0.2,0.7,0.292,4.4,6.3,0.702,,,7.4,1.0,0.9,0.9,2.3,2.6,18.9,,6.3,4,,
2,Career,Florida,,65,56,26.7,4.5,9.2,0.483,3.3,5.8,0.565,1.2,3.5,0.345,3.1,4.4,0.718,,,6.0,2.3,1.2,0.3,2.2,1.9,13.3,,7.1,5,,
1,Career,Cincinnati,,32,32,27.5,4.4,9.2,0.478,2.8,4.8,0.575,1.6,4.4,0.371,2.2,3.0,0.737,,,3.8,1.4,1.0,0.9,1.4,2.0,12.6,,8.5,6,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3,Career,Maryland,,96,50,28.2,3.9,9.6,0.407,2.1,4.6,0.456,1.8,5.0,0.361,1.4,1.8,0.769,1.1,3.6,4.6,1.6,0.9,0.4,1.4,1.8,11.0,,10.86,55,,
2,Career,Florida,,51,31,27.6,2.7,6.2,0.443,2.0,4.1,0.493,0.7,2.1,0.343,2.0,2.7,0.759,0.9,2.5,3.4,1.1,1.4,1.1,1.4,2.3,8.2,,9.46,56,,
2,Career,Florida State,,51,20,14.6,2.7,4.3,0.632,2.7,4.3,0.63,0.0,0.0,1.0,1.3,1.9,0.677,1.6,2.3,3.9,0.5,0.3,0.8,1.0,2.0,6.8,,8.4,57,,
4,Career,Texas,,119,77,20.5,2.7,4.3,0.639,2.7,4.3,0.64,0.0,0.0,0.0,1.2,2.2,0.524,1.8,3.6,5.4,0.4,0.4,0.8,1.2,2.2,6.6,,10.27,58,,


In [None]:
# what school gives most nba players?
# what fg% 30% ft% ast stl blk pts gives more of a chance? check which var correlation is stronger  make that one weigh more in predictive algorithm