## Clean and Prepare College Data for College Stat Regression Analysis


The contract value for active players has already been extracted from the NBA API and merged with aggregated NBA stats. For players who were drafted from a college/university, we will get their college statistics and draft information for use in a regression analysis using contract value as the target variable. Unfortuantely, the NBA API that is publically available for free and without subscription does not include college stats. It does however, include draft histroy (player name, season, round number and pick, team and organization drafted from) for all NBA teams. This will be merged with the previously aggregated contract value dataset. From that, we'll create a list of players of whom we will scrape their college basketball stats from https://www.sports-reference.com/cbb/. 

In [1]:
# IMPORT LIBRARIES
import pandas as pd
import numpy as np
import os
import re
import time
import pickle
import requests
#import matplotlib as plt
import seaborn as sns
import statsmodels as sm
import sys
from nba_api.stats.static import teams
from nba_api.stats.static import players
import requests
from nba_api.stats.endpoints import drafthistory
import urllib.request as urllib2
import string
from bs4 import BeautifulSoup as bsoup



Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Prior to running this notebook, please make sure to download  the associated files locally and define your directory path to where they are stored
Files to download:

- analysis_df
- all_urls.csv
- draft_df.csv
- Season Totals folder


In [2]:
# DEFINE DIRECTORY PATH TO WHERE NOTEBOOK FILES ARE STORED AND CHANGE DIRECTORY TO PATH 

#path = r"C:\Users\ahigh\OneDrive\Desktop\team24_milestone1\Draft files"

#os.chdir(path)
os.getcwd()

'c:\\Users\\toobr\\OneDrive\\Documents\\mads\\test\\src'

In order to get draft history from the NBA API, we need to extract the team ids first. From the NBA API, we'll extract team ids and then use those ids to get draft history for each team and combine into a single dataframe. I'll create a function to accomplish this.

In [3]:
# GET NBA TEAMS - NEED TEAM ID TO  PULL IN DRAFT HISTORY
nba_teams = teams.get_teams()

# GET SIZE OF NBA TEAMS FROM API
print (sys.getsizeof(nba_teams))



312


In [4]:
len(nba_teams)

30

In [5]:

def get_drafthistory(start,stop):
    # FOR EACH TEAM, EXTRACT THE TEAM ID AND PUT IN LIST 
    team_ids = []
    for team in nba_teams:
        team_id = team['id']
        team_ids.append(team_id)
    
    # FOR EACH TEAM ID, CREATE A DF OF DRAFT HISTORY AND APPEND TOGETHER - THIS PRODUCES A DRAFT HISTORY DF FOR EVERY PLAYER ON EACH TEAM WHEN APPPLICABLE)

    draft_dfs = []
    for team_id in team_ids[start:stop]:
        a = drafthistory.DraftHistory(team_id_nullable=team_id)
        df = a.get_data_frames()[0]
        draft_dfs.append(df)

    # CONCATENATE ALL TEAMS DRAFT HISTORY DF INTO ONE
        
    draft_df = pd.concat(draft_dfs)
    #draft_df.info(verbose=True)

    return draft_df

In [6]:
# DUE TO API LIMITATIONS, WE'LL NEED TO RUN THE FUNCTION INCREMENTALLY WITH A DELAY BETWEEN EACH RUN TO AVOID SENDING TOO MANY REQUESTS AT ONCE


draft_df0 = get_drafthistory(0,15)
time.sleep(90)
draft_df1 = get_drafthistory(15,31)

# JOINING DFS FROM EACH PULL
draft_df= pd.concat([draft_df0,draft_df1])


In [7]:
# PREVIEW DATAFRAME
draft_df.head(20)

Unnamed: 0,PERSON_ID,PLAYER_NAME,SEASON,ROUND_NUMBER,ROUND_PICK,OVERALL_PICK,DRAFT_TYPE,TEAM_ID,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,ORGANIZATION,ORGANIZATION_TYPE,PLAYER_PROFILE_FLAG
0,1641723,Kobe Bufkin,2023,1,15,15,Draft,1610612737,Atlanta,Hawks,ATL,Michigan,College/University,1
1,1641754,Seth Lundy,2023,2,16,46,Draft,1610612737,Atlanta,Hawks,ATL,Penn State,College/University,1
2,1631100,AJ Griffin,2022,1,16,16,Draft,1610612737,Atlanta,Hawks,ATL,Duke,College/University,1
3,1631157,Ryan Rollins,2022,2,14,44,Draft,1610612737,Atlanta,Hawks,ATL,Toledo,College/University,1
4,1630552,Jalen Johnson,2021,1,20,20,Draft,1610612737,Atlanta,Hawks,ATL,Duke,College/University,1
5,1630536,Sharife Cooper,2021,2,18,48,Draft,1610612737,Atlanta,Hawks,ATL,Auburn,College/University,1
6,1630168,Onyeka Okongwu,2020,1,6,6,Draft,1610612737,Atlanta,Hawks,ATL,Southern California,College/University,1
7,1630219,Skylar Mays,2020,2,20,50,Draft,1610612737,Atlanta,Hawks,ATL,Louisiana State,College/University,1
8,1629637,Jaxson Hayes,2019,1,8,8,Draft,1610612737,Atlanta,Hawks,ATL,Texas,College/University,1
9,1629629,Cam Reddish,2019,1,10,10,Draft,1610612737,Atlanta,Hawks,ATL,Duke,College/University,1


In [8]:
# GET SIZE OF DRAFT_dF 

print (sys.getsizeof(draft_df))

4644280


The next few cells modify the original values to match the name Sports Reference uses in the player's url. Modifications include:

- use of an abbreviated name in place of full name 
- suffix reconcilation
- players with same names; they require a different format for their HTML link
- manually replace some values


In [9]:
# MERGE DRAFT DF WITH AGGREGATE DF TO GET PLAYERS AND CURRENT CONTRACT VALUE

#agg_df = pd.read_csv(path + r"\analysis_df.csv")
agg_df = pd.read_csv("../data/analysis_df.csv")
agg_df.head(20)

def merge_agg():
    df = pd.merge(agg_df, draft_df, how='left', left_on='PLAYER_ID', right_on='PERSON_ID')

    # WE ONLY WANT TO KEEP PLAYERS WHO WERE DRAFTED FROM A COLLEGE/UNIVERSITY 

    df1 = df[df['ORGANIZATION_TYPE']=='College/University']

    # SINCE THE AGGREGATE FILE WAS CREATED FROM JOINING ON NAMES, WE HAVE A DUPLICATE NAME VALUES BUT THIS PLAYER IS NOT ACTIVE - HE'LL BE DROPPED
    df1 = df1[df1['PLAYER_ID']!=76526]

    # REMOVE NBA STATS COLUMNS 
    df1 = df1[['Current_Contract','First_Name','Last_Name','TEAM_CITY','TEAM_NAME','TEAM_ABBREVIATION','ORGANIZATION','SEASON','ROUND_NUMBER','ROUND_PICK','OVERALL_PICK']]

    # WE NEED TO ISLOATE PLAYER NAMES TO SCRAPE COLLEGE STATS 
    # MOST OF THE URLS FROM SPORTS REFERENCE ARE IN THE FROM "first_name-last_name", THERE ARE EXCEPTIONS THAT NEED TO BE MANUALLY MODIFIED 
    # SOME EXAMPLES INCLDUE NAMES WITH SUFFIXES AND INSTANCES WHEN THERE ARE MORE THAN ONE PLAYER WITH THE SAME NAME - WE WILL MODIFY NAMES AFTER EXTRACT THEM INTO A DICTIONARY TO MAINTAIN
    # THEIR ORIGINAL NAME VALUES 

    df1['Player'] = df1['First_Name'] +'-'+ df1['Last_Name']
    df1 = df1[~df1['Player'].isin(['usman-garuba','shaedon-sharpe'])] #THESE PLAYERS HAD NO COLLEGE RECORDS, WE'LL NEED TO REMOVE THEM SO IT DOESN'T CREATE A BAD HTML

    college_players = list(df1['Player'].drop_duplicates())
    college_dict = {}    
    
    return df1, college_players

agg_df1 = merge_agg()[0]
agg_df1.head(20)

Unnamed: 0,Current_Contract,First_Name,Last_Name,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,ORGANIZATION,SEASON,ROUND_NUMBER,ROUND_PICK,OVERALL_PICK,Player
0,51915615.0,stephen,curry,Golden State,Warriors,GSW,Davidson,2009,1.0,7.0,7.0,stephen-curry
1,47649433.0,kevin,durant,Seattle,SuperSonics,SEA,Texas,2007,1.0,2.0,2.0,kevin-durant
3,47607350.0,joel,embiid,Philadelphia,76ers,PHI,Kansas,2014,1.0,3.0,3.0,joel-embiid
5,46741590.0,bradley,beal,Washington,Wizards,WAS,Florida,2012,1.0,3.0,3.0,bradley-beal
7,45640084.0,damian,lillard,Portland,Trail Blazers,POR,Weber State,2012,1.0,6.0,6.0,damian-lillard
8,45640084.0,kawhi,leonard,Indiana,Pacers,IND,San Diego State,2011,1.0,15.0,15.0,kawhi-leonard
9,45640084.0,paul,george,Indiana,Pacers,IND,Fresno State,2010,1.0,10.0,10.0,paul-george
10,45183960.0,jimmy,butler,Chicago,Bulls,CHI,Marquette,2011,1.0,30.0,30.0,jimmy-butler
11,43219440.0,klay,thompson,Golden State,Warriors,GSW,Washington State,2011,1.0,11.0,11.0,klay-thompson
14,40600080.0,anthony,davis,New Orleans,Hornets,NOH,Kentucky,2012,1.0,1.0,1.0,anthony-davis


In [10]:
# GET LIST OF COLLEGE PLAYERS FROM FUNCTION
college_players = merge_agg()[1]


In [11]:
# TO CHECK THAT THE NAME VALUES ARE CORRECT BEFORE SCRAPING THE "PLAYERS TOTALS" TABLES, WE'LL FIRST SCRAPE ALL OF THE URLS AND DETERMINE
# WHICH ONES NEED TO BE MODIFIED 

urls = []
def get_playerurls(start,finish):
    for letter in list(string.ascii_lowercase)[start:finish]:
        #print (letter)
        html_page = urllib2.urlopen(r"https://www.sports-reference.com/cbb/players/{}-index.html".format(letter))
        soup = bsoup(html_page, "html.parser")
        for link in soup.findAll('a'):

            player_url = link.get('href')

            if '/cbb/players/' in player_url:
                name_value = player_url[13:-7]
                urls.append(name_value)
    time.sleep(90)
    return None


Because scraping the website for every player's unique url can take a while (5-6 minutes), I've created an alternate cell that will read them in from a local file. 

In [None]:
#- BECAUSE THE SERVER ONLY ALLOWS 20 REQUESTS PER MINUTE, WE NEED TO DO THIS IN INCREMENTS OF 15 WITH A 90 SECOND DELAY BETWEEN RUNS - ONCE THIS FINISHES, WE'LL HAVE A COMPLETE LIST OF PLAYER URLS FROM 
# SPORTS REFERENCE, WE DON'T NEED EVERY PLAYER JUST THE ONES FROM OUR SOURCE FILE THAT WERE DRAFTED FROM COLLGE - THIS TAKES ABOUT 5-6 MINUTES TO RUN 

# get_playerurls(0,15)
# time.sleep(90)
# get_playerurls(15,27)

# url_df = pd.DataFrame({'player_url':urls})
# url_df

#url_df.to_csv('all_urls.csv')

In [12]:
#url_df = pd.read_csv(path + r"\all_urls.csv", usecols=['player_url'])
url_df = pd.read_csv("../data/all_urls.csv", usecols=['player_url'])
urls1 = list(url_df['player_url'])

urls = urls1.copy()

In [13]:
# NOW, WE NEED TO COMPARE OUR NAME VALUES FROM THE AGGREGATED DF TO THE LINKS SCRAPED FROM SPORTS REFERENCE 
# FOR THE NAME VALUES THAT ALREADY APPEAR IN THE SCRAPED LIST, WE'LL ADD TO A DICTIONARY. FOR THE ONES NOT IN THE SCRAPED LIST
# WE'LL ATTEMP TO MODIFY THEM TO FIT THE URL NAMING CONVETION (THEY WON'T ALL MATCH AND SOME WE'LL HAVE TO BE HARDCODED)
new_name = {}

def fix_names():
    for player in college_players:
        #print(player)
        if player in urls:
            new_name[player] = player
        else:
            p = player.split(' ')
            if len(p) == 2 and (p[1] in ['jr.','sr.']):
                name = p[0] + p[1][0:-1]
            elif len(p) == 2 and (p[1] in['ii','iii','v']):
                name = p[0] + p[1]
            else:
                name = player.replace("'","")
            new_name[player] = name
            
    return None
            


In [14]:
# EXECUTE FUNCTION
fix_names()

In [15]:
# MAKE MANUAL NAME REPLACEMENTS 

def manual_replace():
    new_name['tim-hardaway jr.'] = 'tim-hardaway-jr'
    new_name['ja-morant'] ='temetrius-morant'
    new_name['bam-adebayo'] = 'edrice-adebayo'
    new_name['p.j.-washington'] = 'pj-washington'
    new_name['larry-nance jr.'] = 'larry-nance'
    new_name['nic-claxton'] = 'nicolas-claxton'
    new_name['jabari-smith jr.'] = 'jabari-smith'
    new_name['patty-mills'] = 'patrick-mills'
    new_name['obi-toppin'] = 'obadiah-toppin'
    new_name['otto-porter jr.'] = 'otto-porter'
    new_name['shake-milton'] = 'malik-milton'
    new_name['dereck-lively ii'] = 'dereck-lively-ii'
    new_name['troy-brown jr.'] = 'troy-brown'
    new_name['bones-hyland'] = 'nahshon-hyland'
    new_name['nick-smith jr.'] = 'nick-smith-jr'
    new_name['cam-thomas'] = 'cameron-thomas'
    new_name['kelly-oubre jr.'] = 'kelly-oubre'
    new_name['mo-bamba'] = 'mohamed-bamba'
    new_name['lonnie-walker iv'] = 'lonnie-walker'
    new_name['svi-mykhailiuk'] = 'sviatoslav-mykhailiuk'
    new_name['xavier-tillman sr.'] = 'xavier-tillman'
    new_name['e.j.-liddell'] = 'ej-liddell'
    new_name['hunter-tyson'] = 'tyson-hunter'
    new_name['andre-jackson jr.'] = 'andre-jackson'
    new_name['p.j.-tucker'] = 'pj-tucker'
    
    
    return None

manual_replace()

 The next part of the notebook, iterates through the values of the new name dictionary to scrape the totals table for each player. The totals table contains college stats for each of ther player's college seasons and a row for the total across seasons. This takes a long time to run as it has to be run incrementally for all 340 players. For conveience and ease of reproducibilty, the html files are provided in the Season Totals folder so the user does not need to regenerate them. However, the functions created below originally generated the HTML files. 

In [16]:
#  CHANGE DIRECTORY TO ACCESS "TOTALS" FOLDER - THIS IS WHERE I'LL WRITE THE HTML FILES TO 

#os.getcwd()  
#os.chdir(r"C:\Users\ahigh\OneDrive\Desktop\team24_milestone1")

# SET DEFAULT URL 
url_start = "https://www.sports-reference.com/cbb/players/{}-1.html"  

college_players1 = list(new_name.values())
totals_dfs = []

def get_total_html(start,finish):
    for player in college_players1[start:finish]:
        if player == 'kyle-anderson':
            #WHEN THERE IS MORE THAN 1 PLAYER WITH THE SAME NAME THE URL HAS A DIFFERENT SUFFIX
            url = "https://www.sports-reference.com/cbb/players/{}-3.html".format(player)
        elif player in ('nick-richards','josh-green','lonnie-walker'):
            url = "https://www.sports-reference.com/cbb/players/{}-2.html".format(player)
        elif player =='troy-brown':
            url = "https://www.sports-reference.com/cbb/players/{}-5.html".format(player)
        elif player =='andre-jackson':
            url = "https://www.sports-reference.com/cbb/players/{}-8.html".format(player)
        else:
            url = url_start.format(player)
            print(url)
        data = requests.get(url)
        with open("totals1/{}.html".format(player), "w+" ) as f:
            f.write(data.text)

    return None 
    



In [17]:
#AGAIN, THIS SHOULD NOT BE RUN BUT FOR REFERENCE THIS IS HOW WE WOULD RUN THE PREVIOUSLY CREATED FUNCTION IN INCREMENTS OF 15 FOR ALL PLAYERS

def all_htmls():
    vals = [(0,105,15),(105,210,15),(210,300,15),(300,len(new_name),15)]
    for val in vals:
        for i in range (val[0],val[1],val[2]):
            j = i + 5
            #print(i,j)
            get_total_html(i,j)
            time.sleep(75)

    return None

#all_htmls()

In [19]:
# FROM EACH PLAYERS HTML FILE (PROVIDED), THIS FUNCTION CREATES A DF OF THEIR COLLEGE STATS, INPUTS THE PLAYERS NAME AND CONCATENANTES THE DFS INTO ONE


def get_total_tables(start,finish):
    for player in college_players1[start:finish]:
        #ofile = open(path + r"\Season Totals\{}.html".format(player), encoding='latin-1')
        ofile = open("../data/{}.html".format(player), encoding='latin-1')
        soup = bsoup(ofile)
        table = soup.find(id='players_totals')
        total_df = pd.read_html(str(table))[0]
        total_df['Player'] = player
        totals_dfs.append(total_df)

    return None
            

In [21]:
# EXECUTE FUNCTION AND CONCATENATE DFS
get_total_tables(0,len(college_players1))
college_stats = pd.concat(totals_dfs)


  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.read_html(str(table))[0]
  total_df = pd.

In [22]:
# MAKE SURE ALL PLAYERS TABLES ARE INCLUDED, SHOULD HAVE 340 UNIQUE PLAYERS
college_stats['Player'].nunique()

340

In [23]:
# THE TOTALS TABLE MAINTAINS A TOTAL ROW WHICH IS THE SUM OF THE PLAYERS STATS FOR ALL COLLEGE SEASONS BUT ITS FORMAT IS DIFFERENT THAN THE SEASON LEVEL ROWS (SPECIFICALLY, IT'S CONFERENCE AND CLASS VALUES ARE MISSING),
#  SO WE'LL REMOVE THOSE SO EACH PLAYER HAS ONE ROW FOR EACH OF TEHIR COLLEGE SEASONS AND RECALCULATE THE SUM
college_stats.head()
college_stats.columns

Index(['Season', 'School', 'Conf', 'Class', 'G', 'GS', 'MP', 'FG', 'FGA',
       'FG%', '2P', '2PA', '2P%', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Player'],
      dtype='object')

In [24]:
college_stats=college_stats[college_stats['Season']!='Career']
college_stats.head()

Unnamed: 0,Season,School,Conf,Class,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player
0,2006-07,Davidson,Southern,FR,34,33.0,1049.0,242,523,0.463,...,32.0,125.0,157,95,62.0,6.0,95.0,87.0,730,stephen-curry
1,2007-08,Davidson,Southern,SO,36,36.0,1193.0,317,656,0.483,...,28.0,137.0,165,104,73.0,14.0,93.0,85.0,931,stephen-curry
2,2008-09,Davidson,Southern,JR,34,34.0,1145.0,312,687,0.454,...,21.0,130.0,151,189,86.0,8.0,126.0,81.0,974,stephen-curry
0,2006-07,Texas,Big 12,FR,35,35.0,1255.0,306,647,0.473,...,106.0,284.0,390,46,66.0,67.0,99.0,71.0,903,kevin-durant
0,2013-14,Kansas,Big 12,FR,28,20.0,647.0,107,171,0.626,...,65.0,162.0,227,38,25.0,72.0,66.0,94.0,313,joel-embiid


In [25]:
# AGGREGATE BY PLAYER AND ENGINEER SOME ADDITIONAL VARIABLES FROM DATA

by_player = college_stats[['Player','G','GS','MP','FG','FGA','2P','2PA','3P','3PA','FT','FTA','ORB','DRB','TRB','AST','STL','BLK','TOV','PF','PTS']].groupby(['Player']).sum().reset_index()
season_count = college_stats[['Player','Season']].groupby(['Player']).nunique().reset_index().rename(columns={'Season':'Number of College Seasons'}) # NUMBER OF COLLEGE SEASONS
team_count = college_stats[['Player','School']].groupby(['Player']).nunique().reset_index().rename(columns={'School':'Number of College Teams'}) # DIFFERENT SCHOOLS
season_count

Unnamed: 0,Player,Number of College Seasons
0,aaron-gordon,1
1,aaron-holiday,3
2,aaron-nesmith,2
3,aaron-wiggins,3
4,aj-griffin,1
...,...,...
335,zach-collins,1
336,zach-lavine,1
337,zeke-nnaji,1
338,ziaire-williams,1


In [26]:
# MERGE BY PLAYER DFS TOGETHER

by_player1 = pd.merge(by_player,season_count,how='left', on='Player')
by_player1 = pd.merge(by_player1,team_count,how='left', on='Player')

by_player1.head()

Unnamed: 0,Player,G,GS,MP,FG,FGA,2P,2PA,3P,3PA,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Number of College Seasons,Number of College Teams
0,aaron-gordon,38,38.0,1187.0,189,382,173.0,337.0,16.0,45.0,...,201.0,303,75,34.0,39.0,55.0,90.0,470,1,1
1,aaron-holiday,101,65.0,3209.0,476,1058,296.0,631.0,180.0,427.0,...,272.0,320,477,126.0,22.0,304.0,271.0,1443,3,1
2,aaron-nesmith,46,33.0,1428.0,219,496,100.0,206.0,119.0,290.0,...,200.0,245,58,43.0,30.0,75.0,137.0,675,2,1
3,aaron-wiggins,96,50.0,2707.0,375,922,203.0,445.0,172.0,477.0,...,341.0,445,150,87.0,36.0,138.0,169.0,1052,3,1
4,aj-griffin,39,25.0,935.0,146,296,75.0,137.0,71.0,159.0,...,123.0,153,38,20.0,22.0,25.0,42.0,405,1,1


We want to icnlude college stats and draft history vairables for the regression analysis. We'll need to merge our college stats by player df with our previously aggregated df which includes draft history and contract value. Recall, the player field in the college stats df is the players name as listed in Sports Reference urls and not necessarily how it appears in our original df. We'll need to convert those back before merging. 

In [27]:
# CREATE DF FROM DICTIONARY VALUES

original_names_df = pd.DataFrame({'original_name':new_name.keys(),
                                  'new_name': new_name.values()})

# BRING IN ORIGINAL NAME VALUES TO BY PLAYER COLLEGE STATS
by_player2 = pd.merge(by_player1,original_names_df, how='left', left_on='Player', right_on='new_name')

# THEN WE CAN MERGE WITH AGGREAGATED FILE AND CLEAN FOR REGRESSION

reg_df = pd.merge(by_player2, agg_df1, how='left', left_on='original_name', right_on= 'Player').drop_duplicates()

In [None]:
reg_df

In [28]:
# DROP COLUMNS THAT WE DON'T NEED
reg_df1 = reg_df.drop(['Player_x','First_Name','Last_Name','new_name','Player_y'],axis=1).set_index('original_name')
reg_df1

# WRITE TO CSV TO STORE AND IMPORT IN REGRESSION NOTEBOOK

reg_df1.to_csv('../data/reg_df.csv')

This concludes the data extraction, cleaning & pre-processing to prepare for the regression analysis 