# Data Collection and Cleaning - part 2
### 3-point Shooting Data
Repeating the data collection and cleaning process as the team advanced statistics data, each team's 3-point percentage is collected from [basketball-reference](https://www.basketball-reference.com) for each season from 2000 to 2023 and saved as a Pandas dataframe. This dataframe, again, is passed through a function that cleans the data and prepares it to merge with the dataset from part 1 to complete the final dataset.

In [1]:
import pandas as pd

In [2]:
def clean_df(df):
    ''' 
    Input: raw Pandas DataFrame with each team's 3-point percentage of a given year
    Output: cleaned Pandas DataFrame

    Tasks:
    - Removes asterisk from playoff team names
    - remove league average row and keep only specific team data
    '''
    df = df[['Team', '3P%']]
    df = df[:-1] # removes league average row
    df['Team'] = df['Team'].apply(lambda x: x.replace('*', '')) # removes asterisk from team names
    
    return df

In [3]:
years=range(2000,2024)
with pd.ExcelWriter('3pt_data.xlsx',) as writer: 
    for year in years:
        # collect scraped data
        df = pd.read_html(f'https://www.basketball-reference.com/leagues/NBA_{year}.html#advanced-team')[4]
        df = clean_df(df)
        
        # insert new column that labels the respective year for each row of the dataframe
        # Team name and year will be used as the merge point for the dataframes
        df['Year']=year 
        
        
        # writes the dataframes of each year to their own sheet in the excel file
        df.to_excel(writer, sheet_name=f'{year} 3pt Stats', index=False)

In [4]:
f = '3pt_data.xlsx'
df = pd.read_excel(f, sheet_name=None, index_col=None)

In [5]:
# concatenate all of the data from each sheet of the Excel file into a single dataframe
cdf = pd.concat(df.values())
cdf

Unnamed: 0,Team,3P%,Year
0,Sacramento Kings,0.322,2000
1,Detroit Pistons,0.359,2000
2,Dallas Mavericks,0.391,2000
3,Indiana Pacers,0.392,2000
4,Milwaukee Bucks,0.369,2000
...,...,...,...
25,Orlando Magic,0.346,2023
26,Charlotte Hornets,0.330,2023
27,Houston Rockets,0.327,2023
28,Detroit Pistons,0.351,2023


In [6]:
# export concatenated dataframe to a new, separate Excel file
cdf.to_excel('merged_3pt.xlsx', sheet_name='Data', index=False)

In [8]:
mdf = pd.read_excel('merged.xlsx', sheet_name='Data') # main dataframe

In [9]:
# merge both dataframes to create the final dataset
mdf = mdf.merge(cdf)

In [10]:
mdf

Unnamed: 0,Rk,Team,Age,W,L,Win%,PW,PL,MOV,SOS,...,eFG%.1,TOV%.1,DRB%,FT/FGA.1,Arena,Attend.,Attend./G,Season Result,Year,3P%
0,1,Los Angeles Lakers,29.2,67,15,0.817,64,18,8.55,-0.14,...,0.443,13.4,73.1,0.222,STAPLES Center,771420.0,18815.0,Champion,2000,0.329
1,2,Portland Trail Blazers,29.6,59,23,0.720,59,23,6.40,-0.04,...,0.461,13.8,72.4,0.217,Rose Garden Arena,835078.0,20368.0,Playoffs,2000,0.361
2,3,San Antonio Spurs,30.9,53,29,0.646,58,24,5.94,-0.02,...,0.451,13.5,73.0,0.188,Alamodome,884450.0,21694.0,Playoffs,2000,0.374
3,4,Phoenix Suns,28.6,53,29,0.646,56,26,5.22,0.02,...,0.454,15.7,70.5,0.245,America West Arena,773115.0,18856.0,Playoffs,2000,0.368
4,5,Utah Jazz,31.5,55,27,0.671,54,28,4.46,0.05,...,0.477,15.0,73.2,0.256,Delta Center,801268.0,19543.0,Playoffs,2000,0.385
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
710,26,Portland Trail Blazers,25.1,33,49,0.402,31,51,-4.01,0.05,...,0.563,12.1,74.9,0.217,Moda Center,767374.0,18716.0,Did not qualify,2023,0.365
711,27,Charlotte Hornets,25.3,27,55,0.329,26,56,-6.24,0.35,...,0.544,12.5,75.5,0.211,Spectrum Center,702052.0,17123.0,Did not qualify,2023,0.330
712,28,Houston Rockets,22.1,22,60,0.268,23,59,-7.85,0.24,...,0.564,11.8,75.8,0.218,Toyota Center,668865.0,16314.0,Did not qualify,2023,0.327
713,29,Detroit Pistons,24.1,17,65,0.207,22,60,-8.22,0.49,...,0.557,11.9,74.0,0.231,Little Caesars Arena,759715.0,18596.0,Did not qualify,2023,0.351


In [11]:
# export final dataset to excel file
mdf.to_excel('finalData.xlsx', sheet_name='Data', index=False)