# NBA Four Factors

### Thesis

Dean Oliver has posited a way of looking at what drives wins in a game, known as the Four Factor. The view is that there are Four Factors that drive wins for a team, on order of importance:

1 Effective FG%: eFG% is an adjusted FG% view. The formula is (FGM + 0.5 * 3PFG) / FGA. The more shots you make, the better your chances of winning. Dean has estimated that this factor weighting is about 40%.

2 Turnover %: TOV% is the rate at which the ball is turned over. It is calculated as TOV / (FGA + 0.44 * FTA + TOV). TOV % is estimated at a weighting of 25%.

3 Offensive Rebounds: Offensive rebounds give you additional opportunities. As stated, they are accounted for as ORB / (ORB + opposition DRB). The weighting estimated is 20%. 

4 FT Rate: Finally, free throw rate is the amount of free throws a team had in a game. It is calculated as FT / FGA and its estimated weighting is 15%. 

### What are we testing for?

We will explore several things in this project. First, how much variance do these four factors play per game and are there trends in the mix over the past several years of the NBA? With the mix of shots increasing towards 3 PTs, it would seem that eFG% has increased as a factor. Also, what is the standard deviation of each factor? It would seem that hustle factors such as turnover % and rebounds could have a wider standard deviation than the other factors. 

One factor to take into consideration is what is a team. Given free agency and trades, the composition of a team varies year to year as well as within the season. So identifying what the composition of a team is based on the total 240 minutes played per game is important to determine how that specific team's four factors vary. 

Next, we will look at the impact of travel. Travel can take a lot out of players and between turnovers and offensive rebounds, which are largely "hustle" factors, how does that impact a team. We will look at the impact on the first game of a road trip, second, third, and fourth. It will be interesting to see if the "team" changes as the length of a road trip increases or if the hustle variables exhibit more variance than normalized variance.  

There are some bonus elements that could be included: how does shot selection vary as length of time on the road changes - do teams tend to shoot more 3's while on the road than home or increase the number of 3's as the length of the road trip increases? How does weather play a factor - on flight delay situations, is hustle impacted?  

## First step is to build a web scraper that gets the data from the website into a manageable format, testing out a sample into a pandas dataframe

Goal for today is to get my web scrape working and get through a good % of the total download I am looking to get through. I believe I have identified all the key factors I am looking to download and last night I got my first dataframe in pandas from an initial one game scrape of data. 

In [168]:
import requests
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
import time
import pandas as pd

In [169]:
year = 2018
month = 10
day = 19
team = 'ORL'

In [170]:
web_template = (f'https://www.basketball-reference.com/boxscores/{year}{month}{day}{0}{team}.html')

In [171]:
web_names = requests.get(web_template)

In [172]:
soup = BeautifulSoup(web_names.text, 'html.parser')

In [173]:
headers_four_factors = [th.getText() for th in soup.findAll('tr', limit=2)[1].findAll('th')]

In [135]:
rows = soup.findAll('tr')[2:] # this pulls the rows data; need to start from the second row to eliminate
# the headers for the rows
player_stats1 = [[td.getText() for td in rows[i].findAll('td')]
            for i in range(len(rows))]

In [136]:
player_names1 = [[td.getText() for td in rows[i].findAll('th')] for i in range(len(rows))]

In [137]:
stats = pd.DataFrame(player_stats1, columns = headers_four_factors[1:]) 

In [138]:
player = pd.DataFrame(player_names1)
player = player[0][:66]

In [139]:
stats['Player'] = player

In [140]:
stats[:3]

Unnamed: 0,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,+/-,Player
0,29:59,6,8,0.75,3,4,0.75,0,0,,...,6,6,5,2,0,2,0,15,32,Nicolas Batum
1,26:46,8,16,0.5,5,10,0.5,5,5,1.0,...,2,2,5,1,0,3,0,26,34,Kemba Walker
2,25:05,2,7,0.286,0,2,0.0,4,4,1.0,...,6,7,1,1,1,0,0,8,19,Jeremy Lamb


In [141]:
rows2 = soup.findAll(class_='scorebox')

In [142]:
overall_teams = [strong.getText() for strong in rows2[0].findAll('strong')]
overall_teams = [items.strip('\n') for items in overall_teams]

In [143]:
overall_score = [scores.getText() for scores in rows2[0].findAll(class_='scores')]
overall_score = [items.strip('\n') for items in overall_score]

In [177]:
def date_adjustment():
    overall_date = [dates.getText() for dates in rows2[0].findAll(class_='scorebox_meta')]
    overall_date = [items.strip('\n') for items in overall_date]
    overall_date_2 = [items.split(',') for items in overall_date]
    output_list = []
    output_list.append(overall_date_2[0][0])
    output_list.append(overall_date_2[0][1])
    output_list.append(overall_date_2[0][2][:5].strip(' '))
    return output_list

In [178]:
date_adjustment()

['7:00 PM', ' October 19', '2018']

In [179]:
date_list = [' '.join(date_adjustment())] * 2

In [180]:
date_list

['7:00 PM  October 19 2018', '7:00 PM  October 19 2018']

In [181]:
teams_scores = pd.DataFrame(overall_teams, columns=['Team_Name'])

In [182]:
teams_scores['Score'] = overall_score
teams_scores['Date'] = date_list

In [183]:
teams_scores

Unnamed: 0,Team_Name,Score,Date
0,Charlotte Hornets,120,7:00 PM October 19 2018
1,Orlando Magic,88,7:00 PM October 19 2018


# I need to embed all these scattered assignments into a function, like I did for date, so that I can run for each sheet

### Then, I can pass the data from each into a PostGresSQL database

In [151]:
stats[:3]

Unnamed: 0,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,+/-,Player
0,29:59,6,8,0.75,3,4,0.75,0,0,,...,6,6,5,2,0,2,0,15,32,Nicolas Batum
1,26:46,8,16,0.5,5,10,0.5,5,5,1.0,...,2,2,5,1,0,3,0,26,34,Kemba Walker
2,25:05,2,7,0.286,0,2,0.0,4,4,1.0,...,6,7,1,1,1,0,0,8,19,Jeremy Lamb


In [204]:
four_factors_dataframe = pd.DataFrame(columns=['Player','eFG%','TOV%','ORB%','FTr'])

In [205]:
four_factors_dataframe = stats[stats.MP!=None]

In [206]:
four_factors_dataframe

Unnamed: 0,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,+/-,Player
0,29:59,6,8,.750,3,4,.750,0,0,,...,6,6,5,2,0,2,0,15,+32,Nicolas Batum
1,26:46,8,16,.500,5,10,.500,5,5,1.000,...,2,2,5,1,0,3,0,26,+34,Kemba Walker
2,25:05,2,7,.286,0,2,.000,4,4,1.000,...,6,7,1,1,1,0,0,8,+19,Jeremy Lamb
3,22:45,3,8,.375,0,0,,2,2,1.000,...,5,8,2,1,0,0,2,8,+17,Cody Zeller
4,20:29,3,7,.429,2,6,.333,0,0,,...,3,4,0,0,0,1,0,8,+13,Marvin Williams
5,,0,0,,0,,,,,,...,,,,,,,,,,Reserves
6,23:37,4,12,.333,2,6,.333,1,1,1.000,...,1,1,2,0,0,1,3,11,+5,Malik Monk
7,22:29,5,8,.625,0,2,.000,2,4,.500,...,7,9,5,0,2,1,1,12,+24,Michael Kidd-Gilchrist
8,17:18,2,5,.400,1,1,1.000,1,2,.500,...,2,5,2,1,0,2,4,6,+15,Willy Hernangómez
9,16:12,0,5,.000,0,1,.000,0,0,,...,3,3,6,0,0,2,2,0,+3,Tony Parker


In [195]:
stats['FGA']

0        8
1       16
2        7
3        8
4        7
5        0
6       12
7        8
8        5
9        5
10       8
11       5
12       2
13       1
14      92
15       0
16       0
17    .938
18    .656
19    .286
20    .375
21    .571
22       0
23    .417
24    .625
25    .500
26    .000
27    .938
28    .500
29    .500
      ... 
36      11
37       5
38       3
39       0
40       7
41       9
42      10
43       4
44      11
45       4
46       5
47       1
48      94
49       0
50       0
51    .400
52    .222
53    .545
54    .200
55    .333
56       0
57    .500
58    .389
59    .450
60    .000
61    .364
62    .750
63    .000
64    .900
65    .415
Name: FGA, Length: 66, dtype: object

In [158]:
four_factors_dataframe['Player'] = stats['Player']

In [163]:
four_factors_dataframe['eFG%'] = stats['FG'] * stats['3P']

TypeError: can't multiply sequence by non-int of type 'str'

In [164]:
four_factors_dataframe

Unnamed: 0,Player,eFG%,TOV%,ORB%,FTr
0,Nicolas Batum,6,,,
1,Kemba Walker,8,,,
2,Jeremy Lamb,2,,,
3,Cody Zeller,3,,,
4,Marvin Williams,3,,,
5,Reserves,0,,,
6,Malik Monk,4,,,
7,Michael Kidd-Gilchrist,5,,,
8,Willy Hernangómez,2,,,
9,Tony Parker,0,,,
