# KenPom Scraper

This coding example scrapes off the [KenPom](https://kenpom.com/index.php) table. 

If you want more info on the scraping package that bypasses the cloudflare anti-bot page, see [this link](https://pypi.org/project/cloudscraper/).


#### NOTE: Please be mindful when making requests. If you start a new session, just run the `browser.get(kenpomurl)` once so you are only making one request. You can do everything else in other cells since Jupyter Notebook saves your variable names in the kernel. 

In [1]:
import cloudscraper
from bs4 import BeautifulSoup
import pandas as pd
from io import StringIO
import re

In [2]:
browser = cloudscraper.create_scraper()
response = browser.get('https://kenpom.com/index.php')

In [3]:
soup = BeautifulSoup(response.content, 'html.parser')

In [4]:
df = pd.read_html(StringIO(str(soup.find_all('table')[0])))[0]

df.columns = ['Rk', 'Team', 'Conference','W-L','NetRtg',
              'ORtg','ORtg_rk','DRtg','DRtg_rk','AdjT',
              'AdjT_rk','Luck','Luck_rk','OPP_NetRtg',
              'OPP_NetRtg_rk','OPP_ORtg','OPP_ORtg_rk',
              'OPP_DRtg','OPP_DRtg_rk','NCSOS',
              'NCSOS_rk'
             ]

In [5]:
df = df[df['Team'] != 'Team'].dropna().reset_index(drop=True)
df['Seed'] = df['Team'].apply(lambda x: re.search(r'\d+', x).group() if re.search(r'\d+', x) else None)
df['Team'] = df['Team'].apply(lambda x: re.sub(r'\d+', '', x)).str.strip()
df['NetRtg'] = df['NetRtg'].apply(lambda x: re.sub(r'[^-\d.]+', '', x)).astype(float)
df['OPP_NetRtg'] = df['OPP_NetRtg'].apply(lambda x: re.sub(r'[^-\d.]+', '', x)).astype(float)
df['AdjT'] = df['AdjT'].astype(float)
df[['ORtg', 'DRtg']] = df[['ORtg', 'DRtg']].astype(float)

Below is the example code for computing the projected efficiency margin and the Win Probability of the game.

In [7]:
import scipy.stats as stats

def compute_score(team1, team2, df):
    team1 = df[df['Team'] == team1]
    team2 = df[df['Team'] == team2]
    # Average Efficiency is 106.7 and tempo is 67.2
    approx_tempo = int((team1['AdjT'].iloc[0] / 67.2) * (team2['AdjT'].iloc[0] / 67.2) * 67.2)
    team1_adj_off = (team1['ORtg'].iloc[0] * team2['DRtg'].iloc[0]) / 106.7
    team2_adj_off = (team2['ORtg'].iloc[0] * team1['DRtg'].iloc[0]) / 106.7
    
    team1_score = team1_adj_off / 100 * approx_tempo
    team2_score = team2_adj_off / 100 * approx_tempo
    
    margin = team1_score - team2_score
    wp = stats.norm.cdf(x=0, loc=margin, scale = 11)
    
    print(f"""    Here is the Predicted Score Between these teams on a neutral court according to KenPom:\n 
    {team1['Team'].iloc[0]}: {round(team1_score, 0)} | {team2['Team'].iloc[0]}: {round(team2_score, 0)} | Probability {team1['Team'].iloc[0]} Wins: {1-round(wp, 3)}""")
    
    

In [9]:
compute_score('Kansas', 'St. John\'s', df) # > 75% should be gurantee in theory

    Here is the Predicted Score Between these teams on a neutral court according to KenPom:
 
    Kansas: 68.0 | St. John's: 70.0 | Probability Kansas Wins: 0.405
