#2016 NCAA Bracket Generator

### Based on statistics gathered via fivethirtyeight: 
#### http://fivethirtyeight.com/features/the-best-mens-college-basketball-teams-just-arent-very-good-this-year/

The goal here is to read through six pages using Beautiful Soup 4, gather statistics, and build a Monte Carlo simulation to come up with likely scenarios. 

Pages we're using for statistics:

- 2016 Pomeroy College Basketball rankings: http://kenpom.com/
- Sports Reference 2015-2016 school ratings: http://www.sports-reference.com/cbb/seasons/2016-ratings.html
- Rankings via Dolphinism: http://www.dolphinsim.com/ratings/ncaa_mbb/index_pred.html
- Jeff Sagarin's College Basketball rankings: http://sagarin.com/sports/cbsend.htm
- Sonny Moore's Computer Power ratings: http://sonnymoorepowerratings.com/m-basket.htm
- The 2016 Bracket Matrix: http://www.bracketmatrix.com/

In [30]:
import requests
import bs4
import pandas as pd
import numpy as np
import re

%matplotlib inline

## Grab page data using requests, parse with Beautiful Soup

In [2]:
kp_raw = requests.get("http://kenpom.com")
kp = kp_raw.text
sr_raw = requests.get("http://www.sports-reference.com/cbb/seasons/2016-ratings.html")
sr = sr_raw.text
dp_raw = requests.get("http://www.dolphinsim.com/ratings/ncaa_mbb/index_pred.html")
dp = dp_raw.text
js_raw = requests.get("http://sagarin.com/sports/cbsend.htm")
js = js_raw.text
sm_raw = requests.get("http://sonnymoorepowerratings.com/m-basket.htm")
sm = sm_raw.text
bm_raw = requests.get("http://www.bracketmatrix.com/")
bm = bm_raw.text

### Ken Pomeroy

In [3]:
## NB: Use html5lib parser rather than lxml
kp_data = bs4.BeautifulSoup(kp, "html5lib")

In [4]:
table = kp_data.find_all('table')[0]
rows = table.findAll('tr')

kpd = []

for tr in rows:
    cols = tr.findAll('td')
    #rank,team,conf,wl,pyth,adjo,adjd,adjt,luck,pyth_s,oppo,oppd,pyth_n = [ c.text for c in cols ]
    kpd.append([ c.text for c in cols ])

In [5]:
kp_df = pd.DataFrame(kpd)

In [6]:
kp_df.shape

(369, 21)

### Sports Reference

In [7]:
## NB: Use html5lib parser rather than lxml
sr_data = bs4.BeautifulSoup(sr, "html5lib")

In [8]:
table = sr_data.find_all('table')[0]
rows = table.find_all('tr')

srd = []

for tr in rows:
    cols = tr.find_all('td')
    srd.append([ c.text for c in cols ])

In [9]:
sr_df = pd.DataFrame(srd)

In [10]:
sr_df.shape

(387, 15)

### Dolphinism

In [40]:
## NB: Use html5lib parser rather than lxml
dp_data = bs4.BeautifulSoup(dp, "html5lib")

In [41]:
## Dolphinism doesn't use a table
table = dp_data.find_all('code')

In [42]:
dpd = []
for row in table[0].text.split('\n'):
    dpd.append(re.split(r'\s{2,}', row))

In [43]:
dp_df = pd.DataFrame(dpd)

In [44]:
dp_df.shape

(390, 22)

### Jeff Sagarin

In [45]:
## NB: Use html5lib parser rather than lxml
js_data = bs4.BeautifulSoup(js, "html5lib")

In [46]:
table = js_data.find_all('pre')

In [49]:
## If I thought Dolphinism was bad... Sagarin takes the cake.
## I don't know what to do with this :(

table[1]

<pre><font color="#000000">____________________________________________________________________________________________________________________________________\n<b><font color="#000000">College Basketball 2015-2016          Div I games only    through games of 2016 March 11 Friday                                     </font>\n                                <font color="#9900ff">RATING</font>    W   L  SCHEDL(RANK) VS top 25 | VS top 50 |<font color="#0000ff">  PREDICTOR   </font>|<font color="#ff0000"> GOLDEN_MEAN  |<font color="#4cc417">  RECENT    </font></font></b><font color="#ff0000">\n                <b><font color="#000000">HOME ADVANTAGE=[<font color="#9900ff">  3.31</font>]                                               [<font color="#0000ff">  3.31</font>]       [<font color="#ff0000">  3.31</font>]       [<font color="#4cc417">  3.31</font>]</font></b><font color="#000000">\n<font color="#000000">   1  Kansas                  =</font><font color="#9900ff">  93.36</font><font 

### Sonny Moore

In [50]:
## NB: Use html5lib parser rather than lxml
sm_data = bs4.BeautifulSoup(sm, "html5lib")

In [53]:
## Looks like Sonny prefers plain text tables too
table = sm_data.find_all('b')

In [54]:
smd = []
for row in table[0].text.split('\n'):
    smd.append(re.split(r'\s{2,}', row))

In [55]:
sm_df = pd.DataFrame(smd)

### Bracket Matrix

In [58]:
## NB: Use html5lib parser rather than lxml
bm_data = bs4.BeautifulSoup(bm, "html5lib")

In [62]:
table = bm_data.find_all('table')[0]
rows = table.find_all('tr')

bmd = []

for tr in rows:
    cols = tr.find_all('td')
    bmd.append([ c.text for c in cols ])

In [63]:
bm_df = pd.DataFrame(bmd)

In [67]:
bm_df.shape

(102, 104)

## Data cleaning