## Horse Betting Algorithm -- Past Race Web Scraper 


For a class project we are developing an algorithm for fundamental handicapping & bet optimization. This notebook contains code for scraping  past race data for horses. The information is contained on separate webpages (one per horse) and holds all past race results for that horse. The main function aggregates all information into a single DF.

We initially used a remote web driver with selenium to navigate the main site and isolate race information, but the process was far too slow. Instead we found the base url to which we could simply append all the horse ID's. The final segment of the code parallelizes the get requests to speed up the process (still obviously dependent on internet connection). There is definitely a more sound/robust implementation for the pooling functionality, but works as is.

We are specifically training a model based on data from the HKJC. Check out the inspiration for the larger project [here](https://www.bloomberg.com/news/features/2018-05-03/the-gambler-who-cracked-the-horse-racing-code).

In [6]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [8]:
first = pd.read_csv('race-result-horse 14-17 copy.csv')

In [9]:
first.head()

Unnamed: 0,finishing_position,horse_number,horse_name,horse_id,jockey,trainer,actual_weight,declared_horse_weight,draw,length_behind_winner,running_position_1,running_position_2,running_position_3,running_position_4,finish_time,win_odds,running_position_5,running_position_6,race_id
0,1,1.0,DOUBLE DRAGON,K019,B Prebble,D Cruz,133,1032,1,-,1.0,2.0,2.0,1.0,1.22.33,3.8,,,2014-001
1,2,2.0,PLAIN BLUE BANNER,S070,D Whyte,D E Ferraris,133,1075,13,2,8.0,9.0,9.0,2.0,1.22.65,8.0,,,2014-001
2,3,10.0,GOLDWEAVER,P072,Y T Cheng,Y S Tsui,121,1065,3,2,2.0,1.0,1.0,3.0,1.22.66,5.7,,,2014-001
3,4,3.0,SUPREME PROFIT,P230,J Moreira,C S Shum,132,1222,2,2,6.0,4.0,5.0,4.0,1.22.66,6.1,,,2014-001
4,5,7.0,THE ONLY KID,H173,Z Purton,K W Lui,125,1136,9,4-1/4,9.0,10.0,10.0,5.0,1.23.02,6.1,,,2014-001


In [18]:
# found this code on stack overflow -- functions for getting past the request challange 
# when hkjc blocks automated requests (sorry hkjc...)

from math import cos, pi, floor

def parse_challenge(page):
    """
    Parse a challenge given by mmi and mavat's web servers, forcing us to solve
    some math stuff and send the result as a header to actually get the page.
    This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
    """
    top = page.split('<script>')[1].split('\n')
    challenge = top[1].split(';')[0].split('=')[1]
    challenge_id = top[2].split(';')[0].split('=')[1]
    return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


def get_challenge_answer(challenge):
    """
    Solve the math part of the challenge and get the result
    """
    arr = list(challenge)
    last_digit = int(arr[-1])
    arr.sort()
    min_digit = int(arr[0])
    subvar1 = (2 * int(arr[2])) + int(arr[1])
    subvar2 = str(2 * int(arr[2])) + arr[1]
    power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
    x = (int(challenge) * 3 + subvar1)
    y = cos(pi * subvar1)
    answer = x * y
    answer -= power
    answer += (min_digit - last_digit)
    answer = str(int(floor(answer))) + subvar2
    return answer


def url_parser(URL):
    s = requests.Session()
    r = s.get(URL)

    if 'X-AA-Challenge' in r.text:
        challenge = parse_challenge(r.text)
        r = s.get(URL, headers={
            'X-AA-Challenge': challenge['challenge'],
            'X-AA-Challenge-ID': challenge['challenge_id'],
            'X-AA-Challenge-Result': challenge['challenge_result']
        })

        yum = r.cookies
        r = s.get(URL, cookies=yum)
    
    return r.text

In [11]:
def horse_info_scraper(url_list):
    """
    Takes in the modified url list with the base hkjc link and the appended
    horse IDs. Each entry of arg list is (url, requests.get(url).text). Returns full df.
    """
        
    page_count = 0
    
    for link, soup in url_list:

        result = BeautifulSoup(soup, 'lxml')
        
        tables = result.find_all('table')
        data_counter = 0
        res = []

        main_data = tables[6]
        for tr in main_data.find_all('tr'):

            td = tr.find_all('td')
            row = [i.text.replace('\n',"").replace('\r',"").rstrip() for i in td]

            if data_counter == 0:
                cols = row[:-1]
                data_counter += 1

            else:
                res+= [row[:-2]]

        if page_count == 0:
            df = pd.DataFrame(data = res, columns = cols)
            page_count += 1
            df.dropna(inplace=True)
            ids = [link[-4:]]*len(df)
            df['horse_id'] = ids

        else:
            altdf = pd.DataFrame(data = res, columns = cols)
            altdf.dropna(inplace = True)
            ids = [link[-4:]]*len(altdf)
            altdf['horse_id'] = ids
            df = df.append(altdf)
            
    return df

In [19]:
##grabs all necessary links

base_url = 'http://www.hkjc.com/english/racing/horse.asp?horseno='
ids = first['horse_id'].unique()
bases = np.array([base_url]*len(ids))
full_links = list(bases+ids)
full_links[:5]

['http://www.hkjc.com/english/racing/horse.asp?horseno=K019',
 'http://www.hkjc.com/english/racing/horse.asp?horseno=S070',
 'http://www.hkjc.com/english/racing/horse.asp?horseno=P072',
 'http://www.hkjc.com/english/racing/horse.asp?horseno=P230',
 'http://www.hkjc.com/english/racing/horse.asp?horseno=H173']

### Updates:

This section separates the processes for gathering URL's and putting info into a DF. The url get requests are now pooled -- run in parallel on however many cores the laptop has access to. This is the time intensive process of the overall code, putting those into a DF once you have a list of tuples (url, request.get(url).text) should be quick.

In [13]:
import multiprocessing as mp
print("Number of processors: ", mp.cpu_count())

Number of processors:  4


In [14]:
def return_read_links(url):
    """ tester for parallelizing requetsts -- tuple (link, get.text(url))"""
    return (url, url_parser(url))

In [15]:
pool = mp.Pool(mp.cpu_count())
test_res = pool.map(return_read_links,list(full_links[:5]))
pool.close()

In [16]:
horse_info_scraper(test_res)

Unnamed: 0,RaceIndex,Pla.,Date,RC/Track/Course,Dist.,G,RaceClass,Dr,Rtg.,Trainer,Jockey,LBW,Win Odds,Act.Wt.,RunningPosition,Finish Time,Declar.Horse Wt.,Gear,horse_id
1,773,12,12/07/2015,"ST / Turf / ""B+2""",1400,GF,4,8,40,D Cruz,K C Leung,7-1/4,29,113,1 1 1 12,1.22.92,1022,B,K019
3,690,03,14/06/2015,"ST / Turf / ""B+2""",1400,GF,5,11,39,D Cruz,B Prebble,HD,8.2,132,1 2 1 3,1.22.66,1009,B,K019
5,657,12,31/05/2015,"ST / Turf / ""A""",1400,GF,4,1,41,D Cruz,C Y Lui,8-1/4,23,111,4 3 3 12,1.23.55,1012,B,K019
7,619,03,16/05/2015,"ST / Turf / ""C""",1400,G,4,12,42,D Cruz,C Y Ho,2-1/4,28,118,3 2 2 3,1.23.40,1022,B,K019
9,601,12,09/05/2015,"ST / Turf / ""B+2""",1400,GF,4,6,44,D Cruz,W M Lai,11,31,117,2 2 2 12,1.24.00,1018,B,K019
11,529,03,12/04/2015,"ST / Turf / ""C""",1400,G,4,7,44,D Cruz,D Lane,1-3/4,37,120,1 1 1 3,1.22.84,1028,B,K019
13,255,10,20/12/2014,"ST / Turf / ""A+3""",1400,GF,4,6,46,D Cruz,C Y Lui,9-3/4,33,111,3 1 1 10,1.23.45,1045,B,K019
15,178,04,23/11/2014,"ST / Turf / ""B+2""",1200,GF,4,8,47,D Cruz,K C Leung,3-1/2,41,118,1 1 4,1.09.57,1032,B,K019
17,141,04,09/11/2014,"ST / Turf / ""C+3""",1200,G,4,5,47,D Cruz,M Chadwick,3-3/4,14,120,3 3 4,1.09.98,1030,B,K019
19,042,07,27/09/2014,"ST / Turf / ""C""",1400,GF,4,12,47,D Cruz,N Callan,4-1/4,7.4,122,1 2 2 7,1.22.93,1024,B,K019
