In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pdfplumber

# Analysis of Results of the 2019 ARA National Championships Presented by AMSOIL #
*Darren McGlinchey*

The [American Rally Association (ARA)](https://www.americanrallyassociation.org/) is the top stage rally competition in the United States. I've been following the success of the [Subaru Motorsports USA](https://www.subaru.com/motorsports) team via the popular YouTube show "Launch Control". The team has dominated the rally scene in the US for the last decade, particularly with the Driver/Co-Driver pair David Higgens & Craig Drew. I was curious if the dominance of Subaru in the top Open 4 wheel drive (O4WD) class trickled down. How popular are Subaru's throughout the classes in which they can compete? How successful are they in their classes?

I chose to look into answering these questions beginning with the 2019 season. I chose the 2019 season for a few reasons. First, it is recent and wasn't effected by the Covid pandemic like the 2020 season. And secondly, it was the last season in which Higgens & Drew drove for Subaru Motorsports USA.

## Getting & Cleaning the Data ##

In order to answer the above questions I'm looking for the following data:
* National results for each race in the 2019 season (I'll ignore regional results for now)
* Primarily interested in overall results rather than individual stage times
* Should contain overall position, driver/co-driver, car, class, total time, and position in class.

I was unable to find a compiled and available dataset containing results of the 2019 season. Some, but not all, of the 2019 results can be found on the [ARA](https://www.americanrallyassociation.org/) website, however not in an easily scrapable format. While compiling the results by hand would not be a huge undertaking given the number of races (9) and modest number of competitors per race (14-49), thankfully there was another way. 

The website [ewrc-results.com](https://www.ewrc-results.com/) is a website by rally fans attempting to compile a public database of world rally events and results. They appear to have complete listings and results for each race in the season in a more scraping-friendly format. I'll start there.

Let's start by getting the results for each event individually. There were 9 events in the [2019 ARA season](https://www.ewrc-results.com/season/2019/994-ara/):
1. [Sno*Drift Rally](https://www.ewrc-results.com/final/56054-snodrift-rally-2019/)
2. [Rally in the 100 Acre Wood](https://www.ewrc-results.com/final/56055-rally-in-the-100-acre-wood-2019/)
3. [DirtFish Olympus Rally](https://www.ewrc-results.com/final/56056-dirtfish-olympus-rally-2019/)
4. [Oregon Trail Rally](https://www.ewrc-results.com/final/56057-oregon-trail-rally-2019/)


In [48]:
# Setup the necessary urls
ara_2019_season_url = 'https://www.ewrc-results.com/season/2019/994-ara/'
ara_2019_rally_urls = ['https://www.ewrc-results.com/final/56054-snodrift-rally-2019/',
                    'https://www.ewrc-results.com/final/56055-rally-in-the-100-acre-wood-2019/',
                'https://www.ewrc-results.com/final/56056-dirtfish-olympus-rally-2019/',
                'https://www.ewrc-results.com/final/56057-oregon-trail-rally-2019/']

We can try to see what we can get from the season overview table to start

In [10]:
season_tables = pd.read_html(ara_2019_season_url)
print('season tables: {}'.format(len(season_tables)))
for i, table in enumerate(season_tables):
    print(' table {}, shape: {}'.format(i, table.shape))

season tables: 22
 table 0, shape: (3, 7)
 table 1, shape: (3, 7)
 table 2, shape: (3, 7)
 table 3, shape: (3, 7)
 table 4, shape: (3, 7)
 table 5, shape: (3, 7)
 table 6, shape: (3, 7)
 table 7, shape: (3, 7)
 table 8, shape: (3, 7)
 table 9, shape: (3, 7)
 table 10, shape: (3, 7)
 table 11, shape: (3, 7)
 table 12, shape: (3, 7)
 table 13, shape: (3, 7)
 table 14, shape: (3, 7)
 table 15, shape: (3, 7)
 table 16, shape: (3, 7)
 table 17, shape: (3, 7)
 table 18, shape: (53, 13)
 table 19, shape: (5, 4)
 table 20, shape: (8, 4)
 table 21, shape: (6, 4)


In [11]:
season_tables[0]

Unnamed: 0,0,1,2,3,4,5,6
0,1.0,,Fetela Piotr - Jozwiak Dominik,,Ford Fiesta Proto,Fetela Rally Team,2:20:04.5
1,2.0,,Steely Cameron - Osborn Preston,,Subaru Impreza WRX STi,O.D.D. Racing,2:24:01.5
2,3.0,,Nease Travis - James Matthew,,Subaru WRX STI,Hi Camp Racing,2:31:51.9


In [12]:
season_tables[1]

Unnamed: 0,0,1,2,3,4,5,6
0,1.0,,Fetela Piotr - Jozwiak Dominik,,Ford Fiesta Proto,Fetela Rally Team,2:20:04.5
1,2.0,,Steely Cameron - Osborn Preston,,Subaru Impreza WRX STi,O.D.D. Racing,2:24:01.5
2,3.0,,Nease Travis - James Matthew,,Subaru WRX STI,Hi Camp Racing,2:31:51.9


In [20]:
season_tables[2]
type(season_tables[2])

pandas.core.frame.DataFrame

Based on the website, I would have expected 9 tables (1 for each race) with the top three finishers followed by a few tables of additional statistics. Those results are certainly in there, but let's clean it up. It looks like the results tables got duplicated, and there are some `NaN` columns where the flag/logo's go.

In [47]:
season_dfs = []
for i in range(0, 18, 2):    
    season_tables[i].columns = ['pos','A','driverco','B','car','team','time']
    driver = season_tables[i].driverco.apply(lambda row: row.split(' - ')[0])
    codriver = season_tables[i].driverco.apply(lambda row: row.split(' - ')[1])
    df = season_tables[i].assign(driver=driver.values)
    df = df.assign(codriver=codriver.values)
    df = df.drop(['A','B','driverco'], axis=1)
    season_dfs.append(df)
print('N races: {}'.format(len(season_dfs)))
season_dfs[0]    

N races: 9


Unnamed: 0,pos,car,team,time,driver,codriver
0,1.0,Ford Fiesta Proto,Fetela Rally Team,2:20:04.5,Fetela Piotr,Jozwiak Dominik
1,2.0,Subaru Impreza WRX STi,O.D.D. Racing,2:24:01.5,Steely Cameron,Osborn Preston
2,3.0,Subaru WRX STI,Hi Camp Racing,2:31:51.9,Nease Travis,James Matthew


That looks better. 

Now lets get tables of full results for each of the races. First, let's take a quick look at some results from the first race of the season before we try grabbing all the results.

In [59]:
test_tabs = pd.read_html(ara_2019_rally_urls[0])
print(len(test_tabs))
for i, table in enumerate(test_tabs):
    print('table {} shape: {}'.format(i, table.shape))
print('table[0][:,0]:\n{}'.format(test_tabs[0].iloc[0]))
print('table[1][:,0]:\n{}'.format(test_tabs[1].iloc[0]))

2
table 0 shape: (16, 9)
table 1 shape: (2, 7)
table[0][:,0]:
0                                               1
1                                             #94
2                                             NaN
3                  Fetela Piotr - Jozwiak Dominik
4    Ford Fiesta Proto [FRT 94] Fetela Rally Team
5                                            O4WD
6                                   2:20:04.50:10
7                                             NaN
8                                            80.8
Name: 0, dtype: object
table[1][:,0]:
0                                             SS8
1                                             #97
2                                             NaN
3                     McKenna Barry - Jordan Leon
4    Ford Fiesta R5 [JEA-4540]McKenna Motorsports
5                                            O4WD
6                                      Mechanical
Name: 0, dtype: object


That makes sense. We get 2 tables. The first should be the full results, with the second detailing the retirements. Let's merge those together into a commmon data frame with only the columns we care about.

In [70]:
test_tabs[0].columns = ['pos','carnum','A','driverco','carteam','class','jumbledtime','tdiff','unknown']
test_tabs[1].columns = ['stage','carnum','A','driverco','carteam','class','reasonexit']

In [71]:
res_df = test_tabs[0].drop(['A','tdiff','unknown'], axis=1)
drivers = res_df.driverco.apply(lambda row: row.split(' - ')[0])
codrivers = res_df.driverco.apply(lambda row: row.split(' - ')[1])
res_df = res_df.assign(driver=drivers.values)
res_df = res_df.assign(codriver=codrivers.values)
times = res_df.jumbledtime.apply(lambda row: row[:row.find('.')+2])
res_df = res_df.assign(time=times.values)
res_df = res_df.drop(['driverco','jumbledtime'], axis=1)
res_df

Unnamed: 0,pos,carnum,carteam,class,driver,codriver,time
0,1.0,#94,Ford Fiesta Proto [FRT 94] Fetela Rally Team,O4WD,Fetela Piotr,Jozwiak Dominik,2:20:04.5
1,2.0,#824,Subaru Impreza WRX STi [410 0598] O.D.D. Racing,L4WD,Steely Cameron,Osborn Preston,2:24:01.5
2,3.0,#81,Subaru WRX STIHi Camp Racing,L4WD,Nease Travis,James Matthew,2:31:51.9
3,4.0,#845,Honda CivicTeam Punishment Racing,O2WD,MacDonald Shawn,Cannis Jonathan,2:41:17.9
4,5.0,#777,Subaru ImprezaHeavy Metal,NA4WD,Kramer Jonathan,Smith Jason,2:43:54.5
5,6.0,#123,Subaru Impreza RSMBP Motorsports,NA4WD,Engle Michael Jr,Engle Lauren,2:45:32.9
6,7.0,#386,Mitsubishi Lancer [CMZ-734] Morris Motorsports,O2WD,Morris Bradley,Nagy Douglas,2:45:52.4
7,8.0,#98,Subaru Impreza WRX STi RABMG Racing,L4WD,Bardha Ele,Roshea Corrina,2:53:51.8
8,9.0,#815,Subaru Impreza WRX STiToasted Racing,L4WD,Whitebread Zachary,Carr Cameron,2:56:05.6
9,10.0,#686,Subaru Impreza RS [ABM-4363] Leadfoot Locher R...,NA4WD,Locher Jordan,Addison Thomas,3:00:42.2


In [76]:
ret_df = test_tabs[1].drop(['stage','A','reasonexit'], axis=1)
poss = ret_df.carnum.apply(lambda row: -1.0)
ret_df = ret_df.assign(pos=poss.values)
times = ret_df.carnum.apply(lambda row: 'DNF')
ret_df = ret_df.assign(time=times.values)
ret_df

Unnamed: 0,carnum,driverco,carteam,class,pos,time
0,#97,McKenna Barry - Jordan Leon,Ford Fiesta R5 [JEA-4540]McKenna Motorsports,O4WD,-1.0,DNF
1,#248,Banes Scott - Arpke Brian,Subaru Impreza RSBanes Racing Team,O4WD,-1.0,DNF


In [78]:
res_df = res_df.append(ret_df)
res_df

Unnamed: 0,pos,carnum,carteam,class,driver,codriver,time,driverco
0,1.0,#94,Ford Fiesta Proto [FRT 94] Fetela Rally Team,O4WD,Fetela Piotr,Jozwiak Dominik,2:20:04.5,
1,2.0,#824,Subaru Impreza WRX STi [410 0598] O.D.D. Racing,L4WD,Steely Cameron,Osborn Preston,2:24:01.5,
2,3.0,#81,Subaru WRX STIHi Camp Racing,L4WD,Nease Travis,James Matthew,2:31:51.9,
3,4.0,#845,Honda CivicTeam Punishment Racing,O2WD,MacDonald Shawn,Cannis Jonathan,2:41:17.9,
4,5.0,#777,Subaru ImprezaHeavy Metal,NA4WD,Kramer Jonathan,Smith Jason,2:43:54.5,
5,6.0,#123,Subaru Impreza RSMBP Motorsports,NA4WD,Engle Michael Jr,Engle Lauren,2:45:32.9,
6,7.0,#386,Mitsubishi Lancer [CMZ-734] Morris Motorsports,O2WD,Morris Bradley,Nagy Douglas,2:45:52.4,
7,8.0,#98,Subaru Impreza WRX STi RABMG Racing,L4WD,Bardha Ele,Roshea Corrina,2:53:51.8,
8,9.0,#815,Subaru Impreza WRX STiToasted Racing,L4WD,Whitebread Zachary,Carr Cameron,2:56:05.6,
9,10.0,#686,Subaru Impreza RS [ABM-4363] Leadfoot Locher R...,NA4WD,Locher Jordan,Addison Thomas,3:00:42.2,
