# Columns ['Id', 'Age', 'Sex', 'AvgTime','NumRaces','YrsSinceLast','Y1']

Y1 data for 2016 participation.

Id: Runner id

Age: Age in 2016. Normalized between max=92 and min=11 ages of historical data.

Sex: 1 (M) or 0 (F)

AvgTime: Average of normalized times of previous races. AvgTime is in range [0,1]

NumRaces: Number of races ran before 2016. Normalized [0,1]

YrsSinceLast: Number of Years since the last race they ran. Normalized [0,1]

In [1]:
from collections import Counter
import pandas as pd
import numpy as np
import pickle

In [2]:
f = open('data/out.pkl', 'r')
data = pickle.load(f) 
f.close()

In [3]:
data['Id'].size

34527

Initially there are 34527 rows of data.

In [4]:
rid_2016 = data[data['Year']==2016]['Id'].values.tolist()
for rid in rid_2016:
    if(data[data['Id']==rid]['Year'].size ==1):
        index = data[data['Id']==rid].index.tolist()[0]
        data = data.drop(index)

In [5]:
data['Id'].size

32256

Removing runners that only ran in 2016, we are left with 32256 rows. 

## Sort data by runner, year
data.sort_values(by='Id').sort_values(by='Year')

## Number of unique IDs to predict for 2016

In [6]:
unique_ids = list(set(data['Id'].values.tolist()))

In [7]:
d = {'Id' : unique_ids}
XY1 = pd.DataFrame(d, columns=['Id', 'Age', 'Sex', 'AvgTime','NumRaces','YrsSinceLast','Y1'])

## Set Y1 Table Data for 2016 Participation

In [9]:
XY1['Y1']=0
XY1['NumRaces']=0.
XY1['YrsSinceLast']=15.
XY1['Age']=0.

indices = XY1.index.tolist()
for i in indices:
    rid = XY1.loc[i]['Id']
    last_race_year = max(data[data['Id']==rid]['Year'].values.tolist())
    
    last_age = data[(data['Id']==rid)&(data['Year']==last_race_year)]['Age'].values[0]
    age = 2016-last_race_year + last_age
    
    sex = data[data['Id']==rid]['Sex'].values[0]
    
    #last_ysl = data[(data['Id']==rid)&(data['Year']==last_race_year)]['yrsSinceLast'].values[0]
    prev_year = max(data[(data['Id']==rid)&(data['Year']<2016)]['Year'].values.tolist())
    ysl = 2016-prev_year

    prevRaces = data[(data['Id']==rid)&(data['Year']<2016)]
    prevTimeSum = sum(prevRaces['Time'].values)
    numRaces  = prevRaces['Year'].size
    avgTime = ((prevTimeSum-7942.)/(30282.-7942.))/numRaces
    if(last_race_year==2016):
        XY1.set_value(i, 'Y1', 1)
    
    XY1.set_value(i,'Age',age)
    XY1.set_value(i,'Sex',sex)
    XY1.set_value(i,'YrsSinceLast',ysl)
    XY1.set_value(i,'AvgTime', avgTime)
    XY1.set_value(i,'NumRaces', numRaces)

## Normalization

In [11]:
maxAge = max(XY1['Age'].values.tolist()) #92.0
minAge = min(XY1['Age'].values.tolist()) #11.0

In [12]:
print maxAge
print minAge

92.0
11.0


In [13]:
indices = XY1.index.tolist()
for i in indices:
    person = XY1.loc[i]
    age = person['Age']
    ysl = person['YrsSinceLast']
    numRaces = person['NumRaces']

    XY1.set_value(i,'Age',((age-minAge)*1./(maxAge-minAge)))
    # Since 2016-2003 is 13, with minimum 1 previous race
    XY1.set_value(i,'YrsSinceLast',(ysl-1)/13.)
    # Races 2003-2015 is 13, but subtract 2013 therefore max 12
    XY1.set_value(i,'NumRaces', (numRaces-1)/12.)

In [14]:
XY1[XY1['Y1']==0]['Id'].size

24433

In [15]:
XY1[XY1['Y1']==1]['Id'].size

733

In [25]:
XY1[XY1['Y1']==0]['Id'].size + XY1[XY1['Y1']==1]['Id'].size

25166

733 previous participants attend in 2016. 24433 do not return. We have 25166 rows of data to train for 2016.

## Store Values

In [26]:
f = open('data/xy1.pkl', 'w')
pickle.dump(XY1,f) 
f.close()

XY1.to_csv('data/xy1.csv')