### Exploring Horse Race Data Kaggle 2020:
The data from this CSV file is read into a DataFrame named horse_racing_data. A summary of missing data in the DataFrame is created by calling the isnull() method followed by the sum() method. This summary is stored in the variable missing_data_summary. The next step identifies columns with missing data by filtering the summary for values greater than 0. The resulting filtered summary is stored in the variable columns_with_missing_data. Finally, the code prints out the columns_with_missing_data to display the columns with missing values and their respective counts.

In [1]:
import pandas as pd

# Define the path to the CSV file
file_path = './data/raw_data/horses_2020.csv'

# Load the data into a pandas DataFrame
horse_racing_data = pd.read_csv(file_path)

# Summarize the missing data in the DataFrame
missing_data_summary = horse_racing_data.isnull().sum()

# Filter columns that have missing data
columns_with_missing_data = missing_data_summary[missing_data_summary > 0]

# Check if there are any columns with missing data
if not columns_with_missing_data.empty:
    # Create a DataFrame to display columns with missing data and their respective counts
    missing_data_df = pd.DataFrame({
        'Column': columns_with_missing_data.index,
        'Missing Values': columns_with_missing_data.values
    })

    # Print the summary of missing data
    print("Columns with missing data and their respective counts:")
    print(f"{'Column':<20} {'Missing Values':<15}")
    
    # Iterate over the DataFrame rows and print each column and its missing value count
    for _, row in missing_data_df.iterrows():
        print(f"{row['Column']:<20} {row['Missing Values']:<15}")
else:
    # Indicate that there are no columns with missing data
    print("No columns have missing data.")


Columns with missing data and their respective counts:
Column               Missing Values 
saddle               6              
trainerName          21             
jockeyName           3              
positionL            21227          
dist                 35952          
overWeight           144613         
outHandicap          148159         
headGear             93730          
RPR                  72446          
TR                   101769         
OR                   69967          
mother               1              
gfather              36             
price                54502          


  horse_racing_data = pd.read_csv(file_path)


### Cleaning Data
This script processes horse racing data from a CSV file. It starts by defining the paths for the raw data, the cleaned data, and the directory where split data files will be saved. The script reads the data into a DataFrame with specified data types, then cleans the data by selecting relevant columns and dropping rows with missing values. It also converts price strings to floats and renames columns for better readability. Additionally, it calculates win probabilities based on the price. The cleaned data is saved to a new CSV file, and the data is also split by race ID, with each subset saved to a separate file. The main function coordinates these steps: it checks if the raw data file exists, processes the data, saves the cleaned version, and splits the data by race ID. If the raw data file isn't found, it prints an error message. The script is organized into clear, modular functions, making it easy to understand and maintain.

In [2]:
import os
import pandas as pd

# File paths
RAW_DATA_PATH = './data/raw_data/horses_2020.csv'
CLEANED_DATA_PATH = './data/cleaned_data/horses_2020_cleaned.csv'
SPLIT_DATA_DIR = './data/split_races'

def read_data(file_path):
    """
    Reads CSV data into a DataFrame with specified data types.

    Args:
    file_path (str): The path to the CSV file.

    Returns:
    pd.DataFrame: The loaded data.
    """
    dtype = {
        'rid': str,
        'horseName': str,
        'age': float,
        'saddle': float, 
        'isFav': bool,
        'trainerName': str,
        'jockeyName': str,
        'runners': float,
        'margin': str,
        'res_win': bool,
        'res_place': bool,
        'price': str,
    }
    return pd.read_csv(file_path, dtype=dtype)

def clean_data(df):
    """
    Cleans the horse racing data and performs necessary transformations.

    Args:
    df (pd.DataFrame): The raw data.

    Returns:
    pd.DataFrame: The cleaned data.
    """
    relevant_columns = [
        'rid', 'horseName', 'age', 'saddle', 'isFav', 'trainerName', 
        'jockeyName', 'runners', 'margin', 'res_win', 'res_place', 'price'
    ]
    df_cleaned = df[relevant_columns].dropna(subset=['price', 'saddle', 'trainerName', 'jockeyName'])
    df_cleaned['price_float'] = df_cleaned['price'].apply(convert_price_to_float)
    df_cleaned = df_cleaned.rename(columns={
        'rid': 'Race ID',
        'horseName': 'Horse',
        'age': 'Age',
        'saddle': 'Saddle',
        'isFav': 'Favorite',
        'trainerName': 'Trainer',
        'jockeyName': 'Jockey',
        'runners': 'Runners',
        'margin': 'Margin',
        'res_win': 'Won or Not',
        'res_place': 'Top 3',
        'price': 'Price'
    })
    df_cleaned['Win Probability'] = 1 / (1 + df_cleaned['price_float'])
    return df_cleaned.drop(columns=['price_float'])

def convert_price_to_float(price):
    """
    Converts price string to float. Handles fractional prices.

    Args:
    price (str): The price as a string.

    Returns:
    float: The price as a float.
    """
    if '/' in price:
        numerator, denominator = price.split('/')
        return float(numerator) / float(denominator)
    return float(price)

def save_cleaned_data(df, file_path):
    """
    Saves the cleaned DataFrame to a CSV file.

    Args:
    df (pd.DataFrame): The cleaned data.
    file_path (str): The path to save the CSV file.
    """
    df.to_csv(file_path, index=False)
    print(f"Cleaned data saved to: {file_path}")

def split_and_save_by_race(df, output_dir):
    """
    Splits DataFrame by Race ID and saves each race's data to a separate CSV file.

    Args:
    df (pd.DataFrame): The cleaned data.
    output_dir (str): The directory to save the split files.
    """
    os.makedirs(output_dir, exist_ok=True)
    for race_id, race_data in df.groupby('Race ID'):
        output_file_path = os.path.join(output_dir, f'race_{race_id}.csv')
        race_data.to_csv(output_file_path, index=False)
        print(f"Race {race_id} data saved to: {output_file_path}")
    print("All race events have been split into separate CSV files.")

def main():
    """
    Main function to process horse racing data.
    """
    if not os.path.exists(RAW_DATA_PATH):
        print(f"File not found: {RAW_DATA_PATH}")
        return

    horse_racing_data = read_data(RAW_DATA_PATH)
    horse_racing_cleaned = clean_data(horse_racing_data)
    
    print("Cleaned Horse Racing Data:")
    print(horse_racing_cleaned.head())
    
    save_cleaned_data(horse_racing_cleaned, CLEANED_DATA_PATH)
    split_and_save_by_race(horse_racing_cleaned, SPLIT_DATA_DIR)

if __name__ == "__main__":
    main()

Cleaned Horse Racing Data:
      Race ID            Horse  Age  Saddle  Favorite         Trainer  \
54502  405502      Motakhayyel  4.0    15.0     False  Richard Hannon   
54503  405502     Jack's Point  4.0    19.0     False    William Muir   
54504  405502       Mutamaasik  4.0     8.0     False    Roger Varian   
54505  405502  Cliffs Of Capri  6.0    22.0     False   Jamie Osborne   
54506  405502           Shelir  4.0    18.0     False   David O'Meara   

                Jockey  Runners              Margin  Won or Not  Top 3 Price  \
54502      Jim Crowley     23.0  1.2895897325049972        True   True  14/1   
54503     Martin Dwyer     23.0  1.2895897325049972       False   True  66/1   
54504     Dane O'Neill     23.0  1.2895897325049972       False   True   7/1   
54505  Dougie Costello     23.0  1.2895897325049972       False   True  20/1   
54506   Daniel Tudhope     23.0  1.2895897325049972       False  False  22/1   

       Win Probability  
54502         0.066667  
545


### Time-Rank Duality in Horse Racing
This Python script processes horse race data to normalize win probabilities and optimize initial guesses. It begins by loading data from a CSV file and normalizing the implied win probabilities. After setting a random seed for reproducibility, it generates and normalizes initial guesses for these probabilities. The script defines a residual sum of squares (RSS) function to measure the difference between predicted and actual probabilities and uses scipy's minimize function to optimize these guesses. It calculates the inverse cumulative distribution function (CDF) values, sigma, and mu values for the probabilities. Finally, it compiles all the results into a pandas DataFrame and displaying them.

In [3]:
import pandas as pd
import numpy as np
from scipy.optimize import minimize
from scipy.stats import norm

# Constants
FILE_PATH = './data/split_races/race_410637.csv'
RANDOM_SEED = 42
INITIAL_MU = 1.5

def load_race_data(file_path):
    """Loads the race data from a CSV file."""
    return pd.read_csv(file_path)

def normalize_probabilities(probabilities):
    """Normalizes an array of probabilities so that they sum to 1."""
    return probabilities / np.sum(probabilities)

def rss_function(initial_guess, win_probabilities):
    """Calculates the Residual Sum of Squares (RSS) between predicted and actual probabilities."""
    if np.sum(initial_guess) == 0:
        return np.inf
    predicted_probabilities = normalize_probabilities(initial_guess)
    squared_differences = (predicted_probabilities - win_probabilities) ** 2
    return np.sum(squared_differences)

def optimize_initial_guess(initial_guess, win_probabilities):
    """Optimizes the initial guess to minimize the RSS function."""
    bounds = [(0, 1) for _ in range(len(initial_guess))]
    result = minimize(
        rss_function, 
        initial_guess, 
        args=(win_probabilities,), 
        method='Powell', 
        bounds=bounds
    )
    return result.x

def calculate_mu_sigma(normalized_win_probabilities):
    """Calculates the mu and sigma values for the probabilities."""
    inverse_cdf_values = norm.ppf(normalized_win_probabilities)
    sigma = (len(normalized_win_probabilities) - len(normalized_win_probabilities)**2 / 2) / np.sum(inverse_cdf_values)
    mu_values = INITIAL_MU - sigma * inverse_cdf_values
    return mu_values, sigma

def main():
    """Main function to process and analyze horse racing data."""
    # Load and normalize win probabilities
    race_data = load_race_data(FILE_PATH)
    win_probabilities = race_data['Win Probability'].astype(float).values
    normalized_win_probabilities = normalize_probabilities(win_probabilities)
    
    print(f"Sum of Normalized Win Probabilities: {np.sum(normalized_win_probabilities):.10f}\n")
    
    # Generate initial guesses and normalize them
    np.random.seed(RANDOM_SEED)
    initial_guess = np.random.rand(len(win_probabilities))
    predicted_probabilities = normalize_probabilities(initial_guess)
    
    print(f"Sum of Initial Guesses: {np.sum(initial_guess):.10f}\n")
    print(f"RSS Function Value: {rss_function(initial_guess, win_probabilities):.10f}\n")
    
    # Perform optimization
    optimized_initial_guess = optimize_initial_guess(initial_guess, normalized_win_probabilities)
    normalized_optimized_initial_guess = normalize_probabilities(optimized_initial_guess)
    
    print(f"Sum of Optimized Initial Guesses: {np.sum(optimized_initial_guess):.10f}\n")
    print(f"Sum of Normalized Optimized Initial Guesses: {np.sum(normalized_optimized_initial_guess):.10f}\n")
    
    # Calculate final RSS
    final_rss = rss_function(normalized_optimized_initial_guess, normalized_win_probabilities)
    print(f"RSS Function Value After Normalization: {final_rss:.10f}\n")
    
    # Calculate mu and sigma values
    inverse_cdf_values = norm.ppf(normalized_win_probabilities)
    mu_values, sigma = calculate_mu_sigma(normalized_win_probabilities)
    probabilities = norm.cdf((INITIAL_MU - mu_values) / sigma)
    
    # Compile results into a DataFrame
    results_df = pd.DataFrame({
        'Horse': race_data['Horse'],
        'Win Probabilities': win_probabilities,
        'Normalized Win Probabilities': normalized_win_probabilities,
        'Initial Guesses': initial_guess,
        'Predicted Probabilities': predicted_probabilities,
        'Residuals': predicted_probabilities - win_probabilities,
        'Optimized Initial Guesses': optimized_initial_guess,
        'Normalized Optimized Initial Guesses': normalized_optimized_initial_guess,
        'Inverse of CDF Values': inverse_cdf_values,
        'Sigma': [sigma] * len(win_probabilities),
        'Mu Values': mu_values,
        'Verifying Probabilities': probabilities
    })
    
    # Display results
    pd.options.display.float_format = '{:.10f}'.format
    pd.set_option('display.colheader_justify', 'center')
    print(results_df.to_string(index=False), "\n")

if __name__ == "__main__":
    main()


Sum of Normalized Win Probabilities: 1.0000000000

Sum of Initial Guesses: 7.6018729345

RSS Function Value: 0.1159718888

Sum of Optimized Initial Guesses: 5.5926580177

Sum of Normalized Optimized Initial Guesses: 1.0000000000

RSS Function Value After Normalization: 0.0000000000

     Horse        Win Probabilities  Normalized Win Probabilities  Initial Guesses  Predicted Probabilities    Residuals   Optimized Initial Guesses  Normalized Optimized Initial Guesses  Inverse of CDF Values     Sigma       Mu Values   Verifying Probabilities
 Plenty Of Butty    0.1538461538             0.1191953039           0.3745401188         0.0492694527       -0.1045767012        0.6666301776                     0.1191973790                 -1.1790190059      4.1213865518  6.3591930754       0.1191953039      
   Alfie's Angel    0.0243902439             0.0188968165           0.9507143064         0.1250631673        0.1006729234        0.1056719787                     0.0188947685                 -