# Step 1.3: F1 Data Preprocessing and Cleaning

This notebook covers the tasks for preprocessing and cleaning the raw data collected in the previous steps. The goal is to create a clean, structured, and consistent dataset ready for analysis and model training.

**Tasks:**
1.  **Load Raw Data:** Ingest all datasets from the `f1_data` directory.
2.  **Standard Text Cleaning:** Apply procedures like lowercasing, removing HTML tags, and stripping special characters from text fields.
3.  **Handle Missing Data:** Inspect each dataset for null values and apply an appropriate strategy (e.g., filling or dropping).
4.  **Standardize Schema:** Ensure data types are correct (e.g., dates, numbers) across all datasets.
5.  **Save Cleaned Data:** Store the processed dataframes in a versioned directory (`cleaned_data_v1.0`) in the efficient Parquet format.

### 1. Setup and Configuration

Import necessary libraries and define the file paths for our raw data and the output directory for the cleaned data.

In [1]:
import os
import pandas as pd
import re
from bs4 import BeautifulSoup
import glob

# --- Configuration ---
# Input paths based on the project structure
SCRAPED_DATA_PATH = os.path.join('f1_data', 'scraped_data')
KAGGLE_DATA_PATH = os.path.join('f1_data', 'kaggle_data')

# Output path for the cleaned, versioned dataset
CLEANED_DATA_PATH = os.path.join('f1_data', 'cleaned_data_v1.0')

# Create the output directory if it doesn't exist
os.makedirs(CLEANED_DATA_PATH, exist_ok=True)

print(f"Raw Scraped Data Path: {SCRAPED_DATA_PATH}")
print(f"Raw Kaggle Data Path: {KAGGLE_DATA_PATH}")
print(f"Cleaned Data Output Path: {CLEANED_DATA_PATH}")

Raw Scraped Data Path: f1_data\scraped_data
Raw Kaggle Data Path: f1_data\kaggle_data
Cleaned Data Output Path: f1_data\cleaned_data_v1.0


### 2. Text Cleaning Utility Function

This function will be our general-purpose tool for cleaning text data. It removes HTML tags, converts text to lowercase, and strips out special characters and extra whitespace.

In [None]:
def clean_text(text):
    """
    Applies standard text cleaning procedures.
    - Removes HTML tags.
    - Converts text to lowercase.
    - Removes special characters, keeping alphanumeric and basic punctuation.
    """
    if not isinstance(text, str):
        return ""
    
    # Remove HTML tags using BeautifulSoup
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters (keeping letters, numbers, and basic punctuation)
    text = re.sub(r'[^\w\s.,!?-]', '', text)
    
    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

### 3. Process Scraped News Articles

Load the `f1_news_articles.csv`, apply the text cleaning function to the `title` and `summary` columns, and check for any missing values or duplicates.

In [7]:
def process_scraped_news():
    """
    Loads, cleans, and saves the scraped news articles.
    """
    print("\n--- Processing Scraped News Articles ---")
    news_file = os.path.join(SCRAPED_DATA_PATH, 'f1_news_articles.csv')

    try:
        news_df = pd.read_csv(news_file)
        print("Original news data:")
        print(news_df.head())
        print(f"\nShape: {news_df.shape}")
        print(f"\nMissing values:\n{news_df.isnull().sum()}")

        # Clean text columns
        news_df['title_cleaned'] = news_df['title'].apply(clean_text)
        news_df['summary_cleaned'] = news_df['summary'].apply(clean_text)

        # Handle missing summaries by filling with an empty string
        news_df['summary_cleaned'] = news_df['summary_cleaned'].fillna('')
        
        # Drop original text columns and duplicates
        cleaned_news_df = news_df[['source', 'title_cleaned', 'summary_cleaned', 'link']].copy()
        cleaned_news_df.drop_duplicates(subset=['link'], inplace=True)
        cleaned_news_df.rename(columns={'title_cleaned': 'title', 'summary_cleaned': 'summary'}, inplace=True)
        
        print("\nCleaned news data:")
        print(cleaned_news_df.head())
        print(f"\nShape after cleaning: {cleaned_news_df.shape}")

        # Save to CSV instead of Parquet to avoid dependency issues
        output_file = os.path.join(CLEANED_DATA_PATH, 'cleaned_news_articles.csv')
        cleaned_news_df.to_csv(output_file, index=False)
        print(f"\nSuccessfully cleaned and saved news articles to {output_file}")

    except FileNotFoundError:
        print(f"Error: News articles file not found at {news_file}")
    except Exception as e:
        print(f"An error occurred: {e}")


In [8]:
process_scraped_news()


--- Processing Scraped News Articles ---
Original news data:
                source                                              title  \
0  Motorsport Magazine  Verstappen nears historic F1 comeback - US GP ...   
1  Motorsport Magazine  Verstappen's saving F1 from papaya coporate pr...   
2  Motorsport Magazine  Apple's F1 gamble gets mixed response - What y...   
3  Motorsport Magazine  Mark Hughes: How Verstappen crushed McLaren af...   
4  Motorsport Magazine  When is the next F1 race? Full calendar for 20...   

                                             summary  \
0  Verstappen's double win in Austin moved him cl...   
1  Could F1 2025 be about to wake from its slumbe...   
2  Cadillac's quiet confidence, Apple's broadcast...   
3  Behind Max Verstappenâ€™s perfect 2025 United St...   
4  Full F1 schedule for the year, including the n...   

                                                link  
0  https://www.motorsportmagazine.com/articles/si...  
1  https://www.motorsportm

### 4. Process Scraped Autosport Race Results

Here, we'll find all the individual race result CSVs, combine them into a single DataFrame, standardize the data, and handle any inconsistencies.

In [10]:
def process_scraped_results():
    """
    Loads, cleans, and saves scraped race results.
    """
    print("\n--- Processing Scraped Race Results ---")
    results_files = glob.glob(os.path.join(SCRAPED_DATA_PATH, 'autosport_f1_results_*.csv'))

    if results_files:
        all_results_df = pd.concat([pd.read_csv(f) for f in results_files], ignore_index=True)
        
        print("Original combined results data:")
        print(all_results_df.head())
        print(f"\nShape: {all_results_df.shape}")
        print(f"\nMissing values:\n{all_results_df.isnull().sum()}")

        # Basic cleaning and standardization
        all_results_df['Driver'] = all_results_df['Driver'].apply(clean_text)
        all_results_df['Team'] = all_results_df['Team'].apply(clean_text)
        
        # Convert 'Pos' to numeric, coercing errors (like 'NC') to NaN
        all_results_df['Pos'] = pd.to_numeric(all_results_df['Pos'], errors='coerce')
        
        # Fill NaN in 'Time' with a placeholder
        all_results_df['Time'] = all_results_df['Time'].fillna('Not Classified')
        
        # Drop duplicates if any
        all_results_df.drop_duplicates(inplace=True)

        print("\nCleaned race results data:")
        print(all_results_df.head())
        print(f"\nShape after cleaning: {all_results_df.shape}")

        # Save to CSV
        output_file = os.path.join(CLEANED_DATA_PATH, 'cleaned_race_results.csv')
        all_results_df.to_csv(output_file, index=False)
        print(f"\nSuccessfully cleaned and saved race results to {output_file}")
    else:
        print("No scraped race result files found to process.")

In [11]:
process_scraped_results()


--- Processing Scraped Race Results ---
No scraped race result files found to process.


### 5. Process Kaggle Datasets

Now we'll process each of the Kaggle datasets. Since they are more structured, the focus will be on data type consistency and handling any missing values.

In [18]:
def process_kaggle_datasets():
    """
    Wrapper function to process all Kaggle datasets.
    """
    def process_file(file_path, output_name, text_column=None, date_column=None):
        """A helper function to load, clean, and save a Kaggle CSV."""
        print(f"\n--- Processing {output_name} ---")
        try:
            df = pd.read_csv(file_path, encoding='utf-8', on_bad_lines='warn')
            print(f"Original shape: {df.shape}")
            
            # Clean text column if specified
            if text_column and text_column in df.columns:
                df[text_column] = df[text_column].apply(clean_text)
            
            # Convert date column if specified
            if date_column and date_column in df.columns:
                df[date_column] = pd.to_datetime(df[date_column], errors='coerce')
                
            # Drop rows where critical info might be missing after conversion
            df.dropna(subset=[c for c in [text_column, date_column] if c], inplace=True)
            
            print(f"Shape after cleaning: {df.shape}")
            
            # Save to CSV
            output_file = os.path.join(CLEANED_DATA_PATH, f'cleaned_{output_name}.csv')
            df.to_csv(output_file, index=False)
            print(f"Successfully saved cleaned data to {output_file}")
            return df
        except FileNotFoundError:
            print(f"Error: File not found at {file_path}")
            return None
        except Exception as e:
            print(f"An error occurred while processing {file_path}: {e}")
            return None

    # Process each Kaggle dataset
    process_file(
        os.path.join(KAGGLE_DATA_PATH, 'trending_tweets', 'F1_tweets.csv'), 
        'tweets', 
        text_column='text', 
        date_column='date'
    )

    process_file(
        os.path.join(KAGGLE_DATA_PATH, 'reddit_comments', 'kaggle_RC_2019-05.csv'),
        'reddit_comments',
        text_column='body'
    )

    process_file(
        os.path.join(KAGGLE_DATA_PATH, 'fan_ratings', 'aggregated_kaggle.csv'),
        'fan_ratings',
        date_column='Y'
    )

    history_files = {
        'constructors_performance': 'Constructor_Performance.csv',
        'constructor_rankings': 'Constructor_Rankings.csv',
        'drivers_details': 'Driver_Details.csv',
        'driver_rankings': 'Driver_Rankings.csv',
        'lap_times': 'Lap_Timings.csv',
        'pit_stop_records': 'Pit_Stop_Records.csv',
        'qualifying_results': 'Qualifying_Results.csv',
        'race_results': 'Race_Results.csv',
        'race_schedule': 'Race_Schedule.csv',
        'race_status': 'Race_Status.csv',
        'seasonal_summary': 'Season_Summaries.csv',
        'sprint_results': 'Sprint_Race_Results.csv',
        'team_details': 'Team_Details.csv',
        'track_information': 'Track_Information.csv'

    }

    for name, filename in history_files.items():
        process_file(
            os.path.join(KAGGLE_DATA_PATH, 'championship_history', filename),
            name
        )


In [19]:
process_kaggle_datasets()


--- Processing tweets ---


  df = pd.read_csv(file_path, encoding='utf-8', on_bad_lines='warn')


Original shape: (632388, 13)
Shape after cleaning: (632384, 13)
Successfully saved cleaned data to f1_data\cleaned_data_v1.0\cleaned_tweets.csv

--- Processing reddit_comments ---
Original shape: (1000000, 4)
Shape after cleaning: (1000000, 4)
Successfully saved cleaned data to f1_data\cleaned_data_v1.0\cleaned_reddit_comments.csv

--- Processing fan_ratings ---
Original shape: (202, 7)
Shape after cleaning: (202, 7)
Successfully saved cleaned data to f1_data\cleaned_data_v1.0\cleaned_fan_ratings.csv

--- Processing constructors_performance ---
Original shape: (12505, 5)
Shape after cleaning: (12505, 5)
Successfully saved cleaned data to f1_data\cleaned_data_v1.0\cleaned_constructors_performance.csv

--- Processing constructor_rankings ---
Original shape: (13271, 7)
Shape after cleaning: (13271, 7)
Successfully saved cleaned data to f1_data\cleaned_data_v1.0\cleaned_constructor_rankings.csv

--- Processing drivers_details ---
Original shape: (859, 9)
Shape after cleaning: (859, 9)
Succ

### 6. Conclusion

All raw datasets have been processed, cleaned, and saved to the `f1_data/cleaned_data_v1.0` directory as Parquet files. This completes the preprocessing and cleaning phase, and the resulting dataset is now ready for exploratory data analysis (EDA) and model building.