DATA PREPROCESSING FOR ML

Author: Zeeshan

Purpose: Prepare cleaned data for machine learning model

Preprocessing tasks:
1. Convert data types
2. Handle missing values
3. Extract date features
4. Normalize text fields

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

# Load my cleaned data from previous step
data = pd.read_csv('data/processed/cleaned_master_data.csv')
print(f"üì• Loaded{len(data):,} records")
print(f"   Columns:{data.columns.tolist()}")

üì• Loaded29,959 records
   Columns:['contract_id', 'pub_date', 'contract_amount', 'bidder_count', 'dept_name', 'proc_method', 'data_source']


TYPE CONVERSION STRATEGY:

Why? ML models need proper data types:
- Dates as datetime ‚Üí Extract month, year, day
- Amounts as numbers ‚Üí Calculate statistics
- Categories as text ‚Üí For grouping analysis


In [2]:
print("üîÑ Converting data types...")

# Convert publication date to datetime
# Using UTC to handle timezone issues I found in EDA
data['pub_date'] = pd.to_datetime(data['pub_date'],
                                   errors='coerce',
                                   utc=True)
# Remove timezone for easier handling
data['pub_date'] = data['pub_date'].dt.tz_localize(None)

# Ensure amounts are numeric
data['contract_amount'] = pd.to_numeric(data['contract_amount'],
                                        errors='coerce')

# Ensure bidder count is numeric
data['bidder_count'] = pd.to_numeric(data['bidder_count'],
                                      errors='coerce')

# Check results
print(f"‚úì Date conversion:{data['pub_date'].notna().sum()} valid dates")
print(f"‚úì Amount conversion:{data['contract_amount'].notna().sum()} valid amounts")
print(f"‚úì Bidder conversion:{data['bidder_count'].notna().sum()} valid counts")

üîÑ Converting data types...
‚úì Date conversion:26205 valid dates
‚úì Amount conversion:29959 valid amounts
‚úì Bidder conversion:29929 valid counts


MISSING VALUE STRATEGY (My Rationale):

bidder_count: Fill with 1 (single bidder - worst case assumption)
proc_method: Fill with 'Unknown' (category for missing)
pub_date: Remove rows (can't do time analysis without dates)

Why these choices?
- Assuming single bidder is conservative (flags more for review)
- Unknown category preserves records without making false assumptions
- Date is critical for fraud detection - can't fill with fake dates

In [4]:
print("üîß Handling missing values...")

# Track missing before
print("üìä Missing values BEFORE handling:")
print(data.isnull().sum())
print()

# Fix 1: Fill bidder count (CORRECT METHOD - no inplace)
bidder_missing = data['bidder_count'].isna().sum()
data = data.copy()  # Avoid SettingWithCopyWarning
data['bidder_count'] = data['bidder_count'].fillna(1)  # Modern way
print(f"‚úì Filled {bidder_missing} missing bidder counts with 1")

# Fix 2: Fill procurement method
proc_missing = data['proc_method'].isna().sum()
data['proc_method'] = data['proc_method'].fillna('Unknown')  # Modern way
print(f"‚úì Filled {proc_missing} missing procurement methods with 'Unknown'")

# Fix 3: Handle missing dates (REMOVE these rows)
date_missing = data['pub_date'].isna().sum()
if date_missing > 0:
    print(f"\n‚ö†Ô∏è  Found {date_missing} rows with missing dates")
    print("   Removing these rows (dates are critical for fraud analysis)")
    data = data[data['pub_date'].notna()]  # Keep only rows with valid dates
    print(f"‚úì Removed {date_missing} rows with missing dates")

# Show current missing values
print("\nüìä Missing values AFTER handling:")
remaining_missing = data.isnull().sum()
print(remaining_missing)

# Summary
print(f"\n‚úÖ Data ready for feature engineering!")
print(f"   Final record count: {len(data):,}")
print(f"   Columns with no missing values: {(remaining_missing == 0).sum()}/{len(remaining_missing)}")

üîß Handling missing values...
üìä Missing values BEFORE handling:
contract_id           0
pub_date           3754
contract_amount       0
bidder_count          0
dept_name             0
proc_method           0
data_source           0
dtype: int64

‚úì Filled 0 missing bidder counts with 1
‚úì Filled 0 missing procurement methods with 'Unknown'

‚ö†Ô∏è  Found 3754 rows with missing dates
   Removing these rows (dates are critical for fraud analysis)
‚úì Removed 3754 rows with missing dates

üìä Missing values AFTER handling:
contract_id        0
pub_date           0
contract_amount    0
bidder_count       0
dept_name          0
proc_method        0
data_source        0
dtype: int64

‚úÖ Data ready for feature engineering!
   Final record count: 26,205
   Columns with no missing values: 7/7


DATE FEATURE EXTRACTION (My Design):

From publication date, I'm extracting:
- year: To see trends over time
- month: To detect year-end rush (Dec/March)
- day_of_week: To catch weekend awards (suspicious)
- quarter: For quarterly analysis

Why? Timing is critical in procurement fraud detection

In [5]:
print("üìÖ Extracting date features...")

# Extract year
data['tender_year'] = data['pub_date'].dt.year

# Extract month (1-12)
data['tender_month'] = data['pub_date'].dt.month

# Extract day of week (0=Monday, 6=Sunday)
data['day_of_week'] = data['pub_date'].dt.dayofweek

# Calculate quarter (Q1-Q4)
data['quarter'] = data['pub_date'].dt.quarter

# Create readable month names for later visualization
month_names = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun',
               7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'}
data['month_name'] = data['tender_month'].map(month_names)

print("‚úì Created date-based features:")
print(f"   - tender_year (range:{data['tender_year'].min()}-{data['tender_year'].max()})")
print(f"   - tender_month (1-12)")
print(f"   - day_of_week (0-6)")
print(f"   - quarter (1-4)")

üìÖ Extracting date features...
‚úì Created date-based features:
   - tender_year (range:2022-2022)
   - tender_month (1-12)
   - day_of_week (0-6)
   - quarter (1-4)


In [6]:
# Save preprocessed dataset
data.to_csv('data/processed/preprocessed_data.csv', index=False)

print(f"\n‚úÖ PREPROCESSING COMPLETE")
print(f"   Original columns: 7")
print(f"   Final columns:{len(data.columns)}")
print(f"   Records ready for feature engineering:{len(data):,}")
print(f"\nüíæ Saved to: data/processed/preprocessed_data.csv")

# Display summary
data.info()


‚úÖ PREPROCESSING COMPLETE
   Original columns: 7
   Final columns:12
   Records ready for feature engineering:26,205

üíæ Saved to: data/processed/preprocessed_data.csv
<class 'pandas.core.frame.DataFrame'>
Index: 26205 entries, 0 to 26204
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   contract_id      26205 non-null  object        
 1   pub_date         26205 non-null  datetime64[ns]
 2   contract_amount  26205 non-null  float64       
 3   bidder_count     26205 non-null  float64       
 4   dept_name        26205 non-null  object        
 5   proc_method      26205 non-null  object        
 6   data_source      26205 non-null  object        
 7   tender_year      26205 non-null  int32         
 8   tender_month     26205 non-null  int32         
 9   day_of_week      26205 non-null  int32         
 10  quarter          26205 non-null  int32         
 11  month_name       26205 non-nul