# MOT Test Data Generation

This notebook generates synthetic MOT test data for analysis. The assignment requires:
  - Tabular data with at least 500 rows
  - Non-tabular data generated with AI
  - Explanation of generation and field distributions

**Author:** Gabriel Guimaraes

**Date:** 6.11.25

**Module:** Practical Data Analytics Level 5

## 1. Import required libraries

We'll use:
  - **pandas**: Data manipulation and `DataFrames`
  - **numpy**: Numerical operations and random number generation
  - **faker**: Generate realistic fake data (names, addresses, dates, etc.)
  - **sqlite3/sqlalchemy**: Database creation and management
  - **datetime**: Date handeling

In [2]:
import pandas as pd
import numpy as np
from faker import Faker
from datetime import datetime, timedelta
import random
import sqlite3
from sqlalchemy import create_engine

fake = Faker('en_GB')
np.random.seed(16)  # For reproducibility

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Define Data Generation Parameters

Setting up the scale of our synthetic dataset:
  - **500 vehicles** (minimum requirement)
  - **1000 MOT tests** (vehicles can have multiple tests)
  - **20 test centres** across UK regions
  - **50 examiners** conducting tests

We also define realistic distributions for vehicle makes, fuel types, and regions based on UK MOT statistics.

In [3]:
# Define Data Generation Parameters
NUM_VEHICLES = 500
NUM_TESTS = 1000
NUM_TEST_CENTRES = 20
NUM_EXAMINERS = 50

# Vehicle makes and models 
VEHICLE_MAKES = ['Ford', 'Vauxhall', 'Volkswagen', 'BMW', 'Mercedes', 'Audi', 'Toyota', 'Nissan']
VEHICLE_MODELS = {
  'Ford': ['Fiesta', 'Focus', 'Mondeo', 'Kuga', 'Transit'],
  'Vauxhall': ['Corsa', 'Astra', 'Insignia', 'Mokka'],
  'Volkswagen': ['Golf', 'Polo', 'Passat', 'Tiguan'],
  'BMW': ['3 Series', '5 Series', 'X3', '1 Series'],
  'Mercedes': ['C-Class', 'E-Class', 'A-Class', 'GLC'],
  'Audi': ['A3', 'A4', 'Q3', 'Q5'],
  'Toyota': ['Yaris', 'Corolla', 'RAV4', 'Aygo'],
  'Nissan': ['Qashqai', 'Juke', 'Micra', 'Leaf']
}

FUEL_TYPES = ['Petrol', 'Diesel', 'Electric', 'Hybrid']
UK_REGIONS = ['London', 'South East', 'North West', 'Midlands', 'Scotland', 'Wales', 'South West', 'North East', 'Yorkshire']
DEFECT_CATEGORIES = ['Brakes', 'Tyres', 'Lights', 'Suspension', 'Steering', 'Exhaust', 'Body', 'Registration Plates', 'Seatbelts']
DEFECT_SEVERITIES = ['DANGEROUS', 'MAJOR', 'MINOR', 'ADVISORY']

print("Data generation parameters defined.")


Data generation parameters defined.


## 3. Generate Vehicle Data 

Creating the **vehicles** table with 500 records. Each vehicle has:
  - Unique vehicle_id and registration plate
  - Make and model from realistic UK distributions
  - Fuel type (weighted: Petrol 45%, Diesel 40%, Hybrid 10%, Electric 5%)
  - Registration date between 2010-2023
  - Vehicle class (most are class 4 - standard cars. subtypes: A-J)

This represents the foundational data for our MOT analysis.

In [4]:
def generate_vehicles(num_vehicles) -> pd.DataFrame:
    """Generate synthetic vehicle data."""
    vehicles = []

    for i in range(num_vehicles):
        make = random.choice(VEHICLE_MAKES)
        model = random.choice(VEHICLE_MODELS[make])

        registration_date = fake.date_between(start_date='-15y', end_date='-2y')
        registration = fake.license_plate()

        # Data: https://assets.publishing.service.gov.uk/media/66bdf9923cc0741b923146e1/ntsq09029.ods
        fuel_type = np.random.choice(FUEL_TYPES, p=[0.61, 0.31, 0.02, 0.06])  # Weighted probabilities base on 2023 data
        vehicle_class = '4'  # Most vehicles are class 4 (standard cars)

        vehicle = {
            'vehicle_id': i + 1,
            'registration': registration,
            'make': make,
            'model': model,
            'fuel_type': fuel_type,
            'registration_date': registration_date,
            'vehicle_class': vehicle_class 
        }

        vehicles.append(vehicle)


    return pd.DataFrame(vehicles)

vehicles_df = generate_vehicles(NUM_VEHICLES)
print(f"Generated {len(vehicles_df)} vehicles.")
print(f"\nVehicle Make Distribution:")
print(vehicles_df['make'].value_counts())
vehicles_df.head()
        

Generated 500 vehicles.

Vehicle Make Distribution:
make
Volkswagen    69
Mercedes      68
Vauxhall      65
Audi          63
BMW           63
Toyota        63
Nissan        55
Ford          54
Name: count, dtype: int64


Unnamed: 0,vehicle_id,registration,make,model,fuel_type,registration_date,vehicle_class
0,1,HJ53 UJL,Audi,Q5,Petrol,2017-10-08,4
1,2,GM27 AWO,Mercedes,A-Class,Petrol,2023-05-17,4
2,3,LQ84XFE,Nissan,Leaf,Petrol,2020-01-30,4
3,4,FD36UAD,Ford,Kuga,Petrol,2012-09-22,4
4,5,OY12 YCR,BMW,1 Series,Petrol,2020-12-11,4


## 4. Generate Test Centre Data

Creating 20 MOT test centres distributed across UK regions. Each centre has:
  - Unique centre_id
  - Realistic company name
  - Location (city)
  - Region for geographic analysis
  - UK postcode

This allows us to analyse regional variations in MOT outcomes.

In [8]:
def generate_test_centres(num_centres) -> pd.DataFrame:
    """Generate synthetic test centre data."""
    centres = []

    for i in range(num_centres):
        name = f"{fake.city()} Test Centre"
        # Remove postcode from location to avoid redundancy
        location = fake.address().rsplit(' ', 2)[0].replace('\n', ', ')
        region = random.choice(UK_REGIONS)
        postcode = fake.postcode()
        centre = {
            'centre_id': i + 1,
            'name': name,
            'location': location,
            'region': region,
            'postcode': postcode
        }
        centres.append(centre)

    return pd.DataFrame(centres)

test_centres_df = generate_test_centres(NUM_TEST_CENTRES)
print(f"Generated {len(test_centres_df)} test centres.")
print(f"\nTest Centre Region Distribution:")
print(test_centres_df['region'].value_counts())
test_centres_df.head()


Generated 20 test centres.

Test Centre Region Distribution:
region
Wales         4
North East    3
South West    3
Yorkshire     2
South East    2
Midlands      2
London        2
North West    1
Scotland      1
Name: count, dtype: int64


Unnamed: 0,centre_id,name,location,region,postcode
0,1,Wrightstad Test Centre,"Flat 32B, Mohammed",Yorkshire,B3 7ZJ
1,2,New Bethanyberg Test Centre,996 Howard,South East,WA2 0PX
2,3,Andreaside Test Centre,"Studio 17c, Mills",Yorkshire,RG05 5RT
3,4,Stuartberg Test Centre,"Flat 4, Lynne",South East,B68 8RR
4,5,Lake Rebeccabury Test Centre,"308 Maurice summit, North",Midlands,HD64 4HL


## 5. Generate Examiner Data

Creating 50 MOT examiners who conduct the tests. Each examiner has:
- Unique examiner_id
- Realistic name
- Years of experience (2-25 years)
- Assigned to a specific test centre

This data could be used for analysing examiner consistency and experience effects on test outcomes.

In [9]:
def generate_examiners(num_examiners, test_centres_df) -> pd.DataFrame:
    """Generate synthetic examiner data."""
    examiners = []

    for i in range(num_examiners):
        name = fake.name()
        years_experience = random.randint(2, 25)
        centre_id = random.choice(test_centres_df['centre_id'].tolist())

        examiner = {
            'examiner_id': i + 1,
            'name': name,
            'years_experience': years_experience,
            'centre_id': centre_id
        }
        examiners.append(examiner)

    return pd.DataFrame(examiners)

examiners_df = generate_examiners(NUM_EXAMINERS, test_centres_df)
print(f"Generated {len(examiners_df)} examiners.")
print(f"\nExaminer Experience Distribution:")
print(examiners_df['years_experience'].describe())
examiners_df.head()

Generated 50 examiners.

Examiner Experience Distribution:
count    50.000000
mean     14.200000
std       5.872801
min       2.000000
25%       9.250000
50%      14.500000
75%      18.000000
max      25.000000
Name: years_experience, dtype: float64


Unnamed: 0,examiner_id,name,years_experience,centre_id
0,1,Gregory Brown,16,16
1,2,Hollie Webster,19,5
2,3,Lydia Burns,18,1
3,4,Harry Thomas,21,8
4,5,Linda Brown,11,12
