# Crime Data Analysis

This notebook performs analysis on crime data across different regions. It consolidates CSV files from multiple folders, computes summary statistics, and provides breakdowns by crime type and jurisdiction. 

**Objectives:**
- Load and combine multiple CSV files into a single DataFrame.
- Identify the earliest and latest crime data.
- Calculate the total number of crimes.
- Provide breakdowns by crime type and jurisdiction.

## Import Libraries

We use Python libraries such as `pandas` for data manipulation, `os` for file system operations, and `tqdm` for a progress bar to improve user experience during file loading.




In [14]:
import os
import pandas as pd
from tqdm import tqdm # for the progress bar
import pickle # to store the datafame as a binary object

## Set Project Paths

Here, we set the paths to the data directories. The `start_folder` is the path where the CSV files are stored. The code automatically sets this path relative to the project directory for consistency.


In [15]:
# start_folder = '/mnt/raw_data'  # Replace with your folder path
# output_folder = '/mnt/processed_data'  # Replace with your folder path

# my data is in a raw_data directory within my project
start_folder = os.path.join(os.path.dirname(os.path.abspath(
    "__file__")) if '__file__' in globals() else os.getcwd(), 'raw_data')

# my data will be output to the processed_data directory within my project
output_folder = os.path.join(os.path.dirname(os.path.abspath(
    "__file__")) if '__file__' in globals() else os.getcwd(), 'processed_data')

## Define Functions to Load Data

This function loads all CSV files from the specified `raw_data` directory and its subdirectories into a single DataFrame. The function also uses a progress bar to track the loading process.

- **Function:** `load_csv_files_to_dataframe`
- **Input:** `start_folder` (directory containing CSV files)
- **Output:** Combined DataFrame with all crime data


In [16]:
def load_csv_files_to_dataframe(start_folder):
    combined_df = pd.DataFrame()

    # First, count the total number of CSV files
    csv_files = []
    for root, _, files in os.walk(start_folder):
        for file in files:
            if file.endswith('.csv'):
                csv_files.append(os.path.join(root, file))

    # Display progress bar as files are loaded
    print(f"Found {len(csv_files)} CSV files. Loading...")
    for file_path in tqdm(csv_files, desc="Loading files"):
        try:
            df = pd.read_csv(file_path)
            combined_df = pd.concat([combined_df, df], ignore_index=True)
        except Exception as e:
            print(f"Error reading {file_path}: {e}")

    return combined_df

## Load and Combine Data

This code block uses the `load_csv_files_to_dataframe` function to load all CSV files into a single DataFrame called `combined_df`. A progress bar shows the loading status.


In [17]:
# Usage
combined_df = load_csv_files_to_dataframe(start_folder)

Found 1570 CSV files. Loading...


Loading files: 100%|██████████| 1570/1570 [14:53<00:00,  1.76it/s]


## Data Preprocessing

In this section, we ensure that the `Month` column is in a standard datetime format to enable date-based calculations. This preprocessing step is necessary to compute date-related statistics.


In [18]:
# Ensure 'Month' column is in datetime format
combined_df['Month'] = pd.to_datetime(combined_df['Month'], format='%Y-%m')
combined_df.head()

Unnamed: 0,Crime ID,Month,Reported by,Falls within,Longitude,Latitude,Location,LSOA code,LSOA name,Crime type,Last outcome category,Context
0,,2021-10-01,Avon and Somerset Constabulary,Avon and Somerset Constabulary,-2.49487,51.422276,On or near Conference/Exhibition Centre,E01014399,Bath and North East Somerset 001A,Anti-social behaviour,,
1,,2021-10-01,Avon and Somerset Constabulary,Avon and Somerset Constabulary,-2.511761,51.409966,On or near Caernarvon Close,E01014399,Bath and North East Somerset 001A,Anti-social behaviour,,
2,2ef684ec509091d7b95b33fad587b0556e809a8d9bce9b...,2021-10-01,Avon and Somerset Constabulary,Avon and Somerset Constabulary,-2.495055,51.422132,On or near Cross Street,E01014399,Bath and North East Somerset 001A,Burglary,Status update unavailable,
3,a5838c48aa3cd51c2458a9dc56d3395f84a040ff5f227c...,2021-10-01,Avon and Somerset Constabulary,Avon and Somerset Constabulary,-2.509126,51.416137,On or near St Francis Road,E01014399,Bath and North East Somerset 001A,Burglary,Status update unavailable,
4,b6e9c1573eccc1eee7e6a3152a8477a0b44a66329c1344...,2021-10-01,Avon and Somerset Constabulary,Avon and Somerset Constabulary,-2.515072,51.419357,On or near Stockwood Hill,E01014399,Bath and North East Somerset 001A,Drugs,Unable to prosecute suspect,


## Save Combined Data to File

Finally, we save the combined DataFrame to a CSV file in the `output_data` directory, making it easy to access and share the processed data.


In [30]:
# Define the output file path
output_file = os.path.join(output_folder, 'combined_data.pkl')

# Save the DataFrame to a pickle file
combined_df.to_pickle(output_file)
print("Data saved")

# Load the DataFrame from the pickle file
# combined_df = pd.read_pickle(output_file)

Data saved


## Calculate Total Crime Count

Here, we calculate the total number of crimes recorded in the dataset. This gives an overview of the data size.


In [23]:
# Ensure 'Month' column is in datetime format
combined_df['Month'] = pd.to_datetime(combined_df['Month'], format='%Y-%m')

# Calculate the earliest and latest Month
earliest_month = combined_df['Month'].min()
latest_month = combined_df['Month'].max()

# Calculate the total count of crimes
total_crimes = len(combined_df)

# Breakdown of crimes by Crime type
crime_type_breakdown = combined_df['Crime type'].value_counts()

# Breakdown of crimes by "Falls within"
falls_within_breakdown = combined_df['Falls within'].value_counts()

## Summary of Results

- **Earliest Month:** Displayed as `yyyy-mm`.
- **Latest Month:** Displayed as `yyyy-mm`.
- **Total Crimes:** Displayed with thousands separators.
- **Breakdowns:** Summary tables showing the number of crimes by type and jurisdiction.

This section summarises the main findings from our dataset.


In [24]:
# Display the results with commas in numerical values
print(f"Earliest Month: {earliest_month.strftime('%Y-%m')}")
print(f"Latest Month: {latest_month.strftime('%Y-%m')}")
print(f"Total Crimes: {total_crimes:,}")

Earliest Month: 2021-10
Latest Month: 2024-09
Total Crimes: 18,261,334


## Breakdown of Crimes by Type

This section provides a summary of crimes grouped by crime type. It shows the number of occurrences for each crime type, formatted with thousands separators for readability.


In [25]:
print("\nBreakdown of Crimes by Crime Type:")
print(crime_type_breakdown.apply(lambda x: f"{x:,}"))



Breakdown of Crimes by Crime Type:
Crime type
Violence and sexual offences    6,338,324
Anti-social behaviour           2,933,160
Public order                    1,445,829
Criminal damage and arson       1,440,935
Other theft                     1,376,051
Shoplifting                     1,138,045
Vehicle crime                   1,096,588
Burglary                          755,241
Drugs                             498,951
Other crime                       334,821
Theft from the person             334,314
Robbery                           214,538
Bicycle theft                     199,207
Possession of weapons             155,330
Name: count, dtype: object


## Breakdown of Crimes by Jurisdiction ("Falls within")

We also calculate the number of crimes grouped by the jurisdiction they fall within, represented by the `"Falls within"` column. This breakdown helps us understand crime distribution across different regions.


In [26]:
print("\nBreakdown of Crimes by 'Falls within':")
print(falls_within_breakdown.apply(lambda x: f"{x:,}"))


Breakdown of Crimes by 'Falls within':
Falls within
Metropolitan Police Service           3,357,701
West Midlands Police                  1,110,599
West Yorkshire Police                   992,871
Thames Valley Police                    618,286
Kent Police                             586,313
Lancashire Constabulary                 564,982
Northumbria Police                      557,351
Hampshire Constabulary                  551,983
South Yorkshire Police                  546,229
Essex Police                            526,488
Avon and Somerset Constabulary          518,539
Merseyside Police                       514,143
Sussex Police                           463,046
South Wales Police                      399,965
Devon & Cornwall Police                 364,797
Nottinghamshire Police                  364,396
West Mercia Police                      338,262
Derbyshire Constabulary                 334,054
Leicestershire Police                   332,740
Staffordshire Police               

# Known Issues
Before using the data, it’s important to review the [changelog](https://data.police.uk/changelog/) on DATA.POLICE.UK, which lists known issues. Understanding these issues is crucial for assessing the reliability and limitations of the data, as certain discrepancies or data gaps could impact the analysis and results.

As of 13th November 2024, the issues include:
* Court outcomes from June 2019 onwards are currently unavailable. We are working with the MoJ to provide this data over the coming months.
* Avon and Somerset Constabulary: Due to a change in IT systems there are known issues with crime and outcome data since October 2015 as latitude and longitude information is missing from approximately 2000 crimes each month. The force will be working to rectify this and provide the missing data over the coming months.
* British Transport Police: ASB data has not been provided for period April 2016 onwards. The force will rectify this issue and provide the missing data over the coming months.
Devon and Cornwall: Due to a range of issues, including the implementation of a new record management system in November 2022, outcomes data is unreliable. Work is ongoing to address the issues.
* Greater Manchester Police: Due to a change in IT systems no crime, outcome or stop and search data is available from July 2019 onwards. The force are working to rectify this issue and provide the missing data over the coming months.
* Gwent Police: The force are currently rectifying the issues with the Stop and Search Data and are looking to provide the data once the issues are rectified.
* Humberside Police: The force are implementing a new crime recording system so are currently unable to provide stop and search data. Data will be provided when this work is complete.
* Wiltshire Police: Due to Stop and Search data forms being completed manually there is a data lag of up to 6 weeks with these forms being processed. The force aims to rectify this as soon as possible and will provide the data over the coming months.