In [1]:
import pandas as pd
import matplotlib.pyplot as plt

Upload general dataset CSV into pandas (must be saved in same folder as this notebook) and convert it into a DataFrame. This stage requires that you have already run the first notebook, and gone through the process of categorising charges in the Google Sheet and used an array formula to assess the Charge (s) column for the different types of charges and put either 'Contains' or 'Doesn't Contain' in the columns where there is a match. E.g. if you have a tab with Violent Charges and it contains the charge 'Battery', then if the array formula finds the word 'Battery' in the Charge (s) column for a given row, it will enter 'Contains' in the Violent Charges column

In [2]:
data = pd.read_csv('processed_courtmartial_2010-23.csv')

df = pd.DataFrame(data)

Visualise the dataframe

In [3]:
df

Unnamed: 0,Reference number,Rank,Service,Unit,Trial Court,Sentencing Date,Year of Sentencing,Charge (s),Act charged under,Finding,...,G - Alcohol,H - Child sex offence charge,I - Firearms charges,J - Damage to private property,K - Vehicle Negligence,L - Harassment,M - Manslaughter,N - Perverting Course of Justice,O - Resisting Arrest,P - Use of Internet
0,,Signaller,Army,,Colchester,07-Jan-10,2010,"2 x Battery, 1 x Assault Occasioning Actual Bo...",,"1 x Not Guilty, 2 x Guilty",...,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain
1,,Colonel,Army,,Sennelager,08-Jan-10,2010,"5 x Obtaining a money transfer by deception, 6...",,"11 x Not Guilty, 2 x Guilty",...,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain
2,,Guardsman,Army,,Colchester,13-Jan-10,2010,1 x Absence Without Leave,,Guilty,...,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain
3,,Signaller,Army,,Colchester,13-Jan-10,2010,1 x Absence Without Leave,,Guilty,...,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain
4,,Guardsman,Army,,Colchester,21-Jan-10,2010,"1 x Desertion, 1 x Absence Without Leave",,"Not Guilty, Guilty",...,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6066,6106,Lance Corporal,Army,212 Yorkshire Field Hospital,Catterick,14-Dec-23,2023,Ch 1: Sexual assault Ch 2-4: Disgraceful condu...,,Ch 1-3: Not Guilty\n Ch 4 & 5: Guilty\n Ch 6: ...,...,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain
6067,6245,Ex-Chief Petty Officer,Royal Navy,Formerly of HMS Drake,Bulford,14-Dec-23,2023,Ch 1a: Sexual assault\n Ch 1b: Alternative cha...,,"Ch 1a, 2a, 5a: Not Guilty - no evidence offere...",...,Contains,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain
6068,6317,Sergeant,Army,26 Regt RA,Bulford,15-Dec-23,2023,Ch 1: Threatening with an offensive weapon.,,Guilty,...,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain
6069,6352,Flight Lieutenant,Royal Air Force,RAF Brize Norton,Bulford,15-Dec-23,2023,Ch 1 & 2: Contravention of standing orders,,Guilty,...,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain,Doesn't Contain


Now we isolate the charges columns, so the DataFrame requires that you have the charges columns as the last columns in your dataset, and for this code, the first column of charges is violent charges and must be named 'A - Violent Charges'. But you can change the name of the first column to group by, through changing the 'start_col' column. The below code cell will quantify the volumn of 'Contains' for each charge category

In [4]:
start_col = 'A - Violent charges'
charges_columns = df.loc[:, start_col:]

# Dictionary to store counts
contains_counts = {}
doesnt_contain_counts = {}

# Loop through each column starting from the third and count the values
for column in charges_columns.columns:
    value_counts = charges_columns[column].value_counts().to_dict()
    contains_counts[column] = value_counts.get('Contains', 0)
    doesnt_contain_counts[column] = value_counts.get("Doesn't Contain", 0)


print("Contains Counts:", contains_counts)

Contains Counts: {'A - Violent charges': 2928, 'B - Sex offence charges': 802, 'C - Theft charges': 747, 'D - Military misdemeanours': 2244, 'E - Identity-based discrimination': 56, 'F - Drugs': 233, 'G - Alcohol': 308, 'H - Child sex offence charge': 147, 'I - Firearms charges': 66, 'J - Damage to private property': 146, 'K - Vehicle Negligence': 45, 'L - Harassment': 202, 'M - Manslaughter': 14, 'N - Perverting Course of Justice': 70, 'O - Resisting Arrest': 37, 'P - Use of Internet': 31}


The below code will now calculate the percentage of all court martial cases which contain each of the charges. E.g. 48% of all court martial cases contained violent charges over the whole period.

In [5]:
contains_percentage = {}

for column in charges_columns.columns:
    total_entries = charges_columns[column].size
    value_counts = charges_columns[column].value_counts().to_dict()
    contains_count = value_counts.get('Contains', 0)
    
    # Calculate percentage of "Contains" for each column
    contains_percentage[column] = (contains_count / total_entries) * 100

print("Contains Percentages:", contains_percentage)

Contains Percentages: {'A - Violent charges': 48.229286773183986, 'B - Sex offence charges': 13.210344259594795, 'C - Theft charges': 12.304397957502882, 'D - Military misdemeanours': 36.96260912535003, 'E - Identity-based discrimination': 0.9224180530390381, 'F - Drugs': 3.8379179706802833, 'G - Alcohol': 5.073299291714709, 'H - Child sex offence charge': 2.421347389227475, 'I - Firearms charges': 1.0871355625102948, 'J - Damage to private property': 2.404875638280349, 'K - Vehicle Negligence': 0.7412287926206556, 'L - Harassment': 3.327293691319387, 'M - Manslaughter': 0.23060451325975953, 'N - Perverting Course of Justice': 1.1530225662987976, 'O - Resisting Arrest': 0.6094547850436501, 'P - Use of Internet': 0.510624279360896}


The below code creates a dictionary for the percentage of court martial cases which contain each of the charges, grouped by year

In [6]:
# Initialize a dictionary to store the percentages
yearly_contains_percentage = {}

# Group by 'Year of Sentencing'
grouped = df.groupby('Year of Sentencing')

for year, group in grouped:
    yearly_contains_percentage[year] = {}
    for column in charges_columns:
        total_entries = group[column].size
        contains_count = group[column].value_counts().get('Contains', 0)
        
        # Calculate and store the percentage
        yearly_contains_percentage[year][column] = (contains_count / total_entries) * 100

print(yearly_contains_percentage)


{2010: {'A - Violent charges': 34.97942386831276, 'B - Sex offence charges': 5.967078189300412, 'C - Theft charges': 11.728395061728394, 'D - Military misdemeanours': 52.05761316872428, 'E - Identity-based discrimination': 0.205761316872428, 'F - Drugs': 1.8518518518518516, 'G - Alcohol': 3.0864197530864197, 'H - Child sex offence charge': 1.8518518518518516, 'I - Firearms charges': 0.411522633744856, 'J - Damage to private property': 2.674897119341564, 'K - Vehicle Negligence': 0.6172839506172839, 'L - Harassment': 2.05761316872428, 'M - Manslaughter': 0.0, 'N - Perverting Course of Justice': 0.823045267489712, 'O - Resisting Arrest': 1.440329218106996, 'P - Use of Internet': 0.0}, 2011: {'A - Violent charges': 33.43701399688958, 'B - Sex offence charges': 5.287713841368585, 'C - Theft charges': 15.085536547433904, 'D - Military misdemeanours': 48.52255054432349, 'E - Identity-based discrimination': 0.7776049766718507, 'F - Drugs': 2.488335925349922, 'G - Alcohol': 3.265940902021773, 

Create a dataframe from the yearly percentages dictionary and then set the year of sentencing as the index column

In [9]:
yearly_percentage_df = pd.DataFrame.from_dict(yearly_contains_percentage, orient='index')

yearly_percentage_df.reset_index(inplace=True)
yearly_percentage_df.rename(columns={'index': 'Year of Sentencing'}, inplace=True)

yearly_percentage_df


Unnamed: 0,Year of Sentencing,A - Violent charges,B - Sex offence charges,C - Theft charges,D - Military misdemeanours,E - Identity-based discrimination,F - Drugs,G - Alcohol,H - Child sex offence charge,I - Firearms charges,J - Damage to private property,K - Vehicle Negligence,L - Harassment,M - Manslaughter,N - Perverting Course of Justice,O - Resisting Arrest,P - Use of Internet
0,2010,34.979424,5.967078,11.728395,52.057613,0.205761,1.851852,3.08642,1.851852,0.411523,2.674897,0.617284,2.057613,0.0,0.823045,1.440329,0.0
1,2011,33.437014,5.287714,15.085537,48.522551,0.777605,2.488336,3.265941,1.399689,0.622084,2.643857,1.399689,1.55521,0.311042,1.088647,0.622084,0.0
2,2012,45.544554,7.425743,13.366337,39.108911,1.732673,3.465347,3.465347,2.227723,2.475248,1.980198,1.485149,3.465347,0.49505,0.247525,0.49505,0.0
3,2013,48.703704,8.333333,10.37037,38.518519,0.740741,3.518519,3.518519,1.666667,1.666667,2.037037,1.111111,2.962963,0.185185,1.111111,0.555556,1.481481
4,2014,48.44358,9.143969,14.396887,37.743191,0.389105,3.891051,3.501946,2.140078,1.361868,2.33463,0.77821,2.918288,0.0,0.389105,0.77821,0.583658
5,2015,52.10643,9.977827,15.742794,31.263858,0.0,3.104213,6.208426,3.769401,0.886918,2.660754,0.886918,2.217295,0.221729,0.886918,0.665188,0.443459
6,2016,54.816514,12.614679,9.862385,31.880734,0.229358,3.440367,3.899083,2.293578,0.917431,2.293578,0.458716,3.211009,0.0,1.834862,0.688073,0.688073
7,2017,50.0,20.25,9.5,36.0,1.0,4.0,5.0,4.5,1.0,1.25,0.5,3.0,0.75,0.5,0.5,0.75
8,2018,49.033816,13.768116,11.111111,36.47343,0.483092,6.763285,8.454106,2.898551,0.241546,3.623188,0.724638,3.864734,0.483092,2.898551,0.483092,0.241546
9,2019,49.2,11.4,14.2,30.4,1.6,5.8,7.8,1.4,1.6,2.2,0.6,3.8,0.6,3.4,0.6,0.6


Remove the 'A - ', 'B - ', etc from the charges columns. This is more for ease of use and also for simplicity than for any strict need

In [10]:
yearly_percentage_df.columns = [col.split('- ')[-1].strip() for col in yearly_percentage_df.columns]

yearly_percentage_df

Unnamed: 0,Year of Sentencing,Violent charges,Sex offence charges,Theft charges,Military misdemeanours,Identity-based discrimination,Drugs,Alcohol,Child sex offence charge,Firearms charges,Damage to private property,Vehicle Negligence,Harassment,Manslaughter,Perverting Course of Justice,Resisting Arrest,Use of Internet
0,2010,34.979424,5.967078,11.728395,52.057613,0.205761,1.851852,3.08642,1.851852,0.411523,2.674897,0.617284,2.057613,0.0,0.823045,1.440329,0.0
1,2011,33.437014,5.287714,15.085537,48.522551,0.777605,2.488336,3.265941,1.399689,0.622084,2.643857,1.399689,1.55521,0.311042,1.088647,0.622084,0.0
2,2012,45.544554,7.425743,13.366337,39.108911,1.732673,3.465347,3.465347,2.227723,2.475248,1.980198,1.485149,3.465347,0.49505,0.247525,0.49505,0.0
3,2013,48.703704,8.333333,10.37037,38.518519,0.740741,3.518519,3.518519,1.666667,1.666667,2.037037,1.111111,2.962963,0.185185,1.111111,0.555556,1.481481
4,2014,48.44358,9.143969,14.396887,37.743191,0.389105,3.891051,3.501946,2.140078,1.361868,2.33463,0.77821,2.918288,0.0,0.389105,0.77821,0.583658
5,2015,52.10643,9.977827,15.742794,31.263858,0.0,3.104213,6.208426,3.769401,0.886918,2.660754,0.886918,2.217295,0.221729,0.886918,0.665188,0.443459
6,2016,54.816514,12.614679,9.862385,31.880734,0.229358,3.440367,3.899083,2.293578,0.917431,2.293578,0.458716,3.211009,0.0,1.834862,0.688073,0.688073
7,2017,50.0,20.25,9.5,36.0,1.0,4.0,5.0,4.5,1.0,1.25,0.5,3.0,0.75,0.5,0.5,0.75
8,2018,49.033816,13.768116,11.111111,36.47343,0.483092,6.763285,8.454106,2.898551,0.241546,3.623188,0.724638,3.864734,0.483092,2.898551,0.483092,0.241546
9,2019,49.2,11.4,14.2,30.4,1.6,5.8,7.8,1.4,1.6,2.2,0.6,3.8,0.6,3.4,0.6,0.6


We are nearly there for creating a CSV that can be added to a D3 Javascript visualisation. But first we need to "melt"/convert the dataset so that it is in long form vertically. So it will go through each charge, year by year and showing the percentage value, before the dataframe then shows the next charge. This is because we are going to visualise each charge as a distinct category, against one another, so life is a lot easier if the dataframe is grouped by each charge and moves through them one by one

In [47]:
yearly_percentage_df = yearly_percentage_df.rename(columns={'Year of Sentencing': 'year'})

melted_df = pd.melt(yearly_percentage_df,
                    id_vars=['year'],
                    var_name='category',
                    value_name='value')

melted_df


Unnamed: 0,year,category,value
0,2010,Violent charges,34.979424
1,2011,Violent charges,33.437014
2,2012,Violent charges,45.544554
3,2013,Violent charges,48.703704
4,2014,Violent charges,48.443580
...,...,...,...
219,2019,Use of Internet,0.600000
220,2020,Use of Internet,0.000000
221,2021,Use of Internet,1.063830
222,2022,Use of Internet,0.940439


In [49]:
melted_df.to_csv('yearly_charges_percentages_long_format_cleaned2.csv', index=False)

This exports the dataframe to a CSV and is now ready to run in a visualisation