<a href="https://colab.research.google.com/github/conceptbin/workshops/blob/main/DAPy02_Grouping_Police_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Grouping
This notebook demonstrates the power of grouping for exploring a dataset.

In this case we look at a download of records of street-level crime from the Metropolitan Police, available at [data.police.uk](https://data.police.uk/).

In [None]:
import pandas as pd
import seaborn as sns

# Load and prepare data
Load police dataset: Street-level crime, Metropolitan Police, August 2023.

In [None]:
# File path
file = r'https://github.com/conceptbin/DA_Notebooks/raw/master/pandas-intro/data/2023-08-metropolitan-street.csv'
# Create dataframe (df)
df = pd.read_csv(file)

Display the top and tail of the dataframe:

In [None]:
df

Show the `info()` for the dataframe:

In [None]:
df.info()

## Add a column
Create a Local_Authority column from LSOA_name, to more easily identify crime data by borough.

In [None]:
#Create a new column:
df['Local_Authority'] = df['LSOA name'].str.slice(0, -5)

Select only London boroughs (excluding non-London rows from the dataset).

In [None]:
# List of London boroughs:
LB_list = ['Barking and Dagenham', 'Barnet', 'Bexley', 'Brent', 'Bromley','Camden', 'City of London', 'Croydon', 'Ealing', 'Enfield', 'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Haringey',
       'Harrow', 'Havering', 'Hillingdon', 'Hounslow', 'Islington', 'Kensington and Chelsea', 'Kingston upon Thames', 'Lambeth',
       'Lewisham','Merton', 'Newham', 'Redbridge', 'Richmond upon Thames', 'Southwark', 'Sutton',
       'Tower Hamlets', 'Waltham Forest', 'Wandsworth', 'Westminster']
# Filter the dataframe to include only names in the list:
df = df[df['Local_Authority'].isin(LB_list)]

Now we have a slightly shorter dataframe, removing any cases from outside the London boroughs.

In [None]:
# Print the lenght of the dataframe
len(df)

# Group and aggregate
This code does two things: We group by Local Authority using `groupby()`, and then we count the total number of incidents for each Local Authority, using the `agg()` function.

In [None]:
df_grouped = df.groupby(['Local_Authority'])['Crime type'].agg('count').reset_index()

In [None]:
# Show the new grouped dataframe
df_grouped

Sort by count of "Crime type" in descending order.

In [None]:
df_grouped.sort_values(by=['Crime type'], ascending=False, inplace=True)

Bar plot comparing total count of crimes by borough.

In [None]:
sns.catplot(data=df_grouped, kind='bar', y='Local_Authority', x= 'Crime type', color='blue')

Compare two boroughs by crime types.

In [None]:
# New grouping by LA, crime type, count unique crime IDs to make sure no duplicates
df_group2 = df.groupby(['Local_Authority', 'Crime type'])['Crime ID'].agg('nunique').reset_index()

In [None]:
df_group2

Select names from the "Local Authority" column using `str.contains()`.

In [None]:
# Make a separate dataframe for plotting, containing only selected Local Authority keywords:
selected_LA = df_group2[df_group2['Local_Authority'].str.contains('Westminster|Newham')]
# Plot the selected_LA dataframe:
sns.catplot(data=selected_LA, kind='bar', y='Local_Authority', x='Crime ID', hue='Crime type')

Refine the selection, focusing on particular crime types:

In [None]:
# Refine selection by crime type
selected_LA = selected_LA[selected_LA['Crime type'].str.contains('theft|robbery|shoplifting', case=False)]
# Plot the selected_LA dataframe:
sns.catplot(data=selected_LA, kind='bar', y='Local_Authority', x='Crime ID', hue='Crime type')