# Emergency - 911 Calls

Montgomery County, PA
Data Science Bootcamp Project

Background:

   Montgomery County
 is the third-most populous county in the Commonwealth of Pennsylvania. locally referred to as Montco. Montgomery County is very diverse, ranging from farms and open land in Upper Hanover to densely populated rowhouse streets in Cheltenham.

   911 Calls
 is an emergency telephone number created by Congress in 2004 as the 911 Implementation and Coordination Office (ICO), the National 911 Program is housed within the National Highway Traffic Safety Administration at the U.S. Department of Transportation and is a joint program with the National Telecommunication and Information Administration in the Department of Commerce.

Data Description: 
 This dataset contains emergency calls from Montgomery County, PA.
 It includes 663,522 calls records from 2015 to 2020 and 9 Features. 
    Link : https://www.kaggle.com/mchirico/montcoalert

Feature Columns:
    lat: String variable, Latitude
    lng: String variable, Longitude
    desc: String variable, Description of the Emergency Call
    zip: String variable, ZIP Code
    title: String variable, Title of Emergency
    timeStamp: String variable, Date and time of the call, YYYY-MM-DD HH:MM:SS
    twp: String variable, Township
    addr: String variable, General Address
    e: String variable, Dummy variable, Index column (always 1)

##  Imports

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv('C:/Users/BUDUR/Desktop/Data scinsce Bootcamp/project/archive/911.csv')

In [None]:
df.info()

In [None]:
print(df.columns.values)

[Latitude, Longitude, Description, ZIP Code, Title of Emergency, Date and time, Township, General Address, Dummy variable]

In [None]:
print('Rows     :',df.shape[0])
print('Columns  :',df.shape[1])

Dropping column e

In [None]:
df = df.drop('e',axis=1)

In [None]:
print(df.columns.values)

missing values

In [None]:
print('Missing values:',df.isnull().values.sum())
df.isnull().sum()

In [None]:
df['zip'].isnull().sum()/df.shape[0]

In [None]:
df['twp'].isnull().sum()/df.shape[0]

zip code contains 12% Nan 

In [None]:
df[df['twp'].isnull()]

Records with no townships are mostly dead ends. Lets skip them

In [None]:
df_zip = pd.DataFrame(df['zip'].value_counts().head(5))
df_zip.rename(columns = {'zip':'Top 5'}, inplace = True)
df_zip.style.background_gradient(cmap='Blues')

These are the top 5 zip codes for 911 calls

In [None]:
df['title'].nunique()

There are 148 unique title of emergency codes 

### Reason feature

In the titles column there are "Reasons/Departments" specified before the title code. These are EMS, Fire, and Traffic.

In [None]:
df['reason'] = df['title'].apply(lambda title: title.split(':')[0])

### Title_code feature
Using the same method from above, we are going to create a column with just the title code.

In [None]:
df['title_code'] = df['title'].apply(lambda title: title.split(':')[1])

# Exploratory Data Analysis (EDA)

What is the most common Reason for a 911 call?


In [None]:
df['reason'].value_counts()

  - The number one reason for 911 calls are Emergency Medical Services.
  - Almost half of the reasons are for EMS.

In [None]:
fig, axes = plt.subplots(1,2, figsize=(15, 5))

sns.countplot(x='reason', data=df, order=df['reason'].value_counts().index, ax=axes[0])
axes[0].set_title('Common Reasons for 911 Calls', size=15)
axes[0].set(xlabel='Reason', ylabel='Count')


df['reason'].value_counts().plot.pie(ax=axes[1])



sns.despine(bottom=False, left=True)

The barcahrt shows the top 10 emergency calls from all the categories.

In [None]:
fig, axes = plt.subplots(figsize=(10, 5))
sns.countplot(y='title', data=df, order=df['title'].value_counts().index)
sns.despine(bottom=False, left=True)
axes.set_ylim([9, 0])
axes.set_title('Overall 911 Emregency Calls', size=15)
axes.set(xlabel='Number of 911 Calls', ylabel='')
plt.tight_layout()

   - Vehicle accidents are the number one reason people call 911.
   - Disabled vehicle and fire alarm are in second and third place.

### Traffic 911 Emergency Calls

- The most common emergency titles are vehicle accident, disable vehicle and road obstruction.

In [None]:
df[df['reason']=='Traffic'].groupby('title_code').count()['lat'].sort_values(ascending=True).plot(kind='barh', figsize=(10, 5))
plt.xlabel('Number of 911 Calls')
plt.ylabel('')
plt.title('Traffic 911 Emergency Calls', fontsize=15)

### Fire 911 Emergency Calls

- The most common emergency titles are fire alarm, vehicle accident and fire investigation.

In [None]:
df[df['reason']=='Fire'].groupby('title_code').count()['lat'].sort_values(ascending=True).tail(10).plot(kind='barh', figsize=(10, 5))
plt.xlabel('Number of 911 Calls')
plt.ylabel('')
plt.title('Fire 911 Emergency Calls', fontsize=15)

### EMS 911 Emergency Calls

- The most common emergency titles are fall victim, respiratory emergency and cardiac emergency.

In [None]:
df[df['reason']=='EMS'].groupby('title_code').count()['lat'].sort_values(ascending=True).tail(10).plot(kind='barh', figsize=(10, 5))
plt.xlabel('Number of 911 Calls')
plt.ylabel('')
plt.title('EMS 911 Emergency Calls', fontsize=15)

## Feature Engineering


Convert the timeStamp column from string to DateTime object to create 3 new columns called Hour, Month, and Day of Week.

In [None]:
df['timeStamp'] = pd.to_datetime(df['timeStamp'])

df['Hour'] = df['timeStamp'].apply(lambda time: time.hour)
df['Month'] = df['timeStamp'].apply(lambda time: time.month)
df['Day of Week'] = df['timeStamp'].apply(lambda time: time.dayofweek)
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}

df['Day of Week'] = df['Day of Week'].map(dmap)

### Weekly and monthly calls

- It looks like friday is the day with more calls during the week.
- Regarding the monthly calls, looks like during the first semester there are more calls.

In [None]:
fig, axes = plt.subplots(1,2, figsize=(15,5))

sns.countplot(x='Day of Week', data=df, palette='viridis', ax=axes[0])
axes[0].set_title('Weekly Calls', size=15)

sns.countplot(x='Month', data=df, hue='reason', palette='viridis', ax=axes[1])
axes[1].set_title('Monthly Calls', size=15)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0)

sns.despine(bottom=False, left=True)

### Date feature

Create a new column called 'Date' that contains the date from the timeStamp column.


In [None]:
df['Date'] = df['timeStamp'].apply(lambda t: t.date())

Now groupby this Date column with the count() aggregate and create a plot of counts of 911 calls by reason.

#### Traffic

In [None]:
df[df['reason']=='Traffic'].groupby('Date').count()['lat'].plot(figsize=(15,5), color='darkblue')
plt.title('Traffic', fontsize=15)
sns.despine(bottom=False, left=True)
plt.tight_layout()

#### Fire

In [None]:
df[df['reason']=='Fire'].groupby('Date').count()['lat'].plot(figsize=(15,5), color='darkred')
plt.title('Fire', fontsize=15)
sns.despine(bottom=False, left=True)
plt.tight_layout()

#### EMS

In [None]:
df[df['reason']=='EMS'].groupby('Date').count()['lat'].plot(figsize=(15,5), color='darkgreen')
plt.title('EMS', fontsize=15)
sns.despine(bottom=False, left=True)
plt.tight_layout()

### Heatmap

- In the heatmap we can see that during 14:00 and 17:00 hours there are more calls.
- Friday and Wednesday have more calls.
- Apparently during Sunday the calls drop.

In [None]:
dayHour = df.groupby(by=['Day of Week', 'Hour']).count()['reason'].unstack()

plt.figure(figsize=(12,6))
sns.heatmap(dayHour, linewidths=0.05)

# Modeling - 911 Call Type Prediction

## Can we predict the type reason of the next call?



In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import re 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import tensorflow as tf

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
data = pd.read_csv('C:/Users/BUDUR/Desktop/Data scinsce Bootcamp/project/archive/911.csv', nrows=50000)

In [None]:
data

In [None]:
def get_sequences(texts, vocab_length=10000):
    tokenizer = Tokenizer(num_words=vocab_length)
    tokenizer.fit_on_texts(texts)
    
    sequences = tokenizer.texts_to_sequences(texts)
    
    max_seq_length = np.max([len(sequence) for sequence in sequences])
    
    sequences = pad_sequences(sequences, maxlen=max_seq_length, padding='post')
    
    return sequences

In [None]:
def onehot_encode(df, columns, prefixes):
    df = df.copy()
    
    for column, prefix in zip(columns, prefixes):
        dummies = pd.get_dummies(df[column], prefix=prefix)
        df = pd.concat([df, dummies], axis=1)
        df = df.drop(column, axis=1)
        
    return df

In [None]:
def preprocess_inputs(df):
    
    
    
    # Create label column and drop the title column
    df['type'] = df['title'].apply(lambda x: re.search(r'^\w+', x).group(0))
    df = df.drop('title', axis=1)
  
    
    # Get sequences for desc and addr columns (and drop original columns)
    vocab_length = 10000
    desc_sequences = get_sequences(df['desc'], vocab_length=vocab_length)
    addr_sequences = get_sequences(df['addr'], vocab_length=vocab_length)
    df = df.drop(['desc', 'addr'], axis=1)
    
    # One-hot encode remaining categorical columns (zip and twp)
    df = onehot_encode(df, columns=['zip', 'twp'], prefixes=['z', 't'])
    
    # Split df into X and y 
    y = df['type'].copy()
    X = df.drop('type', axis=1).copy()
    
    # Map labels to integers
    label_mapping = {'EMS': 0, 'Traffic': 1, 'Fire': 2}
    y = y.replace(label_mapping)
    
    # Scale X with a standard scaler
    scaler = StandardScaler()
    X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
    
    return X, desc_sequences, addr_sequences, y

In [None]:
X, desc_sequences, addr_sequences, y = preprocess_inputs(data)

In [None]:
desc_sequences.shape

In [None]:
addr_sequences.shape

In [None]:
y.value_counts()

In [None]:
X_train, X_test, desc_train, desc_test, addr_train, addr_test, y_train, y_test = \
    train_test_split(X, desc_sequences, addr_sequences, y, train_size=0.7, random_state=123)

## Modeling

In [None]:
desc_train

In [None]:
X_inputs = tf.keras.Input(shape=(X_train.shape[1],))
desc_inputs = tf.keras.Input(shape=(desc_train.shape[1],))
addr_inputs = tf.keras.Input(shape=(addr_train.shape[1],))

# X_inputs
X_dense1 = tf.keras.layers.Dense(128, activation='relu')(X_inputs)
X_dense2 = tf.keras.layers.Dense(128, activation='relu')(X_dense1)

# desc_inputs
desc_embedding = tf.keras.layers.Embedding(
    input_dim=10000,
    output_dim=64,
    input_length=desc_train.shape[1]
)(desc_inputs)
desc_flatten = tf.keras.layers.Flatten()(desc_embedding)

# addr_inputs
addr_embedding = tf.keras.layers.Embedding(
    input_dim=10000,
    output_dim=64,
    input_length=addr_train.shape[1]
)(addr_inputs)
addr_flatten = tf.keras.layers.Flatten()(addr_embedding)

# Concatenate results
concat = tf.keras.layers.concatenate([X_dense2, desc_flatten, addr_flatten])

# Make predictions
outputs = tf.keras.layers.Dense(3, activation='softmax')(concat)


model = tf.keras.Model(inputs=[X_inputs, desc_inputs, addr_inputs], outputs=outputs)

print(model.summary())
tf.keras.utils.plot_model(model)

## Training

In [None]:
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)


history = model.fit(
    [X_train, desc_train, addr_train],
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=20,
    callbacks=[
        tf.keras.callbacks.ReduceLROnPlateau()
    ]
)

## Results

In [None]:
results = model.evaluate([X_test, desc_test, addr_test], y_test, verbose=0)

In [None]:
print("Model loss: {:.5f}".format(results[0]))
print("Model accuracy: {:.2f}%".format(results[1] * 100))