## 911 Call Data Analysis Project

In this project, we will conduct an analysis of the 911 call dataset available on Kaggle. The dataset includes the following fields:

- lat: Latitude (string)
- lng: Longitude (string)
- desc: Description of the emergency call (string)
- zip: Zip code (string)
- title: Title of the call (string)
- timeStamp: Timestamp in the format YYYY-MM-DD HH:MM
(string)
- twp: Township (string)
- addr: Address (string)
- e: Dummy variable (always set to 1)

To begin, we will import the necessary libraries for data analysis and visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')

plt.rcParams['figure.figsize'] = (6, 4)

ModuleNotFoundError: No module named 'numpy'

In [None]:
#Reading the data
df = pd.read_csv('911.csv')

In [None]:
df.info()

In [None]:
#Checking the head of the dataframe
df.head()

### Preliminary Analysis

Let's identify the top 5 zip codes with the highest number of 911 calls.

In [None]:
df['zip'].value_counts().head(5)

The top township for the calls are as follows:

In [None]:
df['twp'].value_counts().head(5)

For 600k + entries, how many unique call titles did we have?

In [None]:
df['title'].nunique()

### Feature Engineering for Analysis

To enhance our analysis, we can derive new features from the existing columns in the dataset.

- The title column contains a 'subcategory' or 'reason for call' indicated by the text preceding the colon.

- The timestamp column can be broken down further into Year, Month, and Day of the Week.

We'll begin by creating a 'Reason' feature for each call.

In [None]:
df['Reason'] = df['title'].apply(lambda x: x.split(':')[0])

In [None]:
df.tail()

Next, let's identify the most frequent reason for 911 calls in our dataset.

In [None]:
df['Reason'].value_counts()

In [None]:
sns.countplot(df['Reason'])

Let's examine the data type of the timestamp column to better understand how we can work with the time information it contains.

In [None]:
type(df['timeStamp'][0])

Since the timestamp column is currently in string format, converting it to a Python DateTime object will simplify extracting the year, month, and day details. This conversion will make further time-based analysis more straightforward and efficient.

In [None]:
df['timeStamp'] = pd.to_datetime(df['timeStamp'])

For an individual DateTime object, we can retrieve information in the following way.

In [None]:
time = df['timeStamp'].iloc[0]

print('Hour:',time.hour)
print('Month:',time.month)
print('Day of Week:',time.dayofweek)

Let's generate new features based on the information mentioned above.

In [None]:
df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)
df['Month'] = df['timeStamp'].apply(lambda x: x.month)
df['Day of Week'] = df['timeStamp'].apply(lambda x: x.dayofweek)

In [None]:
df.head(3)

The Day of the Week is represented as an integer, which may not immediately indicate which number corresponds to which day. We can create a mapping to convert that into a string format ranging from Monday to Sunday.

In [None]:
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}

In [None]:
df['Day of Week'] = df['Day of Week'].map(dmap)

df.tail(3)

Let’s merge the newly generated features to examine the most frequent call reasons categorized by the day of the week.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create the countplot
sns.countplot(x='Day of Week', hue='Reason', data=df)

# Adjust the legend
plt.legend(bbox_to_anchor=(1.25, 1))

# Show the plot
plt.show()

In [None]:
byMonth = df.groupby(by='Month').count()

In [None]:
byMonth['e'].plot.line(y='e')
plt.title('Calls per Month')
plt.ylabel('Number of Calls')

Using Seaborn, let's visualize the number of calls per month to determine if there is a significant correlation between them.

In [None]:
byMonth.reset_index(inplace=True)

In [None]:
sns.lmplot(x='Month',y='e',data=byMonth)
plt.ylabel('Number of Calls')

It appears that there are indeed fewer emergency calls during the holiday seasons. 

Next, let's extract the date from the timestamp to analyze the behavior in greater detail.

In [None]:
df['Date']=df['timeStamp'].apply(lambda x: x.date())

In [None]:
df.head(2)

Grouping and plotting the data:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Group by 'Date' and count occurrences of 'e'
grouped_data = df.groupby('Date')['e'].count()

# Plot the line chart
grouped_data.plot.line()

# Optionally remove the legend and tighten the layout
plt.legend().remove()
plt.tight_layout()

# Show the plot
plt.show()

We can also check the plot for each reason separately

In [None]:
pd.DataFrame.groupby(df[df['Reason']=='Traffic'],'Date').count().plot.line(y='e')
plt.title('Traffic')
plt.legend().remove()
plt.tight_layout()

In [None]:
pd.DataFrame.groupby(df[df['Reason']=='Fire'],'Date').count().plot.line(y='e')
plt.title('Fire')
plt.legend().remove()
plt.tight_layout()

In [None]:
pd.DataFrame.groupby(df[df['Reason']=='EMS'],'Date').count().plot.line(y='e')
plt.title('EMS')
plt.legend().remove()
plt.tight_layout()

Let's generate a heatmap to visualize the number of calls for each hour throughout a specific day of the week.

In [None]:
day_hour = df.pivot_table(values='lat',index='Day of Week',columns='Hour',aggfunc='count')

day_hour

Now, let's generate a heatmap using this new DataFrame.

In [None]:
sns.heatmap(day_hour)

plt.tight_layout()

It appears that the majority of calls occur towards the end of office hours on weekdays.

And this concludes the exploratory analysis project.