# Session 1 - Data Analysis with Python

## Exploratory Data Analysis
### Business/Domain Understanding
* Emergency Responder calls from https://data.cincinnati-oh.gov/

#### Start with why
* We're looking at response times in different areas to determine which factors might impact response time. We don't have a hypothesis yet. That's to be developed during this EDA process.

#### Then what
* Patterns may indicate time periods or areas where additional EMTs are needed.
* We'll explore the data with basics statistical and visualization methods to determine if any patterns exist.

In [5]:
#import libraries
import pandas as pd
import numpy as np
import scipy.stats as stats

In [4]:
#import the csv into a new pandas dataframe object

### Data Understanding
#### What data do we have? What type? How complete/clean is it?

In [6]:
#look at a sample of the dataframe

In [7]:
#examine the datatypes of the dataframe

## Data Types

### Series
* A series is a one-dimensional structure like an associative array. E.g.

Series
---------------
index | value 
---------------
0     |  -84.5099

1     |  -84.48937

2     |  -84.4432

* You can assign an index to a series. Pandas uses sequential indexes by default
* Operations on numpy arrays also work on Pandas Series

### DataFrame
* Extends a one-dimensional series to multiple dimensions. E.g.

--------------------------------------------
index |   longitude   |  latitude  | agency
--------------------------------------------------
0     |  -84.5099     |   39.1060  |  CF

1     |  -84.48937    |   39.32011 |  CFD

2     |  -84.4432     |   39.2993  |  CF

#### DataFrame indexes
* we have the row index, a sequential integer by default. There is also a column index. 

In [8]:
#Selecting elements
#dataframe_name['column_name'][row_index]
#use df_name.loc[row_num] to select all values of an observation

In [None]:
#What are the unique values in INCIDENT_TYPE_DESC?

In [9]:
#Check for completeness
#Consider which values are missing and how those will impact the analysis

In [10]:
#check for duplicate entries

In [11]:
# Filter out only those observations that contain crucial missing values

In [12]:
#What if we want to convert the data type? 
#The date is in format %m/%d/%Y %H:%M:%S %p
#We want it in 24 hour ISO format: %Y-%m-%d %H:%M:%S

In [13]:
#Derived variables
#We have dispatch and arrival times, but what if we want the time to arrival?

In [14]:
#We can calculate the arrival time as a DateTime object, but for statistical analysis, an integer will be better
## What if we want the average response time, or the distribution of response time? 
## We'll want a consistent unit of measurement.
## Try converting it to minutes:

### Selecting rows with booleans
What if we only want a subset of the data? We can use booleans to select only matching values and put them into a new dataframe.

In [17]:
#select only neighborhoods where the neighborhood is cuf

In [18]:
#use multiple values as boolean filters

## Data Preparation
* Build the dataset(s) that will be used in our analysis

In [19]:
## add the derived variables to the main dataframe

In [20]:
#check on the datatypes of our dataframe

In [21]:
# use the describe() method to get a quick overview of the data

In [22]:
#what's the average time to arrival?

In [23]:
#if anything is off, use a boolean filter to see only the values that might be causing it

In [24]:
# There is often incorrect data that will skew the averages and other stats. They're obviously errors. Let's drop those.
# We also have many cases of NaN, so we'll remove those as well

In [25]:
#take a look at the cleaned dataset

In [26]:
# are there other illogical (or even impossible) observations? 

In [27]:
# remove any likely errors, review the number of observations removed, and look at the measures of center again

## Exploratory Visualization

### Looking for patterns and outliers|
* Matplotlib is simple and easy to get started with
* Not as adept at presentation and storytelling as Tableau
* Designed to offer a similar experience to MatLab
* Pyplot is part of matplotlib and we'll be using it to create basic graphics

In [28]:
import matplotlib.pyplot as plt
%matplotlib inline
#matplotlib assumes these are the y values and the corresponding x values are 0,1,2,3
#create a basic plot

In [29]:
#enter x data, y data, then options. One option for a scatter plot is "o"

In [30]:
#plot time to arrival (TTA) as points

In [31]:
#plot time to close (TTC) as points

## Observations & Questions
* If there is an obvious pattern around a specific time, is it a natural occurrence or system-generated?
* Are there any obvious outliers that skew our means?
* Is there a threshold above which we can declare an item an outlier? 

In [32]:
#plot TTA as a histogram

In [33]:
#remove outliers

In [34]:
#plot cleaned TTA values as points

In [35]:
# what about outliers with unrealistically low values?

## Examining Outliers
* These are mostly classified as specific incident types with many firestation walkins and detail/special assignments. 
* We'd likely need a subject matter expert to help determine which of these contain errors

## Finding Relationships

* We've seen some patterns, but which variables, if any, are related?
* Correlation and covariance are a way to find relationships between ratio and interval variables
* Next we'll look at how visualizations can help us spot patterns and relationships in categorical variables

## Summing Up
* Data has been imported, transformed, and cleaned enough to begin a deeper exploration.
* Basic plots show major patterns and outliers, often indicating where cleaning efforts should be focused
* Further exploration will be needed to find actionable information