**Explain in your own words: How is explanatory data analysis different from exploratory data analysis?**

Explanatory Data Analysis (EDA) and Exploratory Data Analysis, despite their similar names, serve different purposes in the realm of data analysis and are applied at different stages of the data analysis process.

Exploratory Data Analysis (Exploratory)

Purpose: The primary goal of exploratory data analysis is to explore and understand the data. It's about uncovering patterns, spotting 
anomalies, testing hypotheses, and checking assumptions through summary statistics and graphical representations.

Techniques: Includes plotting histograms, scatter plots, box plots, and more to understand the distribution, variance, and data structure. It often involves calculating summary statistics and using data visualization tools.

When it's used: At the beginning of data analysis workflows. Before any formal modeling or hypothesis testing, exploratory analysis helps to get a sense of the data's structure and content.
Outcome: The outcome of exploratory analysis is a better understanding of the data's characteristics and the formulation of hypotheses for further testing and modeling.


Explanatory Data Analysis (Explanatory)

Purpose: The purpose of explanatory data analysis is to explain the findings from the data to a target audience. This involves communicating the results of the data analysis, including the confirmation or rejection of hypotheses.

Techniques: Uses similar data visualization tools as exploratory analysis but focuses on conveying the results effectively to an audience. This might involve creating more polished and interpretable plots, charts, or dashboards that highlight the key findings.
When it's used: After conducting exploratory analysis and any subsequent statistical modeling or hypothesis testing. It's the final step where the insights and results are compiled and presented.

Outcome: The outcome is a clear and concise presentation of findings, often tailored to a specific audience, that explains what was learned from the data and how it answers the research questions or business objectives.

In summary, while exploratory data analysis is about discovering what the data can tell us, explanatory data analysis is about communicating those findings effectively to others. Exploratory analysis is a more internal, investigator-driven process, while explanatory analysis is external-facing, focusing on storytelling and insight communication.

## Part 2: Interactive visualizations with Bokeh ##

In [68]:
from bokeh.plotting import figure, show
from bokeh.io import show, output_notebook
output_notebook()

In [69]:
# prepare some data
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]

# create a new plot with a title and axis labels
p = figure(title="Simple line example", x_axis_label="x", y_axis_label="y")

# add a line renderer with legend and line thickness
p.line(x, y, legend_label="Temp.", line_width=2)

# show the results
show(p)

In [70]:
from bokeh.layouts import layout
from bokeh.models import Div, RangeSlider, Spinner
from bokeh.plotting import figure, show

# prepare some data
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [4, 5, 5, 7, 2, 6, 4, 9, 1, 3]

# create plot with circle glyphs
p = figure(x_range=(1, 9), width=500, height=250)
points = p.circle(x=x, y=y, size=30, fill_color="#21a7df")

# set up textarea (div)
div = Div(
    text="""
          <p>Select the circle's size using this control element:</p>
          """,
    width=200,
    height=30,
)

# set up spinner
spinner = Spinner(
    title="Circle size",
    low=0,
    high=60,
    step=5,
    value=points.glyph.size,
    width=200,
)
spinner.js_link("value", points.glyph, "size")

# set up RangeSlider
range_slider = RangeSlider(
    title="Adjust x-axis range",
    start=0,
    end=10,
    step=1,
    value=(p.x_range.start, p.x_range.end),
)
range_slider.js_link("value", p.x_range, "start", attr_selector=0)
range_slider.js_link("value", p.x_range, "end", attr_selector=1)

# create layout
layout = layout(
    [
        [div, spinner],
        [range_slider],
        [p],
    ],
)

# show result
show(layout)

In [71]:
from bokeh.palettes import Spectral5
from bokeh.plotting import figure, show
from bokeh.sampledata.autompg import autompg as df
from bokeh.transform import factor_cmap

df.cyl = df.cyl.astype(str)
group = df.groupby('cyl')

cyl_cmap = factor_cmap('cyl', palette=Spectral5, factors=sorted(df.cyl.unique()))

p = figure(height=350, x_range=group, title="MPG by # Cylinders",
           toolbar_location=None, tools="")

p.vbar(x='cyl', top='mpg_mean', width=1, source=group,
       line_color=cyl_cmap, fill_color=cyl_cmap)

p.y_range.start = 0
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "some stuff"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None

show(p)

In [72]:
from bokeh.models import FactorRange
from bokeh.palettes import TolPRGn4
from bokeh.plotting import figure, show

quarters =("Q1", "Q2", "Q3", "Q4")

months = (
    ("Q1", "jan"), ("Q1", "feb"), ("Q1", "mar"),
    ("Q2", "apr"), ("Q2", "may"), ("Q2", "jun"),
    ("Q3", "jul"), ("Q3", "aug"), ("Q3", "sep"),
    ("Q4", "oct"), ("Q4", "nov"), ("Q4", "dec"),
)

fill_color, line_color = TolPRGn4[2:]

p = figure(x_range=FactorRange(*months), height=500, tools="",
           background_fill_color="#fafafa", toolbar_location=None)

monthly = [10, 13, 16, 9, 10, 8, 12, 13, 14, 14, 12, 16]
p.vbar(x=months, top=monthly, width=0.8,
       fill_color=fill_color, fill_alpha=0.8, line_color=line_color, line_width=1.2)

quarterly = [13, 9, 13, 14]
p.line(x=quarters, y=quarterly, color=line_color, line_width=3)
p.circle(x=quarters, y=quarterly, size=10,
         line_color=line_color, fill_color="white", line_width=3)

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xgrid.grid_line_color = None

show(p)

In [73]:
import urllib.request
import numpy as np
import pandas as pd
np.random.seed(42)

In [74]:
df = pd.read_csv('data/Police_Department_Incident_Reports__Historical_2003_to_May_2018_20240127.csv')
df.head(5)

Unnamed: 0,PdId,IncidntNum,Incident Code,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,...,Fix It Zones as of 2017-11-06 2 2,DELETE - HSOC Zones 2 2,Fix It Zones as of 2018-02-07 2 2,"CBD, BID and GBD Boundaries as of 2017 2 2","Areas of Vulnerability, 2016 2 2",Central Market/Tenderloin Boundary 2 2,Central Market/Tenderloin Boundary Polygon - Updated 2 2,HSOC Zones as of 2018-06-05 2 2,OWED Public Spaces 2 2,Neighborhoods 2
0,4133422003074,41334220,3074,ROBBERY,"ROBBERY, BODILY FORCE",Monday,11/22/2004,17:50,INGLESIDE,NONE,...,,,,,,,,,,
1,5118535807021,51185358,7021,VEHICLE THEFT,STOLEN AUTOMOBILE,Tuesday,10/18/2005,20:00,PARK,NONE,...,,,,,,,,,,
2,4018830907021,40188309,7021,VEHICLE THEFT,STOLEN AUTOMOBILE,Sunday,02/15/2004,02:00,SOUTHERN,NONE,...,,,,,,,,,,
3,11014543126030,110145431,26030,ARSON,ARSON,Friday,02/18/2011,05:27,INGLESIDE,NONE,...,,,,,1.0,,,,,94.0
4,10108108004134,101081080,4134,ASSAULT,BATTERY,Sunday,11/21/2010,17:00,SOUTHERN,NONE,...,,,,,2.0,,,,,32.0


In [75]:
df['Date'] = pd.to_datetime(df['Date'])
start_date = '01/01/2010'
end_date = '12/31/2017'
df_filtered = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]

In [76]:
focuscrimes = set(['WEAPON LAWS', 'PROSTITUTION', 'DRIVING UNDER THE INFLUENCE', 'ROBBERY', 'BURGLARY', 'ASSAULT', 'DRUNKENNESS', 'DRUG/NARCOTIC', 'TRESPASS', 'LARCENY/THEFT', 'VANDALISM', 'VEHICLE THEFT', 'STOLEN PROPERTY', 'DISORDERLY CONDUCT'])

In [77]:
df_filtered['Time'] = pd.to_datetime(df_filtered['Time'], format='%H:%M').dt.time
df_filtered['Hour'] = df_filtered['Time'].apply(lambda x: x.hour)
# Group by 'Category' and 'Hour', count the occurrences, and unstack
grouped_counts = df_filtered.groupby(['Category', 'Hour']).size().unstack(fill_value=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['Time'] = pd.to_datetime(df_filtered['Time'], format='%H:%M').dt.time
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['Hour'] = df_filtered['Time'].apply(lambda x: x.hour)


In [78]:
# Normalize the data
normalized_counts_corrected = grouped_counts.div(grouped_counts.sum(axis=1), axis=0)

In [79]:
normalized_counts_transposed = normalized_counts_corrected.T
normalized_counts_transposed.head(5)

Category,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,...,"SEX OFFENSES, NON FORCIBLE",STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
Hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.068071,0.055468,0.148026,0.054767,0.040191,0.052282,0.121869,0.035064,0.080276,0.192735,...,0.172414,0.044247,0.043548,0.064145,0.0,0.027969,0.054945,0.035913,0.040232,0.054413
1,0.062684,0.049745,0.003289,0.034483,0.027653,0.038354,0.114539,0.020654,0.077235,0.003706,...,0.0,0.033786,0.029032,0.030768,0.0,0.021191,0.038576,0.024113,0.027015,0.039486
2,0.065622,0.044837,0.003289,0.040568,0.031432,0.032569,0.098656,0.016746,0.07014,0.001483,...,0.0,0.029686,0.027419,0.027061,0.142857,0.025391,0.035994,0.018234,0.021969,0.032891
3,0.067581,0.023267,0.003289,0.020284,0.032765,0.018642,0.047954,0.012489,0.027367,0.005189,...,0.0,0.023325,0.016129,0.018596,0.0,0.021382,0.026022,0.011841,0.017463,0.022737
4,0.060725,0.014025,0.003289,0.018256,0.029379,0.014999,0.01741,0.009279,0.014393,0.005189,...,0.0,0.020356,0.012903,0.012654,0.0,0.015559,0.017797,0.010011,0.013758,0.016662


**Bokeh**

In [80]:
from bokeh.models import ColumnDataSource, FactorRange, LegendItem
from bokeh.palettes import Category20
from bokeh.plotting import figure
from bokeh.models import Legend

In [81]:
hours = [str(h) for h in range(0, 25)]
print(hours)
# Convert Pandas DataFrame to Bokeh ColumnDataSource
source = ColumnDataSource(normalized_counts_transposed)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24']


In [82]:
# Create an Empty Figure
p = figure(x_range=FactorRange(factors=hours), title="Crime Incidents by Hour", x_axis_label='Hour of the Day', 
           y_axis_label='Incidents', width=800)

In [83]:
bar = {}  # Dictionary to store references to the bars
for indx, i in enumerate(focuscrimes):
    color = Category20[len(focuscrimes)][indx]  # Select color based on index
    bar[i] = p.vbar(x='Hour', top=i, source=source, legend_label=i, muted_alpha=0.2, muted=True, color=color)
p.legend.click_policy = "mute"
show(p)

In [84]:
p = figure(x_range=FactorRange(factors=hours), title="Crime Incidents by Hour",
           x_axis_label='Hour of the Day', y_axis_label='Incidents', width=800)
bar = {}
legend_items = []

for indx, crime in enumerate(focuscrimes):
    color = Category20[len(focuscrimes)][indx]
    renderer = p.vbar(x='Hour', top=crime, source=source, muted_alpha=0.2,
                      muted=True, color=color, width=0.7)
    bar[crime] = renderer
    legend_items.append(LegendItem(label=crime, renderers=[renderer]))

# Create a Legend explicitly and add legend items
legend = Legend(items=legend_items)
legend.click_policy = "mute" 
p.add_layout(legend, 'left')
show(p)

## Part 3: Narrative Dataviz ##

I'm intrigued by the "Gapminder Human Development Trends" for its innovative way of presenting complex global trends in a digestible format. The interactive elements, combined with animated transitions and the ability to explore time-series data, make it a compelling tool for storytelling and data exploration. This approach enhances understanding by allowing users to engage directly with the data.