# HW6 - Part 2

Task: Explore algorithms using interactive Bokeh visualizations.

Notebook Sections (click to navigate):
- [MC1](#Mini-Challenge-1)
- [MC2](#Mini-Challenge-2)
- [MC3](#Mini-Challenge-3)
- [Grand Challenge](#Grand-Challenge)

### Introduction

The setting for this VAST challenge is a major city. There have been increased reports of an illness spreading among residents, sometimes resulting in death. Our goal is to provide an assessment of the situation so that officials can take the appropriate response.

Each mini-challenge section below has a description of the machine learning algorithms we plan to use as well as the visualizations we think will be necessary.

To run all parts of the notebook, you will need to download the dataset from [this link](https://drive.google.com/drive/folders/0B5fJzDDT_kSnakNYeUhQQjVweWs?usp=sharing). It includes the additional datasets not provided by the VAST challenge. You must save the "Datasets" folder at the same level as this notebook.

In [2]:
from bokeh.io import output_notebook, show
output_notebook()

# Mini Challenge 1

[Back to Top](#HW5---Part-1)

This task is to characterize the spread of the epidemic. We need to identify where the outbreak started on the map (1.1) and present a hypothesis on how the infection is being transmitted (1.2).

## Datasets

The data spans April 30, 2011 to May 9, 2011.

| Type | Fields | Datapoints |
|:-|:-|-:|
| Text, Date, GPS, Numeric | ID, Created_at, Location, Content | 1023077 |
| Image | Various labels | 1 |
| Text, Numeric | Zone_name, Population_Density, Daytime_Population | 13 |
| Text, Date, Numeric | Date, Weather, Average_Wind_Speed, Wind_Direction | 25 | 

In [78]:
import pandas as pd
import datetime as dt

# Load the microblog data
microblog_df = pd.read_csv('Datasets/MC_1/Microblogs.csv', encoding='ISO-8859-1')

# Convert location to separate latitude and longitude float columns
long,lat = microblog_df['Location'].str.split(' ',1).str
microblog_df['long'] = long.astype(float)
microblog_df['lat'] = lat.astype(float)*-1

# Convert created_at to datetime objects.
dates = []
e_dates = []
for t in microblog_df['Created_at']:
    try:
        date = dt.datetime.strptime(t, '%m/%d/%Y %H:%M')
    except:
        t = t.split()[0]
        date = dt.datetime.strptime(t, '%m/%d/%Y')
    e_dates.append(date.strftime('%s'))
    dates.append(date)
    
microblog_df['Created_at'] = dates
microblog_df['epoch'] = e_dates

Build the visualization for microblogs.

This should behave like the year bubbles example. 

Plot all the tweets over the map. Let select time and hover to see content.

In [79]:
# Make HOURLY time increments
start = microblog_df['Created_at'].min()
end = microblog_df['Created_at'].max()
num_hrs = (end - start).days*24
times = [start+dt.timedelta(hours=x) for x in range(0,num_hrs)]

In [114]:
from bokeh.models import ColumnDataSource

sources = {}

ONE_HR = dt.timedelta(hours=1)

for time in times:
    day_df = microblog_df[(microblog_df.Created_at-time)<=ONE_HR]
    key = time.strftime('%B_%d_%Y_%H')
    print(key)
    sources[key] = ColumnDataSource(day_df)

April_30_2011_00
April_30_2011_01
April_30_2011_02
April_30_2011_03
April_30_2011_04
April_30_2011_05
April_30_2011_06
April_30_2011_07
April_30_2011_08
April_30_2011_09
April_30_2011_10
April_30_2011_11
April_30_2011_12
April_30_2011_13
April_30_2011_14
April_30_2011_15
April_30_2011_16
April_30_2011_17
April_30_2011_18
April_30_2011_19
April_30_2011_20
April_30_2011_21
April_30_2011_22
April_30_2011_23
May_01_2011_00
May_01_2011_01
May_01_2011_02
May_01_2011_03
May_01_2011_04
May_01_2011_05
May_01_2011_06
May_01_2011_07
May_01_2011_08
May_01_2011_09
May_01_2011_10
May_01_2011_11
May_01_2011_12
May_01_2011_13
May_01_2011_14
May_01_2011_15
May_01_2011_16
May_01_2011_17
May_01_2011_18
May_01_2011_19
May_01_2011_20
May_01_2011_21
May_01_2011_22
May_01_2011_23
May_02_2011_00
May_02_2011_01
May_02_2011_02
May_02_2011_03
May_02_2011_04
May_02_2011_05
May_02_2011_06
May_02_2011_07
May_02_2011_08
May_02_2011_09
May_02_2011_10
May_02_2011_11
May_02_2011_12
May_02_2011_13
May_02_2011_14
May_02_

In [119]:
microblog_df[(microblog_df.Created_at-times[0])<=ONE_HR]

Unnamed: 0,ID,Created_at,Location,text,long,lat,epoch
227,47,2011-04-30 00:37:00,42.22921 93.25617,Celebrating Italy With a Wink At Chinese Fare,42.22921,-93.25617,1304138220
350,66,2011-04-30 00:57:00,42.2466 93.35123,#iphone4 is definitely a pretty good hand warmer.,42.24660,-93.35123,1304139420
660,124,2011-04-30 00:08:00,42.18056 93.30609,Good talk by Matthew Fabb - Flash Player 10.1 ...,42.18056,-93.30609,1304136480
666,124,2011-04-30 00:09:00,42.21653 93.35762,POR MAS DIFICIL QUE SE NOS PRESENTE UNA SITUAC...,42.21653,-93.35762,1304136540
1376,241,2011-04-30 00:18:00,42.22213 93.37839,The Chevy Volt 'cause old hippies still rememb...,42.22213,-93.37839,1304137080
2127,383,2011-04-30 00:27:00,42.264 93.48903,'L' is my everything '),42.26400,-93.48903,1304137620
2499,451,2011-04-30 00:06:00,42.24778 93.4359,Mau nangis njir denger lagu 'indonesia menangi',42.24778,-93.43590,1304136360
2504,451,2011-04-30 00:06:00,42.25574 93.233,Lama lama aneh jg ya perasaan indonesia gempa...,42.25574,-93.23300,1304136360
4388,794,2011-04-30 00:54:00,42.23334 93.33166,I wanna study.. :-( But where is my mind..,42.23334,-93.33166,1304139240
6036,1099,2011-04-30 00:26:00,42.23894 93.35043,Interessant pleidooi tussen de regels door voo...,42.23894,-93.35043,1304137560


Build the plot.

In [127]:
from bokeh.plotting import figure
from bokeh.models import HoverTool
from bokeh.io import push_notebook, show

def update_map(time):
    key = time.strftime('%B_%d_%Y_%H')
    m.data_source=sources[key]
    push_notebook(handle=h)

hover = HoverTool(tooltips=[("Text","@text")])
    
# Load and plot the provided map.
map_img = 'Datasets/MC_1/Vastopolis_Map.png'
x_range = (-93.5673,-93.1923)
y_range = (42.1609,42.3017)

p = figure(title='Microblog Map', x_range=x_range, y_range=y_range, tools=[hover])
p.image_url(url=[map_img],
            x=x_range[0],y=y_range[1],
            w=x_range[1]-x_range[0],h=y_range[1]-y_range[0],alpha=0.5)

m = p.circle(x="lat",y="long",
         source=sources['April_30_2011_11'],
         color='red',
         alpha=0.5)

h = show(p, notebook_handle=True)

In [128]:
from ipywidgets import interact

interact(update_map, time=times)

A Jupyter Widget

<function __main__.update_map>

In [111]:
m.data_source=sources["April_30_2011_03"]
m.glyph.fill_color="white"
push_notebook(handle=h)

In [125]:
dir(m)

['__cached_all__overridden_defaults__',
 '__cached_all__properties__',
 '__cached_all__properties_with_refs__',
 '__class__',
 '__container_props__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__overridden_defaults__',
 '__properties__',
 '__properties_with_refs__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__view_model__',
 '__weakref__',
 '_attach_document',
 '_callbacks',
 '_check_bad_column_name',
 '_check_cdsview_source',
 '_check_missing_glyph',
 '_check_no_source_for_glyph',
 '_clone',
 '_detach_document',
 '_document',
 '_event_callbacks',
 '_id',
 '_overridden_defaults',
 '_property_values',
 '_repr_html_',
 '_repr_pretty',
 '_to_json_like',
 '_trigger_event',
 '_unstable_default_values',
 '_unstable_themed_values',
 '_update_event_callbacks',


## Mini Challenge 3.1



In [None]:
# Load news reports
news_df = pd.DataFrame(columns=['Headline','PubDate','Content'])

for i in range(1,15):
    fn = '{0:0>5}.txt'.format(i)
    with open('Datasets/MC_3/{}'.format(fn),'rb') as f:
        # [:-1] cuts off unnecessary \n character
        news_df = news_df.append({
            'Headline':str(f.readline()[:-2],'utf-8'),
            'PubDate':str(f.readline()[:-2],'utf-8'),
            'Content':str(f.readline()[:-2],'ISO-8859-1')
        }, ignore_index=True)
        
news_df.head()

Extract topics from the news documents.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Extract the raw text
data_samples = list(news_df.Content)

print('Constructing the term-frequency matrix...')
tf_vectorizer = CountVectorizer(stop_words='english')

def get_top_words(model, feature_names, n_top_words=5):
    top_words = {}
    for topic_idx, topic in enumerate(model.components_):
        top_words[topic_idx] = [feature_names[i]
                                for i in topic.argsort()[:-n_top_words-1:-1]]
    return top_words

def get_model(K=100,docs=data_samples):
    tf = tf_vectorizer.fit_transform(docs)

    print("Building the LDA model...")
    lda = LatentDirichletAllocation(n_components=K,
                                    learning_method='online',
                                    max_iter=5)
    print('Extracting topics...')
    lda.fit(tf)
    print('Done.')
    return lda,tf

def update(K=100,docs='All'):
    model,tf = get_model(K,docs)
    top_words = get_top_words(lda, tf.get_feature_names())

Visualizations of the topic model.