# HW6 - Part 2

Task: Explore algorithms using interactive Bokeh visualizations.

Notebook Sections (click to navigate):
- [MC1](#Mini-Challenge-1)
  - [TFIDF](#TFIDF)
- [MC2](#Mini-Challenge-2)
- [MC3](#Mini-Challenge-3)
  - [Topic Modeling](#Topic-Modeling)
- [Grand Challenge](#Grand-Challenge)

### Introduction

The setting for this VAST challenge is a major city. There have been increased reports of an illness spreading among residents, sometimes resulting in death. Our goal is to provide an assessment of the situation so that officials can take the appropriate response.

Each mini-challenge section below has a description of the machine learning algorithms we plan to use as well as the visualizations we think will be necessary.

To run all parts of the notebook, you will need to download the dataset from [this link](https://drive.google.com/drive/folders/0B5fJzDDT_kSnakNYeUhQQjVweWs?usp=sharing). It includes the additional datasets not provided by the VAST challenge. You must save the "Datasets" folder at the same level as this notebook.

In [136]:
from bokeh.io import output_notebook, show
output_notebook()

# Mini Challenge 1

[Back to Top](#HW6---Part-2)

This task is to characterize the spread of the epidemic. We need to identify where the outbreak started on the map (1.1) and present a hypothesis on how the infection is being transmitted (1.2).

## Datasets

The data spans April 30, 2011 to May 9, 2011.

| Type | Fields | Datapoints |
|:-|:-|-:|
| Text, Date, GPS, Numeric | ID, Created_at, Location, Content | 1023077 |
| Image | Various labels | 1 |
| Text, Numeric | Zone_name, Population_Density, Daytime_Population | 13 |
| Text, Date, Numeric | Date, Weather, Average_Wind_Speed, Wind_Direction | 25 | 

In [8]:
import pandas as pd
import datetime as dt

# Load the microblog data
microblog_df = pd.read_csv('Datasets/MC_1/Microblogs.csv', encoding='ISO-8859-1')

First, we need to change the location from a string to useful floats.

In [None]:
# Convert location to separate latitude and longitude float columns
long,lat = microblog_df['Location'].str.split(' ',1).str
microblog_df['long'] = long.astype(float)
microblog_df['lat'] = lat.astype(float)*-1

Then, the Created_at column needs to be converted into python datetime objects so we can run comparisons and other functions on those too.

In [None]:
# Convert created_at to datetime objects.
dates = []
e_dates = []
for t in microblog_df['Created_at']:
    try:
        date = dt.datetime.strptime(t, '%m/%d/%Y %H:%M')
    except:
        t = t.split()[0]
        date = dt.datetime.strptime(t, '%m/%d/%Y')
    e_dates.append(date.strftime('%s'))
    dates.append(date)
    
microblog_df['Created_at'] = dates
microblog_df['epoch'] = e_dates

Finally, we use the health related keywords 

In [9]:
# Load the health related keywords
with open('Datasets/keywords.txt','r') as f:
    health_related = [line[:-2] for line in f.readlines()]

# Score microblogs based on how many health-related keywords they contain
def get_health_score(text):
    score = 0
    for word in text.split():
        if word.lower() in health_related:
            score += 1
    return score

# Add scores column to existing microblog dataframe
scores = []
for text in microblog_df.text:
    scores.append(get_health_score(text))
microblog_df['health_score'] = scores

We extract only tweets related to being sick. There are still over 90k.

In [17]:
sick_df = microblog_df[microblog_df.health_score>0].copy()
sick_df.sample(3)

Unnamed: 0,ID,Created_at,Location,text,long,lat,epoch,health_score
614597,108371,2011-05-03 16:44:00,42.23746 93.32407,Man the sun is starting to kick in at Texans p...,42.23746,-93.32407,1304455440,1
702628,123877,2011-05-20 13:23:00,42.22655 93.32686,being sick sucks. Natalie has caught a case of...,42.22655,-93.32686,1305912180,1
547235,96614,2011-05-19 19:38:00,42.24146 93.26428,had a terrible day and has been a horrible nig...,42.24146,-93.26428,1305848280,2


Now we start building the visualization for the sick-related microblogs by making a plot of the number of health-related microblogs over time.

In [251]:
# Make HOURLY time increments
start = sick_df['Created_at'].min()
end = sick_df['Created_at'].max()
num_hrs = (end - start).days*24
times = [start+dt.timedelta(hours=x) for x in range(0,num_hrs)]

# Calculate number of microblogs at each time increment.
sick_y = []
all_y = []
for time in times:
    sick_day_df = sick_df[(sick_df.Created_at-time)<=ONE_HR]
    sick_y.append(len(sick_day_df))
    day_df = microblog_df[(microblog_df.Created_at-time)<=ONE_HR]
    all_y.append(len(day_df))

In [252]:
from bokeh.plotting import figure
from bokeh.models import HoverTool,DatetimeTickFormatter,ColumnDataSource
from bokeh.io import push_notebook, show
from ipywidgets import interact

def update_time_plot(only_sick_blogs=False):
    if only_sick_blogs:
        s.data_source.data['y'] = sick_y
    else:
        s.data_source.data['y'] = all_y
    push_notebook()

hover = HoverTool(tooltips=[("Date","@time")])
source = ColumnDataSource({'x':times,
                           'y':sick_y,
                           'time':[t.strftime('%B_%d_%Y_%H') for t in times]})

p = figure(title='Health-Related Microblogs over Time',
           x_axis_label='Date',
           y_axis_label='Number of Blogs',
           tools=[hover])
p.xaxis.formatter=DatetimeTickFormatter(
        hours=["%d %B %Y"],
        days=["%d %B %Y"],
        months=["%d %B %Y"],
        years=["%d %B %Y"],)
s = p.line(x="x",y="y",
           source=source,
           alpha=1,
           line_width=2)
h = show(p, notebook_handle=True)
interact(update_time_plot, only_sick_blogs=False)

A Jupyter Widget

<function __main__.update_time_plot>

By highlighting only blog posts related to being sick, we can see there is a sharp increase in people talking about health starting on May 18 2011. To find out more, we'll piece together all of the tweets from the days leading up to May 18 (5 days before) and calculate TFIDF to see what some possible symptoms might be.

In [292]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def tfidf_top_words(tfidf,feature_names,n_top_words=20):
    top_words = [feature_names[i] for i in tfidf.argsort()[:-n_top_words-1:-1]]
    return top_words

def tfidf_top_probs(tfidf,feature_names,n_top_words=20):
    top_idx = tfidf.argsort()[:-n_top_words-1:-1]
    return [tfidf[i] for i in top_idx]

stopwords = stopwords.words('english')
# Combine health-related tweets
sick_microblogs = ' '.join([blog for blog in sick_df.text.values])
sick_microblogs = word_tokenize(sick_microblogs)
# Remove stopwords
sick_words = [word for word in sick_microblogs if word not in stopwords]
doc = ' '.join(sick_words)
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform([doc])
feature_names = tfidf_vectorizer.get_feature_names()

In [308]:
n_top_words = 20
tfidf_tw = tfidf_top_words(tfidf.toarray()[0],feature_names,n_top_words)
tfidf_tp = tfidf_top_probs(tfidf.toarray()[0],feature_names,n_top_words)

p = figure(title='Health-Related TFIDF (Top Words)',
           x_axis_label='Word',
           x_range=tfidf_tw,
           y_axis_label='TFIDF',
           tools=[hover])

p.xaxis.major_label_orientation = 45
x = [0.5 + v for v in range(n_top_words)]
y = tfidf_tp
words = tfidf_tw

hover = HoverTool(tooltips=[("Word","@word"),("TFIDF","@y")])
source = ColumnDataSource({'x':x,'y':y,'word':words})

s = p.vbar(x='x',top='y',source=source,width=0.5)

def update_tfidf(n_top_words=20):
    tfidf_tw = tfidf_top_words(tfidf.toarray()[0],feature_names,n_top_words)
    tfidf_tp = tfidf_top_probs(tfidf.toarray()[0],feature_names,n_top_words)
    p.x_range.factors = tfidf_tw
    s.data_source.data['x'] = [0.5 + v for v in range(n_top_words)]
    s.data_source.data['y'] = tfidf_tp
    s.data_source.data['word'] = tfidf_tw
    push_notebook()
    
show(p,notebook_handle=True)
interact(update_tfidf,n_top_words=(0,40))

A Jupyter Widget

<function __main__.update_tfidf>

## TFIDF

This algorithm definitely seems like it's going to help us solve the VAST problem. It's different from the topic modeling we're planning to use in MC3 in that it doesn't care about which documents the frequent terms are coming from. It also doesn't attempt to divide the terms into topics. These properties, which are necessary for the news articles in MC3, are perfect for figuring out what symptoms are related to the disease.

TFIDF alone isn't enough to help us identify the source of the outbreak. However, using the health-related keywords to find out when people started talking about the disease, we should be able to backtrack and figure out where those people were when they contracted the disease.

We can also plot tweets on top of the map to see where they happened.

In [128]:
# Two days before and one day after
start = dt.datetime.strptime('May_17_2011_00','%B_%d_%Y_%H')
end = dt.datetime.strptime('May_19_2011_00','%B_%d_%Y_%H')
num_hrs = (end - start).days*24
times = [start+dt.timedelta(hours=x) for x in range(0,num_hrs)]

# Build a source for every hour
sources = {}
ONE_HR = dt.timedelta(hours=1)
for time in times:
    # We're just gonna sample 10,000 from each day to make visualizations easier.
    day_df = microblog_df[(microblog_df.Created_at-time)<=ONE_HR].sample(1000)
    key = time.strftime('%B_%d_%Y_%H')
    sources[key] = ColumnDataSource(day_df)

In [250]:
def update_map(time):
    key = time.strftime('%B_%d_%Y_%H')
    new_source = sources[key]
    m.data_source.data["lat"] = new_source.data["lat"]
    m.data_source.data["long"] = new_source.data["long"]
    m.data_source.data["text"] = new_source.data["text"]
    push_notebook()

hover = HoverTool(tooltips=[("Text","@text")])
    
# Load and plot the provided map.
map_img = 'Datasets/MC_1/Vastopolis_Map.png'
x_range = (-93.5673,-93.1923)
y_range = (42.1609,42.3017)

p = figure(title='Microblog Map', x_range=x_range, y_range=y_range, tools=[hover])
p.image_url(url=[map_img],
            x=x_range[0],y=y_range[1],
            w=x_range[1]-x_range[0],h=y_range[1]-y_range[0],alpha=0.5)

m = p.circle(x="lat",y="long",
         source=sources['May_17_2011_02'],
         color='red',
         alpha=0.5)

h = show(p, notebook_handle=True)
interact(update_map,time=times)

A Jupyter Widget

<function __main__.update_map>

# Mini Challenge 2

[Back to Top](#HW6---Part-2)

For Mini-Challenge 2, we are tasked with identifying notable events in a shipping company’s security network based on the security summaries of three days. Each day contains datasets of either firewall and IDS logs or both. There is also an Nessus scan log. In order to tackle this data set, we need to perform a significant amount of preprocessing for the best analysis. From the raw logging data, we shall most likely combine firewall and IDS logs (since they are timestamped) with each entry field as a feature, perhaps needing to categorize the display messages. 

## Datasets

The datasets for this challenge span 4/30/2011 - 5/9/2011 and are:

| Type | Fields | Datapoints |
|------|--------|------------|
| File describing computer network architecture | N/A | N/A |
| Security policy rules | N/A | N/A |
| Firewall log | N/A | N/A |
| IDS log | N/A | N/A |
| Syslogs for all hosts on network | N/A | N/A |
| Nessus Network Vulnerability Scan Report | N/A | N/A |

## Mini Challenge 2.1

Since we are specifically looking for significant events across all the data sets, we should be looking to detect outliers for all data sets we examine and can utilize a single algorithm for this purpose. The best algorithm for this approach is __KMeans Clustering__ with adjustable number of clusters (k). By iterating the algorithm over variable k with a minimum size of 1 should allow us to pinpoint significant events - standard procedures should fall in significant numbers within their own clusters while outlier clusters should contain data that deviates from standard practice.

## Mini Challenge 2.2

For each outlier detected by our clustering analysis, we can examine data points contained within the cluster and search for the earliest time stamp. 

## Mini Challenge 2.3

Presumably a certain network vulnerability was exploited that we can make recommendations to fix.

# Mini Challenge 3

[Back to Top](#HW6---Part-2)

The task for this challenge is to investigate terrorist activity in the region. We need to identify details of imminent terrorist threats and provide officials with the supporting evidentiary documents.

## Dataset:

This dataset spans April 27, 2011 to May 19, 2011 and contains news reports.

| Type | Fields | Datapoints |
|:-|:-|-:|
| Text, Date | Headline, PubDate, Content | 4474 |

## Mini Challenge 3.1

Similar to Mini Challenge 1.1, we'll use __Latent Dirichlet Allocation__ to extract topics from the news stories. This algorithm ranks word likelihoods for being in each topic. Our goal is to allow user interaction by letting them change the number of topics and exploring the most-likely words for each topic (something that LDA allows for). We can greatly reduce the number of documents officials need to sort through by having them manually select topic(s) of interest and only providing the relevant document.

In [30]:
# Load news reports
news_df = pd.DataFrame(columns=['Headline','PubDate','Content'])

for i in range(1,15):
    fn = '{0:0>5}.txt'.format(i)
    with open('Datasets/MC_3/{}'.format(fn),'rb') as f:
        # [:-1] cuts off unnecessary \n character
        news_df = news_df.append({
            'Headline':str(f.readline()[:-2],'utf-8'),
            'PubDate':str(f.readline()[:-2],'utf-8'),
            'Content':str(f.readline()[:-2],'ISO-8859-1')
        }, ignore_index=True)
        
news_df.head()

Unnamed: 0,Headline,PubDate,Content
0,Boatmen's Share Price Jumps On News of Nations...,"May 12, 2011",Boatmen's Bancshares Inc.'s stock price surged...
1,Suburbia State Court Declares Hasidic School D...,"May 09, 2011","ALBANY, N.Y. -- A controversial Suburbia publi..."
2,Television Notes,"May 11, 2011","Networks don't show much Tipper, but make up f..."
3,Codi Unveils Initiative To Clean Up Environment,"May 10, 2011","KALAMAZOO, Mich. -- President Codi unveiled a ..."
4,Television Espanola in Talks For Grupo Televis...,"May 18, 2011",Eastside -- Two of the world's largest produce...


Extract topics from the news documents.

In [203]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Extract the raw text
data_samples = list(news_df.Content)

tf_vectorizer = CountVectorizer(stop_words='english')

def get_top_words(model, feature_names, n_top_words=5):
    top_words = {}
    for topic_idx, topic in enumerate(model.components_):
        top_words[topic_idx] = [feature_names[i]
                                for i in topic.argsort()[:-n_top_words-1:-1]]
    return top_words

def get_top_probs(model, feature_names, n_top_words=5):
    top_probs = {}
    for topic_idx, topic in enumerate(model.components_):
        top_idx = topic.argsort()[:-n_top_words-1:-1]
        top_probs[topic_idx] = [topic[i] for i in top_idx]
    return top_probs

def get_model(K=15,docs=data_samples):
    tf = tf_vectorizer.fit_transform(docs)

    print("Building the LDA model...")
    lda = LatentDirichletAllocation(n_components=K,
                                    learning_method='online',
                                    max_iter=5)
    print('Extracting topics...')
    lda.fit(tf)
    print('Done.')
    return lda,tf,tf_vectorizer.get_feature_names()

def update(K=15,docs='All'):
    model,tf = get_model(K,docs)
    top_words = get_top_words(lda, tf.get_feature_names())

Visualizations of the topic model.

In [224]:
n_topics = 5
n_top_words = 5
model,tf,fn = get_model(K=n_topics)
tw = get_top_words(model,fn,n_top_words)
tp = get_top_probs(model,fn,n_top_words)

Building the LDA model...
Extracting topics...
Done.


In [225]:
sick_tp = {}
for topic in tp:
    sick_tp[topic] = [0]*n_top_words
    for i in range(n_top_words):
        word = tw[topic][i]
        if word in health_related:
            print(word)
            sick_tp[topic][i] = tp[topic][i]

In [248]:
p = figure(title='Topic Top Words',
           x_axis_label='Word',
           x_range=tw[0],
           y_axis_label='Likelihood',
           tools=[hover])

p.xaxis.major_label_orientation = 45
x = [0.5 + v for v in range(n_top_words)]
y = tp[0]
sick_y = sick_tp[0]
words = tw[0]

hover = HoverTool(tooltips=[("Word","@word"),("Likelihood","@y")])
source = ColumnDataSource({'x':x,'y':y,'word':words})
sick_source = ColumnDataSource({'x':x,'y':sick_y,'word':words})

s = p.vbar(x='x',top='y',source=source,width=0.5)
q = p.vbar(x='x',top='y',source=sick_source,width=0.5,fill_color='orange',line_color='orange')

def update_topic(topic):
    idx = int(topic.split()[1])
    p.x_range.factors = tw[idx]
    s.data_source.data['y'] = tp[idx]
    s.data_source.data['word'] = tw[idx]
    q.data_source.data['y'] = sick_tp[idx]
    q.data_source.data['word'] = tw[idx]
    push_notebook()
    
show(p,notebook_handle=True)
interact(update_topic,topic=["Topic {}".format(t) for t in tw.keys()])

A Jupyter Widget

<function __main__.update_topic>

## Topic Modeling

Using the graph above it's easy to see which topics contain health-related words since health-related words are highlighted in orange. Unfortunately this graph is really slow to update. We may need to perform additional reduction techniques to speed it up. Unike TFIDF used in MC1, topic modeling will let us see which topics make up each document. That way, once we use the interactive graph to select a topic, we can trace i

# Grand Challenge

[Back to Top](#HW6---Part-2)

What we assume will happen is that we will detect a connection between the location of the epidemic outbreak and shipping logs as well as motivations and planning by a terrorist organization in the news. Until we have conducted the analysis on the data sets, it’s hard to make any further analysis on the grand challenge.