# HW5 - Part 1

VAST challenge: [2011](http://hcil2.cs.umd.edu/newvarepository/VAST%20Challenge%202011/challenges/Grand%20Challenge%202011/)

Notebook Sections (click to navigate):
- [MC1](#Mini-Challenge-1)
- [MC2](#Mini-Challenge-2)
- [MC3](#Mini-Challenge-3)
- [Grand Challenge](#Grand-Challenge)

The setting for this VAST challenge is a major city. There have been increased reports of an illness spreading among residents, sometimes resulting in death. Our goal is to provide an assessment of the situation so that officials can take the appropriate response.

Each mini-challenge section below has a description of the machine learning algorithms we plan to use as well as the visualizations we think will be necessary.

In [86]:
import numpy as np
import pandas as pd

from bokeh.plotting import figure, show, output_notebook

output_notebook()

# Mini Challenge 1

[Back to Top](#HW5---Part-1)

This task is to characterize the spread of the epidemic. We need to identify where the outbreak started on the map (1.1) and present a hypothesis on how the infection is being transmitted (1.2).

## Datasets

The data spans April 30, 2011 to May 9, 2011.

| Type | Fields | Datapoints |
|:-|:-|-:|
| Text, Date, GPS, Numeric | ID, Created_at, Location, Content | 1023077 |
| Image | Various labels | 1 |
| Text, Numeric | Zone_name, Population_Density, Daytime_Population | 13 |
| Text, Date, Numeric | Date, Weather, Average_Wind_Speed, Wind_Direction | 25 | 

## Mini Challenge 1.1

We use a list of [health-related keywords](https://figshare.com/articles/List_of_Health_Keywords/1084358) to find microblogs that are more likely to be relevant to the infection.

In [None]:
import re
from datetime import datetime

# Load the health related keywords
with open('Datasets/keywords.txt','r') as f:
    health_related = [line[:-2] for line in f.readlines()]   

# Load the microblog data
microblog_df = pd.read_csv('Datasets/MC_1/Microblogs.csv', encoding='ISO-8859-1')

Relevance is ranked using a simple scoring metric to count the number of health-related keywords found in the microblog contents.

In [85]:
# Score microblogs based on how many health-related keywords they contain
def get_health_score(text):
    score = 0
    for word in text.split():
        if word.lower() in health_related:
            score += 1
    return score

The cell below generates several new columns in the microblog dataframe to store numeric representations of string values.

In [75]:
# Add scores column to existing microblog dataframe
scores = []
for text in microblog_df.text:
    scores.append(get_health_score(text))
microblog_df['health_score'] = scores

# Convert location to separate latitude and longitude float columns.
long,lat = microblog_df['Location'].str.split(' ', 1).str
microblog_df['longitude'] = long.astype(float)
microblog_df['latitude'] = lat.astype(float)

# Convert created_at to datetime objects.
dates = []
e_dates = []
for t in microblog_df['Created_at']:
    try:
        date = datetime.strptime(t, '%m/%d/%Y %H:%M')
    except:
        t = t.split()[0]
        date = datetime.strptime(t, '%m/%d/%Y')
    e_dates.append(date.strftime('%s'))
    dates.append(date)
    
microblog_df['Created_at'] = dates
microblog_df['epoch'] = e_dates

Once we have extracted the microblogs that most likely relate to the infection, we can look at the dates and locations to see when and where the earliest mentions of infection occurred. We'll use __KMeans Clustering__ to group them based on date, location, and health-related score. This should also allow us to outline a rough idea of "ground zero" on the provided map.

In [59]:
from sklearn.cluster import KMeans

X = microblog_df[['epoch','latitude','longitude','health_score']]
kmeans = KMeans(n_clusters=10).fit(X)
kmeans.cluster_centers_

array([[  1.30572444e+09,   9.33770918e+01,   4.22279936e+01,
          2.17269920e-01],
       [  1.30477390e+09,   9.33752503e+01,   4.22300083e+01,
          3.55703826e-02],
       [  1.30442104e+09,   9.33761736e+01,   4.22301872e+01,
          3.68473268e-02],
       [  1.30533609e+09,   9.33793767e+01,   4.22311052e+01,
          3.60504979e-02],
       [  1.30513720e+09,   9.33764007e+01,   4.22300617e+01,
          3.57212915e-02],
       [  1.30553720e+09,   9.33762145e+01,   4.22302262e+01,
          3.64969793e-02],
       [  1.30459738e+09,   9.33756974e+01,   4.22301432e+01,
          3.59499431e-02],
       [  1.30587652e+09,   9.33826000e+01,   4.22265082e+01,
          4.60335382e-01],
       [  1.30423620e+09,   9.33765490e+01,   4.22301115e+01,
          3.54763173e-02],
       [  1.30495086e+09,   9.33773561e+01,   4.22301170e+01,
          3.70049679e-02]])

The next step is to visualize cluster centers by overlaying them onto the map to hopefully identify where and when the outbreak started.

## Mini Challenge 1.2

We'll use __Linear Regression__ to look for trends in the data. If the disease is airborne, there may be a correlation between wind direction and microblog mentions of infection. If the disease is 
To determine how the infection is being transmitted we'll use our timeline of infection-mentioning microblogs to see if the timeline of mention spread correlates with the timeline of any other properties. For example, if the disease is airborne, mentions may follow wind direction trends.

In [None]:
population_df = pd.read_csv('Datasets/MC_1/Population.csv')
weather_df = pd.read_csv('Datasets/MC_1/Weather.csv')

In [20]:
# Load and plot the provided map.
map_img = 'Datasets/MC_1/Vastopolis_Map.png'
x_range = (-93.5673,-93.1923)
y_range = (42.1609,42.3017)
p = figure(x_range=x_range, y_range=y_range)
p.image_url(url=[map_img],
            x=x_range[0],y=y_range[1],
            w=x_range[1]-x_range[0],h=y_range[1]-y_range[0])
show(p)

# Mini Challenge 2

[Back to Top](#HW5---Part-1)

For Mini-Challenge 2, we are tasked with identifying notable events in a shipping company’s security network based on the security summaries of three days. Each day contains datasets of either firewall and IDS logs or both. There is also an Nessus scan log. In order to tackle this data set, we need to perform a significant amount of preprocessing for the best analysis. From the raw logging data, we shall most likely combine firewall and IDS logs (since they are timestamped) with each entry field as a feature, perhaps needing to categorize the display messages. 

## Datasets

The datasets for this challenge span 4/30/2011 - 5/9/2011 and are:

| Type | Fields | Datapoints |
|------|--------|------------|
| File describing computer network architecture | N/A | N/A |
| Security policy rules | N/A | N/A |
| Firewall log | N/A | N/A |
| IDS log | N/A | N/A |
| Syslogs for all hosts on network | N/A | N/A |
| Nessus Network Vulnerability Scan Report | N/A | N/A |

## Mini Challenge 2.1

Since we are specifically looking for significant events across all the data sets, we should be looking to detect outliers for all data sets we examine and can utilize a single algorithm for this purpose. The best algorithm for this approach is __KMeans Clustering__ with adjustable number of clusters (k). By iterating the algorithm over variable k with a minimum size of 1 should allow us to pinpoint significant events - standard procedures should fall in significant numbers within their own clusters while outlier clusters should contain data that deviates from standard practice.

## Mini Challenge 2.2

For each outlier detected by our clustering analysis, we can examine data points contained within the cluster and search for the earliest time stamp. 

## Mini Challenge 2.3

Presumably a certain network vulnerability was exploited that we can make recommendations to fix.

# Mini Challenge 3

[Back to Top](#HW5---Part-1)

The task for this challenge is to investigate terrorist activity in the region. We need to identify details of imminent terrorist threats and provide officials with the supporting evidentiary documents.

## Dataset:

This dataset spans April 27, 2011 to May 19, 2011 and contains news reports.

| Type | Fields | Datapoints |
|:-|:-|-:|
| Text, Date | Headline, PubDate, Content | 4474 |

## Mini Challenge 3.1

Similar to Mini Challenge 1.1, we'll use __Latent Dirichlet Allocation__ to extract topics from the news stories. This algorithm ranks word likelihoods for being in each topic. Our goal is to allow user interaction by letting them change the number of topics and exploring the most-likely words for each topic (something that LDA allows for). We can greatly reduce the number of documents officials need to sort through by having them manually select topic(s) of interest and only providing the relevant document.

In [60]:
# Load news reports
news_df = pd.DataFrame(columns=['Headline','PubDate','Content'])

for i in range(1,15):
    fn = '{0:0>5}.txt'.format(i)
    with open('Datasets/MC_3/{}'.format(fn),'rb') as f:
        # [:-1] cuts off unnecessary \n character
        news_df = news_df.append({
            'Headline':str(f.readline()[:-2],'utf-8'),
            'PubDate':str(f.readline()[:-2],'utf-8'),
            'Content':str(f.readline()[:-2],'ISO-8859-1')
        }, ignore_index=True)
        
news_df.head()

Unnamed: 0,Headline,PubDate,Content
0,Boatmen's Share Price Jumps On News of Nations...,"May 12, 2011",Boatmen's Bancshares Inc.'s stock price surged...
1,Suburbia State Court Declares Hasidic School D...,"May 09, 2011","ALBANY, N.Y. -- A controversial Suburbia publi..."
2,Television Notes,"May 11, 2011","Networks don't show much Tipper, but make up f..."
3,Codi Unveils Initiative To Clean Up Environment,"May 10, 2011","KALAMAZOO, Mich. -- President Codi unveiled a ..."
4,Television Espanola in Talks For Grupo Televis...,"May 18, 2011",Eastside -- Two of the world's largest produce...


Build the LDA model.

In [73]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def get_top_words(model, feature_names, n_top_words):
    top_words = {}
    for topic_idx, topic in enumerate(model.components_):
        top_words[topic_idx] = [feature_names[i]
                                for i in topic.argsort()[:-n_top_words-1:-1]]
    return top_words

K = 50 # number of topics

# Extract the raw text
data_samples = list(news_df.Content)

print('Constructing the term-frequency matrix...')
tf_vectorizer = CountVectorizer(stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

print("Building the LDA model...")
lda = LatentDirichletAllocation(n_components=K,
                                learning_method='online',
                                max_iter=5)
print('Extracting topics...')
lda.fit(tf)
print('Done.')
tf_feature_names = tf_vectorizer.get_feature_names()
top_words = get_top_words(lda, tf_feature_names, 6)

Constructing the term-frequency matrix...
Building the LDA model...
Extracting topics...
Done.


In [74]:
top_words

{0: ['headed', 'award', '64', 'compliance', 'championships', 'ridicule'],
 1: ['shares', 'investors', 'cents', 'increases', 'active', '25'],
 2: ['mr', 'howard', 'takes', 'approximately', 'statistics', 'cyr'],
 3: ['racket', 'cents', 'lutes', 'shares', 'tennis', 'ms'],
 4: ['unit', 'fleming', 'offer', 'director', 'bobbles', 'mr'],
 5: ['prior', 'jokes', 'spectator', 'flushing', 'lots', 'reach'],
 6: ['workers', 'executive', 'hoopla', 'previewing', 'pricey', 'change'],
 7: ['regulator', 'marketed', 'classics', 'believes', 'billion', 'surged'],
 8: ['district', 'school', 'state', 'children', 'religious', 'suburbia'],
 9: ['fleming', 'jardine', 'management', 'imro', 'said', 'fund'],
 10: ['racket', 'ms', 'lutes', 'tennis', 'says', 'like'],
 11: ['clark', 'tan', 'dial', 'hopkins', 'harnisch', '000'],
 12: ['spain', 'television', 'digital', 'company', 'tv', 'latin'],
 13: ['needs', 'issued', 'jahnke', 'focus', 'players', 'boatmen'],
 14: ['make', 'lutes', 'rackets', 'racket', 'ripstick', 'b

The next step is to make visualizations to represent the topics. We need to allow officials to change the number of topics, generate top words, skim those words, pick out the most interesting ones and "zoom in" on those documents containing that topic.

# Grand Challenge

What we assume will happen is that we will detect a connection between the location of the epidemic outbreak and shipping logs as well as motivations and planning by a terrorist organization in the news. Until we have conducted the analysis on the data sets, it’s hard to make any further analysis on the grand challenge.