# Issue Trend Analysis and Issue Tracking
##### Copyright (C) CS474 Team #4 

----------------------------------------------------

### Imports & Statics

In [1]:
import csv, pickle
import utils
from utils import Corpus, Issue, Extractor, IssueModel, EventModel
import os, sys
import warnings

warnings.filterwarnings('ignore')

years = [2015, 2016, 2017]
num_issues = 50
num_events = 50
num_keywords = 10

----------------------------------------------------

## Part 0: Preprocessing

### Clean articles: lemmatize, remove stopwords (Already Done)
**_Caution!_ This overwrites dumps/cleaned.bin file. ** (Involves multiprocessing)

In [2]:
# utils.clean_articles()

### Detect Entities: IBM Watson NLU (Already Done)
**_Caution!_ This overwrites dumps/cleaned_watson.bin file. ** (Involves multiprocessing and Watson API calls)

In [3]:
# utils.build_watson()

## Part 1: Issue Trend Analysis

#### Initialize Corpuses

In [4]:
corpus = {}
for year in years:
    corpus[year] = Corpus(year=year-2015)

#### Build Corpuses: Load cleaned articles, build phrasers, dictionary, and BOWs

In [5]:
for year in years:
    print("Corpus "+str(year)+":")
    corpus[year].build_corpus()
    print("Corpus "+str(year)+" Done\n")


Corpus 2015:
building corpus...
collecting articles...
building phrasers...
bigram train finished! 7.40 seconds
trigram train finished! 12.16 seconds
building dictionary...
building bag of words...


100%|██████████| 7155/7155 [00:03<00:00, 2038.94it/s]


Corpus 2015 Done

Corpus 2016:
building corpus...
collecting articles...
building phrasers...
bigram train finished! 7.98 seconds
trigram train finished! 12.99 seconds
building dictionary...
building bag of words...


100%|██████████| 7480/7480 [00:03<00:00, 1881.52it/s]


Corpus 2016 Done

Corpus 2017:
building corpus...
collecting articles...
building phrasers...
bigram train finished! 8.76 seconds
trigram train finished! 14.80 seconds
building dictionary...
building bag of words...


100%|██████████| 9117/9117 [00:04<00:00, 2159.72it/s]


Corpus 2017 Done



#### Extract Keywords from each Article using tf-ifd

In [6]:
for year in years:
    print("Corpus "+str(year)+":")
    corpus[year].build_tfidf()
    corpus[year].extractor = Extractor(corpus[year])
    corpus[year].extractor.extract(k=num_keywords)
    print("Corpus "+str(year)+" Done\n")

Corpus 2015:
building tf-idf model...
tfidf finished! 0.07 seconds
extracting keywords...


100%|██████████| 7155/7155 [00:04<00:00, 1682.06it/s]


Corpus 2015 Done

Corpus 2016:
building tf-idf model...
tfidf finished! 0.07 seconds
extracting keywords...


100%|██████████| 7480/7480 [00:04<00:00, 1680.44it/s]


Corpus 2016 Done

Corpus 2017:
building tf-idf model...
tfidf finished! 0.08 seconds
extracting keywords...


100%|██████████| 9117/9117 [00:05<00:00, 1712.40it/s]

Corpus 2017 Done






#### Save Corpuses

In [7]:
utils.save(corpus, filename='corpus_ready.bin')

saving corpus_ready.bin...
corpus_ready.bin saved


#### Load corpus

In [2]:
corpus = utils.load(filename='corpus_ready.bin')

loading corpus_ready.bin...
corpus_ready.bin loaded


#### Build LDA model, cluster articles into issues

In [4]:
for year in years:
    print("Corpus "+str(year)+":")
    corpus[year].build_lda(num_topics=num_issues)
    corpus[year].issue_model = IssueModel(corpus=corpus[year], model=corpus[year].lda)
    corpus[year].issue_model.build_issues()
    print("Corpus "+str(year)+" Done\n")

Corpus 2015:
building LDA model...
LDA finished! 8.86 seconds
building issues...


100%|██████████| 7155/7155 [00:06<00:00, 1190.21it/s]


extracting keywords...


100%|██████████| 50/50 [00:00<00:00, 3302.34it/s]


Corpus 2015 Done

Corpus 2016:
building LDA model...
LDA finished! 9.58 seconds
building issues...


100%|██████████| 7480/7480 [00:06<00:00, 1188.70it/s]


extracting keywords...


100%|██████████| 50/50 [00:00<00:00, 2738.94it/s]


Corpus 2016 Done

Corpus 2017:
building LDA model...
LDA finished! 9.99 seconds
building issues...


100%|██████████| 9117/9117 [00:06<00:00, 1316.36it/s]


extracting keywords...


100%|██████████| 50/50 [00:00<00:00, 2003.18it/s]

Corpus 2017 Done






#### Save Corpuses

In [4]:
utils.save(corpus, filename='corpus_done.bin')

saving corpus_done.bin...
corpus_done.bin saved


#### Init Issues (for Part 2)

In [5]:
issues = []
for year in years:
    issue_model = corpus[year].issue_model
    top_issue_id = issue_model.sorted_issues[0][0]
    issues.append(Issue(articles=issue_model.issues[top_issue_id], keywords=issue_model.keywords[top_issue_id]))

#### Save Issues (for Part 2)

In [6]:
utils.save(issues, filename='issues_init.bin')

saving issues_init.bin...
issues_init.bin saved


----------------------------------------------------

### Show Results

#### Load corpus

In [17]:
corpus = utils.load(filename='corpus_done.bin')

loading corpus_done.bin...
corpus_done.bin loaded


#### Select year to show

In [19]:
show_year = 2017

#### Show top trending issues  

In [20]:
corpus[show_year].issue_model.show_top_issues()

ID:  24 Score: 1907.37 N:  744 Keywords:  Trump, N._Korea, S._Korea, Tillerson, NK, Mattis, THAAD, White_House, dialogue, envoy
ID:  28 Score: 1502.28 N:  560 Keywords:  N._Korea, NK, S._Korea, missile_launch, ICBM, DPRK, sanction, missile_test, Hwasong-, missile
ID:  22 Score: 948.06 N:  295 Keywords:  Ahn, Hong, People_Party, Yoo, candidate, Bareun_Party, primary, Election, Hwang, presidential_race
ID:   9 Score: 880.82 N:  385 Keywords:  Choi, Park_lawyer, trial, prosecution, Court, Woo, Park_aide, Park, arrest_warrant, arrest
ID:  19 Score: 789.48 N:  126 Keywords:  S._Korea, NK, aid, inter-Korean_exchange, Malaysia, N._Korean, assistance, humanitarian_assistance, complex, UN
ID:  38 Score: 646.02 N:  115 Keywords:  comfort_woman, statue, Dokdo, girl_statue, woman_statue, claim_Dokdo, sex_slave, sex_slavery, islet, sexual_slavery
ID:  23 Score: 615.94 N:   89 Keywords:  Incheon, victim, accomplice, murder, suspect, man, Police, jail_term, get_jail, bomb
ID:  21 Score: 612.78 N:  11

#### Show Articles from Top Issues 

In [5]:
corpus[show_year].issue_model.show_issues(k=5)

ID:   5 Score: 1109.61 N:  539 Keywords:  Saenuri, bill, impeachment, Moon, Ahn, Chung, Minjoo, Minjoo_Party, constituency, Opposition_party
	 0 	 Former leader quits opposition party
	 1 	 [Newsmaker] Park’s lame duck deadline looms
	 2 	 Constitutional reform debate resurfaces
	 3 	 General elections mired in uncertainty without constituencies
	 4 	 Park asks political parties to embrace reform
ID:  14 Score: 946.36 N:  275 Keywords:  N._Korea, sanction, N_K, S._Korea, human_right, Obama, US, NK, resolution, White_House
	 0 	 Quake in North Korea suspected to be 'explosion': report
	 1 	 North Korea announces successful test of hydrogen bomb
	 2 	 China party paper urges North Korea to change 'nuclear path'
	 3 	 Park, Obama talk over N.K. nuke test
	 4 	 Park, Obama agree to closely work together to adopt strong U.N. sanctions against North
ID:   1 Score: 767.23 N:  187 Keywords:  Trump, comfort_woman, Japan, foundation, S._Korea, sex_slavery, Iran, victim, sexual_slavery, sex_slave

#### View LDA Model

In [26]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [24]:
pyLDAvis.gensim.prepare(corpus[show_year].lda, corpus[show_year].get_bows(), corpus[show_year].dict)

In [27]:
# issues_lda_data = pyLDAvis.gensim.prepare(corpus[show_year].lda, corpus[show_year].get_bows(), corpus[show_year].dict)
# utils.save(issues_lda_data, 'issues_lda_data.bin')

saving issues_lda_data.bin...
issues_lda_data.bin saved


----------------------------------------------------

## Part 2: Issue Tracking

### Imports & Statics

In [6]:
import csv, pickle
import utils
from utils import Corpus, Issue, Extractor, IssueModel, EventModel
import os, sys
import warnings

warnings.filterwarnings('ignore')

years = [2015, 2016, 2017]
num_issues = 50
num_events = 50
num_keywords = 10

#### Load Issues

In [2]:
issues = utils.load('issues_init.bin')

loading issues_init.bin...
issues_init.bin loaded


#### Build Issues

In [3]:
for i, issue in enumerate(issues):
    print("Issue "+str(i+1)+":")
    issue.build_issue()
    print("Issue "+str(i+1)+" Done\n")

Issue 1:
building issue...
building phrasers...
bigram train finished! 0.35 seconds
trigram train finished! 0.49 seconds
building dictionary...
building bag of words...


100%|██████████| 351/351 [00:00<00:00, 2260.23it/s]


Issue 1 Done

Issue 2:
building issue...
building phrasers...
bigram train finished! 0.82 seconds
trigram train finished! 1.22 seconds
building dictionary...
building bag of words...


100%|██████████| 539/539 [00:00<00:00, 1619.94it/s]


Issue 2 Done

Issue 3:
building issue...
building phrasers...
bigram train finished! 0.96 seconds
trigram train finished! 1.47 seconds
building dictionary...
building bag of words...


100%|██████████| 744/744 [00:00<00:00, 1841.75it/s]

Issue 3 Done






#### Extract keywords from each Article using tf-idf

In [4]:
for i, issue in enumerate(issues):
    print("Issue "+str(i+1)+":")
    issue.build_tfidf()
    issue.extractor = Extractor(issue)
    issue.extractor.extract(k=num_keywords)
    print("Issue "+str(i+1)+" Done\n")

Issue 1:
building tf-idf model...
tfidf finished! 0.01 seconds
extracting keywords...


100%|██████████| 351/351 [00:00<00:00, 1679.65it/s]


Issue 1 Done

Issue 2:
building tf-idf model...
tfidf finished! 0.01 seconds
extracting keywords...


100%|██████████| 539/539 [00:00<00:00, 1355.12it/s]


Issue 2 Done

Issue 3:
building tf-idf model...
tfidf finished! 0.01 seconds
extracting keywords...


100%|██████████| 744/744 [00:00<00:00, 1474.12it/s]

Issue 3 Done






#### Save Issues

In [6]:
utils.save(issues, filename='issues_ready.bin')

saving issues_ready.bin...
issues_ready.bin saved


#### Load Issues

In [94]:
issues = utils.load(filename='issues_ready.bin')

loading issues_ready.bin...
issues_ready.bin loaded


#### Build LDA model, cluster articles into events

In [5]:
for i, issue in enumerate(issues):
    print("Issue "+str(i+1)+":")
    issue.build_lda(num_topics=num_events)
    issue.event_model = EventModel(issue=issue, model=issue.lda)
    issue.event_model.build_events(threshold=0.5)
    print("Issue "+str(i+1)+" Done\n")

Issue 1:
building LDA model...
LDA finished! 0.66 seconds
building events...


100%|██████████| 351/351 [00:00<00:00, 1262.13it/s]


extracting keywords...


100%|██████████| 50/50 [00:00<00:00, 12995.92it/s]


Issue 1 Done

Issue 2:
building LDA model...
LDA finished! 1.18 seconds
building events...


100%|██████████| 539/539 [00:00<00:00, 1056.62it/s]


extracting keywords...


100%|██████████| 50/50 [00:00<00:00, 10030.38it/s]


Issue 2 Done

Issue 3:
building LDA model...
LDA finished! 1.49 seconds
building events...


100%|██████████| 744/744 [00:00<00:00, 1027.08it/s]


extracting keywords...


100%|██████████| 50/50 [00:00<00:00, 2281.67it/s]

Issue 3 Done






#### Divide events into set of independent events

In [10]:
threshold=[0.15, 0.15, 0.15]
for i, issue in enumerate(issues):
    print("Issue "+str(i+1)+":")
    issue.event_model.build_independents(threshold=threshold[i])
    issue.event_model.filter_events(k=5)
    issue.event_model.build_event_times()
    issue.event_model.build_sorted_independents()
    issue.event_model.build_event_details()
    print("Issue "+str(i+1)+" Done\n")

Issue 1:
building event times...


100%|██████████| 5/5 [00:00<00:00, 1834.62it/s]


building event details...


100%|██████████| 5/5 [00:00<00:00, 1398.29it/s]


Issue 1 Done

Issue 2:
building event times...


100%|██████████| 5/5 [00:00<00:00, 1419.49it/s]


building event details...


100%|██████████| 5/5 [00:00<00:00, 1171.00it/s]


Issue 2 Done

Issue 3:
building event times...


100%|██████████| 5/5 [00:00<00:00, 1185.77it/s]


building event details...


100%|██████████| 5/5 [00:00<00:00, 1232.46it/s]

Issue 3 Done






In [11]:
for i, issue in enumerate(issues):
    print("Issue "+str(i+1)+":")
    print(issue.event_model.sorted_independents)
    print("Issue "+str(i+1)+" Done\n")

Issue 1:
[[3, 11], [10, 6], [43]]
Issue 1 Done

Issue 2:
[[39], [15], [18], [25], [28]]
Issue 2 Done

Issue 3:
[[38], [29, 40, 7], [21]]
Issue 3 Done



#### Save Issues

In [21]:
utils.save(issues, 'issues_done.bin')

saving issues_done.bin...
issues_done.bin saved


----------------------------------------------------

### Show Result

#### Load Issues

In [2]:
issues = utils.load('issues_done.bin')

loading issues_done.bin...
issues_done.bin loaded


#### Select Issue

In [12]:
i = 2

#### Show Issue Keywords

In [13]:
print(', '.join([keyword[0] for keyword in issues[i].keywords]))

Trump, N._Korea, S._Korea, Tillerson, NK, Mattis, THAAD, White_House, dialogue, envoy


#### Show Top Events  

In [14]:
issues[i].event_model.show_top_events()

ID:  29 Score:  90.39 N:   49 Keywords:  Mattis, S._Korea, Pence, Vice_President, Guam, State_Department, Yun, security_chief, future_Seoul-Washington, cast_bright
ID:  21 Score:  86.39 N:   54 Keywords:  Yun, S._Korea, dialogue_table, NK_nuke, Dae_spokesman, bilateral_summit, UN_sanction, Chung, beneficial, mutually
ID:  40 Score:  73.43 N:   46 Keywords:  Mattis, foreign_minister, Kang, missile_launch, objective, S._Korea, pay_system, question, respond, Yun
ID:  38 Score:  70.61 N:   36 Keywords:  Mattis, Tillerson, military_option, NK, White_House, McMaster, diplomatic_effort, question_sincerity, hurt_Chinese, fruitless
ID:   7 Score:  70.19 N:   43 Keywords:  Tillerson, S._Korea, envoy, Kang, spokesperson, White_House, welcome, denuclearize, right_time, professor
ID:  12 Score:  55.72 N:   32 Keywords:  adviser, S._Korea, source, Korea-US_alliance, room, Moon, by-election, kilometer, Korea-US_FTA, initiative
ID:   1 Score:  51.28 N:   32 Keywords:  S._Korea, Republic_of_Korea, THAA

#### Show Articles from Top Events 

In [15]:
issues[i].event_model.show_events(k=5)

ID:  29 Score:  90.39 N:   49 Keywords:  Mattis, S._Korea, Pence, Vice_President, Guam, State_Department, Yun, security_chief, future_Seoul-Washington, cast_bright
	 0 	 S. Korea's top security official to visit US
	 1 	 Kerry warns of 'forceful ways' against N. Korea
	 2 	 Seoul welcomes US secretary nominee's suggestion of continued sanctions on N. Korea
	 3 	 Korea-Japan ties going beyond 'difficult history' also good for US: Victor Cha
	 4 	 Korea seeks high-level talks with US in Feb.: source
ID:  21 Score:  86.39 N:   54 Keywords:  Yun, S._Korea, dialogue_table, NK_nuke, Dae_spokesman, bilateral_summit, UN_sanction, Chung, beneficial, mutually
	 0 	 Controversy brews over lawmakers’ China visit amid THAAD row
	 1 	 New type of cyberattacks to rise in 2017: report
	 2 	 Resumption of Kaesong complex to be hotly debated amid sanctions regime
	 3 	 S. Korea, Japan, China to discuss N. Korea's cyber threats in Tokyo this week
	 4 	 Top diplomats of S. Korea, US to discuss NK nukes in

#### Show Issue Summary

In [16]:
issues[i].event_model.show_issue_summary(num_entities=10)

Issue Keywords: 
	 Trump, N._Korea, S._Korea, Tillerson, NK, Mattis, THAAD, White_House, dialogue, envoy

Events: 
	 38
	 29 -> 40 -> 7
	 21


Event ID:  38
Event Keywords: 
	 Mattis, Tillerson, military_option, NK, White_House, McMaster, diplomatic_effort, question_sincerity, hurt_Chinese, fruitless
Time: 2017-07-19
Entities: 
	North Korea, LOCATION
	US, LOCATION
	Pyongyang, LOCATION
	WASHINGTON, LOCATION
	China, LOCATION
	Donald Trump, PERSON
	President, JOBTITLE
	United States, LOCATION
	Rex Tillerson, PERSON
	South Korea, LOCATION

Event ID:  29
Event Keywords: 
	 Mattis, S._Korea, Pence, Vice_President, Guam, State_Department, Yun, security_chief, future_Seoul-Washington, cast_bright
Time: 2017-05-06
Entities: 
	North Korea, LOCATION
	South Korea, LOCATION
	US, LOCATION
	Seoul, LOCATION
	Donald Trump, PERSON
	Pyongyang, LOCATION
	United States, LOCATION
	China, LOCATION
	Rex Tillerson, PERSON
	Washington, LOCATION

Event ID:  40
Event Keywords: 
	 Mattis, foreign_minister, Kang, m

#### View LDA Model

In [22]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [23]:
pyLDAvis.gensim.prepare(issues[i].lda, issues[i].get_bows(), issues[i].dict)

In [54]:
# events_lda_data = pyLDAvis.gensim.prepare(issues[i].lda, issues[i].get_bows(), issues[i].dict)
# utils.save(events_lda_data, 'events_lda_data.bin')

saving events_lda_data.bin...
events_lda_data.bin saved
