This notebook is some exercise on Gensim. There are 7 parts where the content of each part is:
1. Text file collection
2. Text pre-processing
3. Gensim: dictionary and corpus
4. Gensim: topic modeling
5. pyLDAvis: topic model visualization
6. Gensim: topic coherence
7. Gensim: text summarization



# Basic Settings

In [1]:
# basic settings
from gensim import corpora

#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import os
print os.getcwd()


Using TensorFlow backend.


/Users/ardellelee/PycharmProjects


## 1. Collect original text file to be processed

### 1a - Import from pdf file

In [52]:
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams



def pdfTotxt(filepath,outpath):
    try:
        fp = file(filepath, 'rb')
        outfp=file(outpath,'w')
        # create a resource manager object
        rsrcmgr = PDFResourceManager(caching = False)
        # create a device object
        laparams = LAParams()
        device = TextConverter(rsrcmgr, outfp, codec='utf-8', laparams=laparams,imagewriter=None)
        #create an interpreter object
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp, pagenos = set(),maxpages=0, password='',caching=False, check_extractable=True):
            page.rotate = page.rotate % 360
            interpreter.process_page(page)
        fp.close()
        device.close()
        outfp.flush()
        outfp.close()
    except Exception, e:
         print "Exception:%s",e

pdfTotxt('Natural_language_processing.pdf','nlp1.txt')

# import text
document = open("nlp1.txt").read()

### 1b - directly input a string

In [54]:
document = "Thomas A. Anderson is a man living two lives. By day he is an " + \
    "average computer programmer and by night a hacker known as " + \
    "Neo. Neo has always questioned his reality, but the truth is " + \
    "far beyond his imagination. Neo finds himself targeted by the " + \
    "police when he is contacted by Morpheus, a legendary computer " + \
    "hacker branded a terrorist by the government. Morpheus awakens " + \
    "Neo to the real world, a ravaged wasteland where most of " + \
    "humanity have been captured by a race of machines that live " + \
    "off of the humans' body heat and electrochemical energy and " + \
    "who imprison their minds within an artificial reality known as " + \
    "the Matrix. As a rebel against the machines, Neo must return to " + \
    "the Matrix and confront the agents: super-powerful computer " + \
    "programs devoted to snuffing out Neo and the entire human " + \
    "rebellion. "

## 2. Text Pre-processing

### 2a - Use string methods

This method does NOT work when it comes to create dictionary from tokens. Since creating dictionary needs array, not a single string.

In [41]:
## create dictionary and corpus, and save
import string
from gensim.parsing.preprocessing import STOPWORDS

# remove stopwords
texts = [text for text in document.lower().split() if text not in STOPWORDS]
# remove numbers
texts = [text for text in texts if not text.isdigit()]
# remove punctuations
tokens = [text.translate(None, string.punctuation) for text in texts]
# remove words that appear only once
tokens = [text for text in texts if len(text)>1]

print(len(tokens))
#print tokens

#dictionary0 = corpora.Dictionary(tokens)

2063


### 2b - Use string methods, but line by line

Texts are processed line by line to avoid problems in 2a.

In [57]:
# pre-processing
from gensim.parsing.preprocessing import STOPWORDS
import string
from pprint import pprint

tokens=[]
for line in open('nlp1.txt'):
    texts = [text for text in line.lower().split() if text not in STOPWORDS] 
    texts = [text.translate(None, string.punctuation) for text in texts]    
    texts = [text for text in texts if not text.isdigit()]
    texts = [text for text in texts if len(text)>1]
    if (texts!=[]):
        tokens.append(texts)
    
print(len(tokens))
#pprint(tokens)

340


### 2c - Use NLTK methods
NLTK provides functions to do tokenization. Can just try and see if any difference with results in 2b.

In [96]:
## Another way to clean the text: use nltk to tokenize, lemmatize texts

from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()


tokens=[]
for line in open('nlp1.txt'):
    texts = tokenizer.tokenize(document.lower()) # split into words
    texts = [text for text in texts if text not in STOPWORDS]
    texts = [text for text in texts if not text.isdigit()]
    texts = [text for text in texts if len(text)>1]
    if (texts!=[]):
        tokens.append(texts)

#tokens =[[lemmatizer.lemmatize(text) for text in texts] for token in tokens]

print len(tokens)
#pprint(tokens)


463
[['natural',
  'language',
  'processing',
  'wikipedia',
  'free',
  'encyclopedia',
  'natural',
  'language',
  'processing',
  'nlp',
  'field',
  'science',
  'artificial',
  'intelligence',
  'computational',
  'linguistics',
  'concerned',
  'interactions',
  'computers',
  'human',
  'natural',
  'languages',
  'particular',
  'concerned',
  'programming',
  'computers',
  'fruitfully',
  'process',
  'large',
  'natural',
  'language',
  'corpora',
  'challenges',
  'natural',
  'language',
  'processing',
  'frequently',
  'involve',
  'natural',
  'language',
  'understanding',
  'natural',
  'language',
  'generation',
  'frequently',
  'formal',
  'machine',
  'readable',
  'logical',
  'forms',
  'connecting',
  'language',
  'machine',
  'perception',
  'dialog',
  'systems',
  'combination',
  'thereof',
  'contents',
  'history',
  'statistical',
  'natural',
  'language',
  'processing',
  'major',
  'evaluations',
  'tasks',
  'syntax',
  'semantics',
  'discours

  'especially',
  'useful',
  'identifying',
  'trends',
  'public',
  'opinion',
  'social',
  'media',
  'purpose',
  'marketing',
  'topic',
  'segmentation',
  'recognition',
  'word',
  'sense',
  'disambiguation',
  'given',
  'chunk',
  'text',
  'separate',
  'segments',
  'devoted',
  'topic',
  'identify',
  'topic',
  'segment',
  'words',
  'meaning',
  'select',
  'meaning',
  'makes',
  'sense',
  'context',
  'problem',
  'typically',
  'given',
  'list',
  'words',
  'associated',
  'word',
  'senses',
  'dictionary',
  'online',
  'resource',
  'wordnet',
  'discourse',
  'automatic',
  'summarization',
  'coreference',
  'resolution',
  'produce',
  'readable',
  'summary',
  'chunk',
  'text',
  'provide',
  'summaries',
  'text',
  'known',
  'type',
  'articles',
  'financial',
  'section',
  'newspaper',
  'given',
  'sentence',
  'larger',
  'chunk',
  'text',
  'determine',
  'words',
  'mentions',
  'refer',
  'objects',
  'entities',
  'anaphora',
  'resolutio

  'noun',
  'verb',
  'adjective',
  'different',
  'parts',
  'speech',
  'languages',
  'ambiguity',
  'languages',
  'little',
  'inflectional',
  'morphology',
  'english',
  'particularly',
  'prone',
  'ambiguity',
  'chinese',
  'prone',
  'ambiguity',
  'tonal',
  'language',
  'verbalization',
  'inflection',
  'readily',
  'conveyed',
  'entities',
  'employed',
  'orthography',
  'convey',
  'intended',
  'meaning',
  'parsing',
  'stochastic',
  'grammar',
  'determine',
  'parse',
  'tree',
  'grammatical',
  'analysis',
  'given',
  'sentence',
  'grammar',
  'natural',
  'languages',
  'ambiguous',
  'typical',
  'sentences',
  'multiple',
  'possible',
  'analyses',
  'fact',
  'surprisingly',
  'typical',
  'sentence',
  'thousands',
  'potential',
  'parses',
  'completely',
  'nonsensical',
  'human',
  'sentence',
  'breaking',
  'known',
  'sentence',
  'boundary',
  'disambiguation',
  'given',
  'chunk',
  'text',
  'sentence',
  'boundaries',
  'sentence',
  'bo

  'robust',
  'natural',
  'language',
  'variation',
  'machine',
  'learning',
  'paradigm',
  'calls',
  'instead',
  'statistical',
  'inference',
  'automatically',
  'learn',
  'rules',
  'analysis',
  'large',
  'corpora',
  'typical',
  'real',
  'world',
  'examples',
  'corpus',
  'plural',
  'corpora',
  'set',
  'documents',
  'possibly',
  'human',
  'annotations',
  'different',
  'classes',
  'machine',
  'learning',
  'algorithms',
  'applied',
  'nlp',
  'tasks',
  'algorithms',
  'input',
  'large',
  'set',
  'features',
  'generated',
  'input',
  'data',
  'earliest',
  'algorithms',
  'decision',
  'trees',
  'produced',
  'systems',
  'hard',
  'rules',
  'similar',
  'systems',
  'hand',
  'written',
  'rules',
  'common',
  'increasingly',
  'research',
  'focused',
  'statistical',
  'models',
  'soft',
  'probabilistic',
  'decisions',
  'based',
  'attaching',
  'real',
  'valued',
  'weights',
  'input',
  'feature',
  'models',
  'advantage',
  'express',


  'produces',
  'accurate',
  'results',
  'given',
  'input',
  'data',
  'enormous',
  'non',
  'annotated',
  'data',
  'available',
  'including',
  'things',
  'entire',
  'content',
  'world',
  'wide',
  'web',
  'inferior',
  'results',
  'recent',
  'years',
  'flurry',
  'results',
  'showing',
  'deep',
  'learning',
  'techniques',
  'achieving',
  'state',
  'art',
  'results',
  'natural',
  'language',
  'tasks',
  'example',
  'language',
  'modeling',
  'parsing',
  'statistical',
  'natural',
  'language',
  'processing',
  'called',
  'statistical',
  'revolution',
  'late',
  '1980s',
  'mid',
  '1990s',
  'natural',
  'language',
  'processing',
  'research',
  'relied',
  'heavily',
  'machine',
  'learning',
  'language',
  'processing',
  'tasks',
  'typically',
  'involved',
  'direct',
  'hand',
  'coding',
  'rules',
  'general',
  'robust',
  'natural',
  'language',
  'variation',
  'machine',
  'learning',
  'paradigm',
  'calls',
  'instead',
  'statistic

  'prone',
  'time',
  'consuming',
  'systems',
  'based',
  'automatically',
  'learning',
  'rules',
  'accurate',
  'simply',
  'supplying',
  'input',
  'data',
  'systems',
  'based',
  'hand',
  'written',
  'rules',
  'accurate',
  'increasing',
  'complexity',
  'rules',
  'difficult',
  'task',
  'particular',
  'limit',
  'complexity',
  'systems',
  'based',
  'hand',
  'crafted',
  'rules',
  'systems',
  'unmanageable',
  'creating',
  'data',
  'input',
  'machine',
  'learning',
  'systems',
  'simply',
  'requires',
  'corresponding',
  'increase',
  'number',
  'man',
  'hours',
  'worked',
  'generally',
  'significant',
  'increases',
  'complexity',
  'annotation',
  'process',
  'major',
  'evaluations',
  'tasks',
  'following',
  'list',
  'commonly',
  'researched',
  'tasks',
  'nlp',
  'note',
  'tasks',
  'direct',
  'real',
  'world',
  'applications',
  'commonly',
  'serve',
  'subtasks',
  'aid',
  'solving',
  'larger',
  'tasks',
  'nlp',
  'tasks',
  

  'german',
  'capitalizes',
  'nouns',
  'regardless',
  'refer',
  'names',
  'french',
  'spanish',
  'capitalize',
  'names',
  'serve',
  'adjectives',
  'natural',
  'language',
  'generation',
  'natural',
  'language',
  'understanding',
  'convert',
  'information',
  'databases',
  'semantic',
  'intents',
  'readable',
  'human',
  'language',
  'convert',
  'chunks',
  'text',
  'formal',
  'representations',
  'order',
  'logic',
  'structures',
  'easier',
  'programs',
  'manipulate',
  'natural',
  'language',
  'understanding',
  'involves',
  'identification',
  'intended',
  'semantic',
  'multiple',
  'possible',
  'semantics',
  'derived',
  'natural',
  'language',
  'expression',
  'usually',
  'takes',
  'form',
  'organized',
  'notations',
  'natural',
  'languages',
  'concepts',
  'introduction',
  'creation',
  'language',
  'metamodel',
  'ontology',
  'efficient',
  'empirical',
  'solutions',
  'explicit',
  'formalization',
  'natural',
  'languages',
 

  'case',
  'corpus',
  'linguistics',
  'creation',
  'use',
  'corpora',
  'real',
  'world',
  'data',
  'fundamental',
  'machine',
  'learning',
  'algorithms',
  'nlp',
  'addition',
  'theoretical',
  'underpinnings',
  'chomskyan',
  'linguistics',
  'called',
  'poverty',
  'stimulus',
  'argument',
  'entail',
  'general',
  'learning',
  'algorithms',
  'typically',
  'machine',
  'learning',
  'successful',
  'language',
  'processing',
  'result',
  'chomskyan',
  'paradigm',
  'discouraged',
  'application',
  'models',
  'language',
  'processing',
  'goldberg',
  'yoav',
  'https',
  'www',
  'jair',
  'org',
  'media',
  'live',
  'jair',
  'pdf',
  'primer',
  'neural',
  'network',
  'models',
  'natural',
  'language',
  'processing',
  'journal',
  'artificial',
  'intelligence',
  'research',
  '345\xe2',
  'ian',
  'goodfellow',
  'yoshua',
  'bengio',
  'aaron',
  'courville',
  'http',
  'www',
  'deeplearningbook',
  'org',
  'deep',
  'learning',
  'mit',
  '

  'location',
  'organization',
  'note',
  'capitalization',
  'aid',
  'recognizing',
  'named',
  'entities',
  'languages',
  'english',
  'information',
  'aid',
  'determining',
  'type',
  'named',
  'entity',
  'case',
  'inaccurate',
  'insufficient',
  'example',
  'word',
  'sentence',
  'capitalized',
  'named',
  'entities',
  'span',
  'words',
  'capitalized',
  'furthermore',
  'languages',
  'non',
  'western',
  'scripts',
  'chinese',
  'arabic',
  'capitalization',
  'languages',
  'capitalization',
  'consistently',
  'use',
  'distinguish',
  'names',
  'example',
  'german',
  'capitalizes',
  'nouns',
  'regardless',
  'refer',
  'names',
  'french',
  'spanish',
  'capitalize',
  'names',
  'serve',
  'adjectives',
  'natural',
  'language',
  'generation',
  'natural',
  'language',
  'understanding',
  'convert',
  'information',
  'databases',
  'semantic',
  'intents',
  'readable',
  'human',
  'language',
  'convert',
  'chunks',
  'text',
  'formal',
  '

  'yes',
  'vs',
  'objective',
  'true',
  'false',
  'expected',
  'construction',
  'basis',
  'semantics',
  'formalization',
  'optical',
  'character',
  'recognition',
  'ocr',
  'given',
  'image',
  'representing',
  'printed',
  'text',
  'determine',
  'corresponding',
  'text',
  'question',
  'answering',
  'given',
  'human',
  'language',
  'question',
  'determine',
  'answer',
  'typical',
  'questions',
  'specific',
  'right',
  'answer',
  'capital',
  'canada',
  'open',
  'ended',
  'questions',
  'considered',
  'meaning',
  'life',
  'recent',
  'works',
  'looked',
  'complex',
  'questions',
  'recognizing',
  'textual',
  'entailment',
  'given',
  'text',
  'fragments',
  'determine',
  'true',
  'entails',
  'entails',
  'negation',
  'allows',
  'true',
  'false',
  'given',
  'chunk',
  'text',
  'identify',
  'relationships',
  'named',
  'entities',
  'married',
  'relationship',
  'extraction',
  'sentiment',
  'analysis',
  'extract',
  'subjective',


  'lex',
  'microsoft',
  'cognitive',
  'services',
  'facebook',
  'deeptext',
  'friendlydata',
  'lexalytics',
  'aylien',
  'automated',
  'insights',
  'indico',
  'meaningcloud',
  'rosette',
  'wsc',
  'iminds',
  'automated',
  'essay',
  'scoring',
  'biomedical',
  'text',
  'mining',
  'compound',
  'term',
  'processing',
  'computational',
  'linguistics',
  'assisted',
  'reviewing',
  'controlled',
  'natural',
  'language',
  'deep',
  'learning',
  'deep',
  'linguistic',
  'processing',
  'foreign',
  'language',
  'reading',
  'aid',
  'foreign',
  'language',
  'writing',
  'aid',
  'information',
  'extraction',
  'information',
  'retrieval',
  'language',
  'technology',
  'latent',
  'dirichlet',
  'allocation',
  'lda',
  'latent',
  'semantic',
  'indexing',
  'list',
  'natural',
  'language',
  'processing',
  'toolkits',
  'native',
  'language',
  'identification',
  'natural',
  'language',
  'programming',
  'references',
  'natural',
  'language',
  's

  'api',
  'amazon',
  'lex',
  'microsoft',
  'cognitive',
  'services',
  'facebook',
  'deeptext',
  'friendlydata',
  'lexalytics',
  'aylien',
  'automated',
  'insights',
  'indico',
  'meaningcloud',
  'rosette',
  'wsc',
  'iminds',
  'automated',
  'essay',
  'scoring',
  'biomedical',
  'text',
  'mining',
  'compound',
  'term',
  'processing',
  'computational',
  'linguistics',
  'assisted',
  'reviewing',
  'controlled',
  'natural',
  'language',
  'deep',
  'learning',
  'deep',
  'linguistic',
  'processing',
  'foreign',
  'language',
  'reading',
  'aid',
  'foreign',
  'language',
  'writing',
  'aid',
  'information',
  'extraction',
  'information',
  'retrieval',
  'language',
  'technology',
  'latent',
  'dirichlet',
  'allocation',
  'lda',
  'latent',
  'semantic',
  'indexing',
  'list',
  'natural',
  'language',
  'processing',
  'toolkits',
  'native',
  'language',
  'identification',
  'natural',
  'language',
  'programming',
  'references',
  'natural

  'learning',
  'successful',
  'language',
  'processing',
  'result',
  'chomskyan',
  'paradigm',
  'discouraged',
  'application',
  'models',
  'language',
  'processing',
  'goldberg',
  'yoav',
  'https',
  'www',
  'jair',
  'org',
  'media',
  'live',
  'jair',
  'pdf',
  'primer',
  'neural',
  'network',
  'models',
  'natural',
  'language',
  'processing',
  'journal',
  'artificial',
  'intelligence',
  'research',
  '345\xe2',
  'ian',
  'goodfellow',
  'yoshua',
  'bengio',
  'aaron',
  'courville',
  'http',
  'www',
  'deeplearningbook',
  'org',
  'deep',
  'learning',
  'mit',
  'press',
  'rafal',
  'jozefowicz',
  'oriol',
  'vinyals',
  'mike',
  'schuster',
  'noam',
  'shazeer',
  'yonghui',
  'wu',
  'https',
  'arxiv',
  'org',
  'abs',
  'exploring',
  'limits',
  'language',
  'modeling',
  'kook',
  'choe',
  'eugene',
  'charniak',
  'emnlp',
  'http',
  'www',
  'aclweb',
  'org',
  'website',
  'old_anthology',
  'd16',
  'd16',
  'pdf',
  'parsing',
  

  'chomskyan',
  'linguistics',
  'called',
  'poverty',
  'stimulus',
  'argument',
  'entail',
  'general',
  'learning',
  'algorithms',
  'typically',
  'machine',
  'learning',
  'successful',
  'language',
  'processing',
  'result',
  'chomskyan',
  'paradigm',
  'discouraged',
  'application',
  'models',
  'language',
  'processing',
  'goldberg',
  'yoav',
  'https',
  'www',
  'jair',
  'org',
  'media',
  'live',
  'jair',
  'pdf',
  'primer',
  'neural',
  'network',
  'models',
  'natural',
  'language',
  'processing',
  'journal',
  'artificial',
  'intelligence',
  'research',
  '345\xe2',
  'ian',
  'goodfellow',
  'yoshua',
  'bengio',
  'aaron',
  'courville',
  'http',
  'www',
  'deeplearningbook',
  'org',
  'deep',
  'learning',
  'mit',
  'press',
  'rafal',
  'jozefowicz',
  'oriol',
  'vinyals',
  'mike',
  'schuster',
  'noam',
  'shazeer',
  'yonghui',
  'wu',
  'https',
  'arxiv',
  'org',
  'abs',
  'exploring',
  'limits',
  'language',
  'modeling',
  '

  'www',
  'jair',
  'org',
  'media',
  'live',
  'jair',
  'pdf',
  'primer',
  'neural',
  'network',
  'models',
  'natural',
  'language',
  'processing',
  'journal',
  'artificial',
  'intelligence',
  'research',
  '345\xe2',
  'ian',
  'goodfellow',
  'yoshua',
  'bengio',
  'aaron',
  'courville',
  'http',
  'www',
  'deeplearningbook',
  'org',
  'deep',
  'learning',
  'mit',
  'press',
  'rafal',
  'jozefowicz',
  'oriol',
  'vinyals',
  'mike',
  'schuster',
  'noam',
  'shazeer',
  'yonghui',
  'wu',
  'https',
  'arxiv',
  'org',
  'abs',
  'exploring',
  'limits',
  'language',
  'modeling',
  'kook',
  'choe',
  'eugene',
  'charniak',
  'emnlp',
  'http',
  'www',
  'aclweb',
  'org',
  'website',
  'old_anthology',
  'd16',
  'd16',
  'pdf',
  'parsing',
  'language',
  'modeling',
  'vinyals',
  'oriol',
  'et',
  'al',
  'nips2015',
  'https',
  'papers',
  'nips',
  'cc',
  'paper',
  'grammar',
  'foreign',
  'language',
  'pdf',
  'mark',
  'johnson',
  'stati

  'human',
  'knowledge',
  'structures',
  'kishorjit',
  'vidya',
  'raj',
  'rk',
  'nirmal',
  'sivaji',
  'manipuri',
  'morpheme',
  'identification',
  'http',
  'aclweb',
  'org',
  'anthology',
  'w12',
  'w12',
  'pdf',
  'proceedings',
  '3rd',
  'workshop',
  'south',
  'southeast',
  'asian',
  'natural',
  'language',
  'processing',
  'sanlp',
  'pages',
  '95\xe2',
  'coling',
  'mumbai',
  'december',
  'yucong',
  'duan',
  'christophe',
  'cruz',
  'formalizing',
  'semantic',
  'natural',
  'language',
  'conceptualization',
  'existence',
  'http',
  'www',
  'ijimt',
  'org',
  'abstract',
  'e00187',
  'htm',
  'international',
  'journal',
  'innovation',
  'management',
  'technology',
  'pp',
  'versatile',
  'question',
  'answering',
  'systems',
  'seeing',
  'synthesis',
  'https',
  'www',
  'academia',
  'edu',
  'versatile',
  '_question_answering_systems_seeing_in_synthesis',
  'mittal',
  'et',
  'al',
  'ijiids',
  'pascal',
  'recognizing',
  'textu

  'org',
  'deep',
  'learning',
  'mit',
  'press',
  'rafal',
  'jozefowicz',
  'oriol',
  'vinyals',
  'mike',
  'schuster',
  'noam',
  'shazeer',
  'yonghui',
  'wu',
  'https',
  'arxiv',
  'org',
  'abs',
  'exploring',
  'limits',
  'language',
  'modeling',
  'kook',
  'choe',
  'eugene',
  'charniak',
  'emnlp',
  'http',
  'www',
  'aclweb',
  'org',
  'website',
  'old_anthology',
  'd16',
  'd16',
  'pdf',
  'parsing',
  'language',
  'modeling',
  'vinyals',
  'oriol',
  'et',
  'al',
  'nips2015',
  'https',
  'papers',
  'nips',
  'cc',
  'paper',
  'grammar',
  'foreign',
  'language',
  'pdf',
  'mark',
  'johnson',
  'statistical',
  'revolution',
  'changes',
  'computational',
  'linguistics',
  'http',
  'www',
  'aclweb',
  'anthology',
  'w09',
  'proceedings',
  'eacl',
  'workshop',
  'interaction',
  'linguistics',
  'computational',
  'linguistics',
  'philip',
  'resnik',
  'revolutions',
  'http',
  'languagelog',
  'ldc',
  'upenn',
  'edu',
  'nll',
  'l

  'kongthon',
  'chatchawal',
  'sangkeettrakarn',
  'sarawoot',
  'kongyoung',
  'choochart',
  'haruechaiyasak',
  'published',
  'acm',
  'article',
  'bibliometrics',
  'data',
  'bibliometrics',
  'published',
  'proceeding',
  'medes',
  'proceedings',
  'international',
  'conference',
  'management',
  'emergent',
  'digital',
  'ecosystems',
  'acm',
  'new',
  'york',
  'ny',
  'usa',
  'isbn',
  'doi',
  'https',
  'dx',
  'doi',
  'org',
  '2f1643823',
  'hutchins',
  'history',
  'machine',
  'translation',
  'nutshell',
  'http',
  'ahs',
  'annaisd',
  'org',
  'common',
  'pa',
  'ges',
  'galleryphoto',
  'aspx',
  'photoid',
  'width',
  'height',
  'chomskyan',
  'linguistics',
  'encourages',
  'investigation',
  'corner',
  'cases',
  'stress',
  'limits',
  'theoretical',
  'models',
  'comparable',
  'pathological',
  'phenomena',
  'mathematics',
  'typically',
  'created',
  'thought',
  'experiments',
  'systematic',
  'investigation',
  'typical',
  'phenomen

  'automated',
  'essay',
  'scoring',
  'biomedical',
  'text',
  'mining',
  'compound',
  'term',
  'processing',
  'computational',
  'linguistics',
  'assisted',
  'reviewing',
  'controlled',
  'natural',
  'language',
  'deep',
  'learning',
  'deep',
  'linguistic',
  'processing',
  'foreign',
  'language',
  'reading',
  'aid',
  'foreign',
  'language',
  'writing',
  'aid',
  'information',
  'extraction',
  'information',
  'retrieval',
  'language',
  'technology',
  'latent',
  'dirichlet',
  'allocation',
  'lda',
  'latent',
  'semantic',
  'indexing',
  'list',
  'natural',
  'language',
  'processing',
  'toolkits',
  'native',
  'language',
  'identification',
  'natural',
  'language',
  'programming',
  'references',
  'natural',
  'language',
  'search',
  'query',
  'expansion',
  'reification',
  'linguistics',
  'semantic',
  'folding',
  'speech',
  'processing',
  'spoken',
  'dialogue',
  'text',
  'proofing',
  'text',
  'simplification',
  'thought',
  'v

  'automated',
  'essay',
  'scoring',
  'biomedical',
  'text',
  'mining',
  'compound',
  'term',
  'processing',
  'computational',
  'linguistics',
  'assisted',
  'reviewing',
  'controlled',
  'natural',
  'language',
  'deep',
  'learning',
  'deep',
  'linguistic',
  'processing',
  'foreign',
  'language',
  'reading',
  'aid',
  'foreign',
  'language',
  'writing',
  'aid',
  'information',
  'extraction',
  'information',
  'retrieval',
  'language',
  'technology',
  'latent',
  'dirichlet',
  'allocation',
  'lda',
  'latent',
  'semantic',
  'indexing',
  'list',
  'natural',
  'language',
  'processing',
  'toolkits',
  'native',
  'language',
  'identification',
  'natural',
  'language',
  'programming',
  'references',
  'natural',
  'language',
  'search',
  'query',
  'expansion',
  'reification',
  'linguistics',
  'semantic',
  'folding',
  'speech',
  'processing',
  'spoken',
  'dialogue',
  'text',
  'proofing',
  'text',
  'simplification',
  'thought',
  'v

  'query',
  'expansion',
  'reification',
  'linguistics',
  'semantic',
  'folding',
  'speech',
  'processing',
  'spoken',
  'dialogue',
  'text',
  'proofing',
  'text',
  'simplification',
  'thought',
  'vector',
  'truecasing',
  'question',
  'answering',
  'word2vec',
  'implementing',
  'online',
  'help',
  'desk',
  'based',
  'conversational',
  'agent',
  'http',
  'ahs',
  'annaisd',
  'org',
  'common',
  'pages',
  'galleryphoto',
  'aspx',
  'photoid',
  'width',
  'height',
  'authors',
  'alisa',
  'kongthon',
  'chatchawal',
  'sangkeettrakarn',
  'sarawoot',
  'kongyoung',
  'choochart',
  'haruechaiyasak',
  'published',
  'acm',
  'article',
  'bibliometrics',
  'data',
  'bibliometrics',
  'published',
  'proceeding',
  'medes',
  'proceedings',
  'international',
  'conference',
  'management',
  'emergent',
  'digital',
  'ecosystems',
  'acm',
  'new',
  'york',
  'ny',
  'usa',
  'isbn',
  'doi',
  'https',
  'dx',
  'doi',
  'org',
  '2f1643823',
  'hutch

  'given',
  'sound',
  'clip',
  'person',
  'people',
  'speaking',
  'separate',
  'words',
  'subtask',
  'speech',
  'recognition',
  'typically',
  'grouped',
  'speech',
  'segmentation',
  'text',
  'speech',
  'natural',
  'language',
  'processing',
  'apis',
  'ibm',
  'watson',
  'google',
  'cloud',
  'natural',
  'language',
  'api',
  'amazon',
  'lex',
  'microsoft',
  'cognitive',
  'services',
  'facebook',
  'deeptext',
  'friendlydata',
  'lexalytics',
  'aylien',
  'automated',
  'insights',
  'indico',
  'meaningcloud',
  'rosette',
  'wsc',
  'iminds',
  'automated',
  'essay',
  'scoring',
  'biomedical',
  'text',
  'mining',
  'compound',
  'term',
  'processing',
  'computational',
  'linguistics',
  'assisted',
  'reviewing',
  'controlled',
  'natural',
  'language',
  'deep',
  'learning',
  'deep',
  'linguistic',
  'processing',
  'foreign',
  'language',
  'reading',
  'aid',
  'foreign',
  'language',
  'writing',
  'aid',
  'information',
  'extractio

  'relationship',
  'identified',
  'fact',
  'door',
  'referred',
  'door',
  'john',
  'house',
  'structure',
  'referred',
  'rubric',
  'includes',
  'number',
  'related',
  'tasks',
  'task',
  'identifying',
  'discourse',
  'structure',
  'connected',
  'text',
  'nature',
  'discourse',
  'relationships',
  'sentences',
  'elaboration',
  'explanation',
  'contrast',
  'possible',
  'task',
  'recognizing',
  'classifying',
  'speech',
  'acts',
  'chunk',
  'text',
  'yes',
  'question',
  'content',
  'question',
  'statement',
  'assertion',
  'discourse',
  'analysis',
  'speech',
  'speech',
  'recognition',
  'given',
  'sound',
  'clip',
  'person',
  'people',
  'speaking',
  'determine',
  'textual',
  'representation',
  'speech',
  'opposite',
  'text',
  'speech',
  'extremely',
  'difficult',
  'problems',
  'colloquially',
  'termed',
  'ai',
  'complete',
  'natural',
  'speech',
  'hardly',
  'pauses',
  'successive',
  'words',
  'speech',
  'segmentation',


  'meaning',
  'makes',
  'sense',
  'context',
  'problem',
  'typically',
  'given',
  'list',
  'words',
  'associated',
  'word',
  'senses',
  'dictionary',
  'online',
  'resource',
  'wordnet',
  'discourse',
  'automatic',
  'summarization',
  'coreference',
  'resolution',
  'produce',
  'readable',
  'summary',
  'chunk',
  'text',
  'provide',
  'summaries',
  'text',
  'known',
  'type',
  'articles',
  'financial',
  'section',
  'newspaper',
  'given',
  'sentence',
  'larger',
  'chunk',
  'text',
  'determine',
  'words',
  'mentions',
  'refer',
  'objects',
  'entities',
  'anaphora',
  'resolution',
  'specific',
  'example',
  'task',
  'specifically',
  'concerned',
  'matching',
  'pronouns',
  'nouns',
  'names',
  'refer',
  'general',
  'task',
  'coreference',
  'resolution',
  'includes',
  'identifying',
  'called',
  'bridging',
  'relationships',
  'involving',
  'referring',
  'expressions',
  'example',
  'sentence',
  'entered',
  'john',
  'house',
  '

  'analysis',
  'given',
  'sentence',
  'grammar',
  'natural',
  'languages',
  'ambiguous',
  'typical',
  'sentences',
  'multiple',
  'possible',
  'analyses',
  'fact',
  'surprisingly',
  'typical',
  'sentence',
  'thousands',
  'potential',
  'parses',
  'completely',
  'nonsensical',
  'human',
  'sentence',
  'breaking',
  'known',
  'sentence',
  'boundary',
  'disambiguation',
  'given',
  'chunk',
  'text',
  'sentence',
  'boundaries',
  'sentence',
  'boundaries',
  'marked',
  'periods',
  'punctuation',
  'marks',
  'characters',
  'serve',
  'purposes',
  'marking',
  'abbreviations',
  'stemming',
  'word',
  'segmentation',
  'separate',
  'chunk',
  'continuous',
  'text',
  'separate',
  'words',
  'language',
  'like',
  'english',
  'fairly',
  'trivial',
  'words',
  'usually',
  'separated',
  'spaces',
  'written',
  'languages',
  'like',
  'chinese',
  'japanese',
  'thai',
  'mark',
  'word',
  'boundaries',
  'fashion',
  'languages',
  'text',
  'segmen

  'parts',
  'speech',
  'languages',
  'ambiguity',
  'languages',
  'little',
  'inflectional',
  'morphology',
  'english',
  'particularly',
  'prone',
  'ambiguity',
  'chinese',
  'prone',
  'ambiguity',
  'tonal',
  'language',
  'verbalization',
  'inflection',
  'readily',
  'conveyed',
  'entities',
  'employed',
  'orthography',
  'convey',
  'intended',
  'meaning',
  'parsing',
  'stochastic',
  'grammar',
  'determine',
  'parse',
  'tree',
  'grammatical',
  'analysis',
  'given',
  'sentence',
  'grammar',
  'natural',
  'languages',
  'ambiguous',
  'typical',
  'sentences',
  'multiple',
  'possible',
  'analyses',
  'fact',
  'surprisingly',
  'typical',
  'sentence',
  'thousands',
  'potential',
  'parses',
  'completely',
  'nonsensical',
  'human',
  'sentence',
  'breaking',
  'known',
  'sentence',
  'boundary',
  'disambiguation',
  'given',
  'chunk',
  'text',
  'sentence',
  'boundaries',
  'sentence',
  'boundaries',
  'marked',
  'periods',
  'punctuation

  'languages',
  'like',
  'chinese',
  'japanese',
  'thai',
  'mark',
  'word',
  'boundaries',
  'fashion',
  'languages',
  'text',
  'segmentation',
  'significant',
  'task',
  'requiring',
  'knowledge',
  'vocabulary',
  'morphology',
  'words',
  'language',
  'terminology',
  'extraction',
  'goal',
  'terminology',
  'extraction',
  'automatically',
  'extract',
  'relevant',
  'terms',
  'given',
  'corpus',
  'semantics',
  'lexical',
  'semantics',
  'machine',
  'translation',
  'computational',
  'meaning',
  'individual',
  'words',
  'context',
  'named',
  'entity',
  'recognition',
  'ner',
  'automatically',
  'translate',
  'text',
  'human',
  'language',
  'difficult',
  'problems',
  'member',
  'class',
  'problems',
  'colloquially',
  'termed',
  'ai',
  'complete',
  'requiring',
  'different',
  'types',
  'knowledge',
  'humans',
  'possess',
  'grammar',
  'semantics',
  'facts',
  'real',
  'world',
  'order',
  'solve',
  'properly',
  'given',
  'stre

  'conveyed',
  'entities',
  'employed',
  'orthography',
  'convey',
  'intended',
  'meaning',
  'parsing',
  'stochastic',
  'grammar',
  'determine',
  'parse',
  'tree',
  'grammatical',
  'analysis',
  'given',
  'sentence',
  'grammar',
  'natural',
  'languages',
  'ambiguous',
  'typical',
  'sentences',
  'multiple',
  'possible',
  'analyses',
  'fact',
  'surprisingly',
  'typical',
  'sentence',
  'thousands',
  'potential',
  'parses',
  'completely',
  'nonsensical',
  'human',
  'sentence',
  'breaking',
  'known',
  'sentence',
  'boundary',
  'disambiguation',
  'given',
  'chunk',
  'text',
  'sentence',
  'boundaries',
  'sentence',
  'boundaries',
  'marked',
  'periods',
  'punctuation',
  'marks',
  'characters',
  'serve',
  'purposes',
  'marking',
  'abbreviations',
  'stemming',
  'word',
  'segmentation',
  'separate',
  'chunk',
  'continuous',
  'text',
  'separate',
  'words',
  'language',
  'like',
  'english',
  'fairly',
  'trivial',
  'words',
  'us

  'learning',
  'procedures',
  'use',
  'statistical',
  'inference',
  'algorithms',
  'produce',
  'models',
  'robust',
  'unfamiliar',
  'input',
  'containing',
  'words',
  'structures',
  'seen',
  'erroneous',
  'input',
  'misspelled',
  'words',
  'words',
  'accidentally',
  'omitted',
  'generally',
  'handling',
  'input',
  'gracefully',
  'hand',
  'written',
  'rules\xe2',
  'generally',
  'creating',
  'systems',
  'hand',
  'written',
  'rules',
  'soft',
  'decisions\xe2',
  'extremely',
  'difficult',
  'error',
  'prone',
  'time',
  'consuming',
  'systems',
  'based',
  'automatically',
  'learning',
  'rules',
  'accurate',
  'simply',
  'supplying',
  'input',
  'data',
  'systems',
  'based',
  'hand',
  'written',
  'rules',
  'accurate',
  'increasing',
  'complexity',
  'rules',
  'difficult',
  'task',
  'particular',
  'limit',
  'complexity',
  'systems',
  'based',
  'hand',
  'crafted',
  'rules',
  'systems',
  'unmanageable',
  'creating',
  'data',

  'words',
  'languages',
  'turkish',
  'meitei',
  'highly',
  'agglutinated',
  'indian',
  'language',
  'approach',
  'possible',
  'dictionary',
  'entry',
  'thousands',
  'possible',
  'word',
  'forms',
  'given',
  'sentence',
  'determine',
  'speech',
  'word',
  'words',
  'especially',
  'common',
  'ones',
  'serve',
  'multiple',
  'parts',
  'speech',
  'example',
  'book',
  'noun',
  'book',
  'table',
  'verb',
  'book',
  'flight',
  'set',
  'noun',
  'verb',
  'adjective',
  'different',
  'parts',
  'speech',
  'languages',
  'ambiguity',
  'languages',
  'little',
  'inflectional',
  'morphology',
  'english',
  'particularly',
  'prone',
  'ambiguity',
  'chinese',
  'prone',
  'ambiguity',
  'tonal',
  'language',
  'verbalization',
  'inflection',
  'readily',
  'conveyed',
  'entities',
  'employed',
  'orthography',
  'convey',
  'intended',
  'meaning',
  'parsing',
  'stochastic',
  'grammar',
  'determine',
  'parse',
  'tree',
  'grammatical',
  'analy

  'ambiguity',
  'languages',
  'little',
  'inflectional',
  'morphology',
  'english',
  'particularly',
  'prone',
  'ambiguity',
  'chinese',
  'prone',
  'ambiguity',
  'tonal',
  'language',
  'verbalization',
  'inflection',
  'readily',
  'conveyed',
  'entities',
  'employed',
  'orthography',
  'convey',
  'intended',
  'meaning',
  'parsing',
  'stochastic',
  'grammar',
  'determine',
  'parse',
  'tree',
  'grammatical',
  'analysis',
  'given',
  'sentence',
  'grammar',
  'natural',
  'languages',
  'ambiguous',
  'typical',
  'sentences',
  'multiple',
  'possible',
  'analyses',
  'fact',
  'surprisingly',
  'typical',
  'sentence',
  'thousands',
  'potential',
  'parses',
  'completely',
  'nonsensical',
  'human',
  'sentence',
  'breaking',
  'known',
  'sentence',
  'boundary',
  'disambiguation',
  'given',
  'chunk',
  'text',
  'sentence',
  'boundaries',
  'sentence',
  'boundaries',
  'marked',
  'periods',
  'punctuation',
  'marks',
  'characters',
  'serve

  'task',
  'requiring',
  'knowledge',
  'vocabulary',
  'morphology',
  'words',
  'language',
  'terminology',
  'extraction',
  'goal',
  'terminology',
  'extraction',
  'automatically',
  'extract',
  'relevant',
  'terms',
  'given',
  'corpus',
  'semantics',
  'lexical',
  'semantics',
  'machine',
  'translation',
  'computational',
  'meaning',
  'individual',
  'words',
  'context',
  'named',
  'entity',
  'recognition',
  'ner',
  'automatically',
  'translate',
  'text',
  'human',
  'language',
  'difficult',
  'problems',
  'member',
  'class',
  'problems',
  'colloquially',
  'termed',
  'ai',
  'complete',
  'requiring',
  'different',
  'types',
  'knowledge',
  'humans',
  'possess',
  'grammar',
  'semantics',
  'facts',
  'real',
  'world',
  'order',
  'solve',
  'properly',
  'given',
  'stream',
  'text',
  'determine',
  'items',
  'text',
  'map',
  'proper',
  'names',
  'people',
  'places',
  'type',
  'person',
  'location',
  'organization',
  'note',


  'text',
  'separate',
  'segments',
  'devoted',
  'topic',
  'identify',
  'topic',
  'segment',
  'words',
  'meaning',
  'select',
  'meaning',
  'makes',
  'sense',
  'context',
  'problem',
  'typically',
  'given',
  'list',
  'words',
  'associated',
  'word',
  'senses',
  'dictionary',
  'online',
  'resource',
  'wordnet',
  'discourse',
  'automatic',
  'summarization',
  'coreference',
  'resolution',
  'produce',
  'readable',
  'summary',
  'chunk',
  'text',
  'provide',
  'summaries',
  'text',
  'known',
  'type',
  'articles',
  'financial',
  'section',
  'newspaper',
  'given',
  'sentence',
  'larger',
  'chunk',
  'text',
  'determine',
  'words',
  'mentions',
  'refer',
  'objects',
  'entities',
  'anaphora',
  'resolution',
  'specific',
  'example',
  'task',
  'specifically',
  'concerned',
  'matching',
  'pronouns',
  'nouns',
  'names',
  'refer',
  'general',
  'task',
  'coreference',
  'resolution',
  'includes',
  'identifying',
  'called',
  'bridg

  'door',
  'referring',
  'expression',
  'bridging',
  'relationship',
  'identified',
  'fact',
  'door',
  'referred',
  'door',
  'john',
  'house',
  'structure',
  'referred',
  'rubric',
  'includes',
  'number',
  'related',
  'tasks',
  'task',
  'identifying',
  'discourse',
  'structure',
  'connected',
  'text',
  'nature',
  'discourse',
  'relationships',
  'sentences',
  'elaboration',
  'explanation',
  'contrast',
  'possible',
  'task',
  'recognizing',
  'classifying',
  'speech',
  'acts',
  'chunk',
  'text',
  'yes',
  'question',
  'content',
  'question',
  'statement',
  'assertion',
  'discourse',
  'analysis',
  'speech',
  'speech',
  'recognition',
  'given',
  'sound',
  'clip',
  'person',
  'people',
  'speaking',
  'determine',
  'textual',
  'representation',
  'speech',
  'opposite',
  'text',
  'speech',
  'extremely',
  'difficult',
  'problems',
  'colloquially',
  'termed',
  'ai',
  'complete',
  'natural',
  'speech',
  'hardly',
  'pauses',
  

  'language',
  'search',
  'query',
  'expansion',
  'reification',
  'linguistics',
  'semantic',
  'folding',
  'speech',
  'processing',
  'spoken',
  'dialogue',
  'text',
  'proofing',
  'text',
  'simplification',
  'thought',
  'vector',
  'truecasing',
  'question',
  'answering',
  'word2vec',
  'implementing',
  'online',
  'help',
  'desk',
  'based',
  'conversational',
  'agent',
  'http',
  'ahs',
  'annaisd',
  'org',
  'common',
  'pages',
  'galleryphoto',
  'aspx',
  'photoid',
  'width',
  'height',
  'authors',
  'alisa',
  'kongthon',
  'chatchawal',
  'sangkeettrakarn',
  'sarawoot',
  'kongyoung',
  'choochart',
  'haruechaiyasak',
  'published',
  'acm',
  'article',
  'bibliometrics',
  'data',
  'bibliometrics',
  'published',
  'proceeding',
  'medes',
  'proceedings',
  'international',
  'conference',
  'management',
  'emergent',
  'digital',
  'ecosystems',
  'acm',
  'new',
  'york',
  'ny',
  'usa',
  'isbn',
  'doi',
  'https',
  'dx',
  'doi',
  'org

  'truecasing',
  'question',
  'answering',
  'word2vec',
  'implementing',
  'online',
  'help',
  'desk',
  'based',
  'conversational',
  'agent',
  'http',
  'ahs',
  'annaisd',
  'org',
  'common',
  'pages',
  'galleryphoto',
  'aspx',
  'photoid',
  'width',
  'height',
  'authors',
  'alisa',
  'kongthon',
  'chatchawal',
  'sangkeettrakarn',
  'sarawoot',
  'kongyoung',
  'choochart',
  'haruechaiyasak',
  'published',
  'acm',
  'article',
  'bibliometrics',
  'data',
  'bibliometrics',
  'published',
  'proceeding',
  'medes',
  'proceedings',
  'international',
  'conference',
  'management',
  'emergent',
  'digital',
  'ecosystems',
  'acm',
  'new',
  'york',
  'ny',
  'usa',
  'isbn',
  'doi',
  'https',
  'dx',
  'doi',
  'org',
  '2f1643823',
  'hutchins',
  'history',
  'machine',
  'translation',
  'nutshell',
  'http',
  'ahs',
  'annaisd',
  'org',
  'common',
  'pa',
  'ges',
  'galleryphoto',
  'aspx',
  'photoid',
  'width',
  'height',
  'chomskyan',
  'lingu

  'dx',
  'doi',
  'org',
  '2f1643823',
  'hutchins',
  'history',
  'machine',
  'translation',
  'nutshell',
  'http',
  'ahs',
  'annaisd',
  'org',
  'common',
  'pa',
  'ges',
  'galleryphoto',
  'aspx',
  'photoid',
  'width',
  'height',
  'chomskyan',
  'linguistics',
  'encourages',
  'investigation',
  'corner',
  'cases',
  'stress',
  'limits',
  'theoretical',
  'models',
  'comparable',
  'pathological',
  'phenomena',
  'mathematics',
  'typically',
  'created',
  'thought',
  'experiments',
  'systematic',
  'investigation',
  'typical',
  'phenomena',
  'occur',
  'real',
  'world',
  'data',
  'case',
  'corpus',
  'linguistics',
  'creation',
  'use',
  'corpora',
  'real',
  'world',
  'data',
  'fundamental',
  'machine',
  'learning',
  'algorithms',
  'nlp',
  'addition',
  'theoretical',
  'underpinnings',
  'chomskyan',
  'linguistics',
  'called',
  'poverty',
  'stimulus',
  'argument',
  'entail',
  'general',
  'learning',
  'algorithms',
  'typically',
  

  'data',
  'program',
  'understanding',
  'natural',
  'language',
  'http',
  'hci',
  'stanford',
  'edu',
  'winograd',
  'shrdlu',
  'roger',
  'schank',
  'robert',
  'abelson',
  'scripts',
  'plans',
  'goals',
  'understanding',
  'inquiry',
  'human',
  'knowledge',
  'structures',
  'kishorjit',
  'vidya',
  'raj',
  'rk',
  'nirmal',
  'sivaji',
  'manipuri',
  'morpheme',
  'identification',
  'http',
  'aclweb',
  'org',
  'anthology',
  'w12',
  'w12',
  'pdf',
  'proceedings',
  '3rd',
  'workshop',
  'south',
  'southeast',
  'asian',
  'natural',
  'language',
  'processing',
  'sanlp',
  'pages',
  '95\xe2',
  'coling',
  'mumbai',
  'december',
  'yucong',
  'duan',
  'christophe',
  'cruz',
  'formalizing',
  'semantic',
  'natural',
  'language',
  'conceptualization',
  'existence',
  'http',
  'www',
  'ijimt',
  'org',
  'abstract',
  'e00187',
  'htm',
  'international',
  'journal',
  'innovation',
  'management',
  'technology',
  'pp',
  'versatile',
  'qu

  'roger',
  'schank',
  'robert',
  'abelson',
  'scripts',
  'plans',
  'goals',
  'understanding',
  'inquiry',
  'human',
  'knowledge',
  'structures',
  'kishorjit',
  'vidya',
  'raj',
  'rk',
  'nirmal',
  'sivaji',
  'manipuri',
  'morpheme',
  'identification',
  'http',
  'aclweb',
  'org',
  'anthology',
  'w12',
  'w12',
  'pdf',
  'proceedings',
  '3rd',
  'workshop',
  'south',
  'southeast',
  'asian',
  'natural',
  'language',
  'processing',
  'sanlp',
  'pages',
  '95\xe2',
  'coling',
  'mumbai',
  'december',
  'yucong',
  'duan',
  'christophe',
  'cruz',
  'formalizing',
  'semantic',
  'natural',
  'language',
  'conceptualization',
  'existence',
  'http',
  'www',
  'ijimt',
  'org',
  'abstract',
  'e00187',
  'htm',
  'international',
  'journal',
  'innovation',
  'management',
  'technology',
  'pp',
  'versatile',
  'question',
  'answering',
  'systems',
  'seeing',
  'synthesis',
  'https',
  'www',
  'academia',
  'edu',
  'versatile',
  '_question_an

  'log',
  'february',
  'winograd',
  'terry',
  'procedures',
  'representation',
  'data',
  'program',
  'understanding',
  'natural',
  'language',
  'http',
  'hci',
  'stanford',
  'edu',
  'winograd',
  'shrdlu',
  'roger',
  'schank',
  'robert',
  'abelson',
  'scripts',
  'plans',
  'goals',
  'understanding',
  'inquiry',
  'human',
  'knowledge',
  'structures',
  'kishorjit',
  'vidya',
  'raj',
  'rk',
  'nirmal',
  'sivaji',
  'manipuri',
  'morpheme',
  'identification',
  'http',
  'aclweb',
  'org',
  'anthology',
  'w12',
  'w12',
  'pdf',
  'proceedings',
  '3rd',
  'workshop',
  'south',
  'southeast',
  'asian',
  'natural',
  'language',
  'processing',
  'sanlp',
  'pages',
  '95\xe2',
  'coling',
  'mumbai',
  'december',
  'yucong',
  'duan',
  'christophe',
  'cruz',
  'formalizing',
  'semantic',
  'natural',
  'language',
  'conceptualization',
  'existence',
  'http',
  'www',
  'ijimt',
  'org',
  'abstract',
  'e00187',
  'htm',
  'international',
  'jo

  'language',
  'identification',
  'natural',
  'language',
  'programming',
  'references',
  'natural',
  'language',
  'search',
  'query',
  'expansion',
  'reification',
  'linguistics',
  'semantic',
  'folding',
  'speech',
  'processing',
  'spoken',
  'dialogue',
  'text',
  'proofing',
  'text',
  'simplification',
  'thought',
  'vector',
  'truecasing',
  'question',
  'answering',
  'word2vec',
  'implementing',
  'online',
  'help',
  'desk',
  'based',
  'conversational',
  'agent',
  'http',
  'ahs',
  'annaisd',
  'org',
  'common',
  'pages',
  'galleryphoto',
  'aspx',
  'photoid',
  'width',
  'height',
  'authors',
  'alisa',
  'kongthon',
  'chatchawal',
  'sangkeettrakarn',
  'sarawoot',
  'kongyoung',
  'choochart',
  'haruechaiyasak',
  'published',
  'acm',
  'article',
  'bibliometrics',
  'data',
  'bibliometrics',
  'published',
  'proceeding',
  'medes',
  'proceedings',
  'international',
  'conference',
  'management',
  'emergent',
  'digital',
  'ecos

  'connected',
  'text',
  'nature',
  'discourse',
  'relationships',
  'sentences',
  'elaboration',
  'explanation',
  'contrast',
  'possible',
  'task',
  'recognizing',
  'classifying',
  'speech',
  'acts',
  'chunk',
  'text',
  'yes',
  'question',
  'content',
  'question',
  'statement',
  'assertion',
  'discourse',
  'analysis',
  'speech',
  'speech',
  'recognition',
  'given',
  'sound',
  'clip',
  'person',
  'people',
  'speaking',
  'determine',
  'textual',
  'representation',
  'speech',
  'opposite',
  'text',
  'speech',
  'extremely',
  'difficult',
  'problems',
  'colloquially',
  'termed',
  'ai',
  'complete',
  'natural',
  'speech',
  'hardly',
  'pauses',
  'successive',
  'words',
  'speech',
  'segmentation',
  'necessary',
  'subtask',
  'speech',
  'recognition',
  'note',
  'spoken',
  'languages',
  'sounds',
  'representing',
  'successive',
  'letters',
  'blend',
  'process',
  'termed',
  'coarticulation',
  'conversion',
  'analog',
  'signal'

  'structures',
  'easier',
  'programs',
  'manipulate',
  'natural',
  'language',
  'understanding',
  'involves',
  'identification',
  'intended',
  'semantic',
  'multiple',
  'possible',
  'semantics',
  'derived',
  'natural',
  'language',
  'expression',
  'usually',
  'takes',
  'form',
  'organized',
  'notations',
  'natural',
  'languages',
  'concepts',
  'introduction',
  'creation',
  'language',
  'metamodel',
  'ontology',
  'efficient',
  'empirical',
  'solutions',
  'explicit',
  'formalization',
  'natural',
  'languages',
  'semantics',
  'confusions',
  'implicit',
  'assumptions',
  'closed',
  'world',
  'assumption',
  'cwa',
  'vs',
  'open',
  'world',
  'assumption',
  'subjective',
  'yes',
  'vs',
  'objective',
  'true',
  'false',
  'expected',
  'construction',
  'basis',
  'semantics',
  'formalization',
  'optical',
  'character',
  'recognition',
  'ocr',
  'given',
  'image',
  'representing',
  'printed',
  'text',
  'determine',
  'correspondin

  'rules',
  'soft',
  'decisions\xe2',
  'extremely',
  'difficult',
  'error',
  'prone',
  'time',
  'consuming',
  'systems',
  'based',
  'automatically',
  'learning',
  'rules',
  'accurate',
  'simply',
  'supplying',
  'input',
  'data',
  'systems',
  'based',
  'hand',
  'written',
  'rules',
  'accurate',
  'increasing',
  'complexity',
  'rules',
  'difficult',
  'task',
  'particular',
  'limit',
  'complexity',
  'systems',
  'based',
  'hand',
  'crafted',
  'rules',
  'systems',
  'unmanageable',
  'creating',
  'data',
  'input',
  'machine',
  'learning',
  'systems',
  'simply',
  'requires',
  'corresponding',
  'increase',
  'number',
  'man',
  'hours',
  'worked',
  'generally',
  'significant',
  'increases',
  'complexity',
  'annotation',
  'process',
  'major',
  'evaluations',
  'tasks',
  'following',
  'list',
  'commonly',
  'researched',
  'tasks',
  'nlp',
  'note',
  'tasks',
  'direct',
  'real',
  'world',
  'applications',
  'commonly',
  'serve',


  'specific',
  'example',
  'task',
  'specifically',
  'concerned',
  'matching',
  'pronouns',
  'nouns',
  'names',
  'refer',
  'general',
  'task',
  'coreference',
  'resolution',
  'includes',
  'identifying',
  'called',
  'bridging',
  'relationships',
  'involving',
  'referring',
  'expressions',
  'example',
  'sentence',
  'entered',
  'john',
  'house',
  'door',
  'door',
  'referring',
  'expression',
  'bridging',
  'relationship',
  'identified',
  'fact',
  'door',
  'referred',
  'door',
  'john',
  'house',
  'structure',
  'referred',
  'rubric',
  'includes',
  'number',
  'related',
  'tasks',
  'task',
  'identifying',
  'discourse',
  'structure',
  'connected',
  'text',
  'nature',
  'discourse',
  'relationships',
  'sentences',
  'elaboration',
  'explanation',
  'contrast',
  'possible',
  'task',
  'recognizing',
  'classifying',
  'speech',
  'acts',
  'chunk',
  'text',
  'yes',
  'question',
  'content',
  'question',
  'statement',
  'assertion',
  

  'semi',
  'supervised',
  'learning',
  'algorithms',
  'algorithms',
  'able',
  'learn',
  'data',
  'hand',
  'annotated',
  'desired',
  'answers',
  'combination',
  'annotated',
  'non',
  'annotated',
  'data',
  'generally',
  'task',
  'difficult',
  'supervised',
  'learning',
  'typically',
  'produces',
  'accurate',
  'results',
  'given',
  'input',
  'data',
  'enormous',
  'non',
  'annotated',
  'data',
  'available',
  'including',
  'things',
  'entire',
  'content',
  'world',
  'wide',
  'web',
  'inferior',
  'results',
  'recent',
  'years',
  'flurry',
  'results',
  'showing',
  'deep',
  'learning',
  'techniques',
  'achieving',
  'state',
  'art',
  'results',
  'natural',
  'language',
  'tasks',
  'example',
  'language',
  'modeling',
  'parsing',
  'statistical',
  'natural',
  'language',
  'processing',
  'called',
  'statistical',
  'revolution',
  'late',
  '1980s',
  'mid',
  '1990s',
  'natural',
  'language',
  'processing',
  'research',
  'rel

  'robust',
  'natural',
  'language',
  'variation',
  'machine',
  'learning',
  'paradigm',
  'calls',
  'instead',
  'statistical',
  'inference',
  'automatically',
  'learn',
  'rules',
  'analysis',
  'large',
  'corpora',
  'typical',
  'real',
  'world',
  'examples',
  'corpus',
  'plural',
  'corpora',
  'set',
  'documents',
  'possibly',
  'human',
  'annotations',
  'different',
  'classes',
  'machine',
  'learning',
  'algorithms',
  'applied',
  'nlp',
  'tasks',
  'algorithms',
  'input',
  'large',
  'set',
  'features',
  'generated',
  'input',
  'data',
  'earliest',
  'algorithms',
  'decision',
  'trees',
  'produced',
  'systems',
  'hard',
  'rules',
  'similar',
  'systems',
  'hand',
  'written',
  'rules',
  'common',
  'increasingly',
  'research',
  'focused',
  'statistical',
  'models',
  'soft',
  'probabilistic',
  'decisions',
  'based',
  'attaching',
  'real',
  'valued',
  'weights',
  'input',
  'feature',
  'models',
  'advantage',
  'express',


  '1980s',
  'mid',
  '1990s',
  'natural',
  'language',
  'processing',
  'research',
  'relied',
  'heavily',
  'machine',
  'learning',
  'language',
  'processing',
  'tasks',
  'typically',
  'involved',
  'direct',
  'hand',
  'coding',
  'rules',
  'general',
  'robust',
  'natural',
  'language',
  'variation',
  'machine',
  'learning',
  'paradigm',
  'calls',
  'instead',
  'statistical',
  'inference',
  'automatically',
  'learn',
  'rules',
  'analysis',
  'large',
  'corpora',
  'typical',
  'real',
  'world',
  'examples',
  'corpus',
  'plural',
  'corpora',
  'set',
  'documents',
  'possibly',
  'human',
  'annotations',
  'different',
  'classes',
  'machine',
  'learning',
  'algorithms',
  'applied',
  'nlp',
  'tasks',
  'algorithms',
  'input',
  'large',
  'set',
  'features',
  'generated',
  'input',
  'data',
  'earliest',
  'algorithms',
  'decision',
  'trees',
  'produced',
  'systems',
  'hard',
  'rules',
  'similar',
  'systems',
  'hand',
  'written'

  'deep',
  'learning',
  'techniques',
  'achieving',
  'state',
  'art',
  'results',
  'natural',
  'language',
  'tasks',
  'example',
  'language',
  'modeling',
  'parsing',
  'statistical',
  'natural',
  'language',
  'processing',
  'called',
  'statistical',
  'revolution',
  'late',
  '1980s',
  'mid',
  '1990s',
  'natural',
  'language',
  'processing',
  'research',
  'relied',
  'heavily',
  'machine',
  'learning',
  'language',
  'processing',
  'tasks',
  'typically',
  'involved',
  'direct',
  'hand',
  'coding',
  'rules',
  'general',
  'robust',
  'natural',
  'language',
  'variation',
  'machine',
  'learning',
  'paradigm',
  'calls',
  'instead',
  'statistical',
  'inference',
  'automatically',
  'learn',
  'rules',
  'analysis',
  'large',
  'corpora',
  'typical',
  'real',
  'world',
  'examples',
  'corpus',
  'plural',
  'corpora',
  'set',
  'documents',
  'possibly',
  'human',
  'annotations',
  'different',
  'classes',
  'machine',
  'learning',
 

  'rules',
  'learning',
  'procedures',
  'machine',
  'learning',
  'automatically',
  'focus',
  'common',
  'cases',
  'writing',
  'rules',
  'hand',
  'obvious',
  'effort',
  'directed',
  'automatic',
  'learning',
  'procedures',
  'use',
  'statistical',
  'inference',
  'algorithms',
  'produce',
  'models',
  'robust',
  'unfamiliar',
  'input',
  'containing',
  'words',
  'structures',
  'seen',
  'erroneous',
  'input',
  'misspelled',
  'words',
  'words',
  'accidentally',
  'omitted',
  'generally',
  'handling',
  'input',
  'gracefully',
  'hand',
  'written',
  'rules\xe2',
  'generally',
  'creating',
  'systems',
  'hand',
  'written',
  'rules',
  'soft',
  'decisions\xe2',
  'extremely',
  'difficult',
  'error',
  'prone',
  'time',
  'consuming',
  'systems',
  'based',
  'automatically',
  'learning',
  'rules',
  'accurate',
  'simply',
  'supplying',
  'input',
  'data',
  'systems',
  'based',
  'hand',
  'written',
  'rules',
  'accurate',
  'increasing'

  'analysis',
  'large',
  'corpora',
  'typical',
  'real',
  'world',
  'examples',
  'corpus',
  'plural',
  'corpora',
  'set',
  'documents',
  'possibly',
  'human',
  'annotations',
  'different',
  'classes',
  'machine',
  'learning',
  'algorithms',
  'applied',
  'nlp',
  'tasks',
  'algorithms',
  'input',
  'large',
  'set',
  'features',
  'generated',
  'input',
  'data',
  'earliest',
  'algorithms',
  'decision',
  'trees',
  'produced',
  'systems',
  'hard',
  'rules',
  'similar',
  'systems',
  'hand',
  'written',
  'rules',
  'common',
  'increasingly',
  'research',
  'focused',
  'statistical',
  'models',
  'soft',
  'probabilistic',
  'decisions',
  'based',
  'attaching',
  'real',
  'valued',
  'weights',
  'input',
  'feature',
  'models',
  'advantage',
  'express',
  'relative',
  'certainty',
  'different',
  'possible',
  'answers',
  'producing',
  'reliable',
  'results',
  'model',
  'included',
  'component',
  'larger',
  'systems',
  'based',
  '

  'results',
  'model',
  'included',
  'component',
  'larger',
  'systems',
  'based',
  'machine',
  'learning',
  'algorithms',
  'advantages',
  'hand',
  'produced',
  'rules',
  'learning',
  'procedures',
  'machine',
  'learning',
  'automatically',
  'focus',
  'common',
  'cases',
  'writing',
  'rules',
  'hand',
  'obvious',
  'effort',
  'directed',
  'automatic',
  'learning',
  'procedures',
  'use',
  'statistical',
  'inference',
  'algorithms',
  'produce',
  'models',
  'robust',
  'unfamiliar',
  'input',
  'containing',
  'words',
  'structures',
  'seen',
  'erroneous',
  'input',
  'misspelled',
  'words',
  'words',
  'accidentally',
  'omitted',
  'generally',
  'handling',
  'input',
  'gracefully',
  'hand',
  'written',
  'rules\xe2',
  'generally',
  'creating',
  'systems',
  'hand',
  'written',
  'rules',
  'soft',
  'decisions\xe2',
  'extremely',
  'difficult',
  'error',
  'prone',
  'time',
  'consuming',
  'systems',
  'based',
  'automatically',
 

  'decisions',
  'based',
  'attaching',
  'real',
  'valued',
  'weights',
  'input',
  'feature',
  'models',
  'advantage',
  'express',
  'relative',
  'certainty',
  'different',
  'possible',
  'answers',
  'producing',
  'reliable',
  'results',
  'model',
  'included',
  'component',
  'larger',
  'systems',
  'based',
  'machine',
  'learning',
  'algorithms',
  'advantages',
  'hand',
  'produced',
  'rules',
  'learning',
  'procedures',
  'machine',
  'learning',
  'automatically',
  'focus',
  'common',
  'cases',
  'writing',
  'rules',
  'hand',
  'obvious',
  'effort',
  'directed',
  'automatic',
  'learning',
  'procedures',
  'use',
  'statistical',
  'inference',
  'algorithms',
  'produce',
  'models',
  'robust',
  'unfamiliar',
  'input',
  'containing',
  'words',
  'structures',
  'seen',
  'erroneous',
  'input',
  'misspelled',
  'words',
  'words',
  'accidentally',
  'omitted',
  'generally',
  'handling',
  'input',
  'gracefully',
  'hand',
  'written',
 

  'revolution',
  'late',
  '1980s',
  'mid',
  '1990s',
  'natural',
  'language',
  'processing',
  'research',
  'relied',
  'heavily',
  'machine',
  'learning',
  'language',
  'processing',
  'tasks',
  'typically',
  'involved',
  'direct',
  'hand',
  'coding',
  'rules',
  'general',
  'robust',
  'natural',
  'language',
  'variation',
  'machine',
  'learning',
  'paradigm',
  'calls',
  'instead',
  'statistical',
  'inference',
  'automatically',
  'learn',
  'rules',
  'analysis',
  'large',
  'corpora',
  'typical',
  'real',
  'world',
  'examples',
  'corpus',
  'plural',
  'corpora',
  'set',
  'documents',
  'possibly',
  'human',
  'annotations',
  'different',
  'classes',
  'machine',
  'learning',
  'algorithms',
  'applied',
  'nlp',
  'tasks',
  'algorithms',
  'input',
  'large',
  'set',
  'features',
  'generated',
  'input',
  'data',
  'earliest',
  'algorithms',
  'decision',
  'trees',
  'produced',
  'systems',
  'hard',
  'rules',
  'similar',
  'syste

  'corpus',
  'linguistics',
  'underlies',
  'machine',
  'learning',
  'approach',
  'language',
  'processing',
  'earliest',
  'machine',
  'learning',
  'algorithms',
  'decision',
  'trees',
  'produced',
  'systems',
  'hard',
  'rules',
  'similar',
  'existing',
  'hand',
  'written',
  'rules',
  'speech',
  'tagging',
  'introduced',
  'use',
  'hidden',
  'markov',
  'models',
  'nlp',
  'increasingly',
  'research',
  'focused',
  'statistical',
  'models',
  'soft',
  'probabilistic',
  'decisions',
  'based',
  'attaching',
  'real',
  'valued',
  'weights',
  'features',
  'making',
  'input',
  'data',
  'cache',
  'language',
  'models',
  'speech',
  'recognition',
  'systems',
  'rely',
  'examples',
  'statistical',
  'models',
  'models',
  'generally',
  'robust',
  'given',
  'unfamiliar',
  'input',
  'especially',
  'input',
  'contains',
  'errors',
  'common',
  'real',
  'world',
  'data',
  'produce',
  'reliable',
  'results',
  'integrated',
  'larger',


  'data',
  'generally',
  'task',
  'difficult',
  'supervised',
  'learning',
  'typically',
  'produces',
  'accurate',
  'results',
  'given',
  'input',
  'data',
  'enormous',
  'non',
  'annotated',
  'data',
  'available',
  'including',
  'things',
  'entire',
  'content',
  'world',
  'wide',
  'web',
  'inferior',
  'results',
  'recent',
  'years',
  'flurry',
  'results',
  'showing',
  'deep',
  'learning',
  'techniques',
  'achieving',
  'state',
  'art',
  'results',
  'natural',
  'language',
  'tasks',
  'example',
  'language',
  'modeling',
  'parsing',
  'statistical',
  'natural',
  'language',
  'processing',
  'called',
  'statistical',
  'revolution',
  'late',
  '1980s',
  'mid',
  '1990s',
  'natural',
  'language',
  'processing',
  'research',
  'relied',
  'heavily',
  'machine',
  'learning',
  'language',
  'processing',
  'tasks',
  'typically',
  'involved',
  'direct',
  'hand',
  'coding',
  'rules',
  'general',
  'robust',
  'natural',
  'language

  'flurry',
  'results',
  'showing',
  'deep',
  'learning',
  'techniques',
  'achieving',
  'state',
  'art',
  'results',
  'natural',
  'language',
  'tasks',
  'example',
  'language',
  'modeling',
  'parsing',
  'statistical',
  'natural',
  'language',
  'processing',
  'called',
  'statistical',
  'revolution',
  'late',
  '1980s',
  'mid',
  '1990s',
  'natural',
  'language',
  'processing',
  'research',
  'relied',
  'heavily',
  'machine',
  'learning',
  'language',
  'processing',
  'tasks',
  'typically',
  'involved',
  'direct',
  'hand',
  'coding',
  'rules',
  'general',
  'robust',
  'natural',
  'language',
  'variation',
  'machine',
  'learning',
  'paradigm',
  'calls',
  'instead',
  'statistical',
  'inference',
  'automatically',
  'learn',
  'rules',
  'analysis',
  'large',
  'corpora',
  'typical',
  'real',
  'world',
  'examples',
  'corpus',
  'plural',
  'corpora',
  'set',
  'documents',
  'possibly',
  'human',
  'annotations',
  'different',
  '

  'relative',
  'certainty',
  'different',
  'possible',
  'answers',
  'producing',
  'reliable',
  'results',
  'model',
  'included',
  'component',
  'larger',
  'systems',
  'based',
  'machine',
  'learning',
  'algorithms',
  'advantages',
  'hand',
  'produced',
  'rules',
  'learning',
  'procedures',
  'machine',
  'learning',
  'automatically',
  'focus',
  'common',
  'cases',
  'writing',
  'rules',
  'hand',
  'obvious',
  'effort',
  'directed',
  'automatic',
  'learning',
  'procedures',
  'use',
  'statistical',
  'inference',
  'algorithms',
  'produce',
  'models',
  'robust',
  'unfamiliar',
  'input',
  'containing',
  'words',
  'structures',
  'seen',
  'erroneous',
  'input',
  'misspelled',
  'words',
  'words',
  'accidentally',
  'omitted',
  'generally',
  'handling',
  'input',
  'gracefully',
  'hand',
  'written',
  'rules\xe2',
  'generally',
  'creating',
  'systems',
  'hand',
  'written',
  'rules',
  'soft',
  'decisions\xe2',
  'extremely',
  'dif

  'hand',
  'written',
  'rules',
  'soft',
  'decisions\xe2',
  'extremely',
  'difficult',
  'error',
  'prone',
  'time',
  'consuming',
  'systems',
  'based',
  'automatically',
  'learning',
  'rules',
  'accurate',
  'simply',
  'supplying',
  'input',
  'data',
  'systems',
  'based',
  'hand',
  'written',
  'rules',
  'accurate',
  'increasing',
  'complexity',
  'rules',
  'difficult',
  'task',
  'particular',
  'limit',
  'complexity',
  'systems',
  'based',
  'hand',
  'crafted',
  'rules',
  'systems',
  'unmanageable',
  'creating',
  'data',
  'input',
  'machine',
  'learning',
  'systems',
  'simply',
  'requires',
  'corresponding',
  'increase',
  'number',
  'man',
  'hours',
  'worked',
  'generally',
  'significant',
  'increases',
  'complexity',
  'annotation',
  'process',
  'major',
  'evaluations',
  'tasks',
  'following',
  'list',
  'commonly',
  'researched',
  'tasks',
  'nlp',
  'note',
  'tasks',
  'direct',
  'real',
  'world',
  'applications',
  

  'example',
  'language',
  'modeling',
  'parsing',
  'statistical',
  'natural',
  'language',
  'processing',
  'called',
  'statistical',
  'revolution',
  'late',
  '1980s',
  'mid',
  '1990s',
  'natural',
  'language',
  'processing',
  'research',
  'relied',
  'heavily',
  'machine',
  'learning',
  'language',
  'processing',
  'tasks',
  'typically',
  'involved',
  'direct',
  'hand',
  'coding',
  'rules',
  'general',
  'robust',
  'natural',
  'language',
  'variation',
  'machine',
  'learning',
  'paradigm',
  'calls',
  'instead',
  'statistical',
  'inference',
  'automatically',
  'learn',
  'rules',
  'analysis',
  'large',
  'corpora',
  'typical',
  'real',
  'world',
  'examples',
  'corpus',
  'plural',
  'corpora',
  'set',
  'documents',
  'possibly',
  'human',
  'annotations',
  'different',
  'classes',
  'machine',
  'learning',
  'algorithms',
  'applied',
  'nlp',
  'tasks',
  'algorithms',
  'input',
  'large',
  'set',
  'features',
  'generated',
  

  'examples',
  'statistical',
  'models',
  'models',
  'generally',
  'robust',
  'given',
  'unfamiliar',
  'input',
  'especially',
  'input',
  'contains',
  'errors',
  'common',
  'real',
  'world',
  'data',
  'produce',
  'reliable',
  'results',
  'integrated',
  'larger',
  'comprising',
  'multiple',
  'subtasks',
  'notable',
  'early',
  'successes',
  'occurred',
  'field',
  'machine',
  'translation',
  'especially',
  'work',
  'ibm',
  'research',
  'successively',
  'complicated',
  'statistical',
  'models',
  'developed',
  'systems',
  'able',
  'advantage',
  'existing',
  'multilingual',
  'textual',
  'corpora',
  'produced',
  'parliament',
  'canada',
  'european',
  'union',
  'result',
  'laws',
  'calling',
  'translation',
  'governmental',
  'proceedings',
  'official',
  'languages',
  'corresponding',
  'systems',
  'government',
  'systems',
  'depended',
  'corpora',
  'specifically',
  'developed',
  'tasks',
  'implemented',
  'systems',
  'contin

  'pdf',
  'mark',
  'johnson',
  'statistical',
  'revolution',
  'changes',
  'computational',
  'linguistics',
  'http',
  'www',
  'aclweb',
  'anthology',
  'w09',
  'proceedings',
  'eacl',
  'workshop',
  'interaction',
  'linguistics',
  'computational',
  'linguistics',
  'philip',
  'resnik',
  'revolutions',
  'http',
  'languagelog',
  'ldc',
  'upenn',
  'edu',
  'nll',
  'language',
  'log',
  'february',
  'winograd',
  'terry',
  'procedures',
  'representation',
  'data',
  'program',
  'understanding',
  'natural',
  'language',
  'http',
  'hci',
  'stanford',
  'edu',
  'winograd',
  'shrdlu',
  'roger',
  'schank',
  'robert',
  'abelson',
  'scripts',
  'plans',
  'goals',
  'understanding',
  'inquiry',
  'human',
  'knowledge',
  'structures',
  'kishorjit',
  'vidya',
  'raj',
  'rk',
  'nirmal',
  'sivaji',
  'manipuri',
  'morpheme',
  'identification',
  'http',
  'aclweb',
  'org',
  'anthology',
  'w12',
  'w12',
  'pdf',
  'proceedings',
  '3rd',
  'works

KeyboardInterrupt: 

## 3. Create dictionary and corpus

In [None]:
from gensim import corpora

dictionary = corpora.Dictionary(tokens)
dictionary.save('myDict.dict')
#print len(dictionary)

In [6]:
corpus = [dictionary.doc2bow(token) for token in tokens]
corpora.MmCorpus.serialize('nlp1Corpus.mm', corpus)

## 4. Train models with different transformations

### 4.1 TF-IDF

In [68]:
from gensim import corpora, models

## tf-idf model
tfidf = models.TfidfModel(corpus)
tfidf.save('nlp1_tfidf.model')

##transform
corpus_tfidf = tfidf[corpus]

for vec in corpus_tfidf:
    print(vec)

[(0, 0.6850351176651314), (1, 0.5593597196830157), (2, 0.4667371761084153)]
[(3, 0.5773502691896257), (4, 0.5773502691896257), (5, 0.5773502691896257)]
[(0, 0.2946362866607063), (1, 0.2405828058519198), (2, 0.20074548715663215), (6, 0.3731450657015619), (7, 0.5434492111187911), (8, 0.6167952142927139)]
[(9, 0.45056360343428703), (10, 0.4797396660428377), (11, 0.4094422527747688), (12, 0.5208610167023559), (13, 0.3576354778373191)]
[(1, 0.20963023321081714), (14, 0.47353086793404225), (15, 0.3009469707478206), (16, 0.47353086793404225), (17, 0.3580236214837794), (18, 0.5374404216362902)]
[(10, 0.37968127750049036), (16, 0.41222602636342337), (19, 0.35659041070804204), (20, 0.4678616420188047), (21, 0.41222602636342337), (22, 0.41222602636342337)]
[(0, 0.2591110997871113), (1, 0.42315002079798864), (2, 0.3530819950521141), (23, 0.5424263526062352), (24, 0.44019238170242886), (25, 0.3756899276334588)]
[(1, 0.4381126530404304), (2, 0.3655670140376491), (26, 0.5616066093325625), (27, 0.4557

### 4.2 LSI

In [65]:
lsi= models.LsiModel(corpus=corpus, id2word=dictionary, num_topics=20)
lsi.save('nlp1_lsi.model')

## transform
#corpus_lsi = lsi[corpus]
corpus_lsi = lsi[corpus_tfidf]

lsi.print_topics(20)

[(0,
  u'0.733*"language" + 0.561*"natural" + 0.239*"processing" + 0.085*"understanding" + 0.079*"statistical" + 0.072*"models" + 0.071*"learning" + 0.065*"systems" + 0.064*"machine" + 0.048*"words"'),
 (1,
  u'-0.498*"systems" + -0.370*"rules" + -0.236*"input" + -0.210*"handwritten" + -0.206*"text" + -0.203*"given" + -0.197*"data" + -0.185*"however" + -0.166*"learning" + -0.166*"based"'),
 (2,
  u'0.511*"text" + 0.392*"given" + -0.279*"systems" + 0.244*"words" + 0.233*"determine" + -0.215*"rules" + 0.200*"sentence" + 0.196*"chunk" + 0.151*"speech" + 0.127*"separate"'),
 (3,
  u'-0.551*"learning" + -0.404*"machine" + -0.344*"algorithms" + 0.252*"systems" + 0.239*"rules" + -0.150*"research" + 0.144*"handwritten" + 0.127*"natural" + -0.125*"translation" + -0.124*"nlp"'),
 (4,
  u'0.376*"text" + -0.374*"input" + -0.349*"data" + -0.335*"words" + -0.278*"speech" + 0.167*"machine" + 0.157*"rules" + 0.136*"systems" + 0.122*"learning" + -0.113*"models"'),
 (5,
  u'0.329*"words" + -0.329*"data"

### 4.3 LDA

In [11]:
lda = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, passes=20, eval_every=None)
lda.save('nlp1_lda.model')

## transform
corpus_lda = lda[corpus]

lda.print_topics(20)


## check coherence
'''
average topic coherence = (sum of topic coherences of all topics)/ (number of topics)
top_topics = lda.top_topics(corpus)
print top_topics[1]
avg_topic_coherence = sum([t[1] for t in top_topics]) / 10 
print('average topic coherence: %.4f.' % avg_topic_coherence)
'''

[(0,
  u'0.089*"language" + 0.056*"natural" + 0.031*"processing" + 0.012*"learning" + 0.012*"statistical" + 0.010*"machine" + 0.007*"languages" + 0.007*"introduction" + 0.007*"ambiguity" + 0.007*"answering"'),
 (1,
  u'0.017*"data" + 0.011*"words" + 0.011*"learning" + 0.010*"language" + 0.007*"linguistics" + 0.007*"english" + 0.007*"different" + 0.007*"written" + 0.007*"realworld" + 0.007*"separate"'),
 (2,
  u'0.018*"text" + 0.018*"discourse" + 0.011*"names" + 0.011*"relationships" + 0.011*"major" + 0.011*"tasks" + 0.008*"task" + 0.008*"eg" + 0.008*"input" + 0.008*"identifying"'),
 (3,
  u'0.023*"text" + 0.017*"given" + 0.017*"determine" + 0.010*"larger" + 0.008*"learning" + 0.007*"speech" + 0.007*"press" + 0.007*"research" + 0.007*"see" + 0.007*"people"'),
 (4,
  u'0.028*"systems" + 0.019*"rules" + 0.015*"tasks" + 0.015*"however" + 0.013*"machine" + 0.011*"language" + 0.011*"models" + 0.011*"based" + 0.011*"handwritten" + 0.011*"input"'),
 (5,
  u'0.020*"information" + 0.020*"systems

### 4.4 HDP

In [70]:
hdp = models.HdpModel(corpus, id2word=dictionary, T=20) # T here indicates that HDP should find no more than 20 topics
hdp.save('nlp1_hdp.model')

## transform
corpus_hdp = hdp[corpus]
hdp.print_topics()

[(0,
  u'0.007*case + 0.006*corpora + 0.006*notations + 0.005*optical + 0.005*attaching + 0.005*closedworld + 0.005*free + 0.005*test + 0.004*morphological + 0.004*chinese'),
 (1,
  u'0.008*marking + 0.006*rte7 + 0.006*particular + 0.005*emotion + 0.005*machinery + 0.005*ny + 0.005*segmentation + 0.005*signal + 0.004*jabberwacky + 0.004*capital'),
 (2,
  u'0.007*vinyals + 0.006*flurry + 0.006*nonprofit + 0.005*hidden + 0.005*august + 0.005*existing + 0.005*number + 0.005*identified + 0.004*georgetown + 0.004*expansion'),
 (3,
  u'0.006*encourages + 0.006*philip + 0.005*humans + 0.005*plans + 0.005*yonghui + 0.005*resource + 0.005*included + 0.005*real + 0.004*annotation + 0.004*number'),
 (4,
  u'0.008*stochastic + 0.006*formalization14 + 0.005*convenience + 0.005*raj + 0.005*human + 0.005*examples + 0.005*soft + 0.004*increasingly + 0.004*mathematics + 0.004*verb'),
 (5,
  u'0.006*innovation + 0.005*robert + 0.005*accidentally + 0.005*pp + 0.005*why + 0.005*all + 0.005*false16 + 0.005

## 5. Model visualization
### 5.1 Visualize LDA model

In [14]:
import pyLDAvis.gensim
import pyLDAvis

vis_data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis_data)

### 5.2 visualize HDP model

In [71]:
vis_data2 = pyLDAvis.gensim.prepare(hdp, corpus, dictionary)
pyLDAvis.display(vis_data2)


## 6. Topic coherence

In [92]:
## calculate coherence value
from gensim.models import coherencemodel

## get the topics
topics_lsi = lsi.show_topics(formatted=False)
topics_lda = lda.show_topics(formatted=False)
topics_hdp = hdp.show_topics(formatted=False)

topics_lsi = [[word for word, prob in topic] for topicid, topic in topics_lsi]
topics_hdp = [[word for word, prob in topic] for topicid, topic in topics_hdp]
topics_lda = [[word for word, prob in topic] for topicid, topic in topics_lda]

In [105]:
## options for 'coherence' parameter: u_mass, c_v, c_uci, c_npmi
cm_hdp = coherencemodel.CoherenceModel(topics=topics_hdp, texts=tokens, corpus=corpus, 
                                       dictionary=dictionary, window_size=10, coherence='u_mass')
print 'coherence_hdp:', cm_hdp.get_coherence()

cm_lsi = coherencemodel.CoherenceModel(topics=topics_lsi, texts=tokens, corpus=corpus, 
                                       dictionary=dictionary, window_size=10, coherence='u_mass')
print 'coherence_lsi:', cm_lsi.get_coherence()

cm_lda = coherencemodel.CoherenceModel(topics=topics_lda, texts=tokens, corpus=corpus, 
                                       dictionary=dictionary, window_size=10, coherence='u_mass')
print 'coherence_lda:', cm_lda.get_coherence()


coherence_hdp: -21.812804479
coherence_lsi: -14.3808312729
coherence_lda: -15.2652151782


## 7. Summarization

In [82]:
## summarization
from gensim.summarization import summarize
from gensim.summarization import keywords

document = open("nlp1.txt").read()
pprint (summarize(document, split=True, ratio=0.01))
print (keywords(document, ratio=0.01))

['Formerly, many language-processing tasks typically involved the direct hand coding of rules,[11][12] which is',
 'Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects',
 'http://www.aclweb.org/website/old_anthology/D/D16/D16-1257.pdf Parsing as Language Modeling',
 'Retrieved from "https://en.wikipedia.org/w/index.php?title=Natural_language_processing&oldid=795399343"']
languages
natural language processing
http
https
process
words
word
text
nature
