# Text Analysis
Let's perform some basic text analysis tasks such as accessing data, tokenizing a corpus, and computing token frequencies, on a Wikipedia article and on the NLTK Reuters corpus.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy as sp
import requests
import json
import string

import html2text

from sklearn.metrics import accuracy_score
from sklearn.utils import check_random_state

from helper import *

We'll use [Wikipedia article on Text Mining](https://en.wikipedia.org/wiki/Text_mining) as a sample text.

In [2]:
h = html2text.HTML2Text()
h.ignore_links=True

resp = requests.get('https://en.wikipedia.org/wiki/Text_mining')
text = h.handle(resp.text)

print(text)

# Text mining

From Wikipedia, the free encyclopedia

Jump to: navigation, search

**Text mining**, also referred to as _text data mining_, roughly equivalent to **text analytics**, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarizatio

## Tokenize

- Tokenize the text string `syllabus_text`.
  You should clean up the list of tokens by removing all puntuation tokens
  and keeping only tokens with one or more alphanumeric characters.

In [3]:
syllabus_words = get_words(text)
print(syllabus_words[:5], '...', syllabus_words[-5:])

['text', 'mining', 'from', 'wikipedia', 'the'] ... ['mediawiki', 'static', 'images', 'poweredby_mediawiki_88x31', 'png']


## Lexical Diversity

- Function `count()` computes the the number of tokens, number of words, and lexical diversity.

In [4]:
num_tokens, num_words, lex_div = count(syllabus_words)
print("Article has {0} tokens and {1} words for a lexical diversity of {2:4.3f}"
      "".format(num_tokens, num_words, lex_div))

Article has 1483 tokens and 4178 words for a lexical diversity of 2.817


## Most common occurrences

- Function `get_most_common()` computes the most commonly occurring terms and their counts.

In [5]:
syllabus_most_common = get_most_common(syllabus_words, 10)

print('{0:12s}: {1}'.format('Term', 'Count'))
print(20*'-')

for token, freq in syllabus_most_common:
    print('{0:12s}: {1:4d}'.format(token, freq))

Term        : Count
--------------------
the         :  136
of          :  128
and         :  113
text        :  102
in          :   84
mining      :   76
a           :   75
to          :   74
analysis    :   45
for         :   42


## Hapax

- Function `find_hapaxes()` finds all hapexes in a text string.
- A hapax is a word that exists only once. Since we are looking at Wikipedia, we expect difference languages to appear.

In [6]:
syllabus_hapaxes = find_hapaxes(syllabus_words)
print(sorted(syllabus_hapaxes)[-10:])

['yet', 'you', 'čeština', 'русский', 'українська', 'العربية', 'فارسی', 'ไทย', '中文', '日本語']


## NLTK corpus

Let's take a look at the NLTK Reuters corpus. See the [NLTK docs](http://www.nltk.org/book/ch02.html#reuters-corpus) for more information.

In [7]:
from nltk.corpus import reuters

## Lexical diversity in corpus

- Function `count_corpus()` computes the the number of token, number of words, and lexical diversity.
- Uses the `words()` function of the reuters object, which includes non-alphanumeric characters.

In [8]:
num_words, num_tokens, lex_div = count_corpus(reuters)
print("The Reuters corpus has {0} tokens and {1} words for a lexical diversity of {2:4.3f}"
      "".format(num_tokens, num_words, lex_div))

The Reuters corpus has 41600 tokens and 1720901 words for a lexical diversity of 41.368


## Long words

- Function `get_long_words()` searches for all words in corpus that are longer than a specific length of characters
- We will pass in a length of 20 to find all worders longer than 20 characters

In [9]:
long_words = get_long_words(reuters, length=20)
print(long_words)

['discontinuedoperations', 'Beteiligungsgesellschaft', 'Gloeielampenfabrieken', '..........................................', 'Warenhandelsgesellschaft']


# Text Classification

Next, we can perform text classification tasks by using the nltk and the scikit learn machine learning libraries. We will continue to examine the Reuters corpus.

## Categories

Before delving into text data mining, let's first explore the Reuters data set.

- Function `get_categories()`, given an NLTK corpus object, returns its categories.

In [10]:
all_categories = get_categories(reuters)
print(all_categories)

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']


## fileids

- Function `get_fileids()` finds all `fileids` of a corpus, in our case Reuters.

In [11]:
fileids = get_fileids(reuters)
print(fileids[:5], '...', fileids[-5:])

['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833'] ... ['training/999', 'training/9992', 'training/9993', 'training/9994', 'training/9995']


## List of categories

The corpus contains 10,788 documents (`fileids`) which have been classified into 90 topics (`categories`). We will use those 10,788 documents to train machine learning models and try to predict which topic each document belongs to.

Our next function will find categories for each element of `fileids`.

Note, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. If a document has more than one category, we will return **only the first category** (in alphabetical order).

Since we using only one category for each `fileid`, the result from `get_categories_from_fileids()` should have the same length as the `fileids` list.

In [12]:
categories = get_categories_from_fileids(reuters, fileids)
print(categories[:5], '...', categories[-5:])

['trade', 'grain', 'crude', 'corn', 'palm-oil'] ... ['interest', 'earn', 'earn', 'earn', 'earn']


## Training and test sets

The Reuters data set has already been grouped into a training set and a test set. See `fileids`.

To create a training set, we iterate through `fileids` and find all `fileids` that start with "train". `X_train` is a list of **raw data** strings, which we obtain by using the `raw()` method. `y_train`  is a list of categories. We repeat for all `fileids` that start with "test". In the end, `train_test_split()` will return a 4-tuple of `X_train`, `X_test`, `y_train`, and `y_test`, each of which is a list of strings.

In [13]:
X_train, X_test, y_train, y_test = train_test_split(reuters, fileids, categories)

## SVC (no pipeline, no stop words)

We use `CountVectorizer` to create a document term matrix, and apply `LinearSVC` algorithm to classify which topic each news document belongs to.

In [14]:
cv1, svc1, y_pred1 = cv_svc(X_train, y_train, X_test, random_state=check_random_state(0))
score1 = accuracy_score(y_pred1, y_test)
print("SVC prediction accuracy = {0:3.1f}%".format(100.0 * score1))

SVC prediction accuracy = 88.3%


## SVC (Pipeline, no stop words)

- Now we'll build a pipeline curing `CountVectorizer` and `LinearSVC`

In [15]:
clf2, y_pred2 = cv_svc_pipe(X_train, y_train, X_test, random_state=check_random_state(0))
score2 = accuracy_score(y_pred2, y_test)
print("SVC prediction accuracy = {0:3.1f}%".format(100.0 * score2))

SVC prediction accuracy = 88.3%


## SVC (Pipeline and stop words)

- Finally, we'll add English stop words.

In [16]:
clf3, y_pred3 = cv_svc_pipe_sw(X_train, y_train, X_test, random_state=check_random_state(0))
score3 = accuracy_score(y_pred3, y_test)
print("SVC prediction accuracy = {0:3.1f}%".format(100.0 * score3))

SVC prediction accuracy = 87.9%


## Pipeline of TF-IDF and SVM with stop words

- Now, we can build a pipeline by using `TfidfVectorizer` and `LinearSVC`. We'll also use English stop words here.
- Remember that TF-IDF represents how important a word is in a document or corpus

In [17]:
clf4, y_pred4 = tfidf_svc(X_train, y_train, X_test, random_state=check_random_state(0))
score4 = accuracy_score(y_pred4, y_test)
print("SVC prediction accuracy = {0:5.1f}%".format(100.0 * score4))

SVC prediction accuracy =  89.8%


# Text Mining

We will continue to use the Reuters corpus, this time to perform text mining tasks, such as n-grams, stemming, and clustering.

## n-grams

- Uses unigrams, bigrams, and trigrams,
- We build a pipeline using `TfidfVectorizer` and `LinearSVC`,
- Uses English stop words,
- We also impose a minimum feature term that requires a term to be present in at least three documents, and
- Our maximum frequency is set to 75%, such that any term occurring in more than 75% of all documents will be ignored

In [18]:
clf1, y_pred1 = ngram(X_train, y_train, X_test, random_state=check_random_state(0))
score1 = accuracy_score(y_pred1, y_test)
print("SVC prediction accuracy = {0:5.1f}%".format(100.0 * score1))

SVC prediction accuracy =  90.2%


## Stemming

- Uses the `tokenize` method in the following code cell to incorporate the Porter Stemmer into the classification pipeline,
- Uses unigrams, bigrams, and trigrams
- We build a pipeline by using `TfidfVectorizer` and `LinearSVC`,
- Uses English stop words,
- We impose a minimum feature term that requires a term to be present in at least two documents, and
- Our maximum frequency is set to 50%, such that any term occurring in more than 50% of all documents will be ignored

In [19]:
clf2, y_pred2 = stem(X_train, y_train, X_test, random_state=check_random_state(0))
score2 = accuracy_score(y_pred2, y_test)
print("SVC prediction accuracy = {0:5.1f}%".format(100.0 * score2))

SVC prediction accuracy =  89.2%


## Clustering Analysis

- We build a pipeline by using `TfidfVectorizer` and `KMeans`,
- Uses only unigrams,
- Uses English stop words, and
- The number of clusters will be set to the value of `true_k`.

In [20]:
clf3 = cluster(X_train, X_test, true_k=len(reuters.categories()), random_state=check_random_state(0))

- Function `get_top_tokens()` identifies the most frequently used words in a cluster.

The `cluster` parameter specifies the cluster label. Since we are not using training labels, the cluster labels returned by `KMeans` will be integers.

If we were to set `top_tokens=3`, the function will return a list of top 3 tokens in the specified cluster; similarly, if we set `top_tokens=5`, it will return a list of top 5 tokens.

In [21]:
km = clf3.named_steps['km']
tf = clf3.named_steps['tf']

for i in range(len(reuters.categories())):
    print("Cluster {0}: {1}".format(i, ' '.join(get_top_tokens(km, tf, i, 5))))

Cluster 0: units housing starts pct seasonally
Cluster 1: vs cts shr net qtr
Cluster 2: yeutter trade said clayton representative
Cluster 3: steel nippon said yen year
Cluster 4: shares offer said common stock
Cluster 5: pct december billion january money
Cluster 6: vs mln net cts shr
Cluster 7: opec oil prices saudi bpd
Cluster 8: usair twa piedmont application airlines
Cluster 9: cts quarterly qtly div pay
Cluster 10: oil said herrington tax energy
Cluster 11: loss vs mln 000 cts
Cluster 12: wheat tonnes 000 barley export
Cluster 13: deficit mln account billion trade
Cluster 14: canadian canada rapeseed said crushers
Cluster 15: futures bp exchange oil said
Cluster 16: stg bills mln money drain
Cluster 17: cocoa buffer icco delegates stock
Cluster 18: 50 crude bbl wti raises
Cluster 19: taiwan billion trade foreign dlrs
Cluster 20: yen dealers dollars bank japan
Cluster 21: corn tonnes usda acres bushels
Cluster 22: oper vs 000 net cts
Cluster 23: stake shares pct group investment
Cl