# Text Conditioning and Machine Learning

This notebook lays out the process of text analysis and machine learning of scraped Twitter data to identify tweets relevant to humanitarian crises. First labled twitter data collected from previous natural disaters will be cleaned and vectorized. A XXX model will be trained and tweaked over the data. This model will be saved, then applied over tweets from contemporary humanitarian crises to identify emerging disaster hotspots.

## Python Packages Used
This notebook was set up in an environment running Python 3.8 with the following packages:
pandas, tensorflow, keras, scikit-learn, nltk, gensim

In [1]:
import os 
import pandas as pd
import collections as col
import pprint
import scipy.sparse as sp
from sklearn.feature_extraction.text import CountVectorizer
from operator import itemgetter
import nltk
import re
import seaborn as sns
import matplotlib.pyplot as plt
import string
import twokenize
from nltk.stem.porter import PorterStemmer

## Preprocessing Script

The following script was adapted from the "preprocessing.py" script avlible from [The CrisisNLP
/deep-learning-for-big-crisis-data GitHub repository](https://github.com/CrisisNLP/deep-learning-for-big-crisis-data).

The researchers in their paper mentioned getting rid of the urls, digits, and usersnames improved their nlp results.

In [15]:
################################################################
''' Preporcessing steps: 
1. lowercasing 
2. Digit -> DDD 
3. URLs -> httpAddress 
4. @username -> userID 
5. Remove special characters, keep ; . ! ? 
6. normalize elongation 
7. tokenization using tweetNLP
output is ~/Dropbox (QCRI)/AIDR-DA-ALT-SC/data/labeled datasets/prccd_data/{filename}_AIDR_prccd.csv
'''
#################################################################

#=================
#==> Libraries <==
#=================
import re, os
import string 
import sys
import twokenize
import csv
from collections import defaultdict
from os.path import basename
import ntpath
import codecs
import unicodedata

def process(lst):
    prccd_item_list=[]
    for tweet in lst:


#         # Normalizing utf8 formatting
#         tweet = tweet.decode("unicode-escape").encode("utf8").decode("utf8")
#         #tweet = tweet.encode("utf-8")
#         tweet = tweet.encode("ascii","ignore")
#         tweet = tweet.strip(' \t\n\r')

        # 1. Lowercasing
        tweet = tweet.lower()
        #print "[lowercase]", tweet

        # Word-Level
        tweet = re.sub(' +',' ',tweet) # replace multiple spaces with a single space

        # 2. Normalizing digits
        tweet_words = tweet.strip('\r').split(' ')
        for word in [word for word in tweet_words if word.isdigit()]:
            tweet = tweet.replace(word, "D" * len(word))
#         print( "[digits]", tweet)

        # 3. Normalizing URLs
        tweet_words = tweet.strip('\r').split(' ')
        for word in [word for word in tweet_words if '/' in word or '.' in word and  len(word) > 3]:
            tweet = tweet.replace(word, "")
#         print( "[URLs]", tweet)

        #4. Normalizing username

        tweet_words = tweet.strip('\r').split(' ')
        try:
            for word in [word for word in tweet_words if word[0] == '@' and len(word) > 1]:
                tweet = tweet.replace(word, "")
#         print( "[usrename]", tweet)
        except:
            tweet = tweet


        # 5. Removing special Characters
        punc = '@$%^&*()_+-={}[]:"|\'\~`<>/,'
        trans = str.maketrans(punc, ' '*len(punc))
        tweet = tweet.translate(trans)
        #print( "[punc]", tweet)

        # 6. Normalizing +2 elongated char
        tweet = re.sub(r"(.)\1\1+",r'\1\1', tweet)
        #print ("[elong]", tweet)

        # 7. tokenization using tweetNLP
        tweet = ' '.join(twokenize.simpleTokenize(tweet))
        #print( "[token]", tweet )

        #8. fix \n char
        tweet = tweet.replace('\n', ' ')

        prccd_item_list.append(tweet.strip())
#         print ("[processed]", tweet.replace('\n', ' '))
        
    return prccd_item_list

In [3]:
#nltk.download('all')

In [4]:
code_dir = os.getcwd()
parent_dir = os.path.dirname(code_dir)
print(parent_dir)

/Volumes/Elements/DataScience/dsa/capstone


## [Crisis Benchmark data for training Models](https://crisisnlp.qcri.org/crisis_datasets_benchmarks.html)

<p>The crisis benchmark dataset consists data from several different data sources such as CrisisLex (<a href="http://crisislex.org/data-collections.html#CrisisLexT26" target="_blank">CrisisLex26</a>, <a href="http://crisislex.org/data-collections.html#CrisisLexT6" target="_blank">CrisisLex6</a>), <a href="https://crisisnlp.qcri.org/lrec2016/lrec2016.html" target="_blank">CrisisNLP</a>, <a href="http://mimran.me/papers/imran_shady_carlos_fernando_patrick_practical_2013.pdf" target="_blank">SWDM2013</a>, <a href="http://mimran.me/papers/imran_shady_carlos_fernando_patrick_iscram2013.pdf" target="_blank">ISCRAM13</a>, Disaster Response Data (DRD), <a href="https://data.world/crowdflower/disasters-on-social-media" target="_blank">Disasters on Social Media (DSM)</a>, <a href="https://crisisnlp.qcri.org/crisismmd" target="_blank">CrisisMMD</a> and data from <a href="http://aidr.qcri.org/" target="_blank">AIDR</a>. 
	  The purpose of this work was to map the class label, remove duplicates and provide a benchmark results for the community. </p>

The authors have their model and data availible on github at <a href="https://github.com/firojalam/crisis_datasets_benchmarks">https://github.com/firojalam/crisis_datasets_benchmarks</a>    </p>

#### Data Availible from: https://crisisnlp.qcri.org/data/crisis_datasets_benchmarks/crisis_datasets_benchmarks_v1.0.tar.gz
<h4><strong>References</strong></h4>
<ol>
<li><a href="http://sites.google.com/site/firojalam/">Firoj Alam</a>, <a href="https://hsajjad.github.io/">Hassan Sajjad</a>, <a href="http://mimran.me/">Muhammad Imran</a> and <a href="https://sites.google.com/site/ferdaofli/">Ferda Ofli</a>, <a href="https://arxiv.org/abs/2004.06774" target="_blank"><strong>CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing,</strong></a> In ICWSM, 2021. [<a href="crisis_dataset_bib1.html">Bibtex</a>]
        </li>
<!-- <li><a href="http://sites.google.com/site/firojalam/">Firoj Alam</a>, <a href="https://hsajjad.github.io/">Hassan Sajjad</a>, <a href="http://mimran.me/">Muhammad Imran</a> and <a href="https://sites.google.com/site/ferdaofli/">Ferda Ofli</a>, <a href="https://arxiv.org/abs/2004.06774" target="_blank"><strong>Standardizing and Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing,</strong></a> In arxiv, 2020. [<a href="crisis_dataset_bib.html">Bibtex</a>]</li>-->
        <li>Firoj Alam, Ferda Ofli and Muhammad Imran. CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA.</li>
        <li>Muhammad Imran, Prasenjit Mitra, and Carlos Castillo: Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages. In Proceedings of the 10th Language Resources and Evaluation Conference (LREC), pp. 1638-1643. May 2016, Portorož, Slovenia.</li>
        <li>A. Olteanu, S. Vieweg, C. Castillo. 2015. What to Expect When the Unexpected Happens: Social Media Communications Across Crises. In Proceedings of the ACM 2015 Conference on Computer Supported Cooperative Work and Social Computing (CSCW '15). ACM, Vancouver, BC, Canada.</li>
        <li>A. Olteanu, C. Castillo, F. Diaz, S. Vieweg. 2014. CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises. In Proceedings of the AAAI Conference on Weblogs and Social Media (ICWSM'14). AAAI Press, Ann Arbor, MI, USA.</li>
        <li>Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier. Extracting Information Nuggets from Disaster-Related Messages in Social Media. In Proceedings of the 10th International Conference on Information Systems for Crisis Response and Management (ISCRAM), May 2013, Baden-Baden, Germany.</li>
        <li>Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier. Practical Extraction of Disaster-Relevant Information from Social Media. In Social Web for Disaster Management (SWDM'13) - Co-located with WWW, May 2013, Rio de Janeiro, Brazil.</li>
        <li>https://appen.com/datasets/combined- disaster-response-data/</li>
        <li>https://data.world/crowdflower/disasters- on-social-media</li>

### Pull text into notebook and establish variables

In [5]:
# Set up folders
labled_data_folder  =  os.path.join(parent_dir,"Data/crisis_datasets_benchmarks/all_data_en")
initial_filtering_folder = os.path.join(parent_dir,"Data/crisis_datasets_benchmarks/initial_filtering")

In [29]:
self_pull_folder = os.path.join(parent_dir,"Data/scraped")

In [30]:
geotweets = pd.read_csv(os.path.join(self_pull_folder,"tweetsid.csv"))
geotweets.head()

Unnamed: 0,TweetId,Text
0,1.362915e+18,@ErinBrockovich We haven't had running water i...
1,1.362915e+18,@ErinBrockovich We haven't had running water i...
2,1.362915e+18,"NHS College Storm still battling in Bemidji, t..."
3,1.362915e+18,I hope one of the last two challenges that hav...
4,1.362915e+18,@atliberalandold Once after an ice storm in Ka...


In [6]:
# Establish dataframes
# filtered  = pd.read_table(os.path.join
#                        (initial_filtering_folder,
#                                     "crisis_consolidated_informativeness_filtered_lang.tsv"))
# english = filtered[filtered["lang"] == 'en']

train =  pd.read_table(os.path.join
                       (labled_data_folder,
                                    "crisis_consolidated_informativeness_filtered_lang_en_train.tsv"))
test  =  pd.read_table(os.path.join
                       (labled_data_folder,
                                    "crisis_consolidated_informativeness_filtered_lang_en_train.tsv"))
dev =  pd.read_table(os.path.join
                     (labled_data_folder,
                                  "crisis_consolidated_informativeness_filtered_lang_en_train.tsv"))

In [7]:
df_list = [train, test, dev]

#### Learning how to do some text stuff 

In [16]:
preproccessed_tweets_train = process(train['text'])
preproccessed_tweets_test = process(test['text'])
preproccessed_tweets_dec = process(dev['text'])

In [31]:
preproccessed_tweets_geotweets = process(geotweets ['Text'])

In [18]:
test['processed2'] = preproccessed_tweets_train
train['processed2'] = preproccessed_tweets_test
dev['processed2'] = preproccessed_tweets_dec


In [32]:
geotweets['processed2'] = preproccessed_tweets_geotweets

In [11]:
test.head()

Unnamed: 0,id,event,source,text,lang,lang_conf,class_label,processed
0,530,disaster_events,drd-figureeight-multimedia,"Organization that are working in Haiti, I do n...",en,1.0,informative,organization that are working in haiti i do no...
1,913070034204884992,hurricane_maria,crisismmd,Maria now a hurricane again!! Strong storm sur...,en,,informative,maria now a hurricane again !! strong storm su...
2,540027128478044160,2014_philippines_typhoon,crisisnlp-cf,RT @ANCALERTS: Fallen tree branches scattered ...,en,0.950529,informative,rt usrId fallen tree branches scattered in sor...
3,17711,disaster_events,drd-figureeight-multimedia,The Government did not request international a...,en,1.0,informative,the government did not request international h...
4,778251007,disaster_events,dsm-cf,Remove http://t.co/77b2rNRTt7 Browser Hijacker...,en,1.0,not_informative,remove httpAddress browser hijacker how httpAd...


In [33]:
geotweets.head()

Unnamed: 0,TweetId,Text,processed2
0,1.362915e+18,@ErinBrockovich We haven't had running water i...,we haven t had running water in lake charles l...
1,1.362915e+18,@ErinBrockovich We haven't had running water i...,we haven t had running water in lake charles l...
2,1.362915e+18,"NHS College Storm still battling in Bemidji, t...",nhs college storm still battling in bemidji tr...
3,1.362915e+18,I hope one of the last two challenges that hav...,i hope one of the last two challenges that hav...
4,1.362915e+18,@atliberalandold Once after an ice storm in Ka...,atliberalandold once after an ice storm in kan...


In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

cv = CountVectorizer()

train_counts = cv.fit_transform(train['processed'])
test_data = cv.transform(test['processed'])

nb = MultinomialNB()

clf = nb.fit(train_counts, train['class_label'])
predicted = clf.predict(test_data)


print("NB prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test_data, test['class_label'])))

NB prediction accuracy =  85.6%


In [35]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

cv = CountVectorizer()

train_counts = cv.fit_transform(train['processed'])
test_data = cv.transform(geotweets['processed2'])

nb = MultinomialNB()

clf = nb.fit(train_counts, train['class_label'])
predicted_geo = clf.predict(test_data)


# print("NB prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test_data, test['class_label'])))

In [50]:
geotweets['predict']=predicted_geo
print(geotweets.iloc[244]['Text'] )
print (geotweets.iloc[244]['processed2'])
print (geotweets.iloc[244]['predict'])

The Federal Energy Regulatory Commission urged Texas to winterize its power plants with insulation and heat pipes in 2011.  The state deliberately cut its power grid off from the rest of the country precisely to escape federal regulation and did not act. https://t.co/J36cPXcT5j
the federal energy regulatory commission urged texas to winterize its power plants with insulation and heat pipes in the state deliberately cut its power grid off from the rest of the country precisely to escape federal regulation and did not
informative


In [13]:
# Read about Pipelines here:
# http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
from sklearn.pipeline import Pipeline

tools = [('cv', CountVectorizer()), ('nb', MultinomialNB())]
clf = Pipeline(tools)

clf = clf.fit(train['processed'], train['class_label'])
predicted = clf.predict(test['processed'])

print("NB prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test['processed'], test['class_label'])))

NB prediction accuracy =  85.6%


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tools = [('tf', TfidfVectorizer()), ('nb', MultinomialNB())]
clf = Pipeline(tools)

# set_params() of TfidfVectorizer below, sets the parameters of the estimator. The method works on simple estimators as 
# well as on nested objects (such as pipelines). The pipelines have parameters of the form <component>__<parameter> 
# so that it’s possible to update each component of a nested object.
clf.set_params(tf__stop_words = 'english')

clf = clf.fit(train['processed'], train['class_label'])
predicted = clf.predict(test['processed'])

print("NB (TF-IDF with Stop Words) prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test['processed'], test['class_label'])))

NB (TF-IDF with Stop Words) prediction accuracy =  84.9%


In [19]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

cv = CountVectorizer()

train_counts = cv.fit_transform(train['processed2'])
test_data = cv.transform(test['processed2'])

nb = MultinomialNB()

clf = nb.fit(train_counts, train['class_label'])
predicted = clf.predict(test_data)


print("NB prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test_data, test['class_label'])))

NB prediction accuracy =  86.2%


In [20]:
# Read about Pipelines here:
# http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
from sklearn.pipeline import Pipeline

tools = [('cv', CountVectorizer()), ('nb', MultinomialNB())]
clf = Pipeline(tools)

clf = clf.fit(train['processed2'], train['class_label'])
predicted = clf.predict(test['processed2'])

print("NB prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test['processed2'], test['class_label'])))

NB prediction accuracy =  86.2%


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

tools = [('tf', TfidfVectorizer()), ('nb', MultinomialNB())]
clf = Pipeline(tools)

# set_params() of TfidfVectorizer below, sets the parameters of the estimator. The method works on simple estimators as 
# well as on nested objects (such as pipelines). The pipelines have parameters of the form <component>__<parameter> 
# so that it’s possible to update each component of a nested object.
clf.set_params(tf__stop_words = 'english')

clf = clf.fit(train['processed2'], train['class_label'])
predicted = clf.predict(test['processed2'])

print("NB (TF-IDF with Stop Words) prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test['processed2'], test['class_label'])))

NB (TF-IDF with Stop Words) prediction accuracy =  85.5%


In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer

clf = Pipeline([('vect', CountVectorizer(stop_words = 'english')),
                ('tfidf', TfidfTransformer()),
                ('lr', LogisticRegression())])


clf = clf.fit(train['processed2'], train['class_label'])
predicted = clf.predict(test['processed2'])
predicted_geo_lab = clf.predict(geotweets['processed2'])
geotweets['predict']=predicted_geo_lab

print("LR (TF-IDF with Stop Words) prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test['processed2'], test['class_label'])))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LR (TF-IDF with Stop Words) prediction accuracy =  88.5%


In [53]:
import random
#Generate 10 random numbers between 0 and 30
randomlist = random.sample(range(0, len(geotweets)), 10)
print(randomlist)
for z in randomlist:
    print("Predicted Label {0}".format(geotweets.iloc[z]['predict']))
    print("initial tweet {0}".format(geotweets.iloc[z]['Text']))
    print("Processed tweet {0}".format(geotweets.iloc[z]['processed2']))

[2059, 1990, 1701, 1589, 1009, 1773, 1219, 297, 1209, 1324]
Predicted Label informative
initial tweet WATCH: These camels watched as snow fell in the mountains near Tabuk, Saudi Arabia on Thursday.
https://t.co/mvRdOJZCZm https://t.co/IkvQox3D4L
Processed tweet watch these camels watched as snow fell in the mountains near tabuk saudi arabia on
Predicted Label not_informative
initial tweet Icepocalypse, Austin, TX.. a pictorial:
1. Terracotta worked minimally, better if setup as a true radiator in a small room.
2. Non-potable melted snow/ice for dishes, rinsing, flushing.
3. Sanitizer in the bathroom because of the boil notice.
4. OMG oatmeal was a life saver! https://t.co/Xrfe7FCLkp
Processed tweet icepocalypse austin a terracotta worked minimally better if setup as a true radiator in a small non potable melted for dishes rinsing sanitizer in the bathroom because of the boil omg oatmeal was a life saver !
Predicted Label not_informative
initial tweet @MileHigh_Nick @MileHighChubb @Jaco

In [25]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics

tools = [('cv', CountVectorizer()), ('nb', MultinomialNB())]
pclf = Pipeline(tools)


# Lowercase and restrict ourselves to about half the available features
pclf.set_params(cv__stop_words = 'english', \
                cv__ngram_range=(1,2), \
                cv__lowercase=True)

pclf.fit(train['processed2'], train['class_label'])
y_pred = pclf.predict(test['processed2'])
print(metrics.classification_report(test['class_label'], y_pred, target_names = ['informative','not']))

              precision    recall  f1-score   support

 informative       0.95      0.98      0.96     65612
         not       0.96      0.92      0.94     43829

    accuracy                           0.95    109441
   macro avg       0.96      0.95      0.95    109441
weighted avg       0.95      0.95      0.95    109441



In [26]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import scale, LabelBinarizer
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Random seed for numpy
np.random.seed(18937)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation

# Build a mode that is composed of this list of layers
model = Sequential(
    [
          # This specifies a single neuron, and the input is 2 numbers.
    Dense(1, input_dim=2),  # a dense layer, every neuron is connected to all points from the lower layer (input)
    Activation('linear'),   # Specify the type of decision surface, i.e., simple linear regression
    Dense(1),               # another dense layer, input_dim is inferred from the previous layer's output
    Activation('sigmoid')   # Specify the type of decision surface, i.e., simple logistic regression
    ]
)
model.summary()

In [None]:
n = 9
message = english.iloc[n]['text']
message_procc =  preproccessed_tweets [n]
print(message)
print(message_procc)

In [None]:
# Accumulate counts of tokens, using string functionality


# Used to print sequences in a nice manner

pp = pprint.PrettyPrinter(indent=2, depth=2, width=80,compact=True)

# Tokenize the message and create a counter for frequency of each word in message.
# Browse for split() python or go to this link http://www.pythonforbeginners.com/dictionary/python-split to see what split() does
words = message.split()
word_count = col.Counter(words)

# Setting the limit to 40 for the number of tokens to display 
counts_to_display = 40

# Display results. 
print('Total number of tokens = {0}'.format(len(word_count)))
print(30*'-')
print('Top {} tokens by frequency:'.format(counts_to_display))
print(30*'-')
pp.pprint(word_count.most_common(counts_to_display))

In [None]:
# Accumulate counts of tokens, using string functionality


# Used to print sequences in a nice manner

pp = pprint.PrettyPrinter(indent=2, depth=2, width=80,compact=True)

# Tokenize the message and create a counter for frequency of each word in message.
# Browse for split() python or go to this link http://www.pythonforbeginners.com/dictionary/python-split to see what split() does
words = message_procc.split()
word_count = col.Counter(words)

# Setting the limit to 40 for the number of tokens to display 
counts_to_display = 40

# Display results. 
print('Total number of tokens = {0}'.format(len(word_count)))
print(30*'-')
print('Top {} tokens by frequency:'.format(counts_to_display))
print(30*'-')
pp.pprint(word_count.most_common(counts_to_display))

In [None]:
words = message.lower().split()
word_count = col.Counter(words)

# Setting the limit to 40 for the number of tokens to display 
counts_to_display = 40

# Display results. 
print('Total number of tokens = {0}'.format(len(word_count)))
print(30*'-')
print('Top {} tokens by frequency:'.format(counts_to_display))
print(30*'-')
pp.pprint(word_count.most_common(counts_to_display))

In [None]:
twks= twokenize.tokenizeRawTweetText(message)
word_count = col.Counter(twks)
# Setting the limit to 40 for the number of tokens to display 
counts_to_display = 40

# Display results. 
print('Total number of tokens = {0}'.format(len(word_count)))
print(30*'-')
print('Top {} tokens by frequency:'.format(counts_to_display))
print(30*'-')
pp.pprint(word_count.most_common(counts_to_display))

In [None]:
# In the below print statememt, {0:12s} means, print argument 1 with 12 spaces allocated for it. 
# You can see that for two of the results, 12 spaces is not adequate and these results are misaligned.
print('{0:12s}: {1}'.format('Term', 'Frequency'))
print(20*'-')

total_word_count = sum(word_count.values())
for count in word_count.most_common(counts_to_display):
    pp.pprint('{0:12s}: {1:4.3f}'.format(count[0], count[1]/total_word_count))

In [None]:
cv = CountVectorizer(analyzer='word', lowercase=True)\
cv.fit(english['text'])

In [None]:
# We can now process documents.

# We need an iteratable to apply cv.transform()
msg = []
msg.append(message)

# Transforming a single message is easier to comprehend. By default, scikit learn uses sparse matrices for text processing
# It returns a Document Term Matrix (dtm)
dtm = cv.transform(msg)

# In sparse format number of tokens indicate size of dataset vocabulary. 
# So there is 1 document and 130107 featues in the dtm.
print('Number of Samples = {0}'.format(dtm.shape[0]))
print('Number of Tokens = {0}'.format(dtm.shape[1]))
print(80*'-')


# You can't explore the document-term matrix when it is in sparse form. We can convert from sparse to dense form to explore 
# the document-term matrix. The range given below is chosen randomly. 
# Each word is a feature. Below zeros indicate the words/features in columns 1000 to 1100, those words do not appear in 
# the input message. Thats why we have zeros for those cells
print(dtm.todense()[:,1000:1100])
print(80*'-')


# We can also print only nonzero DTM matrix elements. 
print('Cells from Document-Term Matrix[i, j] and c (Count)')
print(80*'-')



# Find non-zero elements. scipy.sparse.find() returns the indices and values of the nonzero elements of a matrix.
# i,j contains the row and column indices where non zero matrix entries are present while V has the entry's value.
i, j, V = sp.find(dtm)
dtm_list = list(zip(i, j, V))
pp.pprint(dtm_list)

In [None]:
print(cv.vocabulary_["monsoon"])

In [None]:

dtm_list[0:5]

In [None]:
max(dtm_list, key=itemgetter(2))

In [None]:
# Explore the terms in the vocabulary
terms = cv.vocabulary_

# Look for a single term confuse
search_word = 'blizzard'
print("Chosen Word ({0}): Column = {1}".format(search_word, terms[search_word]))

# Find the maximum value in dtm_list in 3rd column which will be 114455
max_key = max(dtm_list, key=itemgetter(2))[1]

# Find the minimum value in dtm_list in 3rd column which will be 2336
min_key = min(dtm_list, key=itemgetter(2))[1]

# In the below two lines, terms.keys() will return all keys - i.e. the column names(words).
# For loop iterates over all this words to see get the column name which matches the column index we have in max_key and min_key
x_max = [key for key in terms.keys() if terms[key] == max_key]
x_min = [key for key in terms.keys() if terms[key] == min_key]

# the for loop above returned a list as output. So x_max is a list with a column name as same with x_min
print("Max Word ({0}): Column = {1}".format(x_max[0], max_key))
print("Min Word ({0}): Column = {1}".format(x_min[0], min_key))

In [None]:
# Tokenize a text document
# word_tokenize() is tokenizing the message and each word is being converted to lowercase. 
# So words has the vocabulary of message
words = [word.lower() for word in nltk.word_tokenize(message)]
top_display=25

# Count number of occurances for each token
counts = nltk.FreqDist(words)
pp.pprint(counts.most_common(top_display))

In [None]:
# Tokenize a text document
# word_tokenize() is tokenizing the message and each word is being converted to lowercase. 
# So words has the vocabulary of message
words = [word.lower() for word in nltk.word_tokenize(message_procc)]
top_display=25

# Count number of occurances for each token
counts = nltk.FreqDist(words)
pp.pprint(counts.most_common(top_display))

In [None]:
# Specify a Regular Expression to parse a text document

pattern = re.compile(r'[^\w\s]')
words = [word.lower() for word in nltk.word_tokenize(re.sub(pattern, ' ', message_procc))]

# Count token occurances
counts = nltk.FreqDist(words)
pp.pprint(counts.most_common(top_display))

In [None]:
num_words = len(words)
num_tokens = len(counts)
lexdiv  =  num_words / num_tokens
print("Message has %i tokens and %i words for a lexical diversity of %0.3f" % (num_tokens, num_words, lexdiv))

In [None]:
# Display number of unique tokens (or bins)
print('Number of unique bins(tokens) = {0}\n'.format(counts.B()))
print('Number of sample outcomes = {0}\n'.format(counts.N()))
print('Maximum occuring token = {0}\n'.format(counts.max()))

print('{0:12s}: {1}'.format('Term', 'Count'))
print(25*'-')

for token, freq in counts.most_common(top_display):
    print('{0:12s}:  {1:4.3f}'.format(token, freq))

In [None]:
# Hapaxes
pp.pprint(counts.hapaxes()[:10])

In [None]:
# Number of elements to display
top_display=10
counts.tabulate(top_display)

In [None]:
fig, axs = plt.subplots(figsize=(10,6))
sns.set(style="white", font_scale=1.5)
sns.despine(offset=5)#, trim=True)
counts.plot(top_display, cumulative=True)
axs.set_title('Term Count')
plt.show()

In [None]:
# Sample sentance to tokenize
my_text = message_procc

cv1 = CountVectorizer(lowercase=True)
cv2 = CountVectorizer(stop_words = 'english', lowercase=True)

tk_func1 = cv1.build_analyzer()
tk_func2 = cv2.build_analyzer()

pp = pprint.PrettyPrinter(indent=2, depth=1, width=80, compact=True)

print('Tokenization:')
pp.pprint(tk_func1(my_text))

print()

print('Tokenization (with Stop words):')
pp.pprint(tk_func2(my_text))

In [None]:
tk_func1

In [None]:
new_text = message_procc
stemmer = PorterStemmer()
tokens = nltk.word_tokenize(new_text)
tokens = [token for token in tokens if token not in string.punctuation]

for w in tokens:
    print(stemmer.stem(w))

In [None]:
english['text']0