## DTM building in Py

Class,

We've seen R's text-an in BOW functionality extensively. Time to quickly replicate the essentials of that work in py as well. 

Plan is to load a corpus, clean it up, apply stopwords and stem if required and finally build a DTM out of it. See below.

In [1]:
## setup chunk
import time   # to time 'em opns
t0 = time.time()    # start timer
import numpy as np
import pandas as pd
import nltk
from bs4 import BeautifulSoup
import re
import os
import codecs
from sklearn import feature_extraction
# import mpld3  # conda install -c conda-forge mpld3 
t1 = time.time()

time.taken = round(t1-t0, 3)
print(time.taken)
print("\n")    # print newline

72.529




### Data Loading and Basic Cleaning

The document 'synopses_wiki.txt' contains movie synopses or summaries data scraped from Wiki.

Load this document into py by customzing the path in the code below. We'll use only the top 100 summaries for demo purposes here in class. Feel free to try the same on the full dataset or on other, larger datasets. Do time your ops.

Special mention to the 'encoding="utf-8"' bit in open(). Without it, error will come as some characters in the text are non-ASCII.

Note also the use of *html.parser*from bs4 to get rid of html junk.

In [3]:
# note insertion of utf-8 enconding here below. Else charmap error comes.
synopses_wiki = open(r"C:\Users\31172\TABA\Session 1\synopses_list_wiki.txt", encoding="utf-8").read().split('\n BREAKS HERE')
synopses_wiki = synopses_wiki[:100]   # only 101 summaries taken for analysis

# use bs4 to cleanup html junk & append into clean_text
t0 = time.time()
synopses_clean_wiki = []
for text in synopses_wiki:
    text = BeautifulSoup(text, 'html.parser').getText()
    #strips html formatting and converts to unicode
    synopses_clean_wiki.append(text)
t1 = time.time()
print(round(t1-t0, 3))    # 0.036 secs

synopses_wiki = synopses_clean_wiki

0.024


### Tokenization and Stemming

Below we import NLTK's sentence and word tokenizer, and stemmer. Note the use of list comprehension to bundle both into one  line of efficient code.

Note also the use of regex from *re* to detect and drop any non alphabetic characters from the corpus.

Find below two straightforward user defined funcs to tokenize (and stem).

We will apply these funcs on each doc in the corpus subsequently.

In [4]:
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')

# load nltk's SnowballStemmer as variabled 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

## here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) using regex
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [5]:
# Use above funcs to iterate over the list of synopses to create two vocabularies: one stemmed and one only tokenized. 
totalvocab_stemmed = []
totalvocab_tokenized = []
t0 = time.time()
for i in synopses_wiki:
    allwords_stemmed = tokenize_and_stem(i)
    totalvocab_stemmed.extend(allwords_stemmed)
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)
t1 = time.time()
print(round(t1-t0, 3))    # 10.716 secs

7.667


### Building the DTM for TFIDF weighing

While lists are nice nd all, nothing quite like the tabular structure of a dataFrame to put the mind at ease and conjure up analysis visualizations in our minds. Hence, step 1 below is building and populating a panda DF.

Next, we import *TfidfVectorizer* to build the DTM under TFIDF scheme.

Finally, we inspect the DTM object, its dimensions, the main column names etc.

In [6]:
## create a pandas DataFrame with the stemmed vocabulary as the index and the tokenized words as the column
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)

## Tf-idf and document similarity
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
# defining parms for the tfidf-tokenizer here
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, # max proportion of docs word is present in
				   max_features=200000,
                                   min_df=0.2, 
				   stop_words='english',
                                   use_idf=True, 
				   tokenizer=tokenize_and_stem, 
				   ngram_range=(1,3))

# note magic cmd %time
%time tfidf_matrix = tfidf_vectorizer.fit_transform(synopses_wiki)    # 6.05 secs

print(tfidf_matrix.shape)    # dimns of the tfidf matrix

  'stop_words.' % sorted(inconsistent))


Wall time: 6.81 s
(100, 218)


In [13]:
terms = tfidf_vectorizer.get_feature_names()
terms[:20]

['accept',
 'agre',
 'allow',
 'alon',
 'american',
 'ani',
 'anoth',
 'apart',
 'appear',
 'approach',
 'arm',
 'armi',
 'arrang',
 'arriv',
 'ask',
 'attack',
 'attempt',
 'away',
 'becaus',
 'becom']

In [17]:
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(),columns = terms)
#tfidf_df
tfidf_df

Unnamed: 0,accept,agre,allow,alon,american,ani,anoth,apart,appear,approach,...,way,wife,wit,woman,work,world,wound,year,york,young
0,0.196816,0.050128,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.100256,0.140758,0.000000
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.183194,0.000000,0.000000,0.070243,...,0.000000,0.059369,0.000000,0.000000,0.108227,0.064789,0.000000,0.105546,0.000000,0.000000
2,0.000000,0.058951,0.142565,0.000000,0.079837,0.075890,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.078467,0.000000,0.000000,0.180935,0.000000,0.000000,0.000000,0.000000
3,0.000000,0.000000,0.000000,0.000000,0.101813,0.000000,0.000000,0.000000,0.000000,0.066710,...,0.052051,0.056382,0.000000,0.000000,0.000000,0.000000,0.000000,0.050118,0.211096,0.000000
4,0.000000,0.060724,0.073426,0.164476,0.082238,0.078172,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.068313,0.000000,0.000000,0.062266,0.000000,0.000000,0.000000,0.000000,0.000000
5,0.000000,0.054107,0.065425,0.000000,0.109916,0.069654,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.066427,0.000000,0.066427,0.000000,0.000000,0.000000,0.058439
6,0.051589,0.039418,0.000000,0.053384,0.080076,0.050745,0.091224,0.000000,0.099864,0.000000,...,0.122816,0.000000,0.000000,0.000000,0.040420,0.000000,0.000000,0.000000,0.000000,0.042575
7,0.000000,0.000000,0.090045,0.100852,0.151277,0.000000,0.000000,0.000000,0.000000,0.099121,...,0.000000,0.167550,0.000000,0.000000,0.076360,0.182847,0.000000,0.223403,0.313655,0.000000
8,0.000000,0.056360,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.071393,0.000000,...,0.117069,0.000000,0.000000,0.000000,0.000000,0.103790,0.000000,0.000000,0.000000,0.000000
9,0.000000,0.000000,0.000000,0.121230,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.119149,0.329691,0.000000,0.000000,0.000000,0.000000,0.125678,0.193365


Chalo, dassit for now. Back to the slides.

Sudhir