In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp

sns.set_style("darkgrid")
%matplotlib inline

corpus: collection of documents
1. tweet
2. review
3. resume
4. book
5. article
6. sentence?
...

In [2]:
#Documents usually represented as strings
#string: a sequence (list) of unicode characters
doc = "D.S. is fun!\nIt's  true."
print(doc)
print('|'.join(doc))

D.S. is fun!
It's  true.
D|.|S|.| |i|s| |f|u|n|!|
|I|t|'|s| | |t|r|u|e|.


# Regular Expressions

1. Search patterns over text
2. Useful for finding/replacing/grouping
3. Can use the python re library

Short list of special characters

- . : any single character except newline (r'.', 'x')
- ^ : beginning of string (r'^D', 'D.S.')
- \$ : end of string (r'fun!\\$', 'DS is fun!')
- \* : match 0 or more repetitions (r'x*','')
- \+ : match 1 or more repetitions (r'x+', 'xx')
- ? : match 0 or 1 repetitions (r'x?', '')

Short list of special characters (cont.)
- [] : a set of characters (^ as first element = not)
- \s : whitespace character (Ex: [ \t\n\r\f\v])
- \S : non-whitespace character (Ex: [^ \t\n\r\f\v])
- \w : word character (Ex: [a-zA-Z0-9_])
- \W : non-word character
- \b : boundary between \w and \W

and many more!

In [3]:
import re
# Find all of the whitespaces in doc
re.findall(r'\s+',doc)

[' ', ' ', '\n', '  ']

In [4]:
# split on whitespace
re.split(r'\s+', doc)

['D.S.', 'is', 'fun!', "It's", 'true.']

In [5]:
# find tokens of length 2+ word characters
re.findall(r'\b\w\w+\b', doc)

['is', 'fun', 'It', 'true']

In [6]:
re.findall(r'\b\w\w+\b', re.sub(r'D\.S\.','Data Science',doc))

['Data', 'Science', 'is', 'fun', 'It', 'true']

In [40]:
import spacy
nlp = spacy.load("en_core_web_sm")
parsed = nlp(doc)

TypeError: Argument 'string' has incorrect type (expected str, got spacy.tokens.doc.Doc)

In [10]:
parsed

D.S. is fun!
It's  true.

In [11]:
'|'.join([token.text for token in parsed])

"D.S.|is|fun|!|\n|It|'s| |true|."

doc 1: "The cat in the hat."  

doc 2: "The quick brown cat jumped over the lazy cat."

1. terms: distinct values in our vocabulary ('brown','cat',...)
2. vocabulary: set of terms that can be in a document
3. tokens: strings that make up a document ('the','cat',...)
4. tokenization: transform document into tokens

# Tokenization

common tokenization method: whitespace

doc 1: "The","cat","in","the","hat."

doc 2: "The","quick","brown","cat","jumped","over","the","lazy","cat."

Additional transformations depend on problem:
1. lowercase
2. remove stopwords
3. stemming: reduce token to stem (eg: "tokenization"->"tokeniz")
4. lemmatization: common form (eg: "tokenization"->"tokenize")
5. start and end tags \\<START\>, \\<END\>
6. remove special characters

# Doc Representation

Bag of Words (BOW) representation:
split document into tokens, ignore order, but, lose context!

doc 1: "cat","hat","in","the","the"

doc 2: 'brown','cat','cat','jumped','lazy','over','quick','the','the'

In [12]:
sorted(re.findall(r'\b\w\w+\b', re.sub(r'D\.S\.','Data Science',doc).lower()))

['data', 'fun', 'is', 'it', 'science', 'true']

Stopwords: terms that have high DF and aren't informative

ex: 'a', 'about','above',...

often removed prior to analysis

n-grams:
    
create new terms as combinations of n tokens

vocabulary increses quickly,

ex: Bigrams

doc 1: "\\<start\>_the",the_cat","cat_in","the_hat","hat_<end>"
    
doc 2: "\\<start\>_the",'the_quick','quick_brown',...

Term Frequency: number of occurance of a term in a document

Document Frequency: number of documents a term occurs in

Unigram term frequency (TF):

doc 1: "cat":1,"hat":1,"the":2  
doc 2: "cat":2,"brown":1,...

First need to tokenize to generate vocabulary
Unigram document frequency (DF):

cat:2
hat:1
the:2
...

# CountVectorizer

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
docs = ['The cat in the hat.','The quick brown cat jumps over the lazy cat']
cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(docs)

In [15]:
cv.vocabulary_

{'cat': 1, 'hat': 2, 'quick': 5, 'brown': 0, 'jumps': 3, 'lazy': 4}

In [16]:
X.todense()

matrix([[0, 1, 1, 0, 0, 0],
        [1, 2, 0, 1, 1, 1]])

In [17]:
docs = ['D.S. is fun!','It is true.']
cv = CountVectorizer()
X = cv.fit_transform(docs)

In [18]:
# what vocabulary was learned
cv.vocabulary_

{'is': 1, 'fun': 0, 'it': 2, 'true': 3}

In [19]:
cv = CountVectorizer(lowercase=True,
                     min_df=1,
                     max_df=1.0,
                     token_pattern='\\b\\S\\S+\\b',
                     ngram_range=(1,2),
                     stop_words='english')
X = cv.fit_transform(docs)
cv.vocabulary_

{'d.s': 0, 'fun': 2, 'd.s fun': 1, 'true': 3}

In [20]:
sorted([(y,x) for x,y in cv.vocabulary_.items()])

[(0, 'd.s'), (1, 'd.s fun'), (2, 'fun'), (3, 'true')]

In [21]:
X.todense()

matrix([[1, 1, 1, 0],
        [0, 0, 0, 1]])

# Tf-Idf

What if some terms are still uninformative?  
Can we downweight terms that are in many documents?  
Term Frequency - Inverse Document Frequency (TfIdf)

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)

In [52]:
X

<2x9 sparse matrix of type '<class 'numpy.float64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [53]:
X.todense()

matrix([[0.        , 0.33425073, 0.46977774, 0.46977774, 0.        ,
         0.        , 0.        , 0.        , 0.66850146],
        [0.33241213, 0.47302794, 0.        , 0.        , 0.33241213,
         0.33241213, 0.33241213, 0.33241213, 0.47302794]])

In [54]:
tfidf.get_feature_names()

['brown', 'cat', 'hat', 'in', 'jumps', 'lazy', 'over', 'quick', 'the']

In [22]:
docs = ['First sentence is a test.','Second sentence is also a test.']
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(token_pattern='\\b\\S\\S+\\b',
                        stop_words=['is'],
                        norm=False,
                        smooth_idf=False
                       )
X = tfidf.fit_transform(docs)
sorted([(y,x) for x,y in tfidf.vocabulary_.items()])

[(0, 'also'), (1, 'first'), (2, 'second'), (3, 'sentence'), (4, 'test')]

In [23]:
X.todense()

matrix([[0.        , 1.69314718, 0.        , 1.        , 1.        ],
        [1.69314718, 0.        , 1.69314718, 1.        , 1.        ]])

# Part of Speech Tagging

In [24]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
print(f"{'text':7s} {'lemma':7s} {'pos':5s} {'is_stop'}")
print('-'*30)
for token in doc:
    print(f'{token.text:7s} {token.lemma_:7s} {token.pos_:5s} {token.is_stop}')

text    lemma   pos   is_stop
------------------------------
Apple   Apple   PROPN False
is      be      AUX   True
looking look    VERB  False
at      at      ADP   True
buying  buy     VERB  False
U.K.    U.K.    PROPN False
startup startup NOUN  False
for     for     ADP   True
$       $       SYM   False
1       1       NUM   False
billion billion NUM   False


In [27]:
from spacy import displacy
displacy.render(doc, style="dep")

In [28]:
[(ent.text,ent.label_) for ent in doc.ents]b

[('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]

In [32]:
doc = nlp('Baseball is played on a diamond.')
doc[0].text

'Baseball'

In [30]:
doc[0].vector.shape

(96,)

In [31]:
list(doc[0].vector[:3])

[0.024354577, 2.3031363, 0.78963846]

In [45]:
# Use nlp.pipe to transform multiple docs at once
docs = list(nlp.pipe(['Baseball is played on a diamond.',
                      'Hockey is played on ice.',
                      'Diamonds are clear as ice.']))

In [55]:
import warnings
warnings.simplefilter(action='ignore') # to remove warning
# using average of token vectors for each document.
np.array([[docs[i].similarity(docs[j]) for j in range(3)]
 for i in range(3)])

array([[1.        , 0.76558144, 0.54887063],
       [0.76558144, 1.        , 0.64784397],
       [0.54887063, 0.64784397, 1.        ]])

# Topic Modeling using Latent Dirichlet Allocation

## What is topic modeling? 

- **topic**: a collection of related words
- a document can be composed of several topics

### Given a collection of documents, we can ask:

- What words make up each topic?

- What topics make up each document?

### What if we knew everything about our corpus?

In [9]:
vocab = ['baseball','cat','dog','pet','played','tennis']

V = len(vocab) # size of vocabulary

In [11]:
K = 2 # number of topics
# the probability of each term given topic 1
topic_1 = [.33,   0,   0,   0, .33, .33]
# the probability of each term given topic 2
topic_2 = [  0, .25, .25, .25, .25,   0]

In [12]:
# per topic word distributions
phi = [topic_1, topic_2]

In [13]:
print(np.array(phi).shape) # K x V (number of topics x size of vocabulary)
print(phi)

(2, 6)
[[0.33, 0, 0, 0, 0.33, 0.33], [0, 0.25, 0.25, 0.25, 0.25, 0]]


### If we had some documents, what topics make up each document?

In [14]:
corpus = ['the dog and cat played tennis',
          'tennis and baseball are sports',
          'a dog or a cat can be a pet']

# recall
vocab = ['baseball','cat','dog','pet','played','tennis']

phi = [[.33,   0,   0,   0, .33, .33],
       [  0, .25, .25, .25, .25,   0]]

In [16]:
# per document topic distributions
theta = [[.50, .50],
         [.99, .01],
         [.01, .99]]

In [17]:
print(np.array(theta).shape) # M x K (number of documents x number of topics)

(3, 2)


### We can even generate a document

In [18]:
np.random.seed(123) # for demo purposes

N = 6 # number of tokens in document

In [19]:
new_theta = [.6,.4] # draw a topic distribution (theta die)

In [20]:
new_doc = []
for i in range(N):
    z = np.argmax(np.random.multinomial(1, new_theta)) # get a topic
    
    idx = np.argmax(np.random.multinomial(1,phi[z]))   
    x = vocab[idx]                                     # get a term
    
    new_doc.append(x)                                  # add to document

In [21]:
' '.join(new_doc)

'pet dog played tennis pet played'

### NOTE: But usually, we don't know the theta or phi!  
### We need to learn these from a set of documents (corpus)!

### Uses for $\phi$ (phi), the per topic word distributions:

- infering labels for topics
- word clouds

### Uses for $\theta$ (theta), the per document topic weights:

- dimentionality reduction
- clustering
- similarity

### How do we learn phi ($\phi$) and theta ($\theta$)?

### Latent Dirichlet Allocation (LDA)

 - generative statistical model
 - *Blei, D., Ng, A., Jordan, M. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (Jan 2003)*
 

### Dirichlet Distribution

- Conjugate prior to the Multinomial Distribution
- Multinomial is like a "die"
- Dirichlet is like a "die factory"

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png" style="width: 30%">

```
K     # number of topics
phi   # per topic word distributions
beta  # parameters for word distribution die factory, length = V
```

```
M     # number of documents
N     # number of words/tokens in each document
theta # per document topic distributions
alpha # parameters for topic die factory, length = K
```

```
z     # topic indexes
```

```
Dirichlet   # dirichlet distribution (aka die factory)
```

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png" style="width: 30%">

```
phi = []  # word distribution die, 1 per topic

# pseudocode to generate topic word distributions
for k in range(K):
    phi.append(Dirichlet(beta,V).get_die())  # generate word distribution die
```

```
corpus = []

# pseudocode to generate corpus
for m in range(M):
    document_m = []
    
    theta_m = Dirichlet(alpha,K).get_die()   # generate a topic die
    
    for n in range(N):
        z_mn = theta_m.get_topic()     # roll topic die
        w_mn = phi[z_mn].get_word()    # roll word distribution die
        
        document_m.append(w_mn)
    
    corpus.append(document_m)
```

## Review

### Things we know: 

 - M : the number of documents
 - N : the lengths of document
 

### Things we choose:

 - K : the number of topics
 - V : our vocabulary

### Things we want to learn: 

 - $\theta$'s (theta's) : the per document topic weights
 - $\phi$'s (phi's) : the per topic word weights

#### Note:

We may want to infer $\alpha$ and $\beta$ as well

## Example using sklearn

In [58]:
import warnings # to deal with deprecation warnings

from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups()
X = newsgroups.data
len(X)

11314

In [59]:
# example document
X[4].replace('\n',' ')[:200]

'From: jcm@head-cfa.harvard.edu (Jonathan McDowell) Subject: Re: Shuttle Launch Question Organization: Smithsonian Astrophysical Observatory, Cambridge, MA,  USA Distribution: sci Lines: 23  From artic'

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(min_df=50, stop_words='english')

In [61]:
# transform our documents (this might take a moment)
X_tfidf = tfidf.fit_transform(X)
X_tfidf.shape

(11314, 4175)

In [69]:
# this is our vocabulary (the column names of our dataset)
feature_names = tfidf.get_feature_names()
print(feature_names[:10])
print(feature_names[-10:])

['00', '000', '01', '02', '03', '04', '05', '06', '07', '08']
['ysu', 'za', 'zealand', 'zero', 'zeus', 'zip', 'zone', 'zoo', 'zuma', 'zx']


In [64]:
from sklearn.decomposition import LatentDirichletAllocation

warnings.simplefilter(action='ignore', category=DeprecationWarning) # to remove warning

In [70]:
# create model with 20 topics
lda = LatentDirichletAllocation(n_components=20,  # the number of topics
                                n_jobs=-1,        # use all cpus
                                random_state=123) # for reproducability

In [71]:
# learn phi and theta (lda.components_ and X_lda)
# this will take a while!
X_lda = lda.fit_transform(X_tfidf)

In [72]:
X_lda[100] # lda representation of document_100

array([0.00684573, 0.00684574, 0.00684573, 0.00684573, 0.00684573,
       0.00684573, 0.00684573, 0.00684573, 0.00684573, 0.00684573,
       0.00684573, 0.00684573, 0.00684573, 0.00684573, 0.00684573,
       0.51803048, 0.263115  , 0.00684573, 0.1024771 , 0.00684573])

In [74]:
np.argsort(X_lda[100])[::-1][:3] # the top topics of document_100
#the X_lda index sort by the value, reverse the sort, pick the first(largest) three

array([15, 16, 18])

In [75]:
# a utility function to print out the most likely terms for each topic
# taken from https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic {:#2d}: ".format(topic_idx)
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)

In [76]:
print_top_words(lda,feature_names,5)

Topic  0: msg unc umich food duke
Topic  1: edu com writes article subject
Topic  2: uni informatik upenn sas tu
Topic  3: uga ai portal georgia covington
Topic  4: nasa gov space digex access
Topic  5: stratus sw cdt tavares rocket
Topic  6: sandvik kent uiuc newton cso
Topic  7: window windows card mouse color
Topic  8: uci rpi dartmouth orion oac
Topic  9: team game hockey players games
Topic 10: cleveland cwru freenet ins reserve
Topic 11: caltech keith sgi morality wpi
Topic 12: scsi bus ide isa mit
Topic 13: valley arbor ann hal verse
Topic 14: gtoal toal hp hewlett packard
Topic 15: windows dos edu file thanks
Topic 16: sale 00 edu cmu andrew
Topic 17: bmp jupiter south geometry tar
Topic 18: henry drive toronto simms ohio
Topic 19: gatech prism nyx georgia du


## Example using gensim(data loss here)

from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups()
corpus_fname = '../../../scikit_learn_data/20news-bydate_py3.data.txt'
with open(corpus_fname,'w') as f:
    for doc in newsgroups.data[:1000]:
        f.write(doc.replace('\n',' ') + '\n')

### What words make up each topic?

### What topics make up each document?