In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt

In [2]:

%matplotlib inline
np.set_printoptions(suppress=True)

# Get data

In [3]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

In [4]:
type(newsgroups_train)

sklearn.utils.Bunch

In [5]:
newsgroups_train.target_names

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

In [6]:
print(newsgroups_train.data[0],newsgroups_train.filenames[0])

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych /home/quantran/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38816


In [7]:
newsgroups_train.target[0]

1

In [8]:
# although we only have 4 topics, we want to see what the model will split out if we adjust # of topics
num_of_topics = 6
num_top_words=8

# Stop words

Is going out of style. Only suitable when you have a very simple model and want to focus more on important words, so you sacrifice the need for stop words.

Neural network, on the other hand, can handle stop words, and you should include stop words for language model to capture all the information it needs.

# stemming and lemmatization

Both involve generating the ROOT form of the word

- Lemmatization: apply language rules. Result tokens are actual words, e.g. foot and footing are 2 different words, but feet and foot are 1 word (no prob)
- Stemming: just chop of the end of the words. Faster. Result tokens are not words. E.g. foot and footing become 1 word: foot, but foot and feet are 2 different words (b/c of different endings). Universe and university will become 1 word (which should not be!)

In [9]:
import spacy

In [10]:
from spacy.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer()

In [11]:
word_list= ["fly", "flies", "flying",
            "organize", "organizes", "organizing",
            "universe", "university",
           "foot","feet","footing"]

In [12]:
[lemmatizer.lookup(word) for word in word_list]
# spacy lemmatizer aint do shit!

['fly',
 'flies',
 'flying',
 'organize',
 'organizes',
 'organizing',
 'universe',
 'university',
 'foot',
 'feet',
 'footing']

# Data processing - Bag of words

In [13]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [14]:
import nltk

In [15]:
vectorizer = CountVectorizer(stop_words='english')

In [16]:

vectors = vectorizer.fit_transform(newsgroups_train.data).todense() 
# (documents, vocab)
vectors.shape

(2034, 26576)

In [17]:
vocab = np.array(vectorizer.get_feature_names())

In [18]:
vocab.shape

(26576,)

In [19]:
vocab[5000:5010]

array(['brow', 'brown', 'browning', 'browns', 'browse', 'browsing',
       'bruce', 'bruces', 'bruise', 'bruised'], dtype='<U80')

# SVD

![](images/svd_fb.png)

SVD is exact decomposition => one solution for the other 3 matrices

Columns of U (each value of column is for each doc) and rows of Vt (each value of row is for each vocab) are **orthonormal** (perpendicular aka each dot product is 0 AND each vectors are unit vectors aka length = 1)

Note that set of orthonormal vectors are **linearly independent**, meaning no vector of the set can be calculated using other vectors in that same set

In [20]:
doc_vocab = vectors

## Full SVD

For the full SVD, both U and V are square matrices, where the extra columns in U (or V, depends on your setup) form an orthonormal basis (but zero out when multiplied by extra rows of zeros in S).

Those made-up columns don't come from the original matrix

In [21]:
%time U, s, Vh = linalg.svd(doc_vocab, full_matrices=True)

CPU times: user 2min 39s, sys: 2.32 s, total: 2min 41s
Wall time: 42.4 s


In [23]:
print(U.shape, s.shape, Vh.shape) # every matrix is square

(2034, 2034) (2034,) (26576, 26576)


![](images/Selection_033.png)

## Reduced SVD

In [25]:
%time U, s, Vh = linalg.svd(doc_vocab, full_matrices=False)

CPU times: user 36.3 s, sys: 1.13 s, total: 37.4 s
Wall time: 9.96 s


In [26]:
print(U.shape, s.shape, Vh.shape)

(2034, 2034) (2034,) (2034, 26576)


In [33]:
np.diag(s).shape # to be consistent with the graph above

(2034, 2034)

- U: doc vs "topics" (# of topics is automatically generated in this case, aka unsupervised)
- s: diagonal matrix showing important value of each topic
- Vh: topics vs vocab 

In [32]:
doc_vocab.shape

(2034, 26576)

In [27]:
s[:4] # diagonal matrix showing important values in DESCENDING ORDER

array([433.92698542, 291.51012741, 240.71137677, 220.00048043])

In [31]:
np.diag(np.diag(s[:4]))

array([433.92698542, 291.51012741, 240.71137677, 220.00048043])

In [34]:
# check the 3 matrices are the decomposition of the main one
temp = U@ np.diag(s) @ Vh
np.allclose(temp,vectors)

True

In [39]:
# check U and V are orthonormal

np.allclose(U@U.T, np.eye(U.shape[0]))
# because U dot itself is 1 b/c unit vector, 
# and U dot other vectors is 0 due to being perpendicular

True

In [45]:
np.allclose(Vh@Vh.T, np.eye(Vh.shape[0]))

True

In [46]:
Vh.T.shape,Vh.shape

((26576, 2034), (2034, 26576))

## Dwell into 'topics'

Remember that Vh is topics vs vocab matrix.

In [53]:
Vh[:10] # vocab value of first 10 topics

array([[-0.00940972, -0.0114532 , -0.00002169, ..., -0.00000572,
        -0.00001144, -0.00109243],
       [-0.00356688, -0.01769167, -0.00003045, ..., -0.00000773,
        -0.00001546, -0.0018549 ],
       [ 0.00094971, -0.02282845, -0.00002339, ..., -0.0000122 ,
        -0.0000244 ,  0.00150538],
       ...,
       [-0.00218087, -0.04322025, -0.00012552, ...,  0.00003759,
         0.00007518,  0.00160907],
       [-0.00039196,  0.00494894,  0.00000309, ..., -0.00001321,
        -0.00002643, -0.00015038],
       [ 0.00306552, -0.01437264, -0.00000405, ..., -0.00003597,
        -0.00007193,  0.00056218]])

In [59]:

num_top_words=8

def show_topics(a):
    # get indices of 8 highest vocab value for each topic, and get the words associated with each index
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = [top_words(t) for t in a]
    return [' '.join(t) for t in topic_words]

In [62]:
for i,v in enumerate(show_topics(Vh[:10])): 
    # 8 highest vocab value (from Vh) for each of 10 topics
    print(f'Topic {i}: {v}')

Topic 0: critus ditto propagandist surname galacticentric kindergarten surreal imaginative
Topic 1: jpeg gif file color quality image jfif format
Topic 2: graphics edu pub mail 128 3d ray ftp
Topic 3: jesus god matthew people atheists atheism does graphics
Topic 4: image data processing analysis software available tools display
Topic 5: god atheists atheism religious believe religion argument true
Topic 6: space nasa lunar mars probe moon missions probes
Topic 7: image probe surface lunar mars probes moon orbit
Topic 8: argument fallacy conclusion example true ad argumentum premises
Topic 9: space larson image theory universe physical nasa material


Since it's unsupervised, now we get an idea of what each topic is based on the vocabulary associated with it.

# Non-negative matrix factorization (NMF)

Similar to SVD, but instead of constraining our vetors to be orthogonal, we will constrain them to be **non-negative** (note that SVD allows negative value in its matrices, and **we are not sure how to interpret them**)

NMF is a factorization of a non-negative data set $V$:

$V$ = $W$$H$

While **SVD is exact factorizaion, NMF is NON-exact => many factorization solution** (variations are based on different constraints on NMF

![](images/nmf_doc.png)

In [64]:
m,n=doc_vocab.shape
d=5  # num topics
m,n

(2034, 26576)

In [65]:
clf = decomposition.NMF(n_components=d, random_state=1)

W1 = clf.fit_transform(doc_vocab) # doc vs topic
H1 = clf.components_ # topic vs vocab (or words)

In [68]:
W1.shape,H1.shape

((2034, 5), (5, 26576))

In [70]:
show_topics(H1)

['jpeg image gif file color images format quality',
 'edu graphics pub mail 128 ray ftp send',
 'space launch satellite nasa commercial satellites year market',
 'jesus god people matthew atheists does atheism said',
 'image data available software processing ftp edu analysis']

# TF-IDF and using NMF on TF-IDF

TF = (# occurrences of term t in document) / (# of words in documents)

IDF = log(# of documents / # documents with term t in it)

TF-IDF = TF * IDF, and it will return a matrix of doc vs vocab

=> the vocab will be important to a document if it appears several time in that document, but rarely appears in other documents


This is a better version of bag-of-words. But similarly to BOW, it doesn't take into acount order of words

In [71]:
vectorizer_tfidf = TfidfVectorizer(stop_words='english')
vectors_tfidf = vectorizer_tfidf.fit_transform(newsgroups_train.data) # (documents, vocab)

In [75]:
vectors_tfidf.shape

(2034, 26576)

In [73]:
W1 = clf.fit_transform(vectors_tfidf)
H1 = clf.components_ # topic vs vocab

In [74]:
show_topics(H1)

['people don think just like objective say morality',
 'graphics thanks files image file program windows know',
 'space nasa launch shuttle orbit moon lunar earth',
 'ico bobbe tek beauchaine bronx manhattan sank queens',
 'god jesus bible believe christian atheism does belief']

# Truncated SVD

the normal SVD spit out too many topics. In NMF you can adjust the # of topics so it will only calculate the subset of topics you are interested in.

Truncated SVD will try to achieve what NMF does: We are just interested in the vectors corresponding to the largest singular values (from the diagonal matrix) so we will throw away smallest singular values, effectively throwing away rows and columns of the other 2 matrices


# Cons of using classical decomposition

- Matrices are sometimes too big
- Input data are often inaccurate/missing => limit the precision and quality of the matrices
- Expensive computation


# Randomized SVD

In [76]:
from sklearn import decomposition

In [77]:

%time u, s, v = decomposition.randomized_svd(doc_vocab, 10)

CPU times: user 5.63 s, sys: 586 ms, total: 6.22 s
Wall time: 1.66 s


In [78]:
u.shape,s.shape,v.shape

((2034, 10), (10,), (10, 26576))

In [79]:
show_topics(v[:10])

['jpeg image edu file graphics images gif data',
 'jpeg gif file color quality image jfif format',
 'space jesus launch god people satellite matthew atheists',
 'jesus god matthew people atheists atheism does graphics',
 'image data processing analysis software available tools display',
 'jesus matthew prophecy messiah psalm isaiah david said',
 'launch commercial satellite market image services satellites launches',
 'data available nasa ftp grass anonymous contact gov',
 'argument fallacy conclusion example true ad argumentum premises',
 'probe data surface moon mars probes lunar launch']

For more on randomized SVD, check out my PyBay 2017 talk: https://www.youtube.com/watch?v=7i6kBz1kZ-A&list=PLtmWHNX-gukLQlMvtRJ19s7-8MrnRV6h6&index=7

For significantly more on randomized SVD, check out the Computational Linear Algebra course: https://github.com/fastai/numerical-linear-algebra

Note: Randomized SVD matches error rate of SVD with faster runtime