## The problem

Topic modeling is a fun way to start our study of NLP. We will use two popular **matrix decomposition techniques**.

We start with a **term-document matrix**:

![[Pasted image 20220607090648.png | 750]]

We can decompose this into one tall thin matrix times one wide short matrix (possibly with a diagnoal matrix in between).

### Motivation

Consider the most extreme case

## Getting started

In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt

%matplotlib inline
np.set_printoptions(suppress=True)

ModuleNotFoundError: No module named 'numpy'

In [None]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)
newsgroups_train.filenames.shape, newsgroups_train.target.shape

In [None]:
print("\n".join(newsgroups_train.data[:3]))

In [None]:
np.array(newsgroups_train.target_names)[newsgroups_train.target[:3]]

In [None]:
newsgroups_train.target[:10]

In [None]:
num_topics, num_top_words = 6, 8

## Stop words, stemming, lemmatization

### Stop words

From Intro to Information Retrieval:

Some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.

The general trend in IR systems over time has been from standard use of quite large stop lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists.

### NLTK

In [None]:
from sklearn.feature_extraction import stop_words

sorted(list(stop_words.ENGLISH_STOP_WORDS))[:20]

## Stemming and lemmatization

from Information Retrieval textbook:

Are the below words the same?

organize, organizes, and organizing

democracy, democratic, and democratization

Stemming and Lemmatization both generate the root form of the words.

Lemmatization uses the rules about a language. The resulting tokens are all actual words

"Stemming is the poor-man’s lemmatization." (Noah Smith, 2011) Stemming is a crude heuristic that chops the ends off of words. The resulting tokens may not be actual words. Stemming is faster.

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
from nltk import stem

In [None]:
wnl = stem.WordNetLemmatizer()
porter = stem.porter.PorterStemmer()

In [None]:
word_list = ['feet', 'foot', 'foots', 'footing']

In [None]:
[wnl.lemmatize(word) for word in word_list]

In [None]:
[porter.stem(word) for word in word_list]

Your turn! Now, try lemmatizing and stemming the following collections of words:

- fly, flies, flying
- organize, organizes, organizing
- universe, university

fastai/course-nlp

Stemming and lemmatization are language dependent. Languages with more complex morphologies may show bigger benefits. For example, Sanskrit has a very large number of verb forms.

### Spacy

Stemming and lemmatization are implementation dependent.

Spacy is a very modern & fast nlp library. Spacy is opinionated, in that it typically offers one highly optimized way to do something (whereas nltk offers a huge variety of ways, although they are usually not as optimized).

You will need to install it.

if you use conda:

conda install -c conda-forge spacy
if you use pip:

pip install -U spacy
You will then need to download the English model:

spacy -m download en_core_web_sm