<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Word-Counts-with-Bag-of-Words" data-toc-modified-id="Word-Counts-with-Bag-of-Words-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Word Counts with Bag-of-Words</a></span></li><li><span><a href="#Simple-Text-Preprocessing" data-toc-modified-id="Simple-Text-Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Simple Text Preprocessing</a></span></li><li><span><a href="#Introduction-to-Gensim" data-toc-modified-id="Introduction-to-Gensim-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Introduction to Gensim</a></span><ul class="toc-item"><li><span><a href="#Creating-a-Gensim-Dictionary" data-toc-modified-id="Creating-a-Gensim-Dictionary-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Creating a Gensim Dictionary</a></span></li></ul></li><li><span><a href="#Tf-idf-with-Gensim" data-toc-modified-id="Tf-idf-with-Gensim-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Tf-idf with Gensim</a></span></li></ul></div>

## Word Counts with Bag-of-Words

```
Counter()
counter.most_common()
```

In [27]:
string = "I love cats and my CAT 喵喵 is chased by my lovely dog and another cat."

In [21]:
from nltk import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
from matplotlib import pyplot as plt

In [28]:
tokens = word_tokenize(string.lower())
print(tokens)

['i', 'love', 'cats', 'and', 'my', 'cat', '喵喵', 'is', 'chased', 'by', 'my', 'lovely', 'dog', 'and', 'another', 'cat', '.']


In [29]:
tokens = [token for token in tokens if token.isalpha()]
print(tokens)

['i', 'love', 'cats', 'and', 'my', 'cat', '喵喵', 'is', 'chased', 'by', 'my', 'lovely', 'dog', 'and', 'another', 'cat']


In [30]:
no_stop = [t for t in tokens if t not in stopwords.words('english')]
print(no_stop)

['love', 'cats', 'cat', '喵喵', 'chased', 'lovely', 'dog', 'another', 'cat']


In [31]:
count = Counter(no_stop)

In [32]:
Counter.most_common(count, 3)

[('cat', 2), ('cats', 1), ('dog', 1)]

## Simple Text Preprocessing

![preprocessing](img/preprocessing.png)

https://blog.csdn.net/lt326030434/article/details/85240591

In [2]:
from nltk.stem import WordNetLemmatizer

In [33]:
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(t) for t in tokens]
print(lemmatized)

['i', 'love', 'cat', 'and', 'my', 'cat', '喵喵', 'is', 'chased', 'by', 'my', 'lovely', 'dog', 'and', 'another', 'cat']


## Introduction to Gensim

**Popular open-source NLP library**

Uses top academic models to perform complex tasks:

- Building document or word vectors

- Performing topic identification and document comparison

> Gensim是一款开源的第三方Python工具包，用于从原始的**非结构化**的文本中，**无监督**地学习到文本隐层的主题向量表达。它支持包括TF-IDF，LSA，LDA，和word2vec在内的多种主题模型算法。

sparse features (lot of zeros and some ones)

**Word vectors** are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between words in a corpus.

LDA

### Creating a Gensim Dictionary

In [34]:
from gensim.corpora.dictionary import Dictionary

In [43]:
 my_documents = ['The movie was about a spaceship and aliens.',
                 'I really liked the movie!',
                 'Awesome action scenes, but boring characters.',
                 'The movie was awful! I hate alien films.',
                 'Space is cool! I liked the movie.',
                 'More space films, please!',]

tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]

In [125]:
dictionary = Dictionary(tokenized_docs)
print(dictionary)

Dictionary(29 unique tokens: ['.', 'liked', 'but', 'awesome', 'boring']...)


In [54]:
# select the id for 'films'
film_id = dictionary.token2id.get('films')

print(film_id)

# use token_id with the dictionary to print the word
print(dictionary.get(film_id))

dictionary.token2id

22
films


{'!': 9,
 ',': 13,
 '.': 0,
 'a': 1,
 'about': 2,
 'action': 14,
 'alien': 20,
 'aliens': 3,
 'and': 4,
 'awesome': 15,
 'awful': 21,
 'boring': 16,
 'but': 17,
 'characters': 18,
 'cool': 24,
 'films': 22,
 'hate': 23,
 'i': 10,
 'is': 25,
 'liked': 11,
 'more': 27,
 'movie': 5,
 'please': 28,
 'really': 12,
 'scenes': 19,
 'space': 26,
 'spaceship': 6,
 'the': 7,
 'was': 8}

In [44]:
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
corpus                                      # 1st number in the tuple refers to the token_id, 2nd is the frequency in the document

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(0, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)],
 [(0, 1),
  (5, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1)],
 [(0, 1), (5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (24, 1), (25, 1), (26, 1)],
 [(9, 1), (13, 1), (22, 1), (26, 1), (27, 1), (28, 1)]]

`collection.defaultdict`

[itertools常用方法](https://www.jianshu.com/p/52992ca06ada)

In [57]:
import itertools
from collections import defaultdict

In [104]:
# Save the fifth document: doc
doc = corpus[4]
print(doc)
print("\n")

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    
print("\n")
    
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count

# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

[(0, 1), (5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (24, 1), (25, 1), (26, 1)]


. 1
movie 1
the 1
! 1
i 1


. 4
movie 4
the 4
! 4
i 3


**下面两段简单的代码帮助理解上面defaultdict相关的一连串处理：**

In [107]:
dic = {"a":1, "b":2, "c":3, "a":1, "c":2,}
sec = defaultdict(int)
for name, id in dic.items():
    print(name)                 # 此处不能打印出重复的key，复习一下字典的遍历
#     print(sec[name])
    sec[name] += id
    
sec

a
c
b


defaultdict(int, {'a': 1, 'b': 2, 'c': 2})

In [101]:
dic = [(1,1), (1,2), (0,3), (0,1), (2,1)]
sec = defaultdict(int)
for name, id in dic:
    sec[name] += id
    
sec

defaultdict(int, {0: 4, 1: 3, 2: 1})

## Tf-idf with Gensim

- Tf-idf (term frequency-inverse document frequency) helps determine the most important words in each document.

- Each corpus may have shared words beyond just stopwords, and these shared words should be down-weighted in importance.

$$
W_{i,j}=tf_{i,j} * \log{\left(\frac{N}{df_{i}}\right)} 
$$

$w_{i,j}=$ tf-idf weight for token i in document j

$tf_{i,j}=$number of occurences of token i in document j 【这个数字通常会被归一化(一般是词频除以文章总词数), 以防止它偏向长的文件】

$df_{i}=$ number of documents that contain token i

$N=$ total number of documents

In [108]:
from gensim.models.tfidfmodel import TfidfModel

In [112]:
# pass bag-of-words model to initialize the tf-idf model
tfidf = TfidfModel(corpus)

In [122]:
corpus[2]

[(0, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)]

In [123]:
# show token_id and token_weights
tfidf[corpus[2]]

[(0, 0.08926151827345048),
 (13, 0.2418550916450883),
 (14, 0.3944486650167261),
 (15, 0.3944486650167261),
 (16, 0.3944486650167261),
 (17, 0.3944486650167261),
 (18, 0.3944486650167261),
 (19, 0.3944486650167261)]

Above weights can help determine good topics and keywords for a corpus with shared vocabulary.