# Natural Language Processing     
Topic Identification

Natural Language Processing (or NLP) is the science of dealing with human language or text data. One of the NLP application is Topic identification, which is a technique used to discover topics across text documents.
We will learn how to identify topics from texts.

### Importing the required libraries and modules

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
nltk.download('wordnet')     #download if using this module for the first time

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [3]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [5]:
#For Gensim
import gensim
import string
from gensim import corpora
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize



## Bag-of-words 
Bag-of-words is a simplistic method for identifying topics in a documents. It works on the assumption that the higher the frequency of the term, the higher it's importance. 

In [6]:
text1  =  "Avengers: Infinity War was a 2018 American superhero film based on the Marvel Comics superhero team the Avengers. It is the 19th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)"
print(text1)

Avengers: Infinity War was a 2018 American superhero film based on the Marvel Comics superhero team the Avengers. It is the 19th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)


In [8]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [9]:
tokens = word_tokenize(text1)
lowercase_tokens = [t.lower() for t in tokens]
print(lowercase_tokens)

['avengers', ':', 'infinity', 'war', 'was', 'a', '2018', 'american', 'superhero', 'film', 'based', 'on', 'the', 'marvel', 'comics', 'superhero', 'team', 'the', 'avengers', '.', 'it', 'is', 'the', '19th', 'film', 'in', 'the', 'marvel', 'cinematic', 'universe', '(', 'mcu', ')', '.', 'the', 'running', 'time', 'of', 'the', 'movie', 'was', '149', 'minutes', 'and', 'the', 'box', 'office', 'collection', 'was', 'around', '2', 'billion', 'dollars', '.', '(', 'source', ':', 'wikipedia', ')']


In [10]:
bagofwords_1 = Counter(lowercase_tokens)
print(bagofwords_1.most_common(10))

[('the', 7), ('was', 3), ('.', 3), ('avengers', 2), (':', 2), ('superhero', 2), ('film', 2), ('marvel', 2), ('(', 2), (')', 2)]


### Text Preprocessing

In [11]:
alphabets = [t for t in lowercase_tokens if t.isalpha()]

words = stopwords.words("english")
stopwords_revoked = [t for t in alphabets if t not in words]

print(stopwords_revoked)

['avengers', 'infinity', 'war', 'american', 'superhero', 'film', 'based', 'marvel', 'comics', 'superhero', 'team', 'avengers', 'film', 'marvel', 'cinematic', 'universe', 'mcu', 'running', 'time', 'movie', 'minutes', 'box', 'office', 'collection', 'around', 'billion', 'dollars', 'source', 'wikipedia']


In [12]:
lemmatizer = WordNetLemmatizer()

lem_tokens = [lemmatizer.lemmatize(t) for t in stopwords_revoked]

bags_words = Counter(lem_tokens)
print(bags_words.most_common(6))

[('avenger', 2), ('superhero', 2), ('film', 2), ('marvel', 2), ('infinity', 1), ('war', 1)]
