# Topic modeling
Let's perform LDA on a collection of reviews

In [1]:
import urllib.request
import os

def download_file(url,local_file, force=False):
    """
    Helper function to download a file and store it locally
    """
    if not os.path.exists(local_file) or force:
        print('Downloading')
        with urllib.request.urlopen(url) as opener, \
             open(local_file, mode='w', encoding='utf-8') as outfile:
                    outfile.write(opener.read().decode('utf-8'))
        print('done')
    else:
        print('Already downloaded')

In [2]:
review_file = 'news_en_train_small.csv'
download_file('http://www.esuli.it/demo/data/news_en_train_small.csv', review_file)

Downloading
done


In [3]:
import csv
x_train = list()
with open(review_file, encoding='utf-8', newline='') as infile:
    reader = csv.reader(infile)
    for row in reader:
        x_train.append(row[0])

IndexError: list index out of range

In [4]:
x_train[0][:200]

"Nigerian women's bobsled team make Winter Olympic history\r\r\nUpdated 1752 GMT (0152 HKT) November 17, 2017\r\r\nOlympic flame arrives in Seoul\r\r\nOlympic flame arrives in Seoul\r\r\nNigeria women's bobsled te"

We perform tokenization removing frequent words, stopwords, rare words, and counting frequencies.

Every document is converted from a string to a set of the words it is composed of, each one associated with its frequency in the document.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

it_stopwords = stopwords.words('english')

tf_vectorizer = CountVectorizer(stop_words=it_stopwords, max_df=0.5, min_df=5,max_features = 1000, ngram_range=(1,2))
tf = tf_vectorizer.fit_transform(x_train)

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

n_components = 8

lda = LatentDirichletAllocation(n_components=n_components, max_iter=10,
                                learning_method = 'batch',
                                n_jobs=-1,verbose=1)
lda.fit(tf)


In [None]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = f'Topic {topic_idx}: '
        message += ', '.join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [None]:
n_top_words = 20
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

## Projecting documents into topics
The transform methods of the LDA model, transforms the tf vectors of documents into topic vectors.
The weights of a topic vector for a document indicates how the content of a document can be considered to be originated by the various topics.

In [None]:
doc_topics = lda.transform(tf)
doc_topics

In [None]:
doc_topics.shape

The highest weighted topic is the one the document may be assigned to.

In [None]:
import numpy as np

np.argmax(doc_topics,axis=1)