## Day26 - NLP

**Latent Semantic Analysis** (LSA) is based on the well known Singular Value Decomposition (SVD). It is possible to truncate the tf-idf matrix in order to drastically reduce the dimension of the problem. This new representation highlights the "latent sentiment" of these topics.

So, the LSA tells you which dimensions are relevant to the semantic of the documents. In fact the "low variance" topics may represents just noise. 

In [44]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sb

from nltk import word_tokenize, SnowballStemmer
from nltk.corpus import stopwords
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from collections import Counter

In [None]:
df = pd.read_csv('../dataset/abcnews-date-text.csv')

reindexed_data = df['headline_text']
reindexed_data.index = df['publish_date']

In [32]:
def preprocess_data(data):
    """
    :param data: series of news headlines
    :return: preprocessed series of news headlines
    """

    stemmer = SnowballStemmer(language="english")
    
    prep_data = []
    
    for document in data:
        
        # lower case normalization
        prep_document = document.lower()
        
        # remove numbers and hashtags
        prep_document = re.sub(r"[0-9]","",prep_document)
        prep_document = re.sub(r"#", "", prep_document)
        
        # tokenization
        tokens = word_tokenize(prep_document)
        
        stemmed_tokens = [stemmer.stem(t) for t in tokens if t not in stopwords.words('english')]
        
        prep_data.append(' '.join(stemmed_tokens))
        
    return prep_data

In [33]:
%%time
preprocessed_docs = preprocess_data(reindexed_data.sample(100000))

Wall time: 2min 59s


In [62]:
## tf - idf
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_docs)

In [63]:
lsa_model = TruncatedSVD(n_components=7)
lsa_topic_matrix = lsa_model.fit_transform(X)

In [68]:
# https://www.kaggle.com/rcushen/topic-modelling-with-lsa-and-lda
    
def get_keys(topic_matrix):
    '''
    returns an integer list of predicted topic 
    categories for a given topic matrix
    '''
    keys = topic_matrix.argmax(axis=1).tolist()
    return keys

def keys_to_counts(keys):
    '''
    returns a tuple of topic categories and their 
    accompanying magnitudes for a given list of keys
    '''
    count_pairs = Counter(keys).items()
    categories = [pair[0] for pair in count_pairs]
    return categories

def get_top_n_words(n, keys, document_term_matrix, count_vectorizer):
    '''
    returns a list of n_topic strings, where each string contains the n most common 
    words in a predicted category, in order
    '''
    top_word_indices = []
    for topic in range(7):
        temp_vector_sum = 0
        for i in range(len(keys)):
            if keys[i] == topic:
                temp_vector_sum += document_term_matrix[i]
        temp_vector_sum = temp_vector_sum.toarray()
        top_n_word_indices = np.flip(np.argsort(temp_vector_sum)[0][-n:],0)
        top_word_indices.append(top_n_word_indices)   
    top_words = []
    for topic in top_word_indices:
        topic_words = []
        for index in topic:
            temp_word_vector = np.zeros((1,document_term_matrix.shape[1]))
            temp_word_vector[:,index] = 1
            the_word = count_vectorizer.inverse_transform(temp_word_vector)[0][0]
            topic_words.append(the_word.encode('ascii').decode('utf-8'))
        top_words.append(" ".join(topic_words))         
    return top_words

lsa_keys = get_keys(lsa_topic_matrix)
lsa_categories = keys_to_counts(lsa_keys)

In [69]:
top_n_words_lsa = get_top_n_words(10, lsa_keys, X, vectorizer)

for i in range(len(top_n_words_lsa)):
    print("Topic {}: ".format(i+1), top_n_words_lsa[i])

Topic 1:  man charg say plan council govt court fire call back
Topic 2:  torchin drugsrisk furri fuselag fusion fuss fussi futcher futil futsal
Topic 3:  polic investig probe search offic hunt death drug miss arrest
Topic 4:  market nation abc rural news busi weather countri hour sport
Topic 5:  new zealand law year case appoint york get plan open
Topic 6:  australia kill crash car die day world australian year cup
Topic 7:  interview report extend win smith nrl michael john us david
