## Day25 - NLP

When we use normalization techniques such as stemming/lemmatization and we measure the importance through the tf-idf (term frequency inverse document frequency) we might have side problems. For example, when there are two documents that are talking about the same thing but in different words, they will be really far in the final vector space. Of course it can be true also the opposit case. For this reason it is important to have also **topic scores**, in order to cluster and search information by their meaning.

**Latent semantic analysis** (LSA) is an algorithm that analyze the relationship of the words, in order to cluster them into topics. Since the number of topics is (obviously) much smaller than the number of topics, it is commonly used for reduce the dimension of your initial matrix. A slightly different algorithm is **Linear Discriminant Analysis** (LDA), which breaks down a document into a single topic.

LDA is one of the fastest algorithms for dimension reduction, however, it is a **supervised algorithm**, so it requires some initial labels.

In [3]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
from nlpia.data.loaders import get_data

sms = get_data('sms-spam')
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms = pd.DataFrame(sms.values, columns=sms.columns, index=index)
sms['spam'] = sms.spam.astype(int)

INFO:nlpia.futil:Reading CSV with `read_csv(*('C:\\Users\\Francesco\\anaconda3\\lib\\site-packages\\nlpia\\data\\sms-spam.csv',), **{'nrows': None, 'low_memory': False})`...


In [4]:
sms.head()

Unnamed: 0,spam,text
sms0,0,"Go until jurong point, crazy.. Available only ..."
sms1,0,Ok lar... Joking wif u oni...
sms2!,1,Free entry in 2 a wkly comp to win FA Cup fina...
sms3,0,U dun say so early hor... U c already then say...
sms4,0,"Nah I don't think he goes to usf, he lives aro..."


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize

tfidf_model = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf_model.fit_transform(raw_documents=sms.text).toarray()

In [18]:
mask = sms.spam.astype(bool).values
# average position of the elements belonging to SPAM
spam_centroid = tfidf_docs[mask].mean(axis=0)
# average position of the elements belonging to HAM
ham_centroid = tfidf_docs[~mask].mean(axis=0)

# distance along the line from the ham centroid to the spam centroid
spamminess_score = tfidf_docs.dot(spam_centroid - ham_centroid)

In [19]:
from sklearn.preprocessing import MinMaxScaler
# apply and normalize the prediction
sms['lda_score'] = MinMaxScaler().fit_transform(spamminess_score.reshape(-1,1))
# 1 if the score is > 0.5, 0 otherwise
sms['lda_predict'] = (sms.lda_score > .5).astype(int)

# result
(1. - (sms.spam - sms.lda_predict).abs().sum() / len(sms)).round(3)

0.977

In [20]:
sms['spam lda_predict lda_score'.split()]

Unnamed: 0,spam,lda_predict,lda_score
sms0,0,0,0.227478
sms1,0,0,0.177888
sms2!,1,1,0.718785
sms3,0,0,0.184565
sms4,0,0,0.286944
...,...,...,...
sms4832!,1,1,0.850649
sms4833,0,0,0.292753
sms4834,0,0,0.269454
sms4835,0,0,0.331306
