# Sentiment classification of sentences

## Introduction

In this notebook a classification of sentiment of sentences is performed with Linear Discriminant analysis (LDA), Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDiA).

The dataset is obtained from here: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

## Dataset

Source of the data (2015): Dimitrios Kotzias dkotzias '@' ics.uci.edu

Dataset information:
- contains sentences labelled with positive or negative sentiment. 
- format: sentence score, either 1 (for positive) or 0 (for negative) 
- The attributes are text sentences, extracted from reviews of products, movies, and restaurants
- The sentences come from three different websites/fields:imdb.com,amazon.com,yelp.com 

For each website, there exist 500 positive and 500 negative sentences. Those were selected randomly for larger datasets of reviews. Sentences were selected so that they have a clearly positive or negative connotaton. Thus the goal was for no neutral sentences to be selected. 

Relevant papers: 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015

## Loading the data

In [1]:
# Load libraries
import numpy as np
import pandas as pd
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize.casual import casual_tokenize
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDiA
from sklearn.model_selection import train_test_split

In [12]:
with open('amazon_cells_labelled.txt') as f:
    content =f.read() 

In [13]:
score=[]; text=[]; index=[]
for i, sentence in enumerate(content.splitlines()):
    line=sentence.split('\t')
    text.append(line[0])
    score.append(int(line[1]=='1'))
    index.append('sent{}'.format(i))

In [14]:
with open('yelp_labelled.txt') as f:
    content =f.read() 

In [15]:
prev_length=len(text)
for i, sentence in enumerate(content.splitlines()):
    line=sentence.split('\t')
    text.append(line[0])
    score.append(int(line[1]=='1'))
    index.append('sent{}'.format(i+prev_length))

In [16]:
with open('imdb_labelled.txt') as f:
    content =f.read() 

In [17]:
prev_length=len(text)
j=0
for i, sentence in enumerate(content.splitlines()):
    line=sentence.split('\t')        
    if len(line)==2:
        text.append(line[0])
        score.append(int(line[1]=='1'))
        index.append('sent{}'.format(j+prev_length))
        j+=1

## Exploring the data

In [18]:
df=pd.DataFrame({'score':score,'text':text},index=index)
df.head()

Unnamed: 0,score,text
sent0,0,So there is no way for me to plug it in here i...
sent1,1,"Good case, Excellent value."
sent2,1,Great for the jawbone.
sent3,0,Tied to charger for conversations lasting more...
sent4,1,The mic is great.


In [19]:
df.tail()

Unnamed: 0,score,text
sent2995,0,I just got bored watching Jessice Lange take h...
sent2996,0,"Unfortunately, any virtue in this film's produ..."
sent2997,0,"In a word, it is embarrassing."
sent2998,0,Exceptionally bad!
sent2999,0,All in all its an insult to one's intelligence...


In [20]:
# Number of document rows
len(df)

3000

In [21]:
# Number of document rows that are positive, and the share of classes
df.score.sum(), round(df.score.sum()/len(df),2)

(1500, 0.5)

Thus the dataset is balanced, 50 % of the sentences are positive, 50% negative.

## Creating Tf-Idf vectors

In [22]:
# Let's do tokenization and TF-IDF vector transformation on all sentences
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
tfidf_model=TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs=tfidf_model.fit_transform(raw_documents=df.text).toarray()
tfidf_docs=tfidf_docs-tfidf_docs.mean(axis=0)
# rows: number of documents,columns:number of terms
tfidf_docs.shape

(3000, 5399)

## Sentiment classification  

The idea is to try simple LDA classification just with Tf-idf vectors, and compare that with cases where LDA is performed with topic vectors. In the latter case the topic vectors are created either with Latent Semantic Analysis, LSA (PCA), or Latent Dirichlet Allocation, LDiA.

In [23]:
# Import the libraries
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA
from sklearn.decomposition import LatentDirichletAllocation as LDiA
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import casual_tokenize

### a) LDA classification with Tf-idf vectors (no PCA)

In [24]:
X_train,X_test,y_train,y_test=train_test_split(tfidf_docs,df.score.values,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)



(1.0, 0.682)

Training set accuracy is perfect, but test set accuracy is quite bad. It is thus better to use some method that reduces the number of dimensions. Let's try PCA.

### b) LDA classification and LSA with 256 PCA topic vectors

In [25]:
# Let's try PCA from scikit-learn, transforming 5399 dimension TF-IDF vectors into 256-D topic vectors
pca=PCA(n_components=256)
pca=pca.fit(tfidf_docs)
pca256_topic_vectors=pca.transform(tfidf_docs)
columns256=['topic{}'.format(i) for i in range(pca.n_components)]
pca256_topic_vectors=pd.DataFrame(pca256_topic_vectors,columns=columns256,index=index)
pca256_topic_vectors.round(3).head(6)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic246,topic247,topic248,topic249,topic250,topic251,topic252,topic253,topic254,topic255
sent0,-0.061,-0.099,0.013,-0.039,0.032,-0.079,-0.049,-0.004,0.033,-0.073,...,0.04,-0.009,0.015,-0.007,0.002,0.036,-0.024,0.015,-0.002,-0.015
sent1,0.01,0.102,0.016,0.075,0.091,0.224,-0.135,0.129,0.068,0.004,...,0.028,-0.004,-0.017,0.053,-0.078,-0.028,-0.02,0.077,0.047,0.005
sent2,0.063,0.169,0.139,0.192,-0.113,-0.166,0.022,0.087,0.081,-0.022,...,-0.008,-0.016,-0.014,-0.004,-0.005,-0.0,0.01,0.003,0.01,0.012
sent3,0.246,-0.027,-0.043,-0.097,-0.044,-0.005,-0.048,-0.01,0.013,0.007,...,0.028,-0.006,-0.017,-0.036,-0.007,-0.024,0.011,-0.006,-0.008,-0.016
sent4,0.053,0.215,0.252,0.145,-0.007,-0.149,0.092,0.02,0.177,-0.059,...,0.006,-0.021,-0.012,-0.016,-0.033,-0.012,-0.024,-0.017,0.028,0.022
sent5,-0.062,-0.134,-0.064,-0.018,-0.051,-0.083,-0.064,-0.022,0.103,0.026,...,-0.011,-0.011,-0.009,-0.028,-0.006,0.027,0.01,-0.02,0.03,-0.008


In [26]:
X_train,X_test,y_train,y_test=train_test_split(pca256_topic_vectors,df.score,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['pca256_sentiment']=lda.predict(pca256_topic_vectors)
# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)

(0.844, 0.776)

### c) LDA classification and LSA with 384 PCA topic vectors

In [27]:
# Let's try PCA from scikit-learn, transforming 5399 dimension TF-IDF vectors into 384-D topic vectors
pca=PCA(n_components=384)
pca=pca.fit(tfidf_docs)
pca384_topic_vectors=pca.transform(tfidf_docs)
columns384=['topic{}'.format(i) for i in range(pca.n_components)]
pca384_topic_vectors=pd.DataFrame(pca384_topic_vectors,columns=columns384,index=index)
pca384_topic_vectors.round(3).head(6)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic374,topic375,topic376,topic377,topic378,topic379,topic380,topic381,topic382,topic383
sent0,-0.061,-0.099,0.013,-0.039,0.032,-0.079,-0.049,-0.004,0.033,-0.073,...,0.017,0.015,-0.009,-0.003,0.041,-0.001,0.006,-0.039,0.019,-0.002
sent1,0.01,0.102,0.016,0.075,0.091,0.224,-0.135,0.129,0.068,0.004,...,-0.003,-0.003,0.002,-0.007,-0.001,-0.001,-0.023,0.008,0.004,0.005
sent2,0.063,0.169,0.139,0.192,-0.113,-0.166,0.022,0.087,0.081,-0.022,...,0.022,-0.036,-0.001,0.05,0.009,-0.023,-0.047,0.036,-0.085,-0.007
sent3,0.246,-0.027,-0.043,-0.097,-0.044,-0.005,-0.048,-0.01,0.013,0.007,...,-0.009,0.022,-0.005,-0.001,-0.017,0.024,-0.007,-0.031,-0.016,-0.025
sent4,0.053,0.215,0.252,0.145,-0.007,-0.149,0.092,0.02,0.177,-0.059,...,0.04,-0.01,-0.009,-0.021,-0.003,-0.033,0.052,0.009,0.001,0.014
sent5,-0.062,-0.134,-0.064,-0.018,-0.051,-0.083,-0.064,-0.022,0.103,0.026,...,0.014,0.004,-0.038,-0.005,0.008,0.002,-0.02,-0.004,0.007,0.014


In [28]:
X_train,X_test,y_train,y_test=train_test_split(pca384_topic_vectors,df.score,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['pca384_sentiment']=lda.predict(pca384_topic_vectors)
# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)

(0.88, 0.8)

### d) LDA classification and LSA with 312 PCA topic vectors

In [29]:
# Let's try PCA from scikit-learn, transforming 5399 dimension TF-IDF vectors into 312-D topic vectors
pca=PCA(n_components=312)
pca=pca.fit(tfidf_docs)
pca312_topic_vectors=pca.transform(tfidf_docs)
columns312=['topic{}'.format(i) for i in range(pca.n_components)]
pca312_topic_vectors=pd.DataFrame(pca312_topic_vectors,columns=columns312,index=index)
pca312_topic_vectors.round(3).head(6)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic302,topic303,topic304,topic305,topic306,topic307,topic308,topic309,topic310,topic311
sent0,-0.061,-0.099,0.013,-0.039,0.032,-0.079,-0.049,-0.004,0.033,-0.073,...,0.035,0.04,0.005,0.001,0.003,-0.023,-0.026,0.023,-0.031,0.067
sent1,0.01,0.102,0.016,0.075,0.091,0.224,-0.135,0.129,0.068,0.004,...,0.031,-0.031,0.006,-0.001,-0.001,0.003,-0.018,-0.013,-0.028,0.007
sent2,0.063,0.169,0.139,0.192,-0.113,-0.166,0.022,0.087,0.081,-0.022,...,0.008,-0.015,-0.0,0.001,0.01,0.027,0.0,0.009,0.013,0.005
sent3,0.246,-0.027,-0.043,-0.097,-0.044,-0.005,-0.048,-0.01,0.013,0.007,...,0.015,-0.007,0.034,-0.06,-0.019,0.029,0.014,0.015,0.027,-0.004
sent4,0.053,0.215,0.252,0.145,-0.007,-0.149,0.092,0.02,0.177,-0.059,...,0.004,0.016,0.035,-0.001,0.003,-0.025,0.032,-0.043,-0.024,-0.031
sent5,-0.062,-0.134,-0.064,-0.018,-0.051,-0.083,-0.064,-0.022,0.103,0.026,...,-0.017,0.04,-0.034,-0.022,0.004,0.019,0.015,-0.012,0.001,0.045


In [30]:
X_train,X_test,y_train,y_test=train_test_split(pca312_topic_vectors,df.score,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['pca312_sentiment']=lda.predict(pca312_topic_vectors)
# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)

(0.858, 0.789)

Now the score seems to be rather optimized vs. number of components. Let's still try LDA with LDiA.

### e) LDA classification with 312 LDiA topic vectors

In [31]:
# LDiA works with raw BOW count vectors rather than normalized TF-IDF vectors
np.random.seed(42)
counter=CountVectorizer(tokenizer=casual_tokenize)
bow_docs=pd.DataFrame(counter.fit_transform(raw_documents=df.text).toarray(),index=index)
column_nums,terms=zip(*sorted(zip(counter.vocabulary_.values(),counter.vocabulary_.keys())))
bow_docs.columns=terms
bow_docs.head()

Unnamed: 0,!,"""",#,$,%,&,',(,(;,),...,yun,z,z500a,zero,zillion,zombie,zombie-students,zombiez,,
sent0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent3,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
ldia312=LDiA(n_components=312,learning_method='batch')
ldia312=ldia312.fit(bow_docs)
ldia312.components_.shape

(312, 5399)

In [33]:
ldia312_topic_vectors=ldia312.transform(bow_docs)
columns312=['topic{}'.format(i) for i in range(ldia312.n_components)]
ldia312_topic_vectors=pd.DataFrame(ldia312_topic_vectors,index=index,columns=columns312)
ldia312_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic302,topic303,topic304,topic305,topic306,topic307,topic308,topic309,topic310,topic311
sent0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent1,0.0,0.0,0.0,0.45,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's use LDiA topic vectors (created from BOW vectors) to train LDA model (simple binary classifier), in the similar way it was done with PCA topic vectors.

In [34]:
X_train,X_test,y_train,y_test=train_test_split(ldia312_topic_vectors,df.score,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['ldia312_sentiment']=lda.predict(ldia312_topic_vectors)
# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)



(0.739, 0.624)

### f) LDA classification with 256 LDiA topic vectors

In [35]:
ldia256=LDiA(n_components=256,learning_method='batch')
ldia256=ldia256.fit(bow_docs)
# Let's compute 256-D topic vectors for all the sentences
ldia256_topic_vectors=ldia256.transform(bow_docs)
columns256=['topic{}'.format(i) for i in range(ldia256.n_components)]
ldia256_topic_vectors=pd.DataFrame(ldia256_topic_vectors,index=index,columns=columns256)
ldia256_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic246,topic247,topic248,topic249,topic250,topic251,topic252,topic253,topic254,topic255
sent0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent1,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.78,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [36]:
# LDA model (classifier training)
X_train,X_test,y_train,y_test=train_test_split(ldia256_topic_vectors,df.score,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['ldia256_sentiment']=lda.predict(ldia256_topic_vectors)
# Let's look at accuracy for train and test set
round(float(lda.score(X_train,y_train)),3),  round(float(lda.score(X_test,y_test)),3)



(0.75, 0.64)

### g) LDA classification with 384 LDiA topic vectors

In [37]:
ldia384=LDiA(n_components=384,learning_method='batch')
ldia384=ldia384.fit(bow_docs)
# Let's compute 384-D topic vectors for all the sentences
ldia384_topic_vectors=ldia384.transform(bow_docs)
columns384=['topic{}'.format(i) for i in range(ldia384.n_components)]
ldia384_topic_vectors=pd.DataFrame(ldia384_topic_vectors,index=index,columns=columns384)
ldia384_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic374,topic375,topic376,topic377,topic378,topic379,topic380,topic381,topic382,topic383
sent0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
X_train,X_test,y_train,y_test=train_test_split(ldia384_topic_vectors,df.score,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['ldia384_sentiment']=lda.predict(ldia384_topic_vectors)
# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)



(0.778, 0.625)

### h) LDA classification with 512 LDiA topic vectors

In [39]:
ldia512=LDiA(n_components=512,learning_method='batch')
ldia512=ldia512.fit(bow_docs)
# Let's compute 512-D topic vectors for all the sentences
ldia512_topic_vectors=ldia512.transform(bow_docs)
columns512=['topic{}'.format(i) for i in range(ldia512.n_components)]
ldia512_topic_vectors=pd.DataFrame(ldia512_topic_vectors,index=index,columns=columns512)
ldia512_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic502,topic503,topic504,topic505,topic506,topic507,topic508,topic509,topic510,topic511
sent0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
X_train,X_test,y_train,y_test=train_test_split(ldia512_topic_vectors,df.score,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['ldia512_sentiment']=lda.predict(ldia512_topic_vectors)
# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)



(0.805, 0.665)

### i) LDA classification with 1024 LDiA topic vectors

In [41]:
ldia1024=LDiA(n_components=1024,learning_method='batch')
ldia1024=ldia1024.fit(bow_docs)
# Let's compute 1024-D topic vectors for all the sentences
ldia1024_topic_vectors=ldia1024.transform(bow_docs)
columns1024=['topic{}'.format(i) for i in range(ldia1024.n_components)]
ldia1024_topic_vectors=pd.DataFrame(ldia1024_topic_vectors,index=index,columns=columns1024)
ldia1024_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic1014,topic1015,topic1016,topic1017,topic1018,topic1019,topic1020,topic1021,topic1022,topic1023
sent0,0.0,0.0,0.0,0.0,0.71,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
X_train,X_test,y_train,y_test=train_test_split(ldia1024_topic_vectors,df.score,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['ldia1024_sentiment']=lda.predict(ldia1024_topic_vectors)
# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)



(0.839, 0.724)

### j) LDA classification with 2048 LDiA topic vectors

In [43]:
ldia2048=LDiA(n_components=2048,learning_method='batch')
ldia2048=ldia2048.fit(bow_docs)
# Let's compute 2048-D topic vectors for all the sentences
ldia2048_topic_vectors=ldia2048.transform(bow_docs)
columns2048=['topic{}'.format(i) for i in range(ldia2048.n_components)]
ldia2048_topic_vectors=pd.DataFrame(ldia2048_topic_vectors,index=index,columns=columns2048)
ldia2048_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic2038,topic2039,topic2040,topic2041,topic2042,topic2043,topic2044,topic2045,topic2046,topic2047
sent0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0
sent1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
X_train,X_test,y_train,y_test=train_test_split(ldia2048_topic_vectors,df.score,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['ldia2048_sentiment']=lda.predict(ldia2048_topic_vectors)
# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)



(0.68, 0.649)

## Conclusions

In this notebook the spam classification was performed with ten methods:
- a) LDA classification done with Tf-idf vectors: test accuracy 0.682
- b) LDA classification with 256 topic vectors created by LSA (PCA):test accuracy 0.776
- c) LDA classification with 384 topic vectors created by LSA (PCA):test accuracy 0.8   <- Best result with LSA (PCA)
- d) LDA classification with 312 topic vectors created by LSA (PCA):test accuracy 0.789
- e) LDA classification with 312 topic vectors created by LDiA:test accuracy 0.624
- f) LDA classification with 256 topic vectors created by LDiA:test accuracy 0.64
- g) LDA classification with 384 topic vectors created by LDiA:test accuracy 0.625
- h) LDA classification with 512 topic vectors created by LDiA:test accuracy 0.665
- i) LDA classification with 1024 topic vectors created by LDiA:test accuracy 0.724  <- Best result with LDiA
- j) LDA classification with 2048 topic vectors created by LDiA:test accuracy 0.649

The results show that some kind of dimension reduction method is needed, since a) did not provide very good results. The best accuracy score was obtained with c) , when LSA, Latent semantic analysis with PCA was used for creating 384 topic vectors. LDA classification was then performed with these 384 topic vectors.

If instead of PCA, LDiA, Latent Dirichlet Allocation, was used for creating the topic vectors, a larger number of dimensions in terms of topic vectors were required. The best result with LDiA was obtained with 1024 topic vectors.