# Finding Themes in Indonesian Twitter with LSA and LDA
### Brian Friederich
### 1 August 2018 

## I. Definition

### Project Overview

In this section, look to provide a high-level overview of the project in layman’s terms. Questions to ask yourself when writing this section:

Has an overview of the project been provided, such as the problem domain, project origin, and related datasets or input data?
Has enough background information been given so that an uninformed reader would understand the problem domain and following problem statement?

### Problem Statement

In this section, you will want to clearly define the problem that you are trying to solve, including the strategy (outline of tasks) you will use to achieve the desired solution. You should also thoroughly discuss what the intended solution will be for this problem. Questions to ask yourself when writing this section:

Is the problem statement clearly defined? Will the reader understand what you are expecting to solve?
Have you thoroughly discussed how you will attempt to solve the problem?
Is an anticipated solution clearly defined? Will the reader understand what results you are looking for?

### Metrics

In this section, you will need to clearly define the metrics or calculations you will use to measure performance of a model or result in your project. These calculations and metrics should be justified based on the characteristics of the problem and problem domain. Questions to ask yourself when writing this section:

Are the metrics you’ve chosen to measure the performance of your models clearly discussed and defined?
Have you provided reasonable justification for the metrics chosen based on the problem and solution?

## II. Analysis

### Data Exploration

In this section, you will be expected to analyze the data you are using for the problem. This data can either be in the form of a dataset (or datasets), input data (or input files), or even an environment. The type of data should be thoroughly described and, if possible, have basic statistics and information presented (such as discussion of input features or defining characteristics about the input or environment). Any abnormalities or interesting qualities about the data that may need to be addressed have been identified (such as features that need to be transformed or the possibility of outliers). Questions to ask yourself when writing this section:

If a dataset is present for this problem, have you thoroughly discussed certain features about the dataset? Has a data sample been provided to the reader?
If a dataset is present for this problem, are statistics about the dataset calculated and reported? Have any relevant results from this calculation been discussed?
If a dataset is not present for this problem, has discussion been made about the input space or input data for your problem?
Are there any abnormalities or characteristics about the input space or dataset that need to be addressed? (categorical variables, missing values, outliers, etc.)

In [52]:
# Load and visualize dataset
import pandas as pd
data = pd.read_csv('tweets.csv')
print(data.head())

#Find dataset size
print("\n{} tweets in dataset".format(len(tweets)))

                                           isi_tweet  sentimen
0  tidak setuju jokowi jadi cawapres capres jokow...         1
1  capres jokowi wacapres abraham samad gubernur ...         1
2  capres prabowo dan cawapres jokowi dan gubdki ...         1
3  jadi skenarionya gini 2014 biar prabowo jadi p...         1
4  sby mantan tni dan calon presiden prabowo subi...         1

1846 tweets in dataset


In [53]:
#Add indexing for future gensim use
data_text = data[['isi_tweet']]
data_text['index'] = data_text.index
tweets = data_text
print(tweets.head())

                                           isi_tweet  index
0  tidak setuju jokowi jadi cawapres capres jokow...      0
1  capres jokowi wacapres abraham samad gubernur ...      1
2  capres prabowo dan cawapres jokowi dan gubdki ...      2
3  jadi skenarionya gini 2014 biar prabowo jadi p...      3
4  sby mantan tni dan calon presiden prabowo subi...      4


### Exploratory Visualization

In this section, you will need to provide some form of visualization that summarizes or extracts a relevant characteristic or feature about the data. The visualization should adequately support the data being used. Discuss why this visualization was chosen and how it is relevant. Questions to ask yourself when writing this section:

Have you visualized a relevant characteristic or feature about the dataset or input data?
Is the visualization thoroughly analyzed and discussed?
If a plot is provided, are the axes, title, and datum clearly defined?

### Algorithms and Techniques

In this section, you will need to discuss the algorithms and techniques you intend to use for solving the problem. You should justify the use of each one based on the characteristics of the problem and the problem domain. Questions to ask yourself when writing this section:

Are the algorithms you will use, including any default variables/parameters in the project clearly defined?
Are the techniques to be used thoroughly discussed and justified?
Is it made clear how the input data or datasets will be handled by the algorithms and techniques chosen?

In [55]:
#import packages used
import csv
import re
import gensim
from gensim import corpora, models
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import numpy as np
import Sastrawi
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# set random seed for reproducable results
np.random.seed(45)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/brianfrieerich/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Benchmark

In this section, you will need to provide a clearly defined benchmark result or threshold for comparing across performances obtained by your solution. The reasoning behind the benchmark (in the case where it is not an established result) should be discussed. Questions to ask yourself when writing this section:

Has some result or value been provided that acts as a benchmark for measuring performance?
Is it clear how this result or value was obtained (whether by data or by hypothesis)?

## III. Methodology

### Data Preprocessing

In this section, all of your preprocessing steps will need to be clearly documented, if any were necessary. From the previous section, any of the abnormalities or characteristics that you identified about the dataset will be addressed and corrected here. Questions to ask yourself when writing this section:

If the algorithms chosen require preprocessing steps like feature selection or feature transformations, have they been properly documented?
Based on the Data Exploration section, if there were abnormalities or characteristics that needed to be addressed, have they been properly corrected?
If no preprocessing is needed, has it been made clear why?

In [56]:
# Create Indonesian stemming function and test on conjugated sentences
factory = StemmerFactory()
indoStemmer = factory.create_stemmer()

def indoStem(text):
    stemmed = indoStemmer.stem(text)
    return stemmed

sentences = ["Mereka meniru-nirukannya", "Saya pembantu", "Dia dipanggil oleh wanita tercantik"]

for sentence in sentences:
    print("-----\nOriginal: {}".format(sentence))
    output = indoStem(sentence)
    print("\nStemmed: {}".format(output))

-----
Original: Mereka meniru-nirukannya

Stemmed: mereka tiru
-----
Original: Saya pembantu

Stemmed: saya bantu
-----
Original: Dia dipanggil oleh wanita tercantik

Stemmed: dia panggil oleh wanita cantik


In [57]:
# Read Indonesian stopword CSV into Python and turn it into a list
stopwords_list = []
with open('stopwords.csv', 'r') as f:
    reader = csv.reader(f)
    stopwords_list = list(reader) 
    flat_stoplist = [item for sublist in stopwords_list for item in sublist]

# Create function to preprocess each tweet by removing non-alphabetical characters, stemming words,
## and removing stopwords
def preprocess(text):
    result=[]
    text = re.sub("[^a-zA-Z]+", " ", text)
    stemmed = indoStem(text)
    for word in stemmed.split(' '):
        if word not in flat_stoplist:
            result.append(word)
    return result

In [34]:
# Map preprocessing to isi_tweet column and print first 5 instances to check
processed_tweets = list(map(preprocess, tweets['isi_tweet']))
processed_tweets[:5]

[['tuju', 'jokowi', 'cawapres', 'capres', 'jokowi', 'harga', 'mati'],
 ['capres',
  'jokowi',
  'wacapres',
  'abraham',
  'samad',
  'gubernur',
  'ahok',
  'koruptor',
  'abissss'],
 ['capres',
  'prabowo',
  'cawapres',
  'jokowi',
  'gubdki',
  'ahok',
  'mantap',
  'presiden',
  'sby',
  'bubar',
  'fpi'],
 ['skenario',
  'gin',
  'biar',
  'prabowo',
  'presiden',
  'jokowi',
  'tetepgubernur',
  'jakarta',
  'hasil',
  'nunggu',
  'gantiin',
  'prabowo'],
 ['sby',
  'mantan',
  'tni',
  'calon',
  'presiden',
  'prabowo',
  'subianto',
  'mantan',
  'kopassus',
  'anggoto',
  'tni',
  'disiplin',
  'smw',
  'presiden']]

In [35]:
dictionary = gensim.corpora.Dictionary(processed_tweets)

In [36]:
dictionary.filter_extremes(no_below = 15, no_above = 0.1)

In [37]:
bow_corpus = [dictionary.doc2bow(twt) for twt in processed_tweets]
bow_corpus[15]

[(2, 1), (3, 1), (20, 2), (22, 1)]

In [38]:
bow_tweet_14 = bow_corpus[15]
for i in range(len(bow_tweet_14)):
    print("Word {} (\"{}\") appears {} time.".format(bow_tweet_14[i][0], 
                                                     dictionary[bow_tweet_14[i][0]], 
                                                     bow_tweet_14[i][1]))

Word 2 ("ahok") appears 1 time.
Word 3 ("gubernur") appears 1 time.
Word 20 ("ya") appears 2 time.
Word 22 ("indonesia") appears 1 time.


In [39]:
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5104769343386324), (1, 0.8598914463513587)]


### Implementation

In this section, the process for which metrics, algorithms, and techniques that you implemented for the given data will need to be clearly documented. It should be abundantly clear how the implementation was carried out, and discussion should be made regarding any complications that occurred during this process. Questions to ask yourself when writing this section:

Is it made clear how the algorithms and techniques were implemented with the given datasets or input data?
Were there any complications with the original metrics or techniques that required changing prior to acquiring a solution?
Was there any part of the coding process (e.g., writing complicated functions) that should be documented?

### Refinement

In this section, you will need to discuss the process of improvement you made upon the algorithms and techniques you used in your implementation. For example, adjusting parameters for certain models to acquire improved solutions would fall under the refinement category. Your initial and final solutions should be reported, as well as any significant intermediate results as necessary. Questions to ask yourself when writing this section:

Has an initial solution been found and clearly reported?
Is the process of improvement clearly documented, such as what techniques were used?
Are intermediate and final solutions clearly reported as the process is improved?

## IV. Results

### Model Evaluation and Validation

In this section, the final model and any supporting qualities should be evaluated in detail. It should be clear how the final model was derived and why this model was chosen. In addition, some type of analysis should be used to validate the robustness of this model and its solution, such as manipulating the input data or environment to see how the model’s solution is affected (this is called sensitivity analysis). Questions to ask yourself when writing this section:

Is the final model reasonable and aligning with solution expectations? Are the final parameters of the model appropriate?
Has the final model been tested with various inputs to evaluate whether the model generalizes well to unseen data?
Is the model robust enough for the problem? Do small perturbations (changes) in training data or the input space greatly affect the results?
Can results found from the model be trusted?

In [40]:
num_topics = 3

In [41]:
lsa_model = gensim.models.LsiModel(corpus_tfidf, 
                                   num_topics = num_topics, 
                                   id2word=dictionary)
for idx, topic in lsa_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.381*"jk" + 0.374*"hatta" + 0.259*"gubernur" + 0.225*"nyata" + 0.217*"tokoh" + 0.212*"cawapres" + 0.203*"sosok" + 0.200*"populer" + 0.200*"bincang" + 0.197*"ketua"


Topic: 1 
Words: -0.420*"hatta" + 0.319*"gubernur" + 0.297*"nyata" + 0.286*"sosok" + 0.283*"populer" + 0.282*"bincang" + 0.282*"tokoh" + -0.214*"buka" + -0.211*"pan" + -0.207*"ketua"


Topic: 2 
Words: -0.644*"jk" + 0.227*"hatta" + -0.215*"dahlan" + -0.205*"arb" + 0.188*"sosok" + 0.187*"gubernur" + 0.186*"populer" + 0.185*"bincang" + 0.169*"tokoh" + 0.168*"nyata"




In [42]:
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics = num_topics, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers = 4)
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.078*"dahlan" + 0.061*"arb" + 0.046*"pilih" + 0.038*"iskan" + 0.035*"rakyat" + 0.030*"wiranto" + 0.028*"dukung" + 0.027*"jk" + 0.024*"ya" + 0.020*"konvensi"


Topic: 1 
Words: 0.071*"jk" + 0.045*"gubernur" + 0.044*"calon" + 0.031*"tokoh" + 0.031*"nyata" + 0.030*"mahfud" + 0.028*"indonesia" + 0.027*"pdip" + 0.026*"sosok" + 0.025*"populer"


Topic: 2 
Words: 0.112*"hatta" + 0.052*"ketua" + 0.051*"cawapres" + 0.044*"pan" + 0.044*"buka" + 0.039*"evaluasi" + 0.039*"pencapresan" + 0.038*"radjasa" + 0.026*"survey" + 0.020*"tweet"




In [43]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                             num_topics = num_topics, 
                                             id2word = dictionary, 
                                             passes = 2, 
                                             workers=4)
                                             
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.068*"jk" + 0.041*"dukung" + 0.038*"wiranto" + 0.032*"mahfud" + 0.032*"pdip" + 0.026*"mega" + 0.022*"md" + 0.022*"kalah" + 0.020*"ical" + 0.020*"menang"


Topic: 1 
Words: 0.046*"hatta" + 0.038*"calon" + 0.037*"indonesia" + 0.033*"pilih" + 0.026*"wapres" + 0.026*"jakarta" + 0.025*"ketua" + 0.023*"ya" + 0.022*"buka" + 0.021*"tweet"


Topic: 2 
Words: 0.051*"dahlan" + 0.043*"arb" + 0.033*"rakyat" + 0.030*"nyata" + 0.029*"iskan" + 0.028*"gubernur" + 0.026*"sby" + 0.026*"sosok" + 0.025*"tokoh" + 0.025*"maju"




In [44]:
document_num = 13
print(data.iloc[document_num, 0])

print(processed_tweets[document_num])

prabowo subianto vs joko widodo no prabowo subianto presiden jokowi wapres yes aminin yaaaaa allah
['prabowo', 'subianto', 'vs', 'joko', 'widodo', 'no', 'prabowo', 'subianto', 'presiden', 'jokowi', 'wapres', 'yes', 'aminin', 'yaaaaa', 'allah']


In [45]:
for index, score in sorted(lsa_model[corpus_tfidf[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lsa_model.print_topic(index, 10)))


Score: 0.037299727710397364	 
Topic: 0.381*"jk" + 0.374*"hatta" + 0.259*"gubernur" + 0.225*"nyata" + 0.217*"tokoh" + 0.212*"cawapres" + 0.203*"sosok" + 0.200*"populer" + 0.200*"bincang" + 0.197*"ketua"

Score: -0.0010628898567282214	 
Topic: -0.420*"hatta" + 0.319*"gubernur" + 0.297*"nyata" + 0.286*"sosok" + 0.283*"populer" + 0.282*"bincang" + 0.282*"tokoh" + -0.214*"buka" + -0.211*"pan" + -0.207*"ketua"

Score: -0.03145815454014468	 
Topic: -0.644*"jk" + 0.227*"hatta" + -0.215*"dahlan" + -0.205*"arb" + 0.188*"sosok" + 0.187*"gubernur" + 0.186*"populer" + 0.185*"bincang" + 0.169*"tokoh" + 0.168*"nyata"


In [46]:
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.8833066821098328	 
Topic: 0.112*"hatta" + 0.052*"ketua" + 0.051*"cawapres" + 0.044*"pan" + 0.044*"buka" + 0.039*"evaluasi" + 0.039*"pencapresan" + 0.038*"radjasa" + 0.026*"survey" + 0.020*"tweet"

Score: 0.06438399851322174	 
Topic: 0.071*"jk" + 0.045*"gubernur" + 0.044*"calon" + 0.031*"tokoh" + 0.031*"nyata" + 0.030*"mahfud" + 0.028*"indonesia" + 0.027*"pdip" + 0.026*"sosok" + 0.025*"populer"

Score: 0.052309323102235794	 
Topic: 0.078*"dahlan" + 0.061*"arb" + 0.046*"pilih" + 0.038*"iskan" + 0.035*"rakyat" + 0.030*"wiranto" + 0.028*"dukung" + 0.027*"jk" + 0.024*"ya" + 0.020*"konvensi"


In [47]:
for index, score in sorted(lda_model_tfidf[corpus_tfidf[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.4878930151462555	 
Topic: 0.046*"hatta" + 0.038*"calon" + 0.037*"indonesia" + 0.033*"pilih" + 0.026*"wapres" + 0.026*"jakarta" + 0.025*"ketua" + 0.023*"ya" + 0.022*"buka" + 0.021*"tweet"

Score: 0.4027964174747467	 
Topic: 0.051*"dahlan" + 0.043*"arb" + 0.033*"rakyat" + 0.030*"nyata" + 0.029*"iskan" + 0.028*"gubernur" + 0.026*"sby" + 0.026*"sosok" + 0.025*"tokoh" + 0.025*"maju"

Score: 0.10931061953306198	 
Topic: 0.068*"jk" + 0.041*"dukung" + 0.038*"wiranto" + 0.032*"mahfud" + 0.032*"pdip" + 0.026*"mega" + 0.022*"md" + 0.022*"kalah" + 0.020*"ical" + 0.020*"menang"


In [50]:
fake_tweet = "Saya mendukung JK dan Kalla! PDI-P selamanya!"
bow_vector = dictionary.doc2bow(preprocess(fake_tweet))

print("----------------------------------------\nLSA Model:")
for index, score in sorted(lsa_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lsa_model.print_topic(index, 10)))

print("----------------------------------------\nLDA with BOW Model:")
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))

print("----------------------------------------\nLDA with Tf-idf Model:")
for index, score in sorted(lda_model_tfidf[bow_vector], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))

----------------------------------------
LSA Model:

Score: 0.49057098758876205	 
Topic: 0.381*"jk" + 0.374*"hatta" + 0.259*"gubernur" + 0.225*"nyata" + 0.217*"tokoh" + 0.212*"cawapres" + 0.203*"sosok" + 0.200*"populer" + 0.200*"bincang" + 0.197*"ketua"

Score: 0.10195250765494418	 
Topic: -0.420*"hatta" + 0.319*"gubernur" + 0.297*"nyata" + 0.286*"sosok" + 0.283*"populer" + 0.282*"bincang" + 0.282*"tokoh" + -0.214*"buka" + -0.211*"pan" + -0.207*"ketua"

Score: -0.7617690669928293	 
Topic: -0.644*"jk" + 0.227*"hatta" + -0.215*"dahlan" + -0.205*"arb" + 0.188*"sosok" + 0.187*"gubernur" + 0.186*"populer" + 0.185*"bincang" + 0.169*"tokoh" + 0.168*"nyata"
----------------------------------------
LDA with BOW Model:

Score: 0.8555165529251099	 
Topic: 0.071*"jk" + 0.045*"gubernur" + 0.044*"calon" + 0.031*"tokoh" + 0.031*"nyata" + 0.030*"mahfud" + 0.028*"indonesia" + 0.027*"pdip" + 0.026*"sosok" + 0.025*"populer"

Score: 0.0761055201292038	 
Topic: 0.078*"dahlan" + 0.061*"arb" + 0.046*"pilih" 

### Justification

In this section, your model’s final solution and its results should be compared to the benchmark you established earlier in the project using some type of statistical analysis. You should also justify whether these results and the solution are significant enough to have solved the problem posed in the project. Questions to ask yourself when writing this section:

Are the final results found stronger than the benchmark result reported earlier?
Have you thoroughly analyzed and discussed the final solution?
Is the final solution significant enough to have solved the problem?

## V. Conclusion

### Free-Form Visualization

In this section, you will need to provide some form of visualization that emphasizes an important quality about the project. It is much more free-form, but should reasonably support a significant result or characteristic about the problem that you want to discuss. Questions to ask yourself when writing this section:

Have you visualized a relevant or important quality about the problem, dataset, input data, or results?
Is the visualization thoroughly analyzed and discussed?
If a plot is provided, are the axes, title, and datum clearly defined?

### Improvement

In this section, you will need to provide discussion as to how one aspect of the implementation you designed could be improved. As an example, consider ways your implementation can be made more general, and what would need to be modified. You do not need to make this improvement, but the potential solutions resulting from these changes are considered and compared/contrasted to your current solution. Questions to ask yourself when writing this section:

Are there further improvements that could be made on the algorithms or techniques you used in this project?
Were there algorithms or techniques you researched that you did not know how to implement, but would consider using if you knew how?
If you used your final solution as the new benchmark, do you think an even better solution exists?