## 1. Introduction

### 1.1 Natural Language Processing

Natural Language Processing is an area of computer science and information engineering within the field of Artificial Intelligence that is focused on the interaction between machines and humans using the natural language. The major tasks of NLP include:

- *Question Answering Systems (QAS)* - These systems automatically answer questions from users in a natural language.

- *Summarization* - The goal of this field is to create a logical and fluent summary from a collection of documents or emails. 

- *Machine Translation* - Being one of the significant areas of NLP, machine translation refers to the assignment of automatically converting a given text to other languages fluently while preserving the original meaning of the text.

- *Speech Recognition* - Speech recognition aims to enable machines to recognize and translate the spoken language into text.

- *Document classification* - This is one of the strong areas of NLP.  Document classification aims to distinguish the category of a document.



### 1.2 Topic Modelling


Topic modelling is the process of discovering the key themes that occur in a collection of documents. It is mostly used as a text-mining tool to find hidden semantic structures in a text body. One of the topic models vastly used is Latent Dirichlet Allocation. LDA assumes that documents are randomly generated from a set of topics which each of these topics being a word distribution. These topics are hidden distributions derived from the regulation through the Dirichlet process. The algorithm repeats this process a number of times for each document, adjusting the topics thoroughly to adapt them to reach a fixed convergence.


## 2. Review of Literature

### 2.1 Terminology
    
  In the paper, we use specific terms for the text collections. We refer to these such as "words", "corpora" and "topic".  It makes it easier for us to understand and make it much clearer when we are trying to identify topics after we precede latent variables. 
    
   We define the following terms:
    
   - *Word* - The basic element of discrete data. It is also called a "token". It can be formally defined as an element from a vocabulary that is indexed by {1,...,V}. Words are represented as unit-based vectors that have one componant equal to one and other components equal to zero. So, the *v*th word from a vocabulary is denoted as a V-vector w such as $w^v$ = 1 and $w^u$ = 0 for u$\neq$v.
   
   
   - *Vocabulary* - A vocabulary is defined as a collection of all unique words. Formally, it is denoted as a V-vector that is indexed by {1,...,V}. 
   
   
   - *Document* - A document denoted by w = ($w_1$,$w_2$,...,$w_N$) is a sequence of N words, where $w_n$ is the *n*th word in the sequence.
   
  
   - *Corpus* -  A corpus denoted by D = {$w_1$,$w_2$,...,$w_M$} is a collection of M documents. 
   
  
   - *Topic* - A topic is a probability distribution on vocabulary
    

**Topic Models**

   Topic models are used for identifying abstract "topics", or "latent" variables in a collection of documents. With the use    of topics, the goals differ from examining a large corpus to classifying new documents.

### 2.2 Brief History of Topic Modelling

 With the need for making documents more understandable, the methods for text mining were created. The basic concept of text mining is to extract meaningful numeric indices from a text. Term-frequency inverse-document-frequency (tf-idf) was the first method to be used. This method compares the frequency of a word in a document to its overall occurrence in all documents, which shows us the importance of that word in the document. But it does not provide a special insight. For example, it may show us the high frequency of two terms but we can't see the information about how frequently they are used together.  Tf-idf matrix is used by latent semantic indexing (LSI) which is where the foundation of a topic model came from. We can get the key correlation by applying the Singular Value Decomposition(SVD) to the matrix. Continuing from the example above, this will give us the information about how frequently they are used together. 
 
LSI was introduced in 1988 by Deerwester et al.  It was used for the progress of a topic model in 1990 as a source but it is not a probabilistic model, therefore it is not a genuine model. 
  
In 1999, Thomas Hofmann created probabilistic latent semantic analysis (PLSA or also known as PLSI). PLSI is a latent class model that is established from a mixture decomposition.  It chooses a random topic from the document's topic distribution and then it chooses a random word from the document's word distribution to generate a word. Basically, as a generative model, it analyses the co-occurrence data of words and documents. For example, if the algorithm chose the topic as vehicle, the generated words will be such as "car" or "door" and in that case, we can also see them having a high occurrence. The disadvantage is that this method leads to overfitting. 
   
Latent Dirichlet allocation (LDA) is an extension of PLSA. LDA is the most popular topic model that is being used as a base for other probabilistic models.  

### 2.3 Latent Dirichlet Allocation

#### 2.3.1 LDA Assumptions

LDA is the simplest topic model and has a couple of key assumptions: 

1. Distribution over a vocabulary is called a *topic*.

2. Each document is created by randomly choosing a topic over a topic mixture, and then randomly choosing a word from that topic. Each word in the document is generated by repeating this process.

3. Each document have a set of topics.

4. Documents are represented by by a probability mass function for topic selection.

5. Words are assumed to be unordered and they are also generated indepently of other words. In other way, it means we can take each document as "bag of words".

6. The number of topics is the only parameter that is defined prior.

We can observe only  the results for the given documents.  Hidden distributions of topics must be deduced.  The difference between LDA and other methods is that for LDA, Dirichlet distribution is the prior distribution for each topic. After randomly choosing topics, Bayesian Inference is used for iterating them which in every iteration the algorithm renews every topic in accordance to the relation between the topics and documents. This process proceeds until the user is satisfied.

#### 2.3.2 Bayesian Inference

Bayesian Inference is the process of adjusting the changes of an event which is the Bayesian interpretation of probability, in regards to the new data that is being collected. But it does not ignore the previous data. 
For example, you went to your hometown. I can make several assumptions about how you travelled:
- You traveled by car.
- You traveled by train.
- You walked all the way.

With the previos data or observation I can tell that that the probability of you travelling by foot is low because the data shows people do not usually travel like that. If I see a train ticket, the probability of you taking the train gets higher and other possibilities gets lower. If I see dirt on your shoes I can assume you walked all the way but I will be skeptical about it because of the previous data. So with the observation of the available data I have, I can improve the possibilities of my estimates. 

#### 2.3.3 Dirichlet Distribution

LDA assumes the following generative process for each document w in a corpus D:
1. Choose *N* ∼ Poisson(ξ).
2. Choose θ ∼ Dir(α).
3. For each of the *N* words $w_n$:
    - Choose a topic $z_n$ ∼ Multinomial(θ).
    - Choose a word $w_n$ from p($w_n$ |$z_n$,β), a multinomial probability conditioned on the topic $z_n$.

Dirichlet distribution is a multivariate generalization of the Beta distribution. Thus, we need to start with *Beta Distribution*.

**Beta Distribution:**

This is a distribution over [0,1] with parameters, α and β. These parameters control the shape of the distribution, and it's probability density function is given by

\begin{aligned}f(x;\alpha ,\beta )&={\frac {1}{\mathrm {B} (\alpha ,\beta )}}x^{\alpha -1}(1-x)^{\beta -1}\end{aligned}



- *B(α, β)* - Beta function serves to normalize the distribution.


**Multinomial Distribution:**

$x_i$ ∈ {0, . . . , n} , $θ_i$ $\geq$ 0

\begin{aligned} P(X_1=x_1,...,X_n=x_n)=\frac{N!}{\prod_{i=1}^{k}{x_i!}}\prod_{i=1}^{k}\theta_i^{x_i},N=\sum_{i=1}^kx_i, \sum_{i=1}^k\theta_i = 1\end{aligned} 

When k = 1, the multinomial distribution simplifies to a unigram language model with 1-of-V coding:

\begin{aligned}P(x|\theta)=\prod_{i=1}^k\theta_i^{x_i}, \sum_{i=1}^k\theta_i =1\end{aligned} 

In multimonial distribution: 

- $x_i$ indicates word *i* of the vocabulary observed.

- $θ_i$ = P($w_i$) indicates the probability that word *i* is seen.

**Dirichlet Distribution:**

Dirichlet (continuous) distribution with parameter α given by:

\begin{aligned}p(x|\alpha) = {\frac{\Gamma (\sum_{i=1}^{k}\alpha_{i})}{\prod_{i=1}^{k}\Gamma (\alpha_{i})}{\prod_{i=1}^{k}x_i^{\alpha_{i}-1}} }\end{aligned} 

- k is the dimensionality of the distribution.
- α is a k-vector with components $α_i$ > 0
- Γ(x) is the Gamma function

Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is given by:

\begin{aligned}p(\theta,z,w|\alpha,\beta)={{p(\theta|\alpha)}{\prod_{n=1}^{n}p(z_n|\theta)p(w_n|z_n,\beta)}}\end{aligned}

<img alt="File:Latent Dirichlet allocation.svg" src="http://www.wikizero.biz/index.php?q=aHR0cDovL3VwbG9hZC53aWtpbWVkaWEub3JnL3dpa2lwZWRpYS9jb21tb25zL3RodW1iL2QvZDMvTGF0ZW50X0RpcmljaGxldF9hbGxvY2F0aW9uLnN2Zy81OTNweC1MYXRlbnRfRGlyaWNobGV0X2FsbG9jYXRpb24uc3ZnLnBuZw" decoding="async" width="300" height="100" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Latent_Dirichlet_allocation.svg/890px-Latent_Dirichlet_allocation.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Latent_Dirichlet_allocation.svg/1186px-Latent_Dirichlet_allocation.svg.png 2x" data-file-width="593" data-file-height="311">

This is the *Plate diagram of the graphical model of LDA*. The boxes are “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document.

#### 2.3.4 LDA Implementations

LDA has various implementations but it all comes down to two standart formulations. The difference in these formulations lays within the changes of Dirichlet priors based on the iterations. 

Variational Bayes creates small sets that are optimized analytically with controllable local optimums using the unknows. These optimums are dependent of other parameters. Thus, to reach a convergence, sets are optimized in an iterative process. 

With Gibbs sampling, each measurement is examined over and over again which renews the latent distributions using Bayesin Inference. The posterior distribution will tend toward the values of the deduced parameters that are mostly likely to generate the observed data, but Gibbs sampling being a Monte Carlo method the rate of convergence is unbounded. So, there is a burn-in period which is the time between until the estimated distribution reaches a convergent behavior to represent the true distribution.

If we compare the two of them, Gibbs sampling is easier to use but the burn-in period makes it less desirable to use. In the end, both methods are used for approximating the hidden parameters through iteration.

#### 2.3.5 LDA Strengths and Weaknesses

Since every document is important for LDA, when we try to evaluate documents over a long span of time, it becomes hard to process. If we try to compare the documents that has a big time difference, we need to process each of them separately which in the end reduces the corpus. Consequently, we also can't observe the topic evaluation over the years. Topic evaluation being an importance of different fields, can only be observed by baseline LDA.

## 3. Experimental Validation

### 3.1 Training Overview
    
I will use Python programming language to process the data, which is widely used for statistical computing. 

Python Package Index is able to interact with most of the different languages. It provides a huge standard library which even includes high use programming tasks as scripts. The advantage of language is in the fast use of different codes which shortens the coding time. Python being an open source tool, it is also free to use and distribute.

This research consists of scraping websites, pre-processing of the data and model training. The first step is scraping the RSS feeds of Hürriyet and Cumhuriyet newspapers, but the format can not be processed directly by the algorithm. Thus, I need pre-processing steps to make the raw text data clean and useful. Pre-processing includes parsing the text data, tokenization, stopwords and removing. After the pre-processing step, a dictionary and a corpus is created. Then I apply the data to fit the LDA topic model. Finally, the fitted model will be used for clustering.

 -  *RSS* - RSS is an abbreviation of Rich Site Summary. It is used for publishing regularly updated information. An RSS document includes texts and metadata such as title, description and publishing date.
 
The main advantage of using RSS is that all of the content is in one centralized location which is time effective for users. Otherwise, you need to visit the websites one by one and try to find if there is any new content. 
Another advantage is that the links that are in the RSS feed have topics as subdirectories. For instance, if an article's topic is sports, you can see that in the link as *https://www..../sports/.../...*

### 3.2 Data Retrieving

In [1]:
import requests
import feedparser
from bs4 import BeautifulSoup
import csv

- [*requests*](https://pypi.org/project/requests/2.7.0/) - It allows you to send organic, grass-fed HTTP/1.1 requests, without doing any manual work. 
9
- [*feedparser*](https://pypi.org/project/feedparser/) - FeedParser is a library that parses ATOM and RSS feeds. 

- [*BeautifulSoup*](https://pypi.org/project/beautifulsoup4) - Beautiful Soup is used for extracting data from HTML and XML files.

Below, I extracted the articles, or 'data', from [Hürriyet](http://www.cumhuriyet.com.tr/rss/son_dakika.xml) and [Cumhuriyet](http://www.hurriyet.com.tr/rss/gundem) newspapers' RSS feeds using feedparser and BeautifulSoup libraries. First, I created an empty list and after I extracted the links, I appended them to the empy list.

In [2]:
cumhuriyet_list_sondakika=[]

feed = feedparser.parse('http://www.cumhuriyet.com.tr/rss/son_dakika.xml') 
   
for post in feed.entries:
    print(post.link)
    cumhuriyet_list_sondakika.append(post.link) 

http://www.cumhuriyet.com.tr/haber/diger_sporlar/1410718/Monaco_da_pole_pozisyonu_Hamilton_in.html
http://www.cumhuriyet.com.tr/haber/basketbol/1410716/ilk_yari_finalist_Anadolu_Efes.html
http://www.cumhuriyet.com.tr/haber/ekonomi/1410713/Yargitay_dan_kredi_masraflarina_iliskin_karar.html
http://www.cumhuriyet.com.tr/haber/futbol/1410705/TFF_baskanligina_5_aday.html
http://www.cumhuriyet.com.tr/haber/siyaset/1410698/Cumhurbaskani_Yardimcisi_Oktay__Gonullere_hileyle_girilmeyecegine_inandik.html
http://www.cumhuriyet.com.tr/haber/turkiye/1410666/_Kiyafeti_kirli__diye_minibuse_oturtulmamisti__Sevindiren_haber.html
http://www.cumhuriyet.com.tr/haber/basketbol/1410662/Gaziantep_Basketbol_seriyi_esitledi.html
http://www.cumhuriyet.com.tr/haber/turkiye/1410650/illegal_bahis_ve_kumar_oyununda_MASAK_milyonlarca_liraya_el_koydu.html
http://www.cumhuriyet.com.tr/haber/turkiye/1410631/Ciftciye_tecavuz_iddiasi..._Mahkemede_bunlari_soyledi.html
http://www.cumhuriyet.com.tr/haber/turkiye/1410629/Caya

Below, I used the *cumhuriyet_list_sondakika* list to extract the article paragraphs.

- The first *for loop* is for scraping every link one by one. 

- *request.get()* method is used for retrieving the data from the links. 

- BeautifulSoup creates a parse tree for parsed pages that. Using that parse tree we can find the elements of HTML such as *body*, *h1* and *p elements. 
The div* element here is a section which has an attribute of *news-body*.

- The *if* loop works to find each of the *p* elements inside the *news-body* one by one.  

- After this process, I created *clean_data.txt* and saved the data to csv.

    - clean_data.txt - It has all the article paragraphs.

Note that we have to use different codes for different websites when scraping. Because every site has a structure that is individual to them. HTML elements can be contained with different names. So you need to inspect the source of the page and try to find the elements you want for each site.

As an example, you can see the difference between the structures in Hürriyet and Cumhuriyet sites:
- In Cumhuriyet, *rhd-all-article-detail* is used as a *div* attribute.
- In Hürriyet, *news-body* is used as a *div* attribute. 

In [3]:
for link in cumhuriyet_list_sondakika: 
    source2 = requests.get(link).text
    soup2 = BeautifulSoup(source2, 'lxml') 
    paragraf_cumh = soup2.findAll('div', id = 'news-body')
    if len(paragraf_cumh) != 0:
        paragraf_cumh = paragraf_cumh[0] 
        icerik2 = str()
        for paragraf2_cumh in paragraf_cumh.find_all('p'):
            icerik2 = icerik2 + paragraf2_cumh.text
        icerik2.replace('\n', '')
        csv2 ='"' + icerik2 + '"\n'
        with open('clean_data.txt', 'a', encoding = 'utf-8') as f: 
            f.write(csv2)
            f.close()

In the codes below, I used the same procedure. 

In [30]:
hürriyet_link_list = []

feed = feedparser.parse('http://www.hurriyet.com.tr/rss/gundem') #extracting links
   
for post in feed.entries:
    print(post.link)
    hürriyet_link_list.append(post.link) #adding the links to this list

http://www.hurriyet.com.tr/gundem/cumhurbaskani-erdogandan-necip-fazil-kisakurek-paylasimi-41225776
http://www.hurriyet.com.tr/gundem/taksici-dehset-sacti-kanli-infaz-41225769
http://www.hurriyet.com.tr/gundem/kimyevi-madde-tasiyan-gemi-yandi-130-kisi-hastanelik-oldu-41225768
http://www.hurriyet.com.tr/gundem/pkknin-kirli-yuzu-kadin-teroristin-not-defterinde-41225764
http://www.hurriyet.com.tr/gundem/fuat-oktay-gece-gunduz-calismaya-devam-edecegiz-41225753
http://www.hurriyet.com.tr/gundem/rizede-iki-kamyonet-carpisti-1i-agir-3-yarali-41225730
http://www.hurriyet.com.tr/gundem/vanda-feci-olay-ekipler-lise-ogrencisi-icin-seferber-oldu-41225724
http://www.hurriyet.com.tr/gundem/4-yildir-restorasyonda-olan-sumela-ziyarete-acildi-41225718
http://www.hurriyet.com.tr/gundem/konyada-mhpli-belediye-baskanini-olduren-3-supheli-icin-flas-karar-41225717
http://www.hurriyet.com.tr/gundem/mersinde-ciftlik-evinde-yangin-41225716
http://www.hurriyet.com.tr/gundem/lise-ogrencisinin-cenazesini-imam-bab

In [5]:
for link in hürriyet_link_list: #scraping every link one by one in a for loop
    source = requests.get(link).text #requesting links
    soup = BeautifulSoup(source, 'lxml')
    paragraf = soup.findAll('div',attrs={'class':'rhd-all-article-detail'}) 
    if len(paragraf) != 0:
        paragraf = paragraf[0] 
        icerik = str()
        for paragraf2 in paragraf.find_all('p'): #finding body paragraphs of articles
            icerik = icerik + paragraf2.text
        icerik.replace('\n', '')
        csv ='"' + icerik + '"\n'
        with open('clean_data.txt', 'a', encoding = 'utf-8') as f: #saving links
            f.write(csv)
            f.close()      

### 3.3 Data Pre-processing

One of the most important steps in this research is to clean and process data so that it will give us well-defined results. 

The process I am doing is simplifying the data and eliminating the language dependent factors as much as I can. In text mining, articles are hard to process for computers because they are written in natural language for humans.

To simplify the articles I used these steps:
 
   - Tokenization: Tokenization is the process of taking a document as a string and cutting those strings into pieces. This process also includes removing the punctuation, making all the characters lower and sometimes getting rid of words with fewer characters because they are usually not meaningful for the data.

   - Removing stop words: Stop words are words which are filtered out before or after processing of natural language data (text). They have got no significant meanings, thus they need to be removed.
   
    Note that pre-processing usually have lemmatization and stemming steps, too. But I skipped those steps for two reasons. The first reason is Turkish language being hard to process because of the complex morphology and the way morphology interacts with syntax. The second reason is there are no well-defined open source Turkish NLP libraries I can benefit from.

In [6]:
import gensim
from gensim.models import CoherenceModel, LdaModel

- [*gensim*](https://pypi.org/project/gensim/) - Gensim library is an open source library used for text classifying. It is implemented in Python and Cython to analyse large Natural Language text collections.

     - [CoherenceModel](https://radimrehurek.com/gensim/models/coherencemodel.html): Calculate topic coherence for topic models.
     - [LdaModel](https://radimrehurek.com/gensim/models/ldamodel.html): his module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents. The model can also be updated with new documents for online training.

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

- [*NLTK (Natural Language Toolkit)*](https://www.nltk.org/) - NLTK is a platform developed by Steven Bird and Edward Loper in the Department of Computer and Information Science. It is used to create Python packages which are used for applying statistical NLP in human language texts. The end goal is to make machines understand human language and then, make their response more precisely. Tasks such as tokenization, parsing, classification, stemming and tagging are done by using NLTK text processing libraries. 

     - [*RegexpTokenizer*](https://www.nltk.org/_modules/nltk/tokenize/regexp.html) - A RegexpTokenizer splits a string into substrings using a regular expression. 

In [8]:
import re
import pprint

- [*re (regular expression)*](https://docs.python.org/3/howto/regex.html) - A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

- [*pprint*](https://docs.python.org/3/library/pprint.html) - The pprint module provides a capability to “pretty-print” arbitrary Python data structures to make it more eye-pleasing and readable. The formatter produces representations of data structures that can be parsed correctly by the interpreter. The output is kept on a single line, if possible, and indented when split across multiple lines.

In [9]:
import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

- [*pyLDAvis*](https://pypi.org/project/pyLDAvis/) - It is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

- [*matplotlib*](https://matplotlib.org/) - Matplotlib is a Python 2D plotting library which is used for plotting beautiful and attractive graphs. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

In [10]:
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

- [*warnings*](https://docs.python.org/3/library/warnings.html) - Warning messages are typically issued in situations where it is useful to alert the user of some condition in a program, where that condition (normally) doesn’t warrant raising an exception and terminating the program. 
    - *DepracationWarning* - Base category for warnings about deprecated features when those warnings are intended for other Python developers (ignored by default, unless triggered by code in _main_).

Here, I am opening the *txt* in text mode, that means, I can read and write strings which are encoded in a specific encoding. 

Using *with* means, the file is properly closed without me manually closing it. 

In [33]:
with open('C:/Users/Lenovo/OneDrive/Masaüstü/clean_data.txt', encoding = 'utf-8', errors='ignore') as infile:
    raw_txt = infile.readlines()

In [51]:
print(raw_txt[0])

"31 Mart Mahalli İdareler Seçimleri'nde Muğla'da AK Parti'den Mehmet Nil Hıdır, CHP'den Osman Gürün, DP'den Mehmet Kocadon ve bağımsız olarak da Behçet Saatcı yarışacak.HDP ise Muğla ve ilçelerinde yarıştan çekildi. HDP Muğla İl Eş Başkanı Yılmaz Yüksel, konuyla ilgili yaptığı açıklamada, Muğla'da Büyükşehir Belediye Başkan adayımız Mehmet Polat dahil, Bodrum, Dalaman, Köyceğiz, Marmaris, Milas, Menteşe, Ortaca ile Ula ilçelerindeki adaylarımızı çektik. Datça, Fethiye, Seydikemer, Yatağan ve Kavaklıdere'de zaten adayımız yoktu. Adaylarımızı geri çekmemizin nedeni AK Parti- MHP'nin oluşturduğu Cumhur İttifakı'na tepki göstermek. CHP'yi destekleyeceğimiz kesinleşmiş bir şey değil. Bunun kararını Genel Merkezimiz verecek"" dedi."""



**Tokenization**

We process the text with the following operations:

1. First, text is converted to lowercase. Thus, words with uppercase are not considered different words from their lowercase equivalents.
2. Replace special characters with their corresponding Turkish characters.
3. Remove non-Turkish-letter characters such as digits or punctuation marks.
4. Split text into a list of words
5. Remove Turkish stopwords (optional)
6. Return as a list or as one string (depends the parameter)

- *stopwords* - It refers to the most common words in a language which are the unmeaningful words. Stopwords differs from languages to languages. There is no universality of it.

There is a fixed list of common stop words in various languages in Python. I used the turkish stopwords but apart from that, I also added other words I deemed to be unfit for the data.

In [14]:
my_stopwords = ['acaba', 'altmış', 'altı', 'ama', 'ancak', 'arada', 'aslında', 'ayrıca', 'bana', 'bazı', 'belki', 'ben', 'benden', 'beni', 'benim', 'beri', 'beş', 'bile', 'bin', 'bir', 'birçok', 'biri', 'birkaç', 'birkez', 'birşey', 'birşeyi', 'biz', 'bize', 'bizden', 'bizi', 'bizim', 'böyle', 'böylece', 'bu', 'buna', 'bunda', 'bundan', 'bunlar', 'bunları', 'bunların', 'bunu', 'bunun', 'burada', 'çok', 'çünkü', 'da', 'daha', 'dahi', 'de', 'defa', 'değil', 'diğer', 'diye', 'doksan', 'dokuz', 'dolayı', 'dolayısıyla', 'dört', 'edecek', 'eden', 'ederek', 'edilecek', 'ediliyor', 'edilmesi', 'ediyor', 'eğer', 'elli', 'en', 'etmesi', 'etti', 'ettiği', 'ettiğini', 'gibi', 'göre', 'halen', 'hangi', 'hatta', 'hem', 'henüz', 'hep', 'hepsi', 'her', 'herhangi', 'herkesin', 'hiç', 'hiçbir', 'için', 'iki', 'ile', 'ilgili', 'ise', 'işte', 'itibaren', 'itibariyle', 'kadar', 'karşın', 'katrilyon', 'kendi', 'kendilerine', 'kendini', 'kendisi', 'kendisine', 'kendisini', 'kez', 'ki', 'kim', 'kimden', 'kime', 'kimi', 'kimse', 'kırk', 'milyar', 'milyon', 'mu', 'mü', 'mı', 'nasıl', 'ne', 'neden', 'nedenle', 'nerde', 'nerede', 'nereye', 'niye', 'niçin', 'o', 'olan', 'olarak', 'oldu', 'olduğu', 'olduğunu', 'olduklarını', 'olmadı', 'olmadığı', 'olmak', 'olması', 'olmayan', 'olmaz', 'olsa', 'olsun', 'olup', 'olur', 'olursa', 'oluyor', 'on', 'ona', 'ondan', 'onlar', 'onlardan', 'onları', 'onların', 'onu', 'onun', 'otuz', 'oysa', 'öyle', 'pek', 'rağmen', 'sadece', 'sanki', 'sekiz', 'seksen', 'sen', 'senden', 'seni', 'senin', 'siz', 'sizden', 'sizi', 'sizin', 'şey', 'şeyden', 'şeyi', 'şeyler', 'şöyle', 'şu', 'şuna', 'şunda', 'şundan', 'şunları', 'şunu', 'tarafından', 'trilyon', 'tüm', 'üç', 'üzere', 'var', 'vardı', 've', 'veya', 'ya', 'yani', 'yapacak', 'yapılan', 'yapılması', 'yapıyor', 'yapmak', 'yaptı', 'yaptığı', 'yaptığını', 'yaptıkları', 'yedi', 'yerine', 'yetmiş', 'yine', 'yirmi', 'yoksa', 'yüz', 'zaten'] 
print(my_stopwords, end =' ')

['acaba', 'altmış', 'altı', 'ama', 'ancak', 'arada', 'aslında', 'ayrıca', 'bana', 'bazı', 'belki', 'ben', 'benden', 'beni', 'benim', 'beri', 'beş', 'bile', 'bin', 'bir', 'birçok', 'biri', 'birkaç', 'birkez', 'birşey', 'birşeyi', 'biz', 'bize', 'bizden', 'bizi', 'bizim', 'böyle', 'böylece', 'bu', 'buna', 'bunda', 'bundan', 'bunlar', 'bunları', 'bunların', 'bunu', 'bunun', 'burada', 'çok', 'çünkü', 'da', 'daha', 'dahi', 'de', 'defa', 'değil', 'diğer', 'diye', 'doksan', 'dokuz', 'dolayı', 'dolayısıyla', 'dört', 'edecek', 'eden', 'ederek', 'edilecek', 'ediliyor', 'edilmesi', 'ediyor', 'eğer', 'elli', 'en', 'etmesi', 'etti', 'ettiği', 'ettiğini', 'gibi', 'göre', 'halen', 'hangi', 'hatta', 'hem', 'henüz', 'hep', 'hepsi', 'her', 'herhangi', 'herkesin', 'hiç', 'hiçbir', 'için', 'iki', 'ile', 'ilgili', 'ise', 'işte', 'itibaren', 'itibariyle', 'kadar', 'karşın', 'katrilyon', 'kendi', 'kendilerine', 'kendini', 'kendisi', 'kendisine', 'kendisini', 'kez', 'ki', 'kim', 'kimden', 'kime', 'kimi', 'kim

In the first 5 lines, I combined different regular expression patterns into pattern objects, which I can use for pattern matching. I used those pattern objects in the *clean_text* function. 

Inside the fuction, respectively, these lines:

- Lowercase the text and saves it again.
- Replace the symbols in replace_with_space symbols with the space in text.
- Delete symbols which are in remove_symbols from text.
- Delete the words less than 4 characters.
- Delete stopwords from text.

In [15]:
replace_with_space = re.compile('[/(){}\[\]\|@,;]')
remove_symbols1 = re.compile("[^0-9a-z_ğüşıöç .']")
stopwords = nltk.corpus.stopwords.words('turkish')
stopwords.extend(my_stopwords)
remove_3chars = re.compile(r'\b\w{1,3}\b')

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() 
    text = replace_with_space.sub(' ', text) 
    text = remove_symbols1.sub('', text) 
    text = remove_3chars.sub('', text)
    text = ' '.join([word for word in text.split() if word not in stopwords])
    return text

I used the function above to clean raw_txt:

In [16]:
cleaned_data = [clean_text(içerik) for içerik in raw_txt]
print(cleaned_data[0:5], end = ' ')

["mart mahalli idareler seçimleri' muğla' parti' mehmet hıdır ' osman gürün ' mehmet kocadon bağımsız behçet saatcı yarışacak. muğla ilçelerinde yarıştan çekildi. muğla başkanı yılmaz yüksel konuyla açıklamada muğla' büyükşehir belediye başkan adayımız mehmet polat dahil bodrum dalaman köyceğiz marmaris milas menteşe ortaca ilçelerindeki adaylarımızı çektik. datça fethiye seydikemer yatağan kavaklıdere' adayımız yoktu. adaylarımızı geri çekmemizin nedeni parti ' oluşturduğu cumhur ittifakı' tepki göstermek. ' destekleyeceğimiz kesinleşmiş değil. kararını genel merkezimiz verecek dedi.", "beyoğlu' kişinin öldüğü yangınla gözaltına alınan özkan ' yangını arkadaşlarını korkutmak çıkardığını belirterek tartıştığım ramazan demir aramızda husumet vardı. olay sabahı geldiğimde eşyalarımı dışarı attığını gördüm. sinirlendim. yattığı kanepeye gazete kağıtlarını yerleştirip yaktım. . pişmanım dediği öğrenildi.beyoğlu asmalı mescit' kişinin öldüğü kişinin yaralandığı yangınla hatay' yakalanan şüp

After the pre-process, data is ready for tokenization.
To separate a sentence into words without puctuation, I used RegexpTokenizer(r'\w+') as a tokenizer, since I still have the 'apostrophe' and the 'dot' in the *cleaned_data*.

I created an empty list *data_tokens*. After using the tokenizer, I appended the tokens I got to that list.

In [17]:
data_tokens = []
tokenizer = RegexpTokenizer(r'\w+')
for i in range (len(cleaned_data)-1):
    tokens = tokenizer.tokenize(cleaned_data[i])
    data_tokens.append(tokens)
print(data_tokens[0:5], end = ' ')

[['mart', 'mahalli', 'idareler', 'seçimleri', 'muğla', 'parti', 'mehmet', 'hıdır', 'osman', 'gürün', 'mehmet', 'kocadon', 'bağımsız', 'behçet', 'saatcı', 'yarışacak', 'muğla', 'ilçelerinde', 'yarıştan', 'çekildi', 'muğla', 'başkanı', 'yılmaz', 'yüksel', 'konuyla', 'açıklamada', 'muğla', 'büyükşehir', 'belediye', 'başkan', 'adayımız', 'mehmet', 'polat', 'dahil', 'bodrum', 'dalaman', 'köyceğiz', 'marmaris', 'milas', 'menteşe', 'ortaca', 'ilçelerindeki', 'adaylarımızı', 'çektik', 'datça', 'fethiye', 'seydikemer', 'yatağan', 'kavaklıdere', 'adayımız', 'yoktu', 'adaylarımızı', 'geri', 'çekmemizin', 'nedeni', 'parti', 'oluşturduğu', 'cumhur', 'ittifakı', 'tepki', 'göstermek', 'destekleyeceğimiz', 'kesinleşmiş', 'değil', 'kararını', 'genel', 'merkezimiz', 'verecek', 'dedi'], ['beyoğlu', 'kişinin', 'öldüğü', 'yangınla', 'gözaltına', 'alınan', 'özkan', 'yangını', 'arkadaşlarını', 'korkutmak', 'çıkardığını', 'belirterek', 'tartıştığım', 'ramazan', 'demir', 'aramızda', 'husumet', 'vardı', 'olay',

## Experimental Validation

### 4.1 Dictionary and Corpus

A dictionary and a corpus is needed for topic modelling. 

Thus, I created a dictionary from ‘data_tokens’ which contains the number of times a word appears in the training set.

*id2word* is the dictionary:

In [18]:
id2word  = gensim.corpora.Dictionary(data_tokens) 

In [19]:
count = 0
for k, v in id2word.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 adaylarımızı
1 adayımız
2 açıklamada
3 bağımsız
4 başkan
5 başkanı
6 behçet
7 belediye
8 bodrum
9 büyükşehir
10 cumhur


I don't want words that have high frequency, because I can't get useful information about the topic and as well as words that are too infrequent. Lastly, I will keep just 100,000 words. So I filter out tokens:

- less than 15 documents (absolute number) 
- more than 0.5 documents (this is for the total corpus size, not in the absolute number).
- keep only the first 100000 most frequent tokens.

In [20]:
id2word .filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

I created a corpus from data_tokens:

In [21]:
texts = data_tokens

After that I created a Term Document Frequency:

For each document we create a dictionary reporting how many words and how many times those words appear using *Gensim doc2bow*.

*doc2bow* creates a unique id for each word in the document. The produced corpus below above is a mapping of (word_id, word_frequency).

We can check it with a sample. In this case, it is the 658th line:

In [22]:
corpus = [id2word .doc2bow(doc) for doc in data_tokens]
print(corpus[658], end = ' ')

[(10, 1), (87, 1), (152, 2), (162, 1), (163, 1), (185, 2), (219, 1), (235, 3), (253, 1), (285, 1), (329, 1), (366, 2), (408, 2), (568, 1), (683, 1), (738, 1), (752, 1), (801, 1), (803, 1), (1004, 1), (1053, 1), (1097, 1), (1101, 1), (1117, 1), (1141, 1), (1142, 1), (1159, 1), (1176, 1), (1184, 1), (1243, 2), (1289, 1), (1358, 1), (1359, 1), (1361, 1), (1363, 1), (1365, 2), (1376, 1), (1518, 1), (1558, 1), (1578, 1), (1720, 1), (1781, 2), (1946, 1), (2016, 1)] 

For example, (10, 1) above implies, word id 10 occurs once in the first document. Likewise, word id 83 occurs once and so on.

This is used as the input by the LDA model.

This the human readable format of corpus (term-frequency) for the 658th line:

In [23]:
corpus_doc_658 = corpus[658]
for i in range(len(corpus_doc_658)- 37):
    print("Word {} (\"{}\") appears {} time.".format(corpus_doc_658[i][0], 
                                                     id2word [corpus_doc_658[i][0]], corpus_doc_658[i][1]))

Word 10 ("genel") appears 1 time.
Word 87 ("süre") appears 1 time.
Word 152 ("saatte") appears 2 time.
Word 162 ("yaşananlar") appears 1 time.
Word 163 ("yaşananlarson") appears 1 time.
Word 185 ("devam") appears 2 time.
Word 219 ("mücadele") appears 1 time.


### 4.2 LDA Topic Model

Now, I have every conditon fulfilled to train the LDA model. But dictionary and corpus is not the only necessity for LDA model. I also need to provide the number of topics as well.

- *alpha* is the hyperparameter that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior.
- *chunksize* is the number of documents to be used in each training chunk. 
- *update_every* determines how often the model parameters should be updated
- *passes* is the total number of training passes.

In [24]:
lda_model = LdaModel(corpus=corpus,
                    id2word=id2word ,
                    num_topics=20, 
                    random_state=100,
                    update_every=1,
                    chunksize=100,
                    passes=10,
                    alpha='auto',
                    per_word_topics=True)

**Topics in LDA model**

The above LDA model is built with 20 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword using lda_model.print_topics() as shown next.

In [25]:
# Print the Keyword in the 10 topics
pprint.pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.042*"ceza" + 0.040*"sanık" + 0.033*"hapis" + 0.032*"suçundan" + '
  '0.032*"karar" + 0.030*"ağır" + 0.029*"mahkeme" + 0.026*"cezası" + '
  '0.019*"çelik" + 0.019*"tutuklu"'),
 (1,
  '0.089*"aykırı" + 0.073*"kadın" + 0.071*"görevlisi" + 0.054*"genç" + '
  '0.038*"ordu" + 0.033*"nedeni" + 0.028*"gaziantep" + 0.028*"organize" + '
  '0.027*"barış" + 0.022*"dönemin"'),
 (2,
  '0.086*"ilçe" + 0.074*"jandarma" + 0.069*"yavuz" + 0.033*"tespit" + '
  '0.028*"arama" + 0.028*"belirtildi" + 0.026*"sonuç" + 0.024*"kamu" + '
  '0.021*"ortadan" + 0.018*"sonucu"'),
 (3,
  '0.052*"yılmaz" + 0.048*"milletvekili" + 0.033*"bakanı" + 0.032*"idareler" + '
  '0.032*"mahalli" + 0.030*"soylu" + 0.027*"türkiye" + 0.024*"bakan" + '
  '0.023*"mehmet" + 0.022*"içişleri"'),
 (4,
  '0.062*"ekipleri" + 0.037*"polis" + 0.034*"müdürlüğü" + 0.031*"ekipler" + '
  '0.027*"itfaiye" + 0.026*"şube" + 0.025*"asayiş" + 0.020*"sürüyor" + '
  '0.020*"adliyeye" + 0.019*"sevk"'),
 (5,
  '0.051*"terör" + 0.039*"fetö" + 0.

This is how to interpret this:

Topic 0 is a represented as 0.042*"ceza" + 0.040*"sanık" + 0.033*"hapis" + 0.032*"suçundan" + 0.032*"karar" + 0.030*"ağır" + 0.028*"mahkeme" + 0.026*"cezası" + 0.020*"çelik" + 0.019*"tutuklu".

For topic 0, these 10 keywords are the ones that has got the heighest weight, respectively: ‘ceza’, ‘sanık’, ‘hapis’.. For example, the weight of ‘ceza’ on topic 0 is 0.042.

The more a keyword has weight, the more important that keyword is for the topic. 

Looking at the keywords, for example, I can assign topic 0 as *law*.

### 4.3 Model Perplexity and Coherence Score

Model perplexity and topic coherence provide a convenient measure to judge how good a given topic model is. 

Perplexity gives us of how good the model is. The lower it is, the better it is.

Coherence measure to be used. Fastest method is ‘u_mass’ but the ‘c_v’ usually gives the best result.

- for ‘u_mass’ corpus should be provided, if texts is provided, it will be converted to corpus using the dictionary. 
- for ‘c_v’, ‘c_uci’ and ‘c_npmi’ texts should be provided

In [26]:
print('\nPerplexity: ', lda_model.log_perplexity(corpus))


Perplexity:  -9.00519947891263


In [22]:
cm = CoherenceModel(model=lda_model, texts=data_tokens, dictionary=id2word, coherence='c_v')
coherence = cm.get_coherence()
print('\nCoherence Score: ', coherence)


Coherence Score:  0.472797646980033


As you can see coherence score is 0.47.

### 4.4 Visualization of the Topics and Keywords

LDA model is ready. Taking to the next step, I need to examine the produced topics and the associated keywords. 

For this, I am going to use pyLDAvis package’s interactive chart.

In [27]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Below, you can see the image of the pyLDAvis' graph. 

In [69]:
%%html
<img src = "pyLDavis.png">

In the interactive graph:

Each circle on Intertopic Distance Map shows a topic. These circles are proportional to topic frequencies, so if a circle is more large, that topic is more prevalent.

If you have a big, non-overlapping circles scattered around the chart, that means you have a fairly good topic model.

If you end up with too many topics, you will have many overlaps with small sized bubbles being clustered in one region of the map.

If you select a circle or a *topic*, you will see it highlighted and when you look at the right-side, it will automatically give you the most relevant terms for that topic. Red bars indicate term-frequency within the selected topic. Gray bars indicate overall term-frequencies across the corpus. 

### **5. Conclusion**

With the evaluation of data mining tools, there will be surely different models for the fast process of huge collections of datasets. The possibilities of natural language process is getting much more bigger and there is so much to learn in this field. It is still improving and the world is starting to get more dependent on the technology. LDA model is one of the examples of it.

Latent Dirichlet allocation is trained on non-labeled documents. It is usually assessed in two ways. The first way to evaluate is by calculating performance on a secondary task, such as document classification or information retrieval, or by estimating the probability of unnoticed held-out documents given some training documents. The second way to *evaluate* is to follow your instinct. You can usually tell a story about the generated topics when you have a decent model.  

I have proposed a topic model-based approach to journalistic text analysis. On every update, I calculated the evolution of topics to detect newly emerged topics in the document collection. To do that, I applied the methodology to a collection of online articles and demonstrated the model’s strength. 
The results are good. The topics are diverse and scattered throughout the map. But I couldn't eliminate all the stopwords, so some of the salient terms does not give meaningful insights. 

Most of the data I collected are from political news and the keywords I got mostly belongs to the category of *politics*. So there is a relation between topics and documents. If you look at the results you can also see the corrolation using your instinct. Overall  

# Sources

- Natural Language Annotation for Machine Learning
- http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
- https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#9createbigramandtrigrammodels
- http://delivery.acm.org/10.1145/2140000/2133826/p77-blei.pdf?ip=94.54.17.207&id=2133826&acc=OPEN&key=4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1558621368_230868b64b41596adbd5eba7f4c6ecbd