# Workshop Topic Modeling
## by Cornelius Puschmann, Tatjana Scheffler, Damian Trilling

ADD SOME INTRO TEXT

# Preparation
We assume that you run Python 3 and have NLTK (Bird, Loper, & Klein, 2009) installed. If you use Anaconda, you have it anyway. Otherwise, use 
```
pip install nltk
```
or 
```
sudo pip install nltk
```
(or possibly pip3) in your terminal to install it.


We also assume you have `gensim` and `pyldavis` installed, if not, do so as well using pip.

Furthermore, we have to download some data for some specific NLTK modules. Download them by executing the following cell (you only have to do this once):

Bird, S., Loper, E., & Klein, E. (2009). *Natural language processing with Python*. Sebastopol, CA: O'Reilly.

In [None]:
import nltk
nltk.download('vader_lexicon')
nltk.download('stopwords')
nltk.download('maxent_treebank_pos_tagger')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Warming up

Think back of what you know already about Python. Use the cell below to do the following task:
- Create a list that contains strings with numbers inside, something like ["12","42","11]
- Write a loop that converts the strings to integers, prints them, and adds them to a new list
- Modify your loop in such a way that it multiplies the numbers by two before adding them to the new list.

# Let's get started!

## Import modules
Before we start, let's import some modules that we need today. It is good practice to do so at the beginning of a script, so we'll do it right now and not later when we need them. The benefit is that you immediately see if something goes wrong (for instance, because the module is not installed).

In [15]:
import csv
import re
from nltk.sentiment import vader
from nltk.corpus import stopwords
import nltk
from gensim import corpora, models
import pyLDAvis
import pyLDAvis.gensim

  from pandas import hashtable, tslib, lib


## Download the data
We will use a dataset by Schumacher et al. (2016). From the abstract:
> This paper presents EUSpeech, a new dataset of 18,403 speeches from EU leaders (i.e., heads of government in 10 member states, EU commissioners, party leaders in the European Parliament, and ECB and IMF leaders) from 2007 to 2015. These speeches vary in sentiment, topics and ideology, allowing for fine-grained, over-time comparison of representation in the EU. The member states we included are Czech Republic, France, Germany, Greece, Netherlands, Italy, Spain, United Kingdom, Poland and Portugal.

Schumacher, G, Schoonvelde, M., Dahiya, T., Traber, D, & de Vries, E. (2016): *EUSpeech: a New Dataset of EU Elite Speeches*. [doi:10.7910/DVN/XPCVEI](http://dx.doi.org/10.7910/DVN/XPCVEI)

Download and unpack the following file:
```
speeches_csv.tar.gz
```

In the .tar.gz file, you find a .zip file. Extract the whole folder to your home directory.
See below a screenshot of how this looks like in Lubuntu (double-click on "speeches_csv.zip" in the left window, then the right window will open. Click on "Extract")

In [None]:
from IPython.display import Image
Image("https://github.com/damian0604/bdaca/raw/master/ipynb/euspeech_download.png")

Let's have a look at the files we downloaded.

In [5]:
%ls ~/Cleaned_Speeches/

Speeches_ALDE_Cleaned.csv       Speeches_GR_Cleaned.csv
Speeches_CZ_Cleaned.csv         Speeches_IMF_Cleaned.csv
Speeches_DE_Cleaned.csv         Speeches_IT_Cleaned.csv
Speeches_ECB_Cleaned.csv        Speeches_NL_Cleaned.csv
Speeches_EC_Cleaned.csv         Speeches_PL_Cleaned.csv
Speeches_ECR_Cleaned.csv        Speeches_PO_Cleaned.csv
Speeches_EP_Cleaned.csv         Speeches_SP_Cleaned.csv
Speeches_EUCouncil_Cleaned.csv  Speeches_UK_Cleaned.csv
Speeches_FR_Cleaned.csv         [0m[01;34mTranslated[0m/


## Let's start!
Let's retrieve a list of all speeches from one of the files. Of course, we could also loop over all the files...

In [39]:
with open("/home/damian/Cleaned_Speeches/Speeches_NL_Cleaned.csv") as fi:
    reader=csv.reader(fi)
    speeches=[]
    for row in reader:
        if row[7]=='en':   # only include english-language speches; we might as well choose 'nl' or 'fr'
            speeches.append(row[5])

In [40]:
len(speeches)

132

We'll clean up a bit.

In [41]:
speeches=[speech.replace('<p>',' ').replace('</p>',' ') for speech in speeches]   #remove HTML tags
speeches=[" ".join(speech.split()) for speech in speeches]   # remove double spaces by splitting the strings into words and joining these words again

Let's look at the first speech to check everything's fine.

In [26]:
speeches[0]

"Ladies and gentlemen, It is an honour to be here today to introduce the theme of 'recession and recovery'. If you will permit, I would like to suggest that this afternoon we focus more on recovery than on recession. I think we know enough about the recession side of the story. It started with the fall of Lehman Brothers on 15 September 2008. I happened to be here, at the Blouin Creative Leadership Summit, only ten days later. Everyone was talking about the collapse of Lehman. They were shocked and alarmed. But even then we could hardly imagine that its impact would be so dramatic, so historic. As we now know, this event triggered a global financial and economic crisis. Governments were forced to give cash injections running into billions to prevent an economic and financial meltdown. When credit dried up and demand fell, businesses struggled to keep their heads above water, and many went under. Ordinary people's jobs, homes and pensions were at risk. The aftermath was high unemploymen

## Example: Simple LDA where each speech is one document, without major preprocessing

In [49]:
ldainput = []
for speech in speeches:
    ldainput.append(speech.lower().split())    

In [50]:
id2word = corpora.Dictionary(ldainput)
id2word.filter_extremes()   #remove tokens that occur almost never or very frequently
mm = [id2word.doc2bow(text) for text in ldainput]
tfidf = models.TfidfModel(mm)
lda = models.ldamodel.LdaModel(corpus=tfidf[mm],id2word=id2word,num_topics=20)



In [51]:
for top in lda.print_topics(num_topics = 30, num_words = 5):
    print(top)

(0, '0.005*"indonesia" + 0.004*"indonesian" + 0.004*"india" + 0.004*"nuclear" + 0.002*"they’re"')
(1, '0.004*"." + 0.003*"her" + 0.003*"delta" + 0.003*"sustainable" + 0.002*"human"')
(2, '0.004*"vietnam" + 0.003*"internet" + 0.003*"nuclear" + 0.003*"delta" + 0.002*"production"')
(3, '0.004*"sector" + 0.003*"un" + 0.003*"finance" + 0.003*"terrorist" + 0.002*"private"')
(4, '0.002*"disaster." + 0.002*"site." + 0.002*"hit" + 0.002*"mh17" + 0.002*"conditions"')
(5, '0.004*"." + 0.003*"water" + 0.003*"malaysia" + 0.002*"european" + 0.002*"forum"')
(6, '0.003*"july" + 0.003*"–" + 0.003*"day" + 0.003*"full" + 0.002*"children"')
(7, '0.005*"." + 0.005*"indonesia" + 0.003*"indonesian" + 0.002*"sustainable" + 0.002*"georgia"')
(8, '0.004*"russia" + 0.004*"terrorist" + 0.004*"cyber" + 0.004*"head" + 0.003*"foreign"')
(9, '0.005*"indonesia" + 0.003*"indonesian" + 0.003*"china\'s" + 0.002*"indonesia." + 0.002*"challenges"')
(10, '0.003*"nuclear" + 0.003*"china" + 0.002*"site." + 0.002*"." + 0.002*"

In [52]:
vis_data = pyLDAvis.gensim.prepare(lda,mm,id2word)
pyLDAvis.display(vis_data)

# SCRATCH - MISSCHIEN LATER NOG GEBRUIKEN NLP
As a prerequisite for many techiques we want to use tomorrow, we want to clean up the text. Typical steps involve:
- converting to lowercase
- remove punctuation
- remove stopwords
- stemming
- parsing (= determining the grammatical function of words).
Of course, depending on the task at hand, we don't want to do all of them - and also the order matters. If we want to parse a sentence, well, we better still have a sentence (and not already have removed stopwords and punctuation).

Below, you find some examples:

## Stopword removal

In [None]:
cleanedspeeches=[]
for speech in speeches:
    speech=speech.lower().replace(".","").replace(",","").replace('"',''.replace("'","")).replace("?","")
    words=speech.split()
    words = [w for w in words if w not in stopwords.words('english')]
    speechnew = " ".join(words)
    cleanedspeeches.append(speechnew)

In [None]:
cleanedspeeches

## Parsing and retaining only nouns and adjectives
Look at the NLTK documentation to find out what each code means (e.g., 'NN' is 'noun') 

In [None]:
speechesnounsadj=[]
for speech in speeches:
    tokens = nltk.word_tokenize(speech)
    tagged = nltk.pos_tag(tokens)
    cleanspeech = ""
    for element in tagged:
        if element[1] in ('NN','NNP','JJ'):
            cleanspeech=cleanspeech+element[0]+" "
    speechesnounsadj.append(cleanspeech)

In [None]:
speechesnounsadj