# ACA-Workshop, Day 1
## by Damian Trilling

This is a Notebook file with exercises by a two-day workshop on using Python in the social sciences. It assumes that you have a very, very  basic understanding of Python (e.g., you know what a for-loop is). It introduces you into some basic techniques:
- sentiment analysis
- regular expressions (for, e.g., counting the occurrance of specific words)
- basic natural language processing

# Preparation
We assume that you have NLTK (Bird, Loper, & Klein, 2009) installed. If you use Anaconda, you have it anyway. Otherwise, use 
```
pip install nltk
```
or 
```
sudo pip install nltk
```
in your terminal to install it.
Furthermore, we have to download some data for some specific NLTK modules. Download them by executing the following cell (you only have to do this once):

Bird, S., Loper, E., & Klein, E. (2009). *Natural language processing with Python*. Sebastopol, CA: O'Reilly.

In [None]:
import nltk
nltk.download('vader_lexicon')
nltk.download('stopwords')
nltk.download('maxent_treebank_pos_tagger')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Warming up

Think back of what you know already about Python. Use the cell below to do the following task:
- Create a list that contains strings with numbers inside, something like ["12","42","11]
- Write a loop that converts the strings to integers, prints them, and adds them to a new list
- Modify your loop in such a way that it multiplies the numbers by two before adding them to the new list.

# Let's get started!

## Import modules
Before we start, let's import some modules that we need today. It is good practice to do so at the beginning of a script, so we'll do it right now and not later when we need them. The benefit is that you immediately see if something goes wrong (for instance, because the module is not installed).

In [None]:
import csv
import re
from nltk.sentiment import vader
from nltk.corpus import stopwords
import nltk

## Download the data
We will use a dataset by Schumacher et al. (2016). From the abstract:
> This paper presents EUSpeech, a new dataset of 18,403 speeches from EU leaders (i.e., heads of government in 10 member states, EU commissioners, party leaders in the European Parliament, and ECB and IMF leaders) from 2007 to 2015. These speeches vary in sentiment, topics and ideology, allowing for fine-grained, over-time comparison of representation in the EU. The member states we included are Czech Republic, France, Germany, Greece, Netherlands, Italy, Spain, United Kingdom, Poland and Portugal.

Schumacher, G, Schoonvelde, M., Dahiya, T., Traber, D, & de Vries, E. (2016): *EUSpeech: a New Dataset of EU Elite Speeches*. [doi:10.7910/DVN/XPCVEI](http://dx.doi.org/10.7910/DVN/XPCVEI)

Download and unpack the following file:
```
speeches_csv.tar.gz
```

In the .tar.gz file, you find a .zip file. Extract the whole folder to your home directory.
See below a screenshot of how this looks like in Lubuntu (double-click on "speeches_csv.zip" in the left window, then the right window will open. Click on "Extract")

In [None]:
from IPython.display import Image
Image("https://github.com/damian0604/bdaca/raw/master/ipynb/euspeech_download.png")

Let's have a look at the files we downloaded. The following cell does this (assuming that you work on Linux or MacOS *and* that you saved the files in the same directory where you started your notebook server and where this notebook lies). 

In [None]:
%ls Cleaned_Speeches/

## Get some idea about the data
Let us inspect the data. Let us only look at the first row:

In [None]:
with open("Cleaned_Speeches/Speeches_NL_Cleaned.csv") as fi:
    reader=csv.reader(fi)
    firstrow=next(reader)
    print("It looks like we have",len(firstrow),"columns.")
    print("\nThis is the content:\n")
    print(firstrow)

As you can see, we can directly address a specific element from this row (we start counting at zero!). Which one might be most interesting for us? Just **play around** a bit! Note down (on a piece of paper or in a file) how the structure of the dataset looks like!

In [None]:
firstrow[0]

## Let's start!
Now that we know how the data looks like, we can *loop* over all rows in the file in order to retrieve a list of all speeches:

In [None]:
with open("Cleaned_Speeches/Speeches_NL_Cleaned.csv") as fi:
    reader=csv.reader(fi)
    speeches=[]
    for row in reader:
        speeches.append(row[5])

In [None]:
len(speeches)

We'll clean up a bit. You don't know the technique used here yet (it's called 'list comprehension), and I can explain it to you later. It is basically a short form of writing a for-loop.

In [None]:
speeches=[speech.replace('<p>',' ').replace('</p>',' ') for speech in speeches]   #remove HTML tags
speeches=[" ".join(speech.split()) for speech in speeches]   # remove double spaces by splitting the strings into words and joining these words again

Let's look at the first speech to check everything's fine.

In [None]:
speeches[0]

# Sentiment analysis
We will do our first analysis, using the algorithm by Hutto and Gilbert (2014). It is already implemented in NLTK, so we can run the analysis with just two lines of code! 
The only thing we have to care about is providing the input data and storing the output.

Hutto, C.J., & Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. *Eigth internatioanl AAAI conference on weblogs and social media.*

In [None]:
senti=vader.SentimentIntensityAnalyzer()

In [None]:
senti.polarity_scores(speeches[0])

In [None]:
senti.polarity_scores(speeches[1])

So, how could we apply this to the whole dataset? With a loop! I'll give you a basic example with a lot of possibilities for improvement:

In [None]:
with open("Cleaned_Speeches/Speeches_NL_Cleaned.csv") as fi,  open('myoutput.csv',mode='w') as fo:
    reader=csv.reader(fi)
    writer=csv.writer(fo)
    for row in reader:
        speech=row[5]
        sentiment = senti.polarity_scores(speech)
        writer.writerow([speech[:100],sentiment['pos']])

In [None]:
!head myoutput.csv

## It's your turn!
Your task: write a better code that 
- outputs more info
- preprocesses the string (remove p-tags, for example)

If you feel a bit more adventurous: 
- Add an if-statement to filter out the french speeches! Modify your script by including a structure like
```
if APPROPRIATECOLUMN=='en':
    DO SOMETHING
```

# Regular Expressions
There are a lot of online tutorials explaining regular expressions (and you can read up in my book or on the slides), so I won't go into detail here how to construct one. But let's look at a prototypical usecase: Counting how often something is mentioned in texts. Let's start by examing one single speech:

In [None]:
speeches[0]

Then we can get a list with all substrings that match the regexp. And, as with any lists, we can calculate its length!

In [None]:
re.findall(r"[Ee]conomy|[Ee]conomic",speeches[0])

In [None]:
len(re.findall(r"[Ee]conomy|[Ee]conomic",speeches[0]))

## It's your turn!
Let's write a loop to count the numbers of references to the economy per article and output it to a csv file!

# NLP
As a prerequisite for many techiques we want to use tomorrow, we want to clean up the text. Typical steps involve:
- converting to lowercase
- remove punctuation
- remove stopwords
- stemming
- parsing (= determining the grammatical function of words).
Of course, depending on the task at hand, we don't want to do all of them - and also the order matters. If we want to parse a sentence, well, we better still have a sentence (and not already have removed stopwords and punctuation).

Below, you find some examples:

## Stopword removal

In [None]:
cleanedspeeches=[]
for speech in speeches:
    speech=speech.lower().replace(".","").replace(",","").replace('"',''.replace("'","")).replace("?","")
    words=speech.split()
    words = [w for w in words if w not in stopwords.words('english')]
    speechnew = " ".join(words)
    cleanedspeeches.append(speechnew)

In [None]:
cleanedspeeches

## Parsing and retaining only nouns and adjectives
Look at the NLTK documentation to find out what each code means (e.g., 'NN' is 'noun') 

In [None]:
speechesnounsadj=[]
for speech in speeches:
    tokens = nltk.word_tokenize(speech)
    tagged = nltk.pos_tag(tokens)
    cleanspeech = ""
    for element in tagged:
        if element[1] in ('NN','NNP','JJ'):
            cleanspeech=cleanspeech+element[0]+" "
    speechesnounsadj.append(cleanspeech)

In [None]:
speechesnounsadj