# Natural Language Processing (NLP) in Python
## An introduction to key concepts and techniques

In this notebook we are going to explore central concepts of NLP and their implementation in modern high-level Python libraries.
This is aimed to be a very general introduction to make this field more approacheable and also provide some familiarity with the specific jargon. While NLP offers many opportunities as a technique (or actually an array of different techniques) for social science research the application is yet limited but growing.

The research field of NLP itself has been turn upside-down and developed a lot since the introduction of word embeddings around 2013 and the growth of deep learning (neural network models) in the past 3-4 years. Particularly recurrent neural networks and the LSTM (Long short-term memory) variation shifted the research field.

![nlp problems](https://image.slidesharecdn.com/lang-detect-161011092815/95/nlp-project-full-cycle-16-638.jpg)


This workshop aims at presenting established techniques that I think are most useful in a social science research setting.

To be more specific, below we will explore:

- basic string manipulation
- tokens and tokenization + some preprocessing
- the Bag-of-Words model
- topic modeling (and its close relation to dimensionality reduction / unsupervised machine learning)
- entity extraction
- text classification

### Basic string manipulation and tokenization

In the following we will just juse basic python string manipulation

You can do much much more if you learn using regular expressions (RegEx) but that would go too far - and you can learn some of it in the DC course.

Let's start with a recent news text form the Guardian.

## Installing some packackages that we are going to use

In [None]:
# Topic modelling package
!pip install gensim

In [None]:
#import nltk
#nltk.download()

# then enter "d" for download and "popular" for the popular packages

In [None]:
text = """The US Senate has voted to confirm judge Brett Kavanaugh to the supreme court, handing Donald Trump a major victory and America a bench expected to tilt to the right for the next generation.
The president will hold a ceremony for Kavanaugh at the White House on Monday evening and he is expected to take his place on the court on Tuesday.
After a bitter fight on Capitol Hill dominated by partisan entrenchment and the allegations of sexual assault against Kavanaugh, the 53-year-old federal judge was sworn in by supreme court chief justice John Roberts on Saturday evening just a few hours after Republicans won the confirmation vote 50 to 48.
Furious protesters hammered on the huge front doors beneath the white columns of the majestic court building on Capitol Hill as Kavanaugh was being sworn in, following a day of demonstrations that saw many arrested but were more muted than days earlier, as it became clear the ultra-conservative’s confirmation was all but inevitable."""

In [None]:
# We can split the text-chunk into something like sentences.
split_text = text.split('.')
print(split_text)

In [None]:
# print out the first stentence
sentence_3 = split_text[2]
print(sentence_3)

In [None]:
# Let's create tokens
tokens_sentence_3 = [word for word in sentence_3.split(' ')]
print(tokens_sentence_3)

In [None]:
# Let's lowercase all these tokens and clean up the \n (new line command)
# Also we will replace "()" as well as make sure that only words lend in our list
tokens_sentence_3_lower = [word.lower().strip() for word in sentence_3.split(' ')]
print('### OUTPUT1 ###')
print(tokens_sentence_3_lower)
print('\n')
    
tokens_sentence_3_lower = [word.replace('(','').replace(')','') 
                           for word in tokens_sentence_3_lower if word.isalpha()]

print('### OUTPUT2 ###')
print(tokens_sentence_3_lower)


In [None]:
# Removing stopwords

stopwords_en = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
                'ourselves', 'you', "you're", "you've", "you'll", 
                "you'd", 'your', 'yours', 'yourself', 'yourselves', 
                'he', 'him', 'his', 'himself', 'she', "she's", 'her', 
                'hers', 'herself', 'it', "it's", 'its', 'itself', 
                'they', 'them', 'their', 'theirs', 'themselves', 'what', 
                'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 
                'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 
                'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 
                'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 
                'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 
                'between', 'into', 'through', 'during', 'before', 'after', 'above', 
                'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 
                'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 
                'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 
                'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 
                'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 
                'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 
                'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', 
                "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', 
                "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 
                'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 
                'won', "won't", 'wouldn', "wouldn't"]

In [None]:
tokens_sentence_3_clean = [word for word in tokens_sentence_3_lower if word not in stopwords_en]
print(tokens_sentence_3_clean)

Introducing NLTK, which will make your life much easier

In [None]:
import nltk

In [None]:
# Tokenizing sentences
from nltk.tokenize import sent_tokenize

# Tokenizing words
from nltk.tokenize import word_tokenize

# Tokenizing Tweets!
from nltk.tokenize import TweetTokenizer

In [None]:
# Let's get our stences.
# Note that the full-stops at the end of each sentence are still there
sentences = sent_tokenize(text)
print(sentences)

In [None]:
# Use word_tokenize to tokenize the third sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[2])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(text))
print(unique_tokens)

Let's see how this works with teweets using a well known example

In [None]:
tweets = ["On behalf of @FLOTUS Melania & myself, THANK YOU for today's update & GREAT WORK! #SouthernBaptist @SendRelief,… https://t.co/4yZCeXCt6n",
"I will be going to Texas and Louisiana tomorrow with First Lady. Great progress being made! Spending weekend working at White House.",
"Stock Market up 5 months in a row!",
"'President Donald J. Trump Proclaims September 3, 2017, as a National Day of Prayer' #HurricaneHarvey #PrayForTexas… https://t.co/tOMfFWwEsN",
"Texas is healing fast thanks to all of the great men & women who have been working so hard. But still so much to do. Will be back tomorrow!"]

In [None]:
# We can use the tweet tokenizer to parse these tweets:

tknzr = TweetTokenizer()
tweets_tokenized = [tknzr.tokenize(tweet) for tweet in tweets]
print(tweets_tokenized)

In [None]:
# Get out all hashtags using loops

hashtags = []

for tweet in tweets_tokenized:
    hashtags.extend([word for word in tweet if word.startswith('#')])
    
print(hashtags)

### Bag of words model

In order for a computer to understand text we need to somehow find a useful representation.
If you need to compare different texts e.g. articles, you will probably go for keywords. These keywords may come from a keyword-list with for example 200 different keywords
In that case you could represent each document with a (sparse) vector with 1 for "keyword present" and 0 for "keyword absent"
We can also get a bit more sophoistocated and count the number of times a word from our dictionary occurs.
For a corpus of documents that would give us a document-term matrix
![example](https://i.stack.imgur.com/C1UMs.png)

Let's try creating a bag of words model from our initial example.

In [None]:
# We import the Counter module from python's standard collections

from collections import Counter

word_tokenized = word_tokenize(text)
bow = Counter(word_tokenized)
print(bow.most_common())

In [None]:
# Let's add some preprocessing

from nltk.corpus import stopwords

english_stopwords = stopwords.words('english')

word_tokenized = word_tokenize(text)

# lowercasing
cleaned_word_tokenized = [word.lower().strip() for word in word_tokenized]
# replacing some unwanted things
cleaned_word_tokenized = [word.replace('(','').replace(')','') for word in cleaned_word_tokenized if word.isalpha()]
# removing stopwords
cleaned_word_tokenized = [word for word in cleaned_word_tokenized if word not in english_stopwords]

bow = Counter(cleaned_word_tokenized)
print(bow.most_common())

One important part of text preprocessing is normalization. Here we can use stemmers and lematizers to aggregate plural forms and similar. This can be extremely useful if working with languages that have a rich morphology such as Russian or Turkish.

![example_stemm](https://image.slidesharecdn.com/lightweightnaturallanguageprocessingnlp-120314154200-phpapp01/95/lightweight-natural-language-processing-nlp-34-728.jpg?cb=1331814243)

In [None]:
# Let's import a lemmatizer from NLTK and try how it works
from nltk.stem import WordNetLemmatizer

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in cleaned_word_tokenized]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

So far you learned some basic unicode string manipulation and I also introduced NLTK. If you want to lean more about traditional NLP, check out the [free online book on NLTK](https://www.nltk.org/book/). You will learn old school NLP along with Python (and general programming foundations).

When it comes to comparing documents (this is often what we want), simple "keyword counts" may be too simplistic and sure, we can do better – we can do topic modeling. One amazing library for working with state of the art topic models is Gensim.

![gensim](https://rare-technologies.com/wp-content/uploads/2017/01/atmodel_plot-855x645.png)

Let's try to work with a bigger dataset.

Gensim allows you to work with a large number of high-performant NLP models including word embedding techniques.  We will be using something more traditional: TF-IDF and LSI

In [None]:
# We start by importing the data, ~1900 Abstracts/Titles from Scopus
import pandas as pd

abstracts = pd.read_csv('https://github.com/SDS-AAU/M2-2018/raw/master/input/abstracts.csv')

In [None]:
# Let's inspect the data
abstracts.head()

**Introducing Lambda Functions** Python allows you to write short functions in one line using the *lambda* keyword with a variable and a ":". 
Below we will transform the abstract column into a new one that we call tokenized compressing our preprocessing pipeline into 3 lines

We combine our lambda functions with the Pandas method "map" that apply this function to every row.

In [None]:
# Tokenize each abstract
abstracts['tokenized'] = abstracts['Abstract'].map(lambda t: word_tokenize(t))

In [None]:
# lowecase, strip and ensure it's words
abstracts['tokenized'] = abstracts['tokenized'].map(lambda t: [word.lower().strip() for word in t if word.isalpha()])

In [None]:
# lemmarize and remove stopwords
abstracts['tokenized'] = abstracts['tokenized'].map(lambda t: [wordnet_lemmatizer.lemmatize(word) for word in t if word not in stopwords_en])

Sure, one could do so much more to pre-process. We could try to identify bi-grams, remove prepositions, verbs etc. But already this brings us rather far.

Now we will dive into Gensim further transform our abstracts using more advanced techniques.

In [None]:
# We start by importing and initializing a Gensim Dictionary. 
# The dictionary will be used to map between words and IDs

from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(abstracts['tokenized'])

In [None]:
# And this is how you can map back and forth
# Select the id for "firm": firm_id
firm_id = dictionary.token2id.get("firm")

# Use computer_id with the dictionary to print the word
print(dictionary.get(firm_id))

In [None]:
# Create a Corpus: corpus
# We use a list comprehension to transform our abstracts into BoWs
corpus = [dictionary.doc2bow(abstract) for abstract in abstracts['tokenized']]

In [None]:
# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[10][:10])

# This is the same what we did before when we were counting words with the Counter (just in big)

In [None]:
# Sort the doc for frequency: bow_doc
bow_doc = sorted(corpus[10], key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:10]:
    print(dictionary.get(word_id), word_count)

#### TF-IDF - Term Frequency - Inverse Document Frequency

A token is importan for a document if appears very often
A token becomes less important for comparaison across a corpus if it appears all over the place in the corpus

*Innovation* in a corpus of abstracts talking about innovation is not that important


\begin{equation*}
w_{i,j} = tf_{i,j}*log(\frac{N}{df_i})
\end{equation*}

- $w_{i,j}$ = the TF-IDF score for a term i in a document j
- $tf_{i,j}$ = number of occurence of term i in document j
- $N$ = number of documents in the corpus
- $df_i$ = number of documents with term i


We will use TF-IDF to transform our corpus. However, first we need to fir the TF-IDF model.

In [None]:
# Import the TfidfModel from Gensim
from gensim.models.tfidfmodel import TfidfModel

# Create and fit a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[corpus[10]]

# Print the first five weights
print(tfidf_weights[:5])

In [None]:
# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:10]:
    print(dictionary.get(term_id), weight)

In [None]:
# Now we can transform the whole corpus
tfidf_corpus = tfidf[corpus]

The transformed corpus is much more interesting in terms of analysis than the pure bag of words representation. In fact, you could transform it now into a matrix and perform clustering and other unsupervised machine learning.

![surprise](http://www.jaclynfriedman.com/wp-content/uploads/2018/06/giphy-23.gif)

**Surprise**: This is exactly what topic modelling is about! Algorithms like LSI are closely related to PCA, NMF and SVD.



In [None]:
# Just like before, we import the model
from gensim.models.lsimodel import LsiModel

# And we fir it on the tfidf_corpus pointing to the dictionary as reference and the number of topics.
# In more serious settings one would pick between 300-400
lsi = LsiModel(tfidf_corpus, id2word=dictionary, num_topics=100)

In [None]:
# Once the model is ready, we can inspect the topics
lsi.show_topics(num_topics=10)

In [None]:
# And just as before, we can use the trained model to transform the corpus
lsi_corpus = lsi[tfidf_corpus]

At this point, our corpus is a document-topic matrix. in corpus-format. We can create a full matrix using the built in MatrixSimilarity function (which is actually used for similarity-queries)

In [None]:
# Load the MatrixSimilarity
from gensim.similarities import MatrixSimilarity

# Create the document-topic-matrix
document_topic_matrix = MatrixSimilarity(lsi_corpus)
document_topic_matrix = document_topic_matrix.index

In [None]:
# Let's identify some clusters in our corpus

# We import KMeans form the Sklearn library
from sklearn.cluster import KMeans

# Instatiate a model with 4 clusters
kmeans = KMeans(n_clusters=10)

# And fit it on our matrix
kmeans.fit(document_topic_matrix)

In [None]:
# Let's annotate our abstracts with the assigned cluster number
abstracts['cluster'] = kmeans.labels_

In [None]:
!pip install umap-learn

In [None]:
import umap

In [None]:
# We can try to visualize our documents using TSNE - an approach for visualizing high-dimensional data

# Import the module first
from sklearn.manifold import TSNE

# And instantiate
tsne = TSNE()

# Let's try to boil down the 100 dimensions into 2
visualization =  tsne.fit_transform(document_topic_matrix)

In [None]:
# Import plotting library
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(visualization[:,0],visualization[:,1], 
                data = abstracts, palette='RdBu', 
                hue=abstracts.cluster, 
                legend='full')

Now let's explore the different clusters. For that we will look at the titles. We could do it "manually" but why not using NLP for that, too.
We will preprocess the titles, just as we did witht he abstracts and then use TF-IDF of the title-token-sum of each cluster to see which tokens are most important in which cluster.

In [None]:
# Preprocessing
abstracts['title_tok'] = abstracts['Title'].map(lambda t: word_tokenize(t))
abstracts['title_tok'] = abstracts['title_tok'].map(lambda t: [word.lower().strip() for word in t if word.isalpha()])
abstracts['title_tok'] = abstracts['title_tok'].map(lambda t: [wordnet_lemmatizer.lemmatize(word) for word in t if word not in stopwords_en])

In [None]:
# Collectiong

Cluster = 2

cluster_titles = []
for x in abstracts[abstracts['cluster'] == Cluster]['title_tok']:
    cluster_titles.extend(x)

In [None]:
# Transfortm into tf_idf format
titles_tfidf = tfidf[dictionary.doc2bow(cluster_titles)]

In [None]:
# Sort the weights from highest to lowest: sorted_tfidf_weights
titles_tfidf = sorted(titles_tfidf, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in titles_tfidf[:20]:
    print(dictionary.get(term_id), weight)