# (Optional) Chapter V - Natural Language Processing

This notebook is aimed to work alongside our explanation to help illustrate how to begin the basics of natural language processing.

- Topic modelling
- Resources for further research

## Section A) Setup

We'll first import our required packages. For the basics of natural language processing we're going to be using nltk. You'll need to install this in your current enviroment using the terminal.

To make sure your enviroment is set up, open your terminal. The line should begin with (base).

- We have previously used 'conda create -n enviroment_name' to create an enviroment. Now use 'conda activate enviroment_name' to open your enviroment, once you've run this the line should start with (enviroment_name).

- Once in the enviroment we can install packages. We want to install them here rather than in the base so that they are only used for this project. 

- To do this use 'conda install package_name'.
In this case 'conda install nltk'.

- Note that for smaller packages (eg. pyldavis), we have to specify where the package can be found in this case, the conda-forge.


In [None]:
## Run those three commands in order to install the needed packages

# conda activate your_enviroment_name
# conda install nltk
# conda install gensim
# conda install conda-forge::pyldavis

These import statements are different to what we have used so far. Instead of importing a whole library, using '.' allows us to import particular functions and subselections.

In [None]:
# Set up imports
import pandas as pd, matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt_tab')

import gensim
from gensim.corpora import Dictionary
import pyLDAvis.gensim_models
import string # This provides us a list of punctuation later down the line

Let's now load in the dataframe and start cleaning. 
Note: this file is tab seperated (see how this changes read_csv below)

In [None]:
df = pd.read_csv('../../data/sample_museum_tweets.tsv', sep='\t')

### A) Data cleaning and Tokenization

Let's check what out data looks like.

In [None]:
df.info()

In [None]:
df.head(10)

✏️ [Ex. 1]

Let's clean up the datatypes first.
- Rename 'original_text' to 'text'
- Make sure that the column's datatype is a string
- How many null values are there in 'text'?

The dataset we have has already been cleaned so there is not much to do! But you may want to be wary of duplicates in your own dataset. The pandas function drop_duplicates() will help. You can check the length of the df before and after to see how many it found. Check the documentation for more details.

In [None]:
df = df.rename(columns={'original_text':'text'})
df['text'] = df['text'].astype(str)

# # Note it's always useful to check that there aren't a number of "NaN" strings in the column. As there would be more that one, we can identify it by using 'NaN'.
# df['text'].value_counts()

To make our dataset more homogenous, for now we will focus on all of the tweets that are not replies. This is becuase replies tend to have different characteristics (shorter, more boring vocabulary) so we'll drop them for this exercise. 

Here we can use loc, or simply a tilda (~) for 'not'.

When we do this we drop more than half the tweets. 

In [None]:
print(len(df))
df = df[~df["is_reply"]]
print(len(df))

## Section B) Preprocessing Tweets
https://towardsdatascience.com/lda-topic-modeling-with-tweets-deff37c0e131

https://medium.com/swlh/quick-text-pre-processing-c444f0ed9dcc

Here we will: 
- Remove URLs (you could also remove mentions using what we learnt yesterday or the code below, but we will leave them for now)
- Remove stop words, punctuation, and lowercase all words
- Tokenize each tweet
- Remove any remaining special characters
- Lemmatize/Stem the tweets

To remove urls, we will need to use regex with string operators. Here are some helpful guides:
https://blog.hpc.qmul.ac.uk/Beginners-Guide-to-Regular-Expressions/

And for a more full description: https://regexone.com/

You can test your regex with this site: 
https://regex101.com/ 

In [None]:
def remove_urls(df, text_column_name):
    df[text_column_name] = df[text_column_name].str.replace(r"http\S+", "", regex=True)
    return (df)

df = remove_urls(df, 'text')
df.head()

### Removing stopwords and tokenization

In [None]:
# Get a predefined list of stopwords
stop_words = stopwords.words('english')

# Add punctuation to our stopwords list
stop_words += list(string.punctuation) # This comes from python itself and is a full list of non-alphabetical characters. e.g. -./:;<=>?@

# Add integers
stop_words += ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [None]:
def tokenize_and_lowercase_new(text):
    tokens = word_tokenize(text)

    short_removed = []
    for token in tokens:
        token_lower = token.lower()
        if (token_lower not in stop_words)&(len(token_lower)>3):
            short_removed.append(token_lower)
    return short_removed

df['text_tokens'] = df['text'].apply(tokenize_and_lowercase_new)
df['text_tokens'].head(20)

Great! Each row is now a list of words that were in that tweet. However, it's a little bit cleaner than the method we used before, as we now don't have any punctuation or common words such as 'a' or 'the'. This is a much better start for analysis. 

There is one last step, we still need to lemmatize or perform stemming. Lemmatizing is a little bit more sophisticated (remember our example of leaf and leaves) so we will lemmatize using nltk. 

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_text(df_text):
    lemmatized_words =[]
    for w in df_text:
        lemmatized_words.append(lemmatizer.lemmatize(w))
    return lemmatized_words

df['text_lemmatized'] = df['text_tokens'].apply(lemmatize_text)
df['text_lemmatized'].head()

In [None]:
# You may want to save the dataframe for ease of access later.
df.to_pickle("../../data/lemmatized_museum_tweets.pkl")  

# to read it back to a dataframe we would use: 
# df = pd.read_pickle("../../data/lemmatized_museum_tweets.pkl") 

https://www.youtube.com/watch?v=UkmIljRIG_M&t=179s
For evalutation of topic models

## Section C) Basic Analysis

In [None]:
df = pd.read_pickle("../../data/lemmatized_museum_tweets.pkl") 

We saw how to visualise the length of museum tweets yesterday, but using nltk, we can do additional checks on the most common words and how they change over time.

In [None]:
from nltk.probability import FreqDist

#iterate through each tweet, then each token in each tweet, and store in one list
#flat_words = [item for sublist in df['text'] for item in sublist]

flat_words = []
for sublist in df['text_lemmatized']:
    for item in sublist:
        flat_words.append(item)

word_freq = FreqDist(flat_words)
word_freq.most_common(30)

It is currently a list of tuples, but we can easily convert it into a dictionary, which we could turn into a wordcloud.

In [None]:
top_30_words = dict(word_freq.most_common(30))
top_30_words

In [None]:
raf_museums_tweets = df.loc[df['museum_account']=='rafmuseum']
horniman_tweets = df.loc[df['museum_account']=='hornimanmuseum']

#flat_words_raf = [item for sublist in raf_museums_tweets['text'] for item in sublist]
flat_words_raf = []
for sublist in raf_museums_tweets['text_lemmatized']:
    for item in sublist:
        flat_words_raf.append(item)

word_freq_raf = FreqDist(flat_words_raf)
top_10_words_raf = dict(word_freq_raf.most_common(10))
top_10_words_raf

In [None]:
import matplotlib.pyplot as plt

plt.bar(range(len(top_10_words_raf)), list(top_10_words_raf.values()), align='center')
plt.xticks(range(len(top_10_words_raf)), list(top_10_words_raf.keys()))

plt.show()


In [None]:
flat_words_horn = [item for sublist in horniman_tweets['text_lemmatized'] for item in sublist]
word_freq_horn = FreqDist(flat_words_horn)
top_10_words_horn = dict(word_freq_horn.most_common(10))

plt.bar(range(len(top_10_words_horn)), list(top_10_words_horn.values()), align='center')
plt.xticks(range(len(top_10_words_horn)), list(top_10_words_horn.keys()))

plt.show()

## Section D) Topic modeling

So what is our data about? One interesting thing we can try is topic modeling. 

For this notebook, we're going to be using LDA (Latent Dirichlet Allocation). This is a white box/transparent box model. With this approach you can specifiy the number of clusters, which can be limiting. However, it means that you can label your data manually to create a test dataset to check your parameters agains, just as we discussed earlier this week. 

In [None]:
# First we have to create a bag of words

#create dictionary
text_dict = gensim.corpora.Dictionary(df['text_lemmatized'])

#view integer mappings
text_dict.token2id

tweets_bow = []
for tweet in df['text_lemmatized']:
    tweets_bow.append(text_dict.doc2bow(tweet))

In [None]:
all_words = []
for tokens in df['text_lemmatized']:
    for word in tokens:
        all_words.append(word)


tweet_length_in_lemmas = []
for tokens in df['text_lemmatized']:
    tweet_length_in_lemmas.append(len(tokens))

tweet_length_in_lemmas = [len(tokens) ]
vocab = sorted(list(set(all_words)))

print('{} words total, with a vocabulary size of {}'.format(len(all_words), len(vocab)))
print('Max tweet length is {}'.format(max(tweet_length_in_lemmas)))

In [None]:
from gensim.models.ldamodel import LdaModel

# When we train a model there are a number of parameters that we can add.

# k here represents the number of topics
k = 7
tweets_lda = LdaModel(tweets_bow,
                      num_topics = k,
                      id2word = text_dict,
                      random_state = 1,
                      passes=10)

tweets_lda.show_topics()
tweets_lda.save("lda_museum_tweets_7.model") # The seven here refers to the number of clusters we chose

In [None]:
pyLDAvis.enable_notebook()
lda_viz = pyLDAvis.gensim_models.prepare(tweets_lda, tweets_bow, dictionary=tweets_lda.id2word)
pyLDAvis.show(lda_viz, local=False)
lda_viz

What do the topics look like? Do they make sense to you? We'll try to see if there are 

✏️ [Ex. 2]

Let's clean up the datatypes first.
- Write a function that takes the text_dict, the number of clusters (k), and the number of passes and trains and saves a model (don't forget to use f strings to change the name of the file you save each time).
- Using your function, try changing the number of clusters to 4, 7, 8 and 10. Which one do you think worked best?
- What about the number of passes, how does this change your model? 

## Further resources (natural language processing and beyond...)

This has been a brief introduction to some basic nlp tasks, but it only scratches the surface. The below links point you in the direction of some helpful resources. They represent a wide range of approaches including neural networks, so do keep in mind what we talked about regarding black box models!

### 1) spaCy
For further exploration of syntax and named entity recognition, you can explore the spaCy library. They have visualisers to help you get to grips with what it does.
https://demos.explosion.ai/displacy-ent
https://demos.explosion.ai/displacy

I would encourage you to explore the different examples used in the "Guides" section of the spaCy documentation.
https://spacy.io/usage/linguistic-features
https://spacy.io/usage/rule-based-matching

It might be helpful for some of your projects!

### 2) berTopic
Often the go-to tool these days for neural topic modeling. There is a very helpful quickstart guide!
https://maartengr.github.io/BERTopic/index.html

### 3) Transkribus
For those of you who have not processed your scans of text yet and need to do OCR (Optical Character Recognition), Transkribus is to go-to tool for transcribing texts (free for a while, then talk to your university/send them an email). It's a great team and run by academics for academics. Takes some work to figure out all of the functions and interface but it well worth investing the time!

Different models be used to work with either handwritten texts or printed/typed, as well as for a wide selection of languages and time periods. 
https://www.transkribus.org/

### 4) Hugging face
Most of the libraries mentioned here have a number of pretrained models avaialble (at different sizes), for everything else, there's hugging face.
https://huggingface.co/

It's not ideal in many ways, but it is the largest repository of pretrained models and datasets currently out there, have an explore! It will be especially helpful for those of you not working in English.
An alternative would be kaggle (more for datasets). 

### 5) Distant viewing lab
Fantastic resource for computer vision (step by step guide)
https://distantviewing.org/

But they also have a section on sentiment analysis using neural networks.
https://distantviewing.org/dvscripts/sentiment.html

### 6) GLAM workbench 
A series of amazing notebooks exploring GLAM data. 
https://glam-workbench.net/

I've added some helpful links based on some of your projects! However, feel free to have an explore of the whole site, there are many more notebooks you may find useful. 

Here they have a guide on how to run these notebooks:

https://glam-workbench.net/trove-newspapers/#data-and-images

#### Useful notebooks that have direct relevance to some of your projects include: 
Finding nouns, verbs, sentences, and cleaning your text:
https://glam-workbench.net/trove-books/recipe-generator/

Geomapping:
https://glam-workbench.net/trove-newspapers/Map-newspaper-results-by-place-of-publication/ OR 
https://glam-workbench.net/trove-newspapers/Map-newspaper-results-by-place-of-publication-over-time/

Topic modeling of parlimentary press releases: 
https://glam-workbench.net/trove-journals/topic-modelling-parliament-press-releases/?h=topic

Topic modeling of books:
https://glam-workbench.net/trove-books/exploring-digitised-books-adel-rahmani/?h=topic

(Alas, should be noted that the trove notebooks aren't being updated due to issues with the National Library of Australia being weird about their API's terms and conditions, but you can still take and adapt them for yourself!)

### 7) Plotly
Not nlp related but is a different and really nice plotly library where the visualisations are interactive. Worth an experiment: 
https://plotly.com/python/

## Lastly if you come across any great resources yourself feel free to share! 