# A practical introduction to machine learning and natural language processing on papyrus data 
## By MSc. André Walsøe and dr. Andrea Gasparini University of Oslo

In this workshop we will cover all the steps from downloading and preparing a dataset, filtering the data to training a classification model.


# Session 1 Preparation and exploration of Dataset

1. Install python libraries and download resources
2. Download dataset from github repository

### 1.1 Download dataset and install libraries
The dataset was created by downloading and filtering data from papyri.info.
All the "raw" data can be found here:
https://github.com/papyri/idp.data/tree/master/APIS

In [None]:
##Downloading dataset
!curl -L -o data.tsv https://raw.githubusercontent.com/auwalsoe/encode_nlp_workshop_2023/main/data/papyrus_data.tsv

## Installing libraries that are note provided in the default colab kernel.
!pip install nltk
!pip install gradio

## Downloading ntlk resources for tokenization and stopword removal
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Session 2: Data exploration and introduction to NLP techniques



## 2.1 Data exploration
Data exploration is important in NLP and machine learning because it helps to understand the characteristics of the data that will be used to train and test the model. This includes understanding the distribution of the data, identifying patterns and trends, and identifying any potential issues such as missing values or outliers. Additionally, data exploration can help to identify any biases or errors in the data that may negatively impact the performance of the model. By performing data exploration, practitioners can gain insights that can be used to improve the quality of the data and the performance of the model.

What we will cover here:
1. Basic data exploration with pandas
2. Application of filtering techniques based on data exploration findings
3. Hands-on task: Data exploration and filtering 

### 2.1.2 Basic Data exploration and filtering with pandas
To load the dataset into our google colab we use a library called pandas (https://pandas.pydata.org) which is a open source data analysis and data manipulation tool that is widely used for every tasks involving data.

In [None]:
#Importing the pandas Library
import pandas as pd

# Loading the downloaded data into a pandas dataframe
#data = pd.read_csv('/content/data.tsv', delimiter = '\t')
data = pd.read_csv('data.tsv', delimiter = '\t')
data.head()

In [None]:
## Printing column names
for col in data.columns:
    print(col)

In [None]:
## Get frequencies of columns
def plot_column_distribution(dataset, column_name, top_n):
    """Creates a bar plot of the frequency of different columns in the dataset.
    Input variables:
    dataset: Pandas dataframe
    Column name: string
    top_n: int"""
    
    dataset[column_name].value_counts()[:top_n].plot(kind="bar")
#data.category.value_counts()[:10].plot(kind='bar')

plot_column_distribution(data, "provenance", 10)

### 2.2.2 Application of filtering techniques based on data exploration findings

In [None]:
# Removing data entries with unknown provenance from the dataset
data = data[data.provenance != 'unknown']

In [None]:
# Assigning all provenances with frequency less than 100 to "other"
threshold = 100
freq = data['provenance'].value_counts()
mappings = freq.index.to_series().mask(freq < threshold, 'Other').to_dict()

data['provenance'] = data['provenance'].map(mappings)


translations = data['translation'].values
provenance = data['provenance'].values


In [None]:
plot_column_distribution(data, "provenance", 10)


In [None]:
data['category'].value_counts()

### 2.1.3 Hands-on excercise: Data exploration and filtering
Explore the different columns of the dataset, for example with the plot_column_distribution function, or data[COLUMN_NAME].value_counts() and propose other filters that should be applied to improve the dataset.


In [None]:
### Write your solution or ideas below this line:


# Session 2.2: Introduction to nlp techniques
1. Lower text 
2. Tokenization
3. Stopword removal
4. Vectorization
6. Show how these techniques can be applied on papyrus data using NLTK
7. Hands-on exercise: Apply NLP techniques to papyrus dataset

### Example corpus
A corpus is a structured set of texts which can be used for statistical analysis or similar. In this exercise we will use a small example corpus consisting of parts of 3 papyri translations that were chosen randomly. In this section I will show how we can apply the pre-processing methods shown above on a corpus. We will re-use the functions written above. 

In [None]:
example_corpus = ["Agathon to Patron, greeting. Apollonios has applied to me about the village of Takona, that you should appoint him as guard. you would therefore do well to hand (the post?) over to him. good-bye", 
                  "Address Asklepiades to Marres, greeting. Menches having been appointed village scribe of Kerkeosiris on the understanding that he shall cultivate at his own expense ten arouras of the land in the neighborhood of the village which has been reported as unproductive at a rent of fifty artabas", 
                  "I therefore present to you this complaint in order that the accused may be summoned and compelled to refund me the damage; and if he refuses i beg you to forward a copy of the petition to the proper officials, so that I may have it placed on record and the king may incur no loss. Farewell."]
example_corpus

### 2.2.1 Lower text
Here it is shown how to lower the text. In a small dataset it is an important step, as the word will be interpreted as 2 different words by the algorithm if the word is written with capital letter or not.

In [None]:
example_corpus = [x.lower() for x in example_corpus]
example_corpus

### 2.2.2 Tokenization
Tokenization is a way of splitting a text into smaller units, called tokens, in this example we will split it into words, but one can also split into sentences, letters or other types of units. The reason for doing this is that most NLP models and algorithms work on a token level. The tokens are also used to create the vocabulary, the set of unique tokens in the corpus.

In our example we use the word_tokenize method from the Natural language toolkit library (NLTK): https://www.nltk.org/api/nltk.tokenize.html

In [None]:
from nltk.tokenize import word_tokenize
tokenized_example_corpus = [word_tokenize(x) for x in example_corpus]
tokenized_example_corpus

### 2.2.3 Stopword removal
After tokenizing the corpus the next step is to remove the stopwords. Stopwords are words that are words that are frequently used in a language. In english it can be words like:
- i
- me
- my
- myself
- we
- our
- ours
- ourselves
- you
- your
- yours
- yourself
- yourselves
- he


Full list can be seen here: https://gist.github.com/sebleier/554280

Removing stopwords can improve the performance of text classification models in several ways:

- Dimensionality reduction: Stopwords can make up a large proportion of the words in a text, removing them can reduce the dimensionality of the data and make the model more efficient.

- Improved feature selection: Stopwords do not carry much meaning, so removing them can help the model to focus on the most meaningful words or phrases in the text, which can improve feature selection.

- Better generalization: Removing stopwords can help the model to generalize better, as it will not be distracted by common words that may not be indicative of the class of the text.

In [None]:
from nltk.corpus import stopwords
def remove_stopwords(tokenized_text, stopwords):
    clean_tokenized_text = [x for x in tokenized_text if x not in stopwords.words('english')]
    return clean_tokenized_text

tokenized_example_corpus_without_stopwords = [remove_stopwords(x, stopwords) for x in tokenized_example_corpus]
tokenized_example_corpus_without_stopwords

### 2.2.4 Vectorization
Vectorization describes the way to convert data from raw text into vectors of numbers that can be input to machine learning models. This is a key part of feature extraction. There are several ways of converting the data, but in this workshop we will cover only 2:
- Count
- Tfidf

#### Count Vectorization
Count vectorization is a technique used in NLP to convert a collection of text documents into a numerical representation, such as a matrix or array. The technique involves tokenizing the text (i.e., breaking it up into individual words or phrases) and counting the occurrences of each token in each document. The resulting matrix or array can then be used as input to a machine learning algorithm for tasks such as text classification or clustering. The columns of the matrix represent the features (i.e., tokens) and the rows represent the samples (i.e., documents). Count vectorization is a simple and effective method for representing text data, but it can be high-dimensional and sparse.

For more information about how it is implemented in Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

CountVect = CountVectorizer(stop_words='english').fit(example_corpus)

example_corpus_count_vectorized = CountVect.transform(example_corpus)
print(CountVect.get_feature_names_out())
print(example_corpus_count_vectorized.toarray())


#### Tf-idf vectorization
Tf-idf stands for term frequency-inverse document frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

How it is calculated:

![alt text](https://mungingdata.files.wordpress.com/2017/11/equation.png "Logo Title Text 1")

![alt text](https://mungingdata.files.wordpress.com/2017/11/tfidf.png "tfidf formula")

For more information about sklearns implementation:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

## Initiating the TFIDF vectorizer and adapting it to our example corpus. Here we also remove the stopwords.
TfidfVect = TfidfVectorizer(stop_words='english').fit(example_corpus)

## Transforming the dataset into a tfidf vectorized representation
example_corpus_tfidf_vectorized = TfidfVect.transform(example_corpus)
print(TfidfVect.get_feature_names_out())
print(example_corpus_tfidf_vectorized.toarray())



### 2.2.5 Hands-on exercise: Apply NLP techniques to papyrus dataset


## Session 3: Building a text classification model
1. Building a text classification model
    1. Choose what to predict and which variables to use
    2. Split data into training and test
    3. Transform/vectorize data
    4. Train a logistic regression model
    5. Test and evaluate metrics
    6. Deploy model with Gradio
4. Hands-on exercise: Build and train your own classification model (deploy it with Gradio if time allows)
5. Wrap-up

### 3.1 Building a text classification model


#### 3.1.1 Choose what to classify and data to train the model with
As we saw in 2.1.2 we have several different types of data columns:
- translation
- category
- author
- summary
- keywords
- originDate
- provenance

Which means that there are many possible data inputs we can use in order to build a model.
Classification is about predicting a class, by identifying which classes belongs to it based on different parameters.
In a classification model we need input variables x and a label y that we want to predict.

In this example (since we are doing NLP), I will use translations as the input data. With the translations we could classify author, category, provenance or originData. Here I will show how to build a model that tries to classify a papyrus provenance based on it's translation. 

In the hands-on session you can try to change what to classify and see how it works :)

#### 3.1.2 Split data into training and test
Splitting the data into a training set and a test set is an important step in the machine learning process, as it helps to ensure that the model is able to generalize well to new, unseen data.

The training set is used to train the model, while the test set is held back as a way to evaluate the model's performance. By using a separate test set, we can get an estimate of how well the model is likely to perform on new, unseen data, rather than just how well it performs on the data it was trained on.

![alt text](https://miro.medium.com/max/1400/1*Nv2NNALuokZEcV6hYEHdGA.png "Logo Title Text 1")

Normally the split is 60% training set, 20% validation, 20% test, but this depends on the size of the dataset. In this workshop we are splitting it into 80% training set and 20% test set. Validation sets are often used in order to tune the hyperparameters of the model, and in this course it will not be covered.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(translations, provenance, test_size=0.2, random_state=42)

#### 3.1.3 Transform/vectorize data

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
## Choose vectorization method:
vect_type = "tfidf"
if vect_type == "tfidf":
    ## Initializing a TFidf vectorizer. Here we also specify that the vectorizer should lowercase the text and remove stopwords.
    tfidf_vect = TfidfVectorizer(lowercase= True, analyzer = "word", stop_words="english")

    ## Adapting the vectorizer to our training set and then transforming our dataset to tfidf_vectorization. I.e the vectorizer vocabulary will be based on the training set.
    X_train_vectorized= tfidf_vect.fit_transform(X_train)

    ## Transforming our dataset to tf_idf
    X_test_vectorized = tfidf_vect.transform(X_test)

    vectorizer = tfidf_vect
else:
    ## Initializing a CountVectorizer. Here we also specify that the vectorizer should lowercase the text and remove stopwords.
    count_vect = CountVectorizer(lowercase=True, analyzer="word", stop_words="english")

    ## Adapting the vectorizer to our training set and then transforming our dataset to tfidf_vectorization. I.e the vectorizer vocabulary will be based on the training set.
    X_train_vectorized= count_vect.fit_transform(X_train)

    ## Transforming our dataset to tf_idf
    X_test_vectorized = count_vect.transform(X_test)
    
    vectorizer = count_vect

#### 3.1.4 Train a logistic regression model
In this workshop we train and predict using a logistic regression model. This type of model is widely used, and is very efficient and not too computationally demanding. It performs well on classification tasks, but it is not State-of-the-art, meaning that there are more complex and modern algorithms that perform better. Other alternatives that perform better are for example deep learning algorithms based on transformer architectures.

For a more in-depth (but practical) introduction to logistic regression one can read [here](https://towardsdatascience.com/the-perfect-recipe-for-classification-using-logistic-regression-f8648e267592)

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=200)
model.fit(X_train_vectorized,y_train)

#### 3.1.5 Test and evaluate metrics
To test how well our model is performing we run the classifier on the data in our test set and then we compare classifications made by our model with the true labels that the test set data has. Based on this we can calculate several metrics to evaluate how well the model is performing.
In this test we will use F1, Accuracy, Precision and Recall.

To learn more about precision and recall: https://en.wikipedia.org/wiki/Precision_and_recall

![alt text](https://wikimedia.org/api/rest_v1/media/math/render/svg/f0453f2614cd29f5dd49c2c9a0ef807985128e9e "Logo Title Text 1") 

![alt text](https://wikimedia.org/api/rest_v1/media/math/render/svg/f5c869c51dba6f1df65a6e6630c516de161632d4 "Logo Title Text 1") 


In [None]:
## Predicting provenances for our test input.
y_pred = model.predict(X_test_vectorized)
y_pred

In [None]:
from sklearn.metrics import classification_report
# Comparing our predicted output with the true labeled provenances and calculating metrics for each provenance.
print(classification_report(y_test, y_pred))

#### 3.3.6 Deploy model with Gradio                                  

In [None]:
import gradio as gr
gradio_model = model
gradio_vectorizer = vectorizer
def deploy_my_model(text):
    text_vectorized = gradio_vectorizer.transform([text])
    return str(gradio_model.predict(text_vectorized)[0])
demo = gr.Interface(fn = deploy_my_model, inputs= gr.Textbox(lines=3,placeholder="Papyrus translation here"), outputs="text")
demo.launch(share=True)

### 3.4 Hands-on exercise: Reflect on possible applications for these tools in your field.
Today you have received a short introduction to a lot of subjects in machine learning and NLP. As the last part of this workshop I want us all to brainstorm on possible applications of these tools and techniques in your field.

Please write up your ideas and we will share them in the class.


# Thank you for participating in the workshop!
Any feedback or question is very welcome at auwalsoe@gmail.com

and if you enjoyed the workshop and want to save it for later, feel free to give it a star on github!

All the best,
André Walsøe