# Simple template for multi-class text classification.
### By André Walsøe, Data scientist/Head Engineer Oslo University Library 2018
This is a template for  multi-class text classification machine learning applications produced for the Research Bazaar, "Hands-on Workshop: Exploring Research Data with Artificial Intelligence and Design Thinking" at UiO, January 11th 2019. 


In [None]:
!curl -L -o data.tsv https://raw.githubusercontent.com/auwalsoe/encode_nlp_workshop_2023/main/data/papyrus_data.tsv

Goals:
    1. Understand how to perform necessary pre-processing for multi-class text classification.
    2. Test pre-processed data in pre-made model training framework.
    3. If time iterate on pre-processing methods to improve model training results.
    4. If time, play with model parameters to further improve results.

In [56]:
import pandas as pd
import numpy as np
# Loading tsv-file containing corpus.
#corpus = pd.read_csv("data/corpus_for_resbas_281218.tsv", sep = '\t')
corpus = pd.read_csv('/content/data.tsv', delimiter = '\t')

## Exploring, what columns does this tsv contain?

In [42]:
list(corpus)

['Unnamed: 0',
 'translation',
 'category',
 'author',
 'summary',
 'keywords',
 'originDate',
 'provenance',
 'num_words_in_translation']

For wordclouds displaying the content, please take a look here:[session3_wordclouds](session3_wordclouds.ipynb)

In [43]:
corpus

Unnamed: 0.1,Unnamed: 0,translation,category,author,summary,keywords,originDate,provenance,num_words_in_translation
0,0,"kollesis obscures much of the text. 6 lines, o...",documentary text,unknown,unknown,papyri,Between 300 and 130 B.C.,unknown,24
1,3,"21 lines, on recto along the fibers; 2 lines (...",letter : conclusion : to maron,unknown,conclusion of a letter to maron about various ...,papyri,Late 2nd or 3rd century A.D.,unknown,24
2,1,3+ cols. (54+ lines) on recto along the fibers...,account : fragment,unknown,unknown,papyri,Late 2nd century B.C.,unknown,12
3,3,"10 lines, on verso across the fibers. on recto...",tax receipt,diodoros,"diodoros and partners, farmers of the crown ta...",papyri,181/182 or 213/214 A.D.,unknown,67
4,3,"49 lines, on verso across the fibers. recto em...",petition from harmiysis and the crown tenants ...,harmiysis,harmiysis and the crown tenants complain to kr...,papyri,Between 105 and 90 B.C.,unknown,240
5,0,"paste obscures part of the text. 1 line, on re...",documentary text,unknown,unknown,papyri,Between 300 and 130 B.C.,unknown,15
6,2,"13 lines, on recto along the fibers. verso emp...",tax account.,unknown,total taxes received in the egyptian months of...,papyri,2nd century A.D.,unknown,62
7,0,"ink is very faint. 2 lines, on recto along the...",documentary text,unknown,unknown,papyri,Between 300 and 130 B.C.,unknown,13
8,2,"10 lines, on recto along the fibers. below: le...",circular letter from horos : copy,horos,horos orders the topogrammateis and village sc...,papyri,29 Oct. or 8 Nov. 114 B.C.,unknown,94
9,1,"2 lines, on recto along the fibers. verso empt...",letter or petition to a strategos.,unknown,fragment with beginning address only.,papyri,1st century A.D.,unknown,19


# Extracting the different columns

In [44]:
translations = corpus["translation"].values


categories = corpus["category"].values
authors = corpus["author"].values
summaries = corpus["summary"].values
keywords = corpus["keywords"].values
date = corpus["originDate"].values
provenance = corpus["provenance"].values


## Pre-processing
For the data you choose there is a need for preprocessing, take a look at the data to determine which
techniques from session 1 (or other techniques you know) that needs to be applied and then apply them.
The techniques from session 1 can be reviewed [here](session1_preprocessing_example.ipynb) 

http://localhost:8888/notebooks/session1_preprocessing_example.ipynb


In [66]:
# Implement pre-processing here





X = translations#What is your input data
y = authors#what is your labels?. 

## Before performing feature extraction, the data needs to be split into training and test set. 
The feature extraction should be fitted to the training data, to avoid overfitting the features to the test data.
To read more about the module: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [67]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

## Feature Extraction
In this step feature extraction will be performed on the data. To make it simpler two types of feature extraction has been implemented already; "count" and "tf-idf". The user can then by setting the variable "feature_extraction_method" choose which one they want to use (or try both!).

To read more about count_vectorization: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

To read more about term frequency inverse document frequency vectorization:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer


In [68]:
feature_extraction_method = "count"  ## The user needs to choose this one.


In [69]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

if feature_extraction_method == "count":
    vectorizer = CountVectorizer()
    X_train_vectorized = vectorizer.fit_transform(X_train)
    X_test_vectorized = vectorizer.transform(X_test)
else:
    vectorizer = TfidfVectorizer()
    X_train_vectorized = vectorizer.fit_transform(X_train)
    X_test_vectorized = vectorizer.transform(X_test)

feature_names = vectorizer.get_feature_names()

## X_vectorized = 
## Y_vectorized = 

In [70]:
## By running the code below, a selection of machine learning algorithms will be trained.

## Training - Logistic Regression

In [71]:
# Creating a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# Fitting model to the training data. A process also know as training.
logreg.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Prediction and Evaluation - Logistic Regression

In [72]:
predictions = logreg.predict(X_test_vectorized)

In [73]:
# Evaluating the predictions done on the test set.
from sklearn.metrics import classification_report
print(classification_report(y_test.astype('U'), predictions))

                                                                                                                            precision    recall  f1-score   support

                                                                                                      . . . alias mikalos        0.00      0.00      0.00         0
                                                                              @r, son of ns-#nsw. p3-whr, son of imn-htp.        0.00      0.00      0.00         0
                                                                                            @r, son of p3-ti-@r-sm3-t3.wy        0.00      0.00      0.00         1
                                                                                                                  [- - -]        0.00      0.00      0.00         3
                                                                        [- - -] nomissianus & marcus petronius servillius        1.00      1.00      1.00         1
               

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


## Training - Random Forest Classifier
Read more here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [53]:
from sklearn.ensemble import RandomForestClassifier
# Creating a random forest model

rf = RandomForestClassifier()

# Fitting model to the training data. A process also know as training.
rf.fit(X_train_vectorized, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## Prediction and Evaluation - Random Forest Classifier

In [54]:
rf_predictions = rf.predict(X_test_vectorized)

In [55]:
# Evaluating the predictions done on the test set.
from sklearn.metrics import classification_report
print(classification_report(y_test.astype('U'), rf_predictions))

                                                                    precision    recall  f1-score   support

                                                       Hermopolis        0.00      0.00      0.00         1
                                                            Syene        0.00      0.00      0.00         2
                                                 Alexandria, Egypt       0.75      0.25      0.38        12
                                               Antinoopolis, Egypt       0.00      0.00      0.00         2
                         Aphrodito modern Kom Ashkauh Upper Egypt        0.00      0.00      0.00         3
                                             Aphroditopolite, nome       0.00      0.00      0.00         1
                                  Apollonopolis, province of Egypt       0.00      0.00      0.00         5
                                                       Areos kome        0.00      0.00      0.00         1
                           

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
