<font face="Arial" size="6" color="white"> **ProfessionAI** |Master AI Development: A.I. Applicata per sviluppatori| </font></br>[[filippo gronchi](mailto:fgronchi&#64;gmail.com) @ babbage25]


# <a tag="title">Final Project - Modello per l'Identificazione della Lingua di Testi di un Museo</a>

### <font face="Arial" size="5" color="white"><u>Table of contents</u>
</font><a class='anchor' name='toc'></a>
- [Preface - Delivery Notes](#0)
- [1. Purpose](#1)
- [2. Implementation steps](#2)
- [3. Code](#3)
    - [3.1 Dataset loading](#3.1)
    - [3.2 Dataset Preparation and Training/Test split](#3.2)
    - [3.3 Sentence cleaning and n-gram split](#3.3)
    - [3.4 Vectorization](#3.4)
- [4. Modelization](#4)
    - [4.1 Random Forest](#4.1)
    - [4.2 SVM linear kernel](#4.2)
    - [4.3 SVM rbf kernel](#4.3)
    - [4.4 Linear SVM](#4.4)  
    - [4.5 Multi Layer Perceptron](#4.5)
    - [4.6 Models Comparison](#4.6)
- [5. Data Augmentation](#5)
    - [5.1 Translation back and forth](#5.1)
- [6 Model optimization on the augmented dataset](#6)
  - [6.1 Dataset vectorization and Train/Test split](#6.1)
  - [6.2 Random Forest](#6.2)
    - [6.2.1 Default hyperparameters](#6.2.1)
    - [6.2.2 Optimized hyperparameters](#6.2.2)
  - [6.3 SVM linear kernel](#6.3)
    - [6.3.1 Default hyperparameters](#6.3.1)
    - [6.3.2 Optimized hyperparameters](#6.3.2)  
  - [6.4 SVM rbf kernel](#6.4)
    - [6.4.1 Default hyperparameters](#6.4.1)
    - [6.4.2 Optimized hyperparameters](#6.4.2)
- [7 Model selection and final operations](#7)
  - [7.1 Model training on the full dataset](#7.1)
  - [7.2 Full pipeline save (data cleaner + vectorizer + model)](#7.2)
  - [7.3 Full pipeline single test cases](#7.3)
  - [7.4 Streamlit Webapp](#7.4)

## <a name="0">Preface - Delivery Note</a>

The following python libraries have been used throughout this notebook:
  - pandas
  - numpy
  - os
  - time
  - string
  - re
  - ngrams (from nltk)
  - sklearn
  - GoogleTranslator (from deep_translator)

A data augmentation step has been foreseen to improve model performance. Since the augmentation procedure is not fully deterministic, a master copy of the new dataset has been saved in a shared folder. The last part of the notebook is based on that master dataset. However the code to generate the new dataset is always available and can be executed on demand (i.e. answering y to the initial question).

[$\uparrow$ top](#toc)

## <a name="1">1. Purpose</a>

This project involves the implementation of a language detector 'MuseumLangID' to be used in museum labeling context. Three languages are covered in the provided dataset: Italian, English, German.

[$\uparrow$ top](#toc)

## <a name="2">2. Implementation Steps</a>

The project has been carried out evaluating performances of multiple vectorization algorithms and Machine Learning models combinations. Data cleaning is performed beforehand with case lowering, punctuation/numbers/blaks removal.

Vectorizers and Models tested:

-> Text Vectorization
  - Bag of Words: 3 grams
  - TfIdf: 3-5 grams

-> Machine Learning model
  - Random Forest
  - SVM linear kernel/Linear SVM
  - Multi Layer Perceptron

Model with the best performance on the same training/test dataset is selected.
Full pipeline is then packaged to be exported for external reuse.

Final testing is performed on new samples, one for each language. A simple Streamlit UX has been published to perform extra tests.



[$\uparrow$ top](#toc)

## <a name="3">3. Code</a>

###<a name="3.1">3.1 Dataset loading and preliminary settings</a>

In [63]:
import pandas as pd
import numpy as np
import time
import os

RANDOM_SEED=23

new_aug_ds = input("Do you want to generate a new augumented dataset? (execution time might be much longer) [y/n]: ")
while new_aug_ds not in ['y','n']:
  new_aug_ds = input("Please enter only [y/n]: ")


Do you want to generate a new augumented dataset? (execution time might be much longer) [y/n]: y


In [64]:
# Load project dataset
file = "https://raw.githubusercontent.com/Profession-AI/progetti-ml/refs/heads/main/Modello%20per%20l'identificazione%20della%20lingua%20dei%20testi%20di%20un%20museo/museo_descrizioni.csv"
df = pd.read_csv(file)
df.head(30)

Unnamed: 0,Testo,Codice Lingua
0,Statua in marmo di un imperatore romano del II...,it
1,Anfora greca con decorazioni a figure nere,it
2,Dipinto rinascimentale raffigurante la Madonna...,it
3,Elmo corinzio in bronzo del VI secolo a.C.,it
4,Manoscritto medievale con miniature dorate,it
5,Scultura lignea gotica di un santo,it
6,Spada vichinga con impugnatura decorata,it
7,Maschera funeraria egizia in oro,it
8,Tavoletta sumera con incisioni cuneiformi,it
9,Vaso cinese della dinastia Ming con smalti blu...,it


In [65]:
print(f'Number of samples: {df.shape[0]}')

Number of samples: 294


In [66]:
print(f'Languages and distribution:')
df['Codice Lingua'].value_counts()

Languages and distribution:


Unnamed: 0_level_0,count
Codice Lingua,Unnamed: 1_level_1
it,98
en,98
de,98


Dataset is composed by:
- 294 samples
- 3 classes: it, en, de
- 98 samples per class: perfectly balanced

###<a name="3.2">3.2 Dataset Preparation and Training/Test split</a>

In [67]:
# Features and Target division
X = df.Testo
y = df['Codice Lingua']

# Encode the target
target_encoding_map = {'it':0, 'en':1, 'de':2}
y_enc = y.map(target_encoding_map)

# Train and Test set preparation (split with stratification to mantain classes balanced after the split)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y_enc, test_size=0.2, stratify=y_enc,random_state=RANDOM_SEED)

###<a name="3.3">3.3 Sentence cleaning and n-gram split</a>

Sentence cleaning will consist of:
- lower casing
- punctuation removal
- numbers removal
- multiple blanks removal

Lemmatization and stopwords removal are language dependent: these steps cannot be performed beforehand in a language recognition pipeline.

Complete preprocessing phase will include sentence cleaning and n-gram division.

In [68]:
# Preprocessing pipeline
# Functions definition
import string
import re
from nltk import ngrams

punctuation = set(string.punctuation)

def data_cleaner(sentence):
  """
  To clean sentence by sentence:
    - lower casing
    - remove punctuation
    - remove numbers
    - remove multiple spaces
  """
  # lower casing
  sentence = sentence.lower()
  # removing punctuation
  for c in string.punctuation:
    sentence = sentence.replace(c, ' ')
  # remove numbers
  sentence = re.sub(r"\d+", "", sentence)
  # remove double spaces
  sentence = re.sub(r"\s+", " ", sentence)
  return sentence

def my_ngram_function(sentence, n):
  """
  To create ngram of sentence
  input:
    - sentence: string or collection of string
    - n: integer
  output:
    - Collection of ngrams
  """
  ngram_sentence = ngrams(sentence,n)
  return ["".join(ngr) for ngr in ngram_sentence]

def sentence_processing(sentence, n):
  """
  To preprocess the dataset:
    - Sentence cleaning
    - n-gram (3) split
  """
  series = sentence.apply(lambda x: data_cleaner(x))
  ngrams_sentence = [my_ngram_function(sentence,n) for sentence in series]
  return [" ".join(ngram) for ngram in ngrams_sentence]

In [69]:
# Full Train/Test dataset preprocessing, tri-gram
X_train_gram = X_train.copy()
X_test_gram = X_test.copy()

X_train_gram = sentence_processing(X_train_gram,3)
X_test_gram = sentence_processing(X_test_gram,3)

### <a name="3.4">3.4 Vectorization</a>

Two mains Vectorization approaches are considered for this use case and their performance will be evaluated:

- Bag-of-Words
- TfIdf

**BoW**
- Vectorization function and application to Train/Test dataset

In [70]:
from sklearn.feature_extraction.text import CountVectorizer
def bow_count(dataset, count_vectorizer):
  """
  Text vectorization
  input:
    - datase: collection of sentences
    - count_vectorize: (optiona) vectorizer
  output:
    - vectorized dataset, vectorizer
  """
  if (count_vectorizer == None):
    count_vectorizer = CountVectorizer()
    X = count_vectorizer.fit_transform(dataset)
  else:
    X = count_vectorizer.transform(dataset)
  return X.toarray(), count_vectorizer

X_train_bow, cvectorizer = bow_count(X_train_gram, None)
X_test_bow, cvectorizer = bow_count(X_test_gram, cvectorizer)

# Train and Test shape
print(f'Train dataset shape: {X_train_bow.shape} - Test dataset shape: {X_test_bow.shape}')

Train dataset shape: (235, 1688) - Test dataset shape: (59, 1688)


In [71]:
# Feature scaling for SVM/MLP
# Using MaxAbsScaler to preserve sparsity
from sklearn.preprocessing import MaxAbsScaler
mas = MaxAbsScaler()
X_train_bow_sc = mas.fit_transform(X_train_bow)
X_test_bow_sc = mas.transform(X_test_bow)

**TfIdf**
- TfIdf can perform internally n-gram division. Typical values to consider for language recognition task are 3, 4 and 5.
- TfIdf Vectorizer makes use of L2 regularization. Further feature scaling is not necessary.
- Vectorizazion of Train/Test dataset

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfv = TfidfVectorizer(
    analyzer="char",
    ngram_range=(3,5))

# Simplified preprocessing: only sentence cleaning
X_train_clean = [data_cleaner(sen) for sen in X_train]
X_test_clean = [data_cleaner(sen) for sen in X_test]
X_train_tfidf = tfidfv.fit_transform(X_train_clean)
X_test_tfidf = tfidfv.transform(X_test_clean)

# Train and Test shape
print(f'Train dataset shape: {X_train_tfidf.shape} - Test dataset shape: {X_test_tfidf.shape}')

Train dataset shape: (235, 11242) - Test dataset shape: (59, 11242)


[$\uparrow$ top](#toc)

##<a name="4">4 Modelization</a>

Full dataset is very small, only 98 samples for each class. Few samples with high number of features due to vectorization. Four philosophically different Machine Learning algorithms will be tested using typical/default hyperparameters.
- Random Forest
- SVM [kernel: rbf an linear]
- Linear SVM
- MLP

In [73]:
from sklearn.metrics import classification_report, confusion_matrix

###<a name="4.1">4.1 Random Forest</a>

In [74]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=RANDOM_SEED)

- BoW

In [75]:
rf.fit(X_train_bow, y_train)
print(classification_report(rf.predict(X_test_bow), y_test))

              precision    recall  f1-score   support

           0       1.00      0.91      0.95        22
           1       0.89      1.00      0.94        17
           2       1.00      1.00      1.00        20

    accuracy                           0.97        59
   macro avg       0.96      0.97      0.97        59
weighted avg       0.97      0.97      0.97        59



In [76]:
confusion_matrix(y_test, rf.predict(X_test_bow))

array([[20,  0,  0],
       [ 2, 17,  0],
       [ 0,  0, 20]])

Two misclassifications out of 59. Predicted as English, italian sentences.

- TfIdf

In [77]:
rf.fit(X_train_tfidf, y_train)
print(classification_report(rf.predict(X_test_tfidf), y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       1.00      1.00      1.00        19
           2       1.00      1.00      1.00        20

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59



In [78]:
confusion_matrix(y_test, rf.predict(X_test_tfidf))

array([[20,  0,  0],
       [ 0, 19,  0],
       [ 0,  0, 20]])

Random Forest: TfIdf produces best score with no misclassifications.

###<a name="4.2">4.2 SVM linear kernel</a>

In [79]:
from sklearn.svm import SVC
svc_l = SVC(kernel='linear',random_state=RANDOM_SEED)

- BoW

In [80]:
start = time.time()
svc_l.fit(X_train_bow_sc, y_train)
end = time.time()
print(f'SVM linear kernel BOW: Training time {end-start:.2f} sec')
print(classification_report(svc_l.predict(X_test_bow_sc), y_test))

SVM linear kernel BOW: Training time 0.05 sec
              precision    recall  f1-score   support

           0       0.95      0.95      0.95        20
           1       0.95      0.95      0.95        19
           2       1.00      1.00      1.00        20

    accuracy                           0.97        59
   macro avg       0.97      0.97      0.97        59
weighted avg       0.97      0.97      0.97        59



In [81]:
confusion_matrix(y_test, svc_l.predict(X_test_bow))

array([[20,  0,  0],
       [ 2, 17,  0],
       [ 0,  0, 20]])

- TfIdF

In [82]:
start = time.time()
svc_l.fit(X_train_tfidf, y_train)
end = time.time()
print(f'SVM linear kernel TfIdf: Training time {end-start:.2f} sec')
print(classification_report(svc_l.predict(X_test_tfidf), y_test))

SVM linear kernel TfIdf: Training time 0.07 sec
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       1.00      1.00      1.00        19
           2       1.00      1.00      1.00        20

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59



In [83]:
confusion_matrix(y_test, svc_l.predict(X_test_tfidf))

array([[20,  0,  0],
       [ 0, 19,  0],
       [ 0,  0, 20]])

SVM linear kernel: TfIdf produces best score with no misclassifications.

###<a name="4.3">4.3 SVM rbf kernel</a>

In [84]:
from sklearn.svm import SVC
svc_r = SVC(kernel='rbf', random_state=RANDOM_SEED)

- BoW

In [85]:
svc_r.fit(X_train_bow_sc, y_train)
print(classification_report(svc_r.predict(X_test_bow_sc), y_test))

              precision    recall  f1-score   support

           0       0.90      0.95      0.92        19
           1       0.95      0.90      0.92        20
           2       1.00      1.00      1.00        20

    accuracy                           0.95        59
   macro avg       0.95      0.95      0.95        59
weighted avg       0.95      0.95      0.95        59



In [86]:
confusion_matrix(y_test, svc_r.predict(X_test_bow_sc))

array([[18,  2,  0],
       [ 1, 18,  0],
       [ 0,  0, 20]])

- TfIdF

In [87]:
svc_r.fit(X_train_tfidf, y_train)
print(classification_report(svc_r.predict(X_test_tfidf), y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       1.00      1.00      1.00        19
           2       1.00      1.00      1.00        20

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59



In [88]:
confusion_matrix(y_test, svc_r.predict(X_test_tfidf))

array([[20,  0,  0],
       [ 0, 19,  0],
       [ 0,  0, 20]])

SVM kernel rbf: TfIdf produces best score with no misclassifications.

###<a name="4.4">4.4 Linear SVM</a>

In [89]:
from sklearn.svm import LinearSVC
l_svc = LinearSVC(random_state=RANDOM_SEED)

- BoW

In [90]:
start = time.time()
l_svc.fit(X_train_bow_sc, y_train)
end = time.time()
print(f'Linear SVM BOW: Training time {end-start:.2f} sec')
print(classification_report(l_svc.predict(X_test_bow_sc), y_test))

Linear SVM BOW: Training time 0.01 sec
              precision    recall  f1-score   support

           0       0.95      0.90      0.93        21
           1       0.89      0.94      0.92        18
           2       1.00      1.00      1.00        20

    accuracy                           0.95        59
   macro avg       0.95      0.95      0.95        59
weighted avg       0.95      0.95      0.95        59



In [91]:
confusion_matrix(y_test, l_svc.predict(X_test_bow))

array([[20,  0,  0],
       [ 2, 17,  0],
       [ 0,  0, 20]])

- TfIdf

In [92]:
start = time.time()
l_svc.fit(X_train_tfidf, y_train)
end = time.time()
print(f'Linear SVM TfIdf: Training time {end-start:.2f} sec')
print(classification_report(l_svc.predict(X_test_tfidf), y_test))

Linear SVM TfIdf: Training time 0.01 sec
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       1.00      1.00      1.00        19
           2       1.00      1.00      1.00        20

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59



In [93]:
confusion_matrix(y_test, l_svc.predict(X_test_tfidf))

array([[20,  0,  0],
       [ 0, 19,  0],
       [ 0,  0, 20]])

Linear SVM: TfIdf produces best score with no misclassifications.

###<a name="4.5">4.5 Multi Layer Perceptron</a>

In [94]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(activation='logistic', solver='adam', max_iter=1000, hidden_layer_sizes=(100,),tol=0.005, alpha=0.01, early_stopping=True, verbose=True, random_state=RANDOM_SEED)

- BoW

In [95]:
mlp.fit(X_train_bow_sc, y_train)
print(classification_report(mlp.predict(X_test_bow_sc), y_test))

Iteration 1, loss = 1.13061546
Validation score: 0.333333
Iteration 2, loss = 1.07415054
Validation score: 0.333333
Iteration 3, loss = 1.03162965
Validation score: 0.875000
Iteration 4, loss = 0.99700753
Validation score: 0.791667
Iteration 5, loss = 0.96761837
Validation score: 0.666667
Iteration 6, loss = 0.94162893
Validation score: 0.666667
Iteration 7, loss = 0.91626909
Validation score: 0.666667
Iteration 8, loss = 0.88936041
Validation score: 0.750000
Iteration 9, loss = 0.86208552
Validation score: 0.875000
Iteration 10, loss = 0.83554320
Validation score: 0.875000
Iteration 11, loss = 0.81000317
Validation score: 0.875000
Iteration 12, loss = 0.78539460
Validation score: 0.916667
Iteration 13, loss = 0.76282081
Validation score: 0.958333
Iteration 14, loss = 0.74037697
Validation score: 0.958333
Iteration 15, loss = 0.71871906
Validation score: 0.958333
Iteration 16, loss = 0.69734600
Validation score: 0.916667
Iteration 17, loss = 0.67591217
Validation score: 0.916667
Iterat

In [96]:
confusion_matrix(y_test, mlp.predict(X_test_bow_sc))

array([[19,  1,  0],
       [ 0, 16,  3],
       [ 0,  0, 20]])

- TfIdf

In [97]:
mlp.fit(X_train_tfidf, y_train)
print(classification_report(mlp.predict(X_test_tfidf), y_test))

Iteration 1, loss = 1.16406430
Validation score: 0.333333
Iteration 2, loss = 1.12199625
Validation score: 0.333333
Iteration 3, loss = 1.09177093
Validation score: 0.333333
Iteration 4, loss = 1.07009447
Validation score: 0.333333
Iteration 5, loss = 1.05073085
Validation score: 0.333333
Iteration 6, loss = 1.03517334
Validation score: 0.333333
Iteration 7, loss = 1.02241609
Validation score: 0.625000
Iteration 8, loss = 1.01075633
Validation score: 0.708333
Iteration 9, loss = 1.00118000
Validation score: 0.458333
Iteration 10, loss = 0.99295481
Validation score: 0.416667
Iteration 11, loss = 0.98408455
Validation score: 0.416667
Iteration 12, loss = 0.97537661
Validation score: 0.583333
Iteration 13, loss = 0.96731482
Validation score: 0.583333
Iteration 14, loss = 0.95983067
Validation score: 0.708333
Iteration 15, loss = 0.95319454
Validation score: 0.708333
Iteration 16, loss = 0.94656180
Validation score: 0.708333
Iteration 17, loss = 0.93961157
Validation score: 0.708333
Iterat

In [98]:
confusion_matrix(y_test, mlp.predict(X_test_tfidf))

array([[20,  0,  0],
       [ 0, 19,  0],
       [ 1, 12,  7]])

MLP does not perform well, with TfIdf case doing very bad.

###<a name="4.6">4.6 Models Comparison</a>

From these tests, perfect performance, 100% accuracy, is achieved using TfIdf in conjunction with RandomForest and SVM (linear and rbf kernel). Linear SVM using, Liblinear algorithm, is faster than SVM with linear kernel (showing same performance) but for the size of this dataset the difference is neglectable.

[$\uparrow$ top](#toc)

##<a name="5">5 Data Augmentation</a>

Considering the small dataset, LinearSVM is the model to go, for performance and complexity. To improve model realiability initial dataset shall be augmented to generete a larger training set.

###<a name="5.1">5.1 Translation back and forth (if needed to generate the new dataset)</a>

In [99]:
if (new_aug_ds=='y'):
  !pip install deep_translator
  from deep_translator import GoogleTranslator

  def back_translate(text, source_lang='it', target_lang='de'):
    """
    To generate equivalent sentence translating to a different language and back to the original one
    Input
    - text: sentence to be 'augmented'
    - source_lang: language of text
    - target_lang: language to be used as intermediate
    Output
    - augmented text: equal or sligthly different from the original
    """
    try:
      # 1. To intermediate language
      translator_to = GoogleTranslator(source=source_lang, target=target_lang)
      intermediate_text = translator_to.translate(text)

      # 2. Back from bridge language
      translator_back = GoogleTranslator(source=target_lang, target=source_lang)
      augmented_text = translator_back.translate(intermediate_text)
      return augmented_text
    except:
      return text

  X_it = df[df['Codice Lingua']=='it']['Testo']
  X_en = df[df['Codice Lingua']=='en']['Testo']
  X_de = df[df['Codice Lingua']=='de']['Testo']

  print('\nExample of original sentence vs new generated one')
  for i in range(X_it.shape[0]):
    if (X[i]!=back_translate(X_it.iloc[i],'it','fr')):
      print(f'{i} - {X_it[i]} - {back_translate(X_it.iloc[i],'it','fr')}\n')
      break

  print('Dataset augmentation')
  print("1. it - fr")
  X_it_1 = X_it.apply(lambda x: back_translate(x,'it','fr'))
  print("2. it - de")
  X_it_2 = X_it.apply(lambda x: back_translate(x,'it','de'))
  print("3. en - de")
  X_en_1 = X_en.apply(lambda x: back_translate(x,'en','de'))
  print("4. en - it")
  X_en_2 = X_en.apply(lambda x: back_translate(x,'en','it'))
  print("5. de - fr")
  X_de_1 = X_de.apply(lambda x: back_translate(x,'de','fr'))
  print("6. de - it")
  X_de_2 = X_de.apply(lambda x: back_translate(x,'de','it'))

  df_target_it = 98*['it']
  df_target_en = 98*['en']
  df_target_de = 98*['de']

  df_it = pd.DataFrame({'Testo':X_it.to_list(), 'Codice Lingua': df_target_it})
  df_it_1 = pd.DataFrame({'Testo':X_it_1.to_list(), 'Codice Lingua': df_target_it})
  df_it_2 = pd.DataFrame({'Testo':X_it_2.to_list(), 'Codice Lingua': df_target_it})

  df_en = pd.DataFrame({'Testo':X_en.to_list(), 'Codice Lingua': df_target_en})
  df_en_1 = pd.DataFrame({'Testo':X_en_1.to_list(), 'Codice Lingua': df_target_en})
  df_en_2 = pd.DataFrame({'Testo':X_en_2.to_list(), 'Codice Lingua': df_target_en})

  df_de = pd.DataFrame({'Testo':X_de.to_list(), 'Codice Lingua': df_target_de})
  df_de_1 = pd.DataFrame({'Testo':X_de_1.to_list(), 'Codice Lingua': df_target_de})
  df_de_2 = pd.DataFrame({'Testo':X_de_2.to_list(), 'Codice Lingua': df_target_de})

  df_aug = pd.concat([df_it,df_it_1,df_it_2, df_en, df_en_1, df_en_2, df_de, df_de_1, df_de_2], ignore_index=True)
  print(f'\nFully augmented dataset shape: {df_aug.shape}')
  # Remove duplicated rows
  df_aug_red = df_aug.drop_duplicates()
  print(f'Augmented dataset real shape (no duplicates): {df_aug_red.shape}')

  # Make a copy
  df_clean_aug = df_aug_red.copy()
  print('\nNew language distribution:')
  print(df_clean_aug['Codice Lingua'].value_counts())

else:
  PATH_AUGMENTED_DATA='https://raw.githubusercontent.com/f1li/PAI-project/refs/heads/main/augmented_dataset.csv'
  df_clean_aug = pd.read_csv(PATH_AUGMENTED_DATA)
  print('\nNew language distribution:')
  print(df_clean_aug['Codice Lingua'].value_counts())


Example of original sentence vs new generated one
1 - Anfora greca con decorazioni a figure nere - Anfora greca decorata a figure nere

Dataset augmentation
1. it - fr
2. it - de
3. en - de
4. en - it
5. de - fr
6. de - it

Fully augmented dataset shape: (882, 2)
Augmented dataset real shape (no duplicates): (581, 2)

New language distribution:
Codice Lingua
en    211
it    188
de    182
Name: count, dtype: int64


[$\uparrow$ top](#toc)

##<a name="6">6 Models evaluation on the augmented dataset</a>

###<a name="6.1">6.1 Dataset vectorization and Train/Test split</a>

In [100]:
df_clean_aug['Testo'] = df_clean_aug['Testo'].apply(lambda x: data_cleaner(x))

X_aug_red = df_clean_aug['Testo']
y_aug_red = df_clean_aug['Codice Lingua']

y_aug_red_enc = y_aug_red.map(target_encoding_map)

In [101]:
X_train, X_test, y_train, y_test = train_test_split(X_aug_red, y_aug_red_enc, test_size=0.2, stratify=y_aug_red_enc,random_state=RANDOM_SEED)

X_train_tfidf = tfidfv.fit_transform(X_train)
X_test_tfidf = tfidfv.transform(X_test)

# Train and Test shape
print(f'Train dataset shape: {X_train_tfidf.shape} - Test dataset shape: {X_test_tfidf.shape}')

Train dataset shape: (464, 13667) - Test dataset shape: (117, 13667)


###<a name="6.2">6.2 Random Forest</a>

In [102]:
rf = RandomForestClassifier(random_state=RANDOM_SEED)

####<a name="6.2.1">6.2.1 Default hyperparameters</a>

In [103]:
rf.fit(X_train_tfidf, y_train)
print(classification_report(rf.predict(X_test_tfidf), y_test))
confusion_matrix(y_test, rf.predict(X_test_tfidf))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       1.00      0.95      0.98        44
           2       0.95      1.00      0.97        35

    accuracy                           0.98       117
   macro avg       0.98      0.98      0.98       117
weighted avg       0.98      0.98      0.98       117



array([[38,  0,  0],
       [ 0, 42,  0],
       [ 0,  2, 35]])

####<a name="6.2.2">6.2.2 Optimized hyperparameters</a>

In [104]:
from sklearn.model_selection import GridSearchCV

grid_search_rf = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth':[None, 5, 20, 30],
    'min_samples_leaf':[1, 5, 15, 20]
}

gs_rf = GridSearchCV(
    estimator=rf,
    param_grid = grid_search_rf,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

gs_rf.fit(X_train_tfidf,y_train)

Fitting 5 folds for each of 64 candidates, totalling 320 fits


In [105]:
gs_rf.best_params_

{'max_depth': None, 'min_samples_leaf': 1, 'n_estimators': 50}

In [106]:
rf_b = RandomForestClassifier(max_depth=gs_rf.best_params_['max_depth'], min_samples_leaf=gs_rf.best_params_['min_samples_leaf'], n_estimators=gs_rf.best_params_['n_estimators'], random_state=RANDOM_SEED)
rf_b.fit(X_train_tfidf, y_train)
print(classification_report(rf_b.predict(X_test_tfidf), y_test))
confusion_matrix(y_test, rf_b.predict(X_test_tfidf))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       1.00      0.95      0.98        44
           2       0.95      1.00      0.97        35

    accuracy                           0.98       117
   macro avg       0.98      0.98      0.98       117
weighted avg       0.98      0.98      0.98       117



array([[38,  0,  0],
       [ 0, 42,  0],
       [ 0,  2, 35]])

###<a name="6.3">6.3 SVM linear kernel</a>

In [107]:
svc_l = SVC(kernel='linear',random_state=RANDOM_SEED)

####<a name="6.3.1">6.3.1 Default hyperparameters</a>




In [108]:
svc_l.fit(X_train_tfidf, y_train)
print(classification_report(svc_l.predict(X_test_tfidf), y_test))
confusion_matrix(y_test, svc_l.predict(X_test_tfidf))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       0.98      0.98      0.98        42
           2       0.97      0.97      0.97        37

    accuracy                           0.98       117
   macro avg       0.98      0.98      0.98       117
weighted avg       0.98      0.98      0.98       117



array([[38,  0,  0],
       [ 0, 41,  1],
       [ 0,  1, 36]])

####<a name="6.3.2">6.3.2 Optimized hyperparameters</a>

In [109]:
grid_search_svm_l = {
    'C': np.logspace(-2, 4, 10)
}

gs_svm_l = GridSearchCV(
    estimator=svc_l,
    param_grid = grid_search_svm_l,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

gs_svm_l.fit(X_train_tfidf,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [110]:
gs_svm_l.best_params_

{'C': np.float64(1.0)}

In [111]:
svc_l_b = SVC(kernel='linear', C=gs_svm_l.best_params_['C'], random_state=RANDOM_SEED)
svc_l_b.fit(X_train_tfidf, y_train)
print(classification_report(svc_l_b.predict(X_test_tfidf), y_test))
confusion_matrix(y_test, svc_l_b.predict(X_test_tfidf))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       0.98      0.98      0.98        42
           2       0.97      0.97      0.97        37

    accuracy                           0.98       117
   macro avg       0.98      0.98      0.98       117
weighted avg       0.98      0.98      0.98       117



array([[38,  0,  0],
       [ 0, 41,  1],
       [ 0,  1, 36]])

###<a name="6.4">6.4 SVM rbf kernel</a>

In [112]:
svc_r = SVC(kernel='rbf', random_state=RANDOM_SEED)

####<a name="6.4.1">6.4.1 Default hyperparameters</a>

In [113]:
svc_r.fit(X_train_tfidf, y_train)
print(classification_report(svc_r.predict(X_test_tfidf), y_test))
confusion_matrix(y_test, svc_r.predict(X_test_tfidf))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       0.98      0.98      0.98        42
           2       0.97      0.97      0.97        37

    accuracy                           0.98       117
   macro avg       0.98      0.98      0.98       117
weighted avg       0.98      0.98      0.98       117



array([[38,  0,  0],
       [ 0, 41,  1],
       [ 0,  1, 36]])

####<a name="6.4.2">6.4.2 Optimized hyperparameters</a>

In [114]:
grid_search_svm_r = {
    'C': np.logspace(-2, 4, 8),
    'gamma': np.logspace(-9, 3, 10)
}

gs_svm_r = GridSearchCV(
    estimator=svc_r,
    param_grid = grid_search_svm_r,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

gs_svm_r.fit(X_train_tfidf,y_train)

Fitting 5 folds for each of 80 candidates, totalling 400 fits


In [115]:
gs_svm_r.best_params_

{'C': np.float64(3.727593720314938), 'gamma': np.float64(0.1)}

In [116]:
svc_r_b = SVC(kernel='rbf', C=gs_svm_r.best_params_['C'], gamma=gs_svm_r.best_params_['gamma'], random_state=RANDOM_SEED)
svc_r_b.fit(X_train_tfidf, y_train)
print(classification_report(svc_r_b.predict(X_test_tfidf), y_test))
confusion_matrix(y_test, svc_r_b.predict(X_test_tfidf))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       0.98      0.98      0.98        42
           2       0.97      0.97      0.97        37

    accuracy                           0.98       117
   macro avg       0.98      0.98      0.98       117
weighted avg       0.98      0.98      0.98       117



array([[38,  0,  0],
       [ 0, 41,  1],
       [ 0,  1, 36]])

##<a name="7">7 Model selection and final operations</a>

Random Forest, SVM linear kernel and SVM rbf kernel shows same performance on this augmented dataset all with 2 wrong classified sentences (EN <-> DE). Common hyperparameters optimization does not help to improve overall accuracy. For the requirements of this particular use case all these model could fit. In general SVM linear kernel is the model to prefer because it can be faster and scale better (eventually using the Liblinear version) for large datasets.

In [117]:
# Enable probability output
svc_l = SVC(kernel='linear', C=gs_svm_l.best_params_['C'], random_state=RANDOM_SEED, probability=True)

###<a name="7.1">7.1 Model training on the full dataset</a>


In [118]:
# New vectorization on the full dataset
X_tfidf = tfidfv.fit_transform(X_aug_red)
# New model training
svc_l.fit(X_tfidf, y_aug_red_enc)

###<a name="7.2">7.2 Full pipeline save (data cleaner + vectorizer + model)</a>

In [119]:
# Extension of data_cleaner to be able to process just one sentence in the full pipeline
def data_cleaner_list(sentence):
  """
  To clean sentence by sentence as iterable:
    - lower casing
    - remove punctuation
    - remove numbers
    - remove multiple spaces
  """
  # In case of single sentence, transform it in a list/iterable
  if isinstance(sentence, str):
        sentence = [sentence]
  out_sen = []
  for sen in sentence:
    # lower casing
    sen = sen.lower()
    # removing punctuation
    for c in string.punctuation:
      sen = sen.replace(c, ' ')
    # remove numbers
    sen = re.sub(r"\d+", "", sen)
    # remove double spaces
    sen = re.sub(r"\s+", " ", sen)
    out_sen.append(sen)
  return out_sen


In [120]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

# Full pipeline
pipeline = Pipeline([
    ('cleaner', FunctionTransformer(data_cleaner_list)), # custom function encapsulated to comply sklearn format
    ('tfidf', tfidfv),
    ('svm', svc_l)
])

# Model saving for reuse
import joblib
joblib.dump(pipeline, 'museum_language_detector_pipeline.pkl')

['museum_language_detector_pipeline.pkl']

###<a name="7.3">7.3 Full pipeline single test cases</a>

In [121]:
def lang_det_result(text_in):
  """
  To print detected language and related probabilities
  -input: text_in
  """
  language_decoder = {0: 'italian', 1: 'english', 2: 'german'}
  print(f'Text: {text_in}')
  print(f'Language detected: {language_decoder[pipeline.predict(text_in)[0]]}\n')

  print(f'Single probabilities')
  for i, prob in enumerate(pipeline.predict_proba(text_in)[0]):
    print(f'{language_decoder[i]}: {prob:.5f}')


In [122]:
# italian
in_it = "Statua lignea secolo V con San Giuseppe mentre lavora!"
lang_det_result(in_it)

Text: Statua lignea secolo V con San Giuseppe mentre lavora!
Language detected: italian

Single probabilities
italian: 0.97254
english: 0.02059
german: 0.00687


In [123]:
# english
in_en="Gold sword from XV century found in central Europe"
lang_det_result(in_en)

Text: Gold sword from XV century found in central Europe
Language detected: english

Single probabilities
italian: 0.00819
english: 0.98417
german: 0.00764


In [124]:
# german
in_de="Kopf einer Göttin, Marmor, 2. Jahrhundert n. Chr., Rom"
lang_det_result(in_de)

Text: Kopf einer Göttin, Marmor, 2. Jahrhundert n. Chr., Rom
Language detected: german

Single probabilities
italian: 0.02365
english: 0.05433
german: 0.92202


###<a name="7.4">7.4 Streamlit Webapp</a>

The full pipeline designed here can be tested using the dedicated Streamlit webapp [here](https://museumlabeler.streamlit.app/).

[$\uparrow$ top](#toc)