<a href="https://colab.research.google.com/github/elliemci/chatbots/blob/main/language_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification of the language of user's messages using BART

BART is a denoising autoencoder, corrupting the input data and adding noise or masking some of the input values, for pretraining sequence-to-sequence models. It uses a Transformer-based neural machine translation architecture with a bidirectional encoder, like BERT, and a left-to-right decoder, like GPT;the encoder's attention mask is fully visible like BERT and the decoder's attention mask is causal, like GPT2.

xlm-roberta-base-language-detection model is a fine-tuned version of XLM-RoBERTa model, a multilingual version of RoBERTa - Robustly optimized method for pretraining natural language processing systems that inproves on Bidirectional Encoder Representation from Transformers (BERT)

In [1]:
!pip install transformers



In [2]:
from transformers import pipeline

In [9]:
# messages which language to be classified
message1 = "Se você insiste em classificar. Meu comportamento de anti-musical."
message2 = "Quand il me prend dans ses bras."
message3 = "Al mal tiempo, buena cara."

In [10]:
# default model for a classification task is BART
langclass_pipeline_bart = pipeline("zero-shot-classification")

possible_languages = ["french", "spanish", "portuguese"]

message_language1 = langclass_pipeline_bart(message1, possible_languages)["labels"][0]
probability1 = round(langclass_pipeline_bart(message1, possible_languages)["scores"][0], 2)

message_language2 = langclass_pipeline_bart(message2, possible_languages)["labels"][0]
probability2 = round(langclass_pipeline_bart(message2, possible_languages)["scores"][0], 2)

message_language3 = langclass_pipeline_bart(message3, possible_languages)["labels"][0]
probability3 = round(langclass_pipeline_bart(message3, possible_languages)["scores"][0], 2)

print(f"Language detection with BART:\n")
print(f"The message \"{message1}\" is in {message_language1} with ptobability of {probability1}")
print(f"The message \"{message2}\" is in {message_language2} with ptobability of {probability2}")
print(f"The message \"{message3}\" is in {message_language3} with ptobability of {probability3}")

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Language detection with BART:

The message "Se você insiste em classificar. Meu comportamento de anti-musical." is in portuguese with ptobability of 0.43
The message "Quand il me prend dans ses bras." is in french with ptobability of 0.87
The message "Al mal tiempo, buena cara." is in spanish with ptobability of 0.51


In [11]:
# instantiate the pipelineclass object with xlm-roberta-base-language-detection,  which
# is fine-tuned version of xlm-reberta-base on Language Identification dataset
langclass_pipeline = pipeline("text-classification", model="papluca/xlm-roberta-base-language-detection")

# xlm-roberta-base-language-detection can be used as language detector supporting 20 languages
languages = {
    "ar": "arabic",
    "bg": "bulgarian",
    "de": "german",
    "el": "modern greek",
    "en": "english",
    "es": "spanish",
    "fr": "french",
    "hi": "hindi",
    "it": "italian",
    "ja": "japanese",
    "nl": "dutch",
    "pl": "polish",
    "pt": "portuguese",
    "ru": "russian",
    "sw": "swahili",
    "th": "thai",
    "tr": "turkish",
    "ur": "urdu",
    "vi": "vietnamese",
    "zh": "chinese",
}

message_language1 = langclass_pipeline(message1)[0]["label"]
probability1 = round(langclass_pipeline(message1)[0]["score"], 2)

message_language2 = langclass_pipeline(message2)[0]["label"]
probability2 = round(langclass_pipeline(message2)[0]["score"], 2)

message_language3 = langclass_pipeline(message3)[0]["label"]
probability3 = round(langclass_pipeline(message3)[0]["score"], 2)

print(f"Language detection with RoBERTa:\n")
print(f"The message \"{message1}\" is in {languages[message_language1]} with ptobability of {probability1}")
print(f"The message \"{message2}\" is in {languages[message_language2]} with ptobability of {probability2}")
print(f"The message \"{message3}\" is in {languages[message_language3]} with ptobability of {probability3}")

Language detection with RoBERTa:

The message "Se você insiste em classificar. Meu comportamento de anti-musical." is in portuguese with ptobability of 0.99
The message "Quand il me prend dans ses bras." is in french with ptobability of 0.71
The message "Al mal tiempo, buena cara." is in spanish with ptobability of 0.99
