##Languge Detection Task



**The goal of this notebook is to implement a method that identifies the language a document is written in.**

**We will use the pre-trained [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) and fine-tune on the [Language Identification dataset](https://huggingface.co/datasets/papluca/language-identification), a corpus consisting of texts from 20 languages where each text is associated with a label that tells its language.**

**Notebook adapted from [Hugging face text classififcation guide](https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb)**



##Setup

**Mount on google drive**


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount= True)
Folder_name = 'NLP_LI' # to be changed according to the folder you place the notebook in
assert Folder_name is not None, "[1] Enter the folder name"

import sys 
sys.path.append('content/drive/MyDrive/{}'.format(Folder_name))
%cd drive/MyDrive/$Folder_name/


Mounted at /content/drive
/content/drive/MyDrive/NLP_LI


**Check GPU**

In [1]:
import torch

if torch.cuda.is_available():    
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

**Install**

In [None]:
! pip install datasets transformers
! apt install git-lfs

**Imports** 

In [3]:
import numpy as np
import pandas as pd
import random


**Push on hub**

It saves the model on your HuggingFaces account (sign up [here](https://huggingface.co/join)), if you don't want to do so set `push_hub= False`

In [4]:
push_hub = True 
if push_hub:
  from huggingface_hub import notebook_login
  notebook_login()
else: 
  pass

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


##Dataset

**Loading [Language Identification](https://huggingface.co/datasets/papluca/language-identification) Dataset**

Download the train, valid, and test data files from [here](https://huggingface.co/datasets/papluca/language-identification/tree/main) and assert them in a data folder in the same directory of this notebook.

In [5]:
from datasets import load_dataset, load_metric
dataset = load_dataset('csv', data_files={'train': 'data/train.csv', 'valid': 'data/valid.csv', 'test': 'data/test.csv'})

Using custom data configuration default-6067271bfadd3684


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-6067271bfadd3684/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-6067271bfadd3684/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

**To have a look on how the dataset looks like**

In [6]:
import datasets
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(dataset["train"])

Unnamed: 0,labels,text
0,tr,Yüksek sesle onun avantajını takip etti : ve ben neden bu kadar güveniyorum ?
1,ur,یہی وجہ ہے کہ میں نے کسی سے پہلے کسی شخص کے ساتھ بات شروع کر دی تھی ۔
2,es,"LO COMPRE PARA CORTAR UNOS ANGULOS DE ALUMINIO Y DEJA MUCHISIMA REBABA, HAY QUE REPASAR CON LA RADIAL PARA ELIMINAR ESTA REBABA. NO HACE LA FUNCION PARA LA QUE LO VENDEN"
3,ur,مجھے لگتا ہے کہ سات ستارے ہیں ۔
4,hi,सिविल कानूनी सहायता के लिए संसाधन में वृद ् धि होती है
5,fr,"Bonne surprise, je trouve ce tome 2 plus abouti que le premier, les personnages ont des reactions plus cohérente et moins exagérées que dans le premier ce qui parfois etait un peu agaçant ! Voila, je conseille cette lecture, vous ne perdrez pas votre temps ! Le seul défaut majeur est qu on en voudrait plus !!!!"
6,ru,"Несмотря на все совершенно убедительные аргументы , которые были предложены в предыдущих пунктах , было бы глупо и нечестно настаивать на том , чтобы это было"
7,th,แต่ ที่ไหน สัก แห่ง ใน ทาง ที่ เขา อาจจะ ดูดซึม บทเรียน ของ เร แกน ว่า ในขณะที่ ชาวอเมริกัน ชอบ รวบรวม ข้อเท็จจริง ข้อเท็จจริง ที่ มีอำนาจ ต้อง ยอมความ คำถาม ที่ สำคัญ มัน สูง เกิน จริง
8,it,La Russia avverte che non può firmare il Trattato delle Nazioni Unite sulle armi
9,ru,Предложенный кэсич пакет корпоративных пособий составляет жалкие 11. долларов .


**Encode the column "labels"**

In [8]:
dataset = dataset.class_encode_column("labels")
print(dataset["train"].features)



Casting to class labels:   0%|          | 0/70 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/7 [00:00<?, ?ba/s]

Casting to class labels:   0%|          | 0/10 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Casting to class labels:   0%|          | 0/10 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

{'labels': ClassLabel(num_classes=20, names=['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'it', 'ja', 'nl', 'pl', 'pt', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'], id=None), 'text': Value(dtype='string', id=None)}


**Preprocessing Dataset**

In [9]:
from transformers import AutoTokenizer

#Use xlm-roberta-base pre-trained model

model_checkpoint = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

max_input_length = 512

def preprocess_function(examples):
  return tokenizer(examples['text'], truncation=True, max_length=max_input_length)
  
encoded_dataset = dataset.map(preprocess_function, batched=True)
encoded_dataset = encoded_dataset.remove_columns(['text'])
encoded_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.68M [00:00<?, ?B/s]

  0%|          | 0/70 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

In [11]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 70000
    })
    valid: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 10000
    })
})

##Loading Metrics



In [12]:
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

##Fine-tuning the model



In [13]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

labels = encoded_dataset['train'].features["labels"].names
num_labels = len(labels)

label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label


model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels, label2id=label2id, id2label=id2label)

Downloading:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'roberta.pooler.dense.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense

In [14]:
print(id2label)
print(label2id)

{'0': 'ar', '1': 'bg', '2': 'de', '3': 'el', '4': 'en', '5': 'es', '6': 'fr', '7': 'hi', '8': 'it', '9': 'ja', '10': 'nl', '11': 'pl', '12': 'pt', '13': 'ru', '14': 'sw', '15': 'th', '16': 'tr', '17': 'ur', '18': 'vi', '19': 'zh'}
{'ar': '0', 'bg': '1', 'de': '2', 'el': '3', 'en': '4', 'es': '5', 'fr': '6', 'hi': '7', 'it': '8', 'ja': '9', 'nl': '10', 'pl': '11', 'pt': '12', 'ru': '13', 'sw': '14', 'th': '15', 'tr': '16', 'ur': '17', 'vi': '18', 'zh': '19'}


In [17]:
model_name = model_checkpoint.split("/")[-1]
train_batch_size = 2
eval_batch_size = 4

task='language-identification'
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=push_hub,
  )



PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [18]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


Cloning https://huggingface.co/dinalzein/xlm-roberta-base-finetuned-language-identification into local empty directory.


In [19]:
trainer.train()

***** Running training *****
  Num examples = 70000
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 35000


Epoch,Training Loss,Validation Loss,Accuracy
1,0.0169,0.034313,0.995


***** Running Evaluation *****
  Num examples = 10000
  Batch size = 4
Saving model checkpoint to xlm-roberta-base-finetuned-language-identification/checkpoint-35000
Configuration saved in xlm-roberta-base-finetuned-language-identification/checkpoint-35000/config.json
Model weights saved in xlm-roberta-base-finetuned-language-identification/checkpoint-35000/pytorch_model.bin
tokenizer config file saved in xlm-roberta-base-finetuned-language-identification/checkpoint-35000/tokenizer_config.json
Special tokens file saved in xlm-roberta-base-finetuned-language-identification/checkpoint-35000/special_tokens_map.json
tokenizer config file saved in xlm-roberta-base-finetuned-language-identification/tokenizer_config.json
Special tokens file saved in xlm-roberta-base-finetuned-language-identification/special_tokens_map.json
Adding files tracked by Git LFS: ['tokenizer.json']. This may take a bit of time if the files are large.


Training completed. Do not forget to share your model on huggingf

TrainOutput(global_step=35000, training_loss=0.0685649524637631, metrics={'train_runtime': 5588.5248, 'train_samples_per_second': 12.526, 'train_steps_per_second': 6.263, 'total_flos': 1728803877664512.0, 'train_loss': 0.0685649524637631, 'epoch': 1.0})

In [20]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 10000
  Batch size = 4


{'epoch': 1.0,
 'eval_accuracy': 0.995,
 'eval_loss': 0.03431255370378494,
 'eval_runtime': 53.3966,
 'eval_samples_per_second': 187.278,
 'eval_steps_per_second': 46.819}

In [21]:
trainer.push_to_hub()

Saving model checkpoint to xlm-roberta-base-finetuned-language-identification
Configuration saved in xlm-roberta-base-finetuned-language-identification/config.json
Model weights saved in xlm-roberta-base-finetuned-language-identification/pytorch_model.bin
tokenizer config file saved in xlm-roberta-base-finetuned-language-identification/tokenizer_config.json
Special tokens file saved in xlm-roberta-base-finetuned-language-identification/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.33k/1.04G [00:00<?, ?B/s]

Upload file runs/May24_11-48-16_8c47af5c803d/events.out.tfevents.1653392921.8c47af5c803d.71.0:  21%|##1       …

Upload file runs/May24_11-48-16_8c47af5c803d/events.out.tfevents.1653398564.8c47af5c803d.71.2: 100%|##########…

remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/dinalzein/xlm-roberta-base-finetuned-language-identification
   31c098c..2986e9f  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Text Classification', 'type': 'text-classification'}, 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.995}]}
remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/dinalzein/xlm-roberta-base-finetuned-language-identification
   2986e9f..25cf53d  main -> main



'https://huggingface.co/dinalzein/xlm-roberta-base-finetuned-language-identification/commit/2986e9f8acca307005a3013e57257fcfa53eff23'

In [54]:
trainer.save_model()

Saving model checkpoint to xlm-roberta-base-finetuned-language-identification
Configuration saved in xlm-roberta-base-finetuned-language-identification/config.json
Model weights saved in xlm-roberta-base-finetuned-language-identification/pytorch_model.bin
tokenizer config file saved in xlm-roberta-base-finetuned-language-identification/tokenizer_config.json
Special tokens file saved in xlm-roberta-base-finetuned-language-identification/special_tokens_map.json
Saving model checkpoint to xlm-roberta-base-finetuned-language-identification
Configuration saved in xlm-roberta-base-finetuned-language-identification/config.json
Model weights saved in xlm-roberta-base-finetuned-language-identification/pytorch_model.bin
tokenizer config file saved in xlm-roberta-base-finetuned-language-identification/tokenizer_config.json
Special tokens file saved in xlm-roberta-base-finetuned-language-identification/special_tokens_map.json


Upload file tokenizer.json:   0%|          | 3.34k/16.3M [00:00<?, ?B/s]

remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/dinalzein/xlm-roberta-base-finetuned-language-identification
   25cf53d..0dbf477  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Text Classification', 'type': 'text-classification'}, 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.995}]}


**Evaluate our model on external text**

In [47]:
languages_map={
'ar':'Arabic',
'bg': 'Bulgarian',
'de': 'German',
'el': 'Modern greek',
'en': 'English',
'es': 'Spanish',
'fr': 'French',
'hi': 'hindi',
'it': 'italian', 
'ja': 'japanese',
'nl': 'dutch',
'pl': 'polish',
'ru': 'russian',
'sw': 'swahili',
'th': 'Thai',
'tr': 'Turkish',
'ur': 'Urdu',
'vi': 'Vietnamese',
'zh': 'Chinese',
'pt': 'portuguese'}

In [46]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

def identify_language(txt):
  txt=[txt]
  tokenized_txt = tokenizer(txt, truncation=True, max_length=max_input_length)
  txt_dataset = Dataset(tokenized_txt)
  raw_pred, _, _ = trainer.predict(txt_dataset)
  # Preprocess raw predictions
  y_pred = np.argmax(raw_pred, axis=1)
  return languages_map[id2label[str(y_pred[0])]]

In [62]:
print(identify_language("That is life"))


***** Running Prediction *****
  Num examples = 1
  Batch size = 4


English


In [63]:
print(identify_language("C'est La Vie"))


***** Running Prediction *****
  Num examples = 1
  Batch size = 4


French


In [50]:
print(identify_language("So ist das Leben"))

***** Running Prediction *****
  Num examples = 1
  Batch size = 4


German


In [64]:
print(identify_language("هذه هي الحياة"))

***** Running Prediction *****
  Num examples = 1
  Batch size = 4


Arabic
