##Languge Detection Task



**The goal of this notebook is to implement a method to identify the language a document is written in.**

**We will use the pre-trained xlm_roberta to fine-tune on the [Language Identification dataset](https://huggingface.co/datasets/papluca/language-identification), a corpus consisting of texts from 20 languages where each text is associated with a label that tells its language.**

**Notebook adapted from [Hugging face text classififcation guide](https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb)**



##Setup

**Mount on google drive**


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount= True)
Folder_name = 'NLP_LI' # to be changed according to the folder you place the notebook in
assert Folder_name is not None, "[1] Enter the folder name"

import sys 
sys.path.append('content/drive/MyDrive/{}'.format(Folder_name))
%cd drive/MyDrive/$Folder_name/


Mounted at /content/drive
/content/drive/MyDrive/NLP_LI


**Check GPU**

In [None]:
import torch

if torch.cuda.is_available():    
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

**Install**

In [None]:
! pip install datasets transformers
! apt install git-lfs

**Imports** 

In [None]:
import numpy as np
import pandas as pd
import random


##Dataset

**Loading [Language Identification](https://huggingface.co/datasets/papluca/language-identification) Dataset**

Download the train, valid, and test data files from [here](https://huggingface.co/datasets/papluca/language-identification/tree/main) and assert them in a data folder in the same directory of this notebook.

In [None]:
from datasets import load_dataset, load_metric
dataset = load_dataset('csv', data_files={'train': 'data/train.csv', 'valid': 'data/valid.csv', 'test': 'data/test.csv'})

Using custom data configuration default-15f77cb63391a7e2


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-15f77cb63391a7e2/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-15f77cb63391a7e2/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

**To have a look on how the dataset looks like**

In [None]:
import datasets
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,labels,text
0,pt,A polícia aumenta o número de mortos em acidente de autocarro para 8
1,tr,Ve bunun için kayıt olmasını istediler ve bana bunu yapmak ister misin diye sordu ve ben de emin ol dedim sadece sen bilirsin işte yardım etmeyi bilirsin ne kadar çok katılımcı varsa o kadar büyük
2,pl,Nerwy uspokojone. Na razie: Markets Jump on Relief over Spanish Bank Bailout
3,es,"No es tan impermeable, ya que la parte superior me la he encontrado mojada cuando le ha caído agua de lluvia. Por el resto está bastante bien."
4,ar,"انها مطاردة مفضلة من bulbuls , babblers , و minivets , فضلا عن اولئك الذين يفهمون هذه اللغة الطائر الطيور ."
5,vi,Đó là sự thật bởi vì tôi nghĩ họ sẽ chuyển đến arizona khi họ già đi .
6,ru,Внутри него крошечная точка света танцевала неистово .
7,pt,Um esquiador a descer a colina nevada.
8,tr,"Yönetim Kurulu , doğal kaynaklar için standartlara hitap edecek aktif bir projesi var ."
9,it,Responsabilità personale molto?


**Encode the column "labels"**

In [None]:
dataset = dataset.class_encode_column("labels")
print(dataset["train"].features)

Casting to class labels:   0%|          | 0/70 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/7 [00:00<?, ?ba/s]

Casting to class labels:   0%|          | 0/10 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Casting to class labels:   0%|          | 0/10 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

{'labels': ClassLabel(num_classes=20, names=['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'it', 'ja', 'nl', 'pl', 'pt', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'], id=None), 'text': Value(dtype='string', id=None)}


**Preprocessing Dataset**

In [None]:
from transformers import AutoTokenizer

#Use xlm-roberta-base pre-trained model

model_checkpoint = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

max_input_length = 512

def preprocess_function(examples):
  return tokenizer(examples['text'], truncation=True, max_length=max_input_length)
  
encoded_dataset = dataset.map(preprocess_function, batched=True)
encoded_dataset = encoded_dataset.remove_columns(['text'])
encoded_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.68M [00:00<?, ?B/s]

  0%|          | 0/70 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

In [None]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 70000
    })
    valid: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 10000
    })
})

##Loading Metrics



In [None]:
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)


##Fine-tuning the model



In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

labels = encoded_dataset['train'].features["labels"].names
num_labels = len(labels)

label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = labels


model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels, label2id=label2id, id2label=id2label)

Downloading:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_p

In [None]:
model_name = model_checkpoint.split("/")[-1]
train_batch_size = 2
eval_batch_size = 4

task='language identification'
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy"
  )



In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


In [None]:
trainer.train()

***** Running training *****
  Num examples = 70000
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 35000


Epoch,Training Loss,Validation Loss,Accuracy
1,0.0029,0.040862,0.9955


***** Running Evaluation *****
  Num examples = 10000
  Batch size = 4
Saving model checkpoint to xlm-roberta-base-finetuned-language identification/checkpoint-35000
Configuration saved in xlm-roberta-base-finetuned-language identification/checkpoint-35000/config.json
Model weights saved in xlm-roberta-base-finetuned-language identification/checkpoint-35000/pytorch_model.bin
tokenizer config file saved in xlm-roberta-base-finetuned-language identification/checkpoint-35000/tokenizer_config.json
Special tokens file saved in xlm-roberta-base-finetuned-language identification/checkpoint-35000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from xlm-roberta-base-finetuned-language identification/checkpoint-35000 (score: 0.9955).


TrainOutput(global_step=35000, training_loss=0.06775543201054846, metrics={'train_runtime': 5552.5128, 'train_samples_per_second': 12.607, 'train_steps_per_second': 6.303, 'total_flos': 1728803877664512.0, 'train_loss': 0.06775543201054846, 'epoch': 1.0})

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 10000
  Batch size = 4


{'epoch': 1.0,
 'eval_accuracy': 0.9955,
 'eval_loss': 0.0408623144030571,
 'eval_runtime': 50.607,
 'eval_samples_per_second': 197.601,
 'eval_steps_per_second': 49.4}

In [None]:
trainer.save_model()

**Evaluate our model on external text**

In [None]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

def predict(txt):
  txt=[txt]
  tokenized_txt = tokenizer(txt, truncation=True, max_length=max_input_length)
  txt_dataset = Dataset(tokenized_txt)
  raw_pred, _, _ = trainer.predict(txt_dataset)
  # Preprocess raw predictions
  y_pred = np.argmax(raw_pred, axis=1)
  return id2label[str(y_pred[0])]

In [None]:
print(predict("That is life"))


***** Running Prediction *****
  Num examples = 1
  Batch size = 4


[4]
en


In [None]:
print(predict("C'est La Vie"))


***** Running Prediction *****
  Num examples = 1
  Batch size = 4


[6]
fr


In [None]:
print(predict("So ist das Leben"))

***** Running Prediction *****
  Num examples = 1
  Batch size = 4


[2]
de


In [None]:
print(predict("هذه هي الحياة"))

***** Running Prediction *****
  Num examples = 1
  Batch size = 4


[0]
ar
