<a href="https://colab.research.google.com/github/ffedox/pbr/blob/main/parallel_corpus_creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting a domain-specific parallel corpus from Wikipedia

# 1. Setup

# 1.1 Enabling the GPU on Colab

Checking if a GPU is available and selecting the device (GPU or CPU) to run PyTorch computations on.

In [None]:
import torch

# If there's a GPU avaiable, tell PyTorch to use the GPU,
# otherwise, using the CPU instead.
if torch.cuda.is_available():
  device = torch.device("cuda")
  print('Found GPU:', torch.cuda.get_device_name(0))
else:
  device = torch.device("cpu")
  print('CPU will be used because no GPU available.')

Found GPU: Tesla T4


## 1.2 Installing Wikipedia API

[Wikipedia-API](https://github.com/martin-majlis/Wikipedia-API) is a Python wrapper for Wikipedias' API. It supports extracting texts, sections, links, categories, translations, etc. from Wikipedia.

In [None]:
!pip install wikipedia-api --quiet

## 1.3 Installing Sentence Transformers



[Sentence Transformers](https://www.sbert.net/) is a Python framework for state-of-the-art sentence, text and image embeddings. 

In [None]:
!pip install sentence-transformers --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m101.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m73.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m112.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## 1.4 Imports

In [None]:
from sentence_transformers import models, SentenceTransformer
from json import JSONDecodeError

import wikipediaapi

import pandas as pd
import numpy as np
import scipy

from nltk.tokenize import sent_tokenize
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# 2. Extracting an IT-EN comparable corpus from Categories

First we will extract a comparable corpus by scraping Category: pages in Italian (it is more likely to find correspondences from IT to EN than the other way around).

For each page linked in the IT Category: page, we will be retrieving the EN equivalent.

## 2.1 Extracting the IT articles from the Category

Defining a function `get_it_articles` that takes two lists and a category name as input, retrieves information about the category from the Italian Wikipedia using the `wikipediaapi` library, extracts the text of each article in the category and appends it to the `page_texts_it` list. It also appends the title of each article to the `page_titles` list. The category input is supposed to be a string representing a category in the Italian Wikipedia, formatted like `"Categoria:Survival_horror"`. 

In [None]:
page_titles = []
page_texts_it = []

def get_it_articles(category, page_titles, page_texts_it):

  wiki_wiki = wikipediaapi.Wikipedia('it')
  cat = wiki_wiki.page(category)   # category should be like .page("Categoria:Survival_horror")

  for p in cat.categorymembers.values():
    if p.namespace == wikipediaapi.Namespace.MAIN:
      # it is page => we can get text
      page_titles.append(p.title)
      page_texts_it.append(p.text)

Calling the `get_it_articles` function defined earlier, then passing the string `"Categoria:Videogiochi_strategici_in_tempo_reale"` as the category and the two lists `page_titles` and `page_texts_it` as arguments.

In [None]:
get_it_articles("Categoria:Videogiochi_strategici_in_tempo_reale", page_titles, page_texts_it)

## 2.2 Extracting the corresponding EN articles

Defining the function `get_langlinks` that takes a page and the list `page_texts_en` as input. It uses the `langlinks` attribute of the input `page` to access the language links of the page, and sorts the keys of the language links dictionary. Then, it tries to retrieve the English version of the page using the key 'en' in the language links dictionary. If the key 'en' is not found in the language links, it appends the string 'No match' to the `page_texts_en` list. If a JSONDecodeError occurs while trying to retrieve the English version of the page, it also appends the string 'No match' to the `page_texts_en` list. This function is used to get the English version of pages from the Italian Wikipedia.

In [None]:
page_texts_en = []

def get_langlinks(page, page_texts_en):

        langlinks = page.langlinks

        for k in sorted(langlinks.keys()):
            v = langlinks[k]

        try:
          page_en = page.langlinks['en']
          page_texts_en.append(page_en.text)

        except KeyError:
          page_texts_en.append(str('No match'))
          
        except JSONDecodeError:
           page_texts_en.append(str('No match'))

Defining the function `get_en_articles` that takes the two lists `page_titles` and `page_texts_en` as input. For each title in the `page_titles` list, it retrieves the corresponding page from the Italian Wikipedia using the `wikipediaapi` library. Then, it calls the `get_langlinks` function, passing the page and the `page_texts_en` list as arguments, to retrieve the English version of the page if it exists.

In [None]:
def get_en_articles(page_titles, page_texts_en):

  for title in page_titles:

    wiki_wiki = wikipediaapi.Wikipedia('it')
    page = wiki_wiki.page(str(title))
    get_langlinks(page, page_texts_en)

Calling the `get_en_articles` function defined earlier, then passing the the lists `page_titles` and `page_texts_en` as arguments. 

In [None]:
get_en_articles(page_titles, page_texts_en)

## 2.3 Comparable IT-EN corpus

Creating a Pandas dataframe named `comparable_corpus_vg` by stacking the `page_texts_en` and `page_texts_it` lists as columns and naming the columns as 'en' and 'it' respectively. The resulting dataframe will have two columns, where each row contains the English and Italian version of a page.

In [None]:
comparable_corpus_vg = pd.DataFrame(np.column_stack([page_texts_en, page_texts_it]), 
                               columns=['en', 'it'])

Filtering the `comparable_corpus_vg` dataframe to include only the rows where the value in the 'en' column does not contain the string 'No match'. The `.str.contains()` method is used to search for the string in the 'en' column and `== False` is used to select only the rows where the search returns False. The `reset_index` method is then used with the `drop=True` argument to reset the index of the dataframe to start from 0 and to drop the original index column. This line of code is used to remove the rows where the English version of the page was not found and to clean up the dataframe.

In [None]:
comparable_corpus_vg = comparable_corpus_vg[comparable_corpus_vg['en'].str.contains('No match')==False].reset_index(drop=True)

Displaying the `comparable_corpus_vg` dataframe.

In [None]:
comparable_corpus_vg

Unnamed: 0,en,it
0,0 A.D. is a free and open-source real-time str...,0 A.D. è un videogioco di strategia in tempo r...
1,"Abomination: The Nemesis Project, released in ...",Abomination è un videogioco strategico/gestion...
2,Act of War: Direct Action is a real-time strat...,Act of War: Direct Action è un videogioco stra...
3,Act of War: High Treason (abbreviated as AOW:H...,Act of War: High Treason è un'espansione del v...
4,"Desert Rats vs. Afrika Korps, released as Afri...",Afrika Korps vs. Desert Rats (abbreviato in AK...
...,...,...
351,WorldShift is a science fiction real-time stra...,WorldShift è un videogioco strategico in tempo...
352,X-COM: Apocalypse is a 1997 science fiction ta...,X-COM: Apocalypse è il terzo videogioco della ...
353,Z is a 1996 real-time strategy computer game b...,Z è un videogioco strategico in tempo reale sv...
354,Z: Steel Soldiers (originally released for Mic...,Z: Steel Soldiers è un videogioco strategico i...


# 3. Extracting parallel sentences from the comparable corpus

For each Wikipedia article, we obtained the link to the corresponding article in English. This can be used to mine sentences limited to the respective articles. This local approach has several advantages: 

1.   Mining is faster since each article usually has a few hundreds of sentences only.
2.   It seems reasonable to assume that a translation of a sentence is more
likely to be found in the same article than anywhere in the whole Wikipedia. 

## 3.1 Loading the Sentence-Transformers model

Sentence-BERT (SBERT) is a modification of the pretrained BERT network that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine similarity [[1]](https://arxiv.org/abs/1908.10084://)

We will be using [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2://). The following line of code creates a SentenceTransformer object, named `model`. The `device` argument is set to `'cuda'`, which tells the code to use the GPU for processing if it is available.

In [None]:
model = SentenceTransformer('distiluse-base-multilingual-cased', device='cuda')

## 3.2 Using the model to create the embeddings, and computing sentence similarity

Defining a function `extract_parallel_sents` that takes three inputs: `comparable_corpus_vg`, `parallel_sentences_en`, and `parallel_sentences_it`. The function extracts parallel sentences from the input comparable corpus. 

The corpus is first split into individual sentences. Then the sentence embeddings are calculated using `distiluse-base-multilingual-cased-v2`.

For each sentence in the Italian version of the corpus, the cosine distance between its embedding and the embeddings of all sentences in the English version of the corpus is calculated. The closest sentence, based on cosine distance, is then selected. If the cosine distance between the embeddings of the two sentences is higher than 0.8 (the similarity threshold), then the pair of sentences is considered parallel and added to the output lists `parallel_sentences_it` and `parallel_sentences_en`.

In [None]:
parallel_sentences_en = []
parallel_sentences_it = []

def extract_parallel_sents(comparable_corpus_vg, parallel_sentences_en, parallel_sentences_it):

  closest_n = 1

  for en, it in zip(comparable_corpus_vg.en.values, comparable_corpus_vg.it.values): # Looping throuth the texts in the comparable corpus

    corpus = sent_tokenize(en)
    queries = sent_tokenize(it)

    corpus_embeddings = model.encode(corpus)
    query_embeddings = model.encode(queries)

    for query, query_embedding in zip(queries, query_embeddings):
      
      distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

      results = zip(range(len(distances)), distances)
      results = sorted(results, key=lambda x: x[1])

      for idx, distance in results[0:closest_n]:
        if 1-distance > 0.8: # Similarity threshold
          parallel_sentences_it.append(query)
          parallel_sentences_en.append(corpus[idx].strip())

Calling the `extract_parallel_sents` function and passing the `comparable_corpus_vg`, `parallel_sentences_en`, and `parallel_sentences_it` variables as arguments.

In [None]:
extract_parallel_sents(comparable_corpus_vg, parallel_sentences_en, parallel_sentences_it)

## 3.3 Building the parallel corpus 

Creating a pandas dataframe called `parallel_corpus`, with two columns: "en" and "it". The values in the columns are stored in the lists `parallel_sentences_en` and `parallel_sentences_it` respectively, which are stacked horizontally and assigned to the dataframe using the `np.column_stack()` method.

In [None]:
parallel_corpus = pd.DataFrame(np.column_stack([parallel_sentences_en, parallel_sentences_it]), 
                               columns=['en', 'it'])

Displaying the `parallel_corpus` dataframe.

In [None]:
parallel_corpus

Unnamed: 0,en,it
0,0 A.D. is a free and open-source real-time str...,0 A.D. è un videogioco di strategia in tempo r...
1,"Chris Charla of NextGen said, ""As much as we l...","Chris Charla di NextGen ha dichiarato: ""Per qu..."
2,If you can find a few copies in the bargain bi...,Se riesci a trovare alcune copie nel cestino d...
3,Act of War: Direct Action is a real-time strat...,Act of War: Direct Action è un videogioco stra...
4,Age of Empires (AoE) is a real-time strategy v...,Age of Empires è un videogioco strategico in t...
...,...,...
1348,A Next Generation critic commented that Z stan...,Un critico di Next Generation ha commentato ch...
1349,He said Z lacks the longevity of its nearest c...,"Stando alla rivista, Z non ha la longevità dei..."
1350,"PC Zone magazine described Z as ""a brilliant s...","La rivista PC Zone ha descritto Z come ""uno st..."
1351,"""Reviewing the Saturn port, Sega Saturn Magazi...","""Durante la sua recensione della versione su S..."


## 3.4 Exporting to .XLSX

Exporting the extracted parallel sentences to an Excel file.

In [None]:
parallel_corpus.to_excel("parallel_corpus_vg_en_it.xlsx")  