<a href="https://colab.research.google.com/github/ffedox/pbr/blob/main/corpus_creation_with_comments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting a domain-specific parallel corpus from Wikipedia

# 1. Setup

## 1.1 Wikipedia API

[Wikipedia-API](https://github.com/martin-majlis/Wikipedia-API) is a Python wrapper for Wikipedias' API. It supports extracting texts, sections, links, categories, translations, etc from Wikipedia.

In [1]:
!pip install wikipedia-api --quiet

## 1.2 Sentence Transformers



[Sentence Transformers](https://www.sbert.net/) is a Python framework for state-of-the-art sentence, text and image embeddings. 

In [None]:
!pip install sentence-transformers --quiet

## 1.3 Imports

In [22]:
from sentence_transformers import models, SentenceTransformer
from json import JSONDecodeError

import wikipediaapi

import pandas as pd
import numpy as np
import scipy

from nltk.tokenize import sent_tokenize
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# 2. Extracting an IT-EN comparable corpus from Category: pages

First we will extract a comparable corpus by scraping Category: pages in Italian (it is more likely to find correspondences from IT to EN than the other way around).

For each page linked in the IT Category: page, we will be retrieving the EN equivalent.

*Here, we will limit the search to the summaries (first paragraph in each Wikipedia entry).* Maybe not?

## 2.1 Extracting the IT articles from the Category

In [6]:
page_titles = []
page_texts_it = []

def get_it_articles(category, page_titles, page_texts_it):

  wiki_wiki = wikipediaapi.Wikipedia('it')
  cat = wiki_wiki.page(category)   # category should be like .page("Categoria:Survival_horror")

  for p in cat.categorymembers.values():
    if p.namespace == wikipediaapi.Namespace.MAIN:
      # it is page => we can get text
      page_titles.append(p.title)
      page_texts_it.append(p.text)

In [9]:
get_it_articles('Categoria:Videogiochi_in_realtà_virtuale', page_titles, page_texts_it)

## 2.2 Extracting the corresponding EN articles

By leveraging the previously obtained titles, we can search for the equivalent pages on the EN Wikipedia.

In [13]:
page_texts_en = []

def get_langlinks(page, page_texts_en):

        langlinks = page.langlinks

        for k in sorted(langlinks.keys()):
            v = langlinks[k]

        try:
          page_en = page.langlinks['en']
          page_texts_en.append(page_en.text)

        except KeyError:
          page_texts_en.append(str('No match'))
          
        except JSONDecodeError:
           page_texts_en.append(str('No match'))

In [14]:
def get_en_articles(page_titles, page_texts_en):

  for title in page_titles:

    wiki_wiki = wikipediaapi.Wikipedia('it')
    page = wiki_wiki.page(str(title))
    get_langlinks(page, page_texts_en)

In [15]:
get_en_articles(page_titles, page_texts_en)

## 2.3 Comparable IT-EN corpus

We merge the IT and the EN pages to obtain a comparable corpus.

In [17]:
comparable_corpus_vg = pd.DataFrame(np.column_stack([page_texts_en, page_texts_it]), 
                               columns=['en', 'it'])

Now we only need to drop the rows where no English equivalent was found.

In [18]:
comparable_corpus_vg = comparable_corpus_vg[comparable_corpus_vg['en'].str.contains('No match')==False].reset_index(drop=True)

End result:

In [20]:
comparable_corpus_vg

Unnamed: 0,en,it
0,Astro Bot Rescue Mission is a 2018 platform vi...,Astro Bot Rescue Mission è un videogioco a pia...
1,Batman: Arkham VR is a virtual reality adventu...,Batman: Arkham VR è un videogioco di avventura...
2,Beat Saber is a virtual reality rhythm game de...,Beat Saber è un rhythm game in realtà virtuale...
3,Blood & Truth is a first-person shooter develo...,Blood & Truth è uno sparatutto in prima person...
4,Dreams is a game creation system video game de...,Dreams è un videogioco di tipo sandbox svilupp...
5,Farpoint is a virtual reality first-person sho...,Farpoint è un videogioco sparatutto in prima p...
6,Five Nights at Freddy's: Help Wanted is a 2019...,Five Nights at Freddy's: Help Wanted (abbrevia...
7,The Forest is a survival horror video game dev...,"The Forest è un videogioco in prima persona, d..."
8,Golem is a video game developed by Highwire Ga...,Golem è un videogioco sviluppato da Highware G...
9,Half-Life: Alyx is a 2020 virtual reality (VR)...,Half-Life: Alyx è un videogioco di genere spar...


# 3. Extracting parallel sentences from the comparable corpus

For each Wikipedia article, we obtained the link to the corresponding article in English. This can be used to mine sentences limited to the respective articles. This local approach has several advantages: 

1.   Mining is faster since each article usually has a few hundreds of sentences only.
2.   It seems reasonable to assume that a translation of a sentence is more
likely to be found in the same article than anywhere in the whole Wikipedia. 

## 3.1 Loading the Sentence-Transformers model

We will be using [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2://) to compute the sentence embeddings for evaluating textual similarity.

In [None]:
model = SentenceTransformer('distiluse-base-multilingual-cased')

## 3.2 Using the model to create the embeddings, and computing sentence similarity

In [23]:
parallel_sentences_en = []
parallel_sentences_it = []

def extract_parallel_sents(comparable_corpus_vg, parallel_sentences_en, parallel_sentences_it):

  closest_n = 1

  for en, it in zip(comparable_corpus_vg.en.values, comparable_corpus_vg.it.values): # Looping throuth the texts in the comparable corpus

    corpus = sent_tokenize(en)
    queries = sent_tokenize(it)

    corpus_embeddings = model.encode(corpus)
    query_embeddings = model.encode(queries)

    for query, query_embedding in zip(queries, query_embeddings):
      
      distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

      results = zip(range(len(distances)), distances)
      results = sorted(results, key=lambda x: x[1])

      for idx, distance in results[0:closest_n]:
        if 1-distance > 0.8: # Similarity threshold
          parallel_sentences_it.append(query)
          parallel_sentences_en.append(corpus[idx].strip())

In [24]:
extract_parallel_sents(comparable_corpus_vg, parallel_sentences_en, parallel_sentences_it)

## 3.3 Building the parallel corpus 

In [25]:
parallel_corpus = pd.DataFrame(np.column_stack([parallel_sentences_en, parallel_sentences_it]), 
                               columns=['en', 'it'])

In [26]:
parallel_corpus

Unnamed: 0,en,it
0,Astro Bot Rescue Mission is a 2018 platform vi...,Astro Bot Rescue Mission è un videogioco a pia...
1,It stars a cast of robot characters first intr...,I robottini presenti nel gioco sono stati intr...
2,Gameplay\nAstro Bot Rescue Mission is a 3D pla...,Modalità di gioco\nAstro Bot Rescue Mission è ...
3,"Astro is able to jump, hover, punch and charge...","Astro è in grado di saltare, caricare e colpir..."
4,There are 8 lost robots in each level and find...,Ci sono 8 robot smarriti in ogni livello e tro...
...,...,...
193,Gameplay\nStar Wars: Squadrons is a space comb...,Star Wars: Squadrons è un videogioco di combat...
194,"As players earn more experience, they can unlo...",È anche possibile ottenere esperienza in modo ...
195,"An updated arcade version, Tekken 7: Fated Ret...",Tekken 7 è uscito nelle sale giochi giapponesi...
196,Until Dawn: Rush of Blood is a rail shooter de...,Until Dawn: Rush of Blood è un videogioco di g...


In [27]:
parallel_corpus.to_excel("parallel_corpus_vg_en_it.xlsx")  