<a href="https://colab.research.google.com/github/ffedox/pbr/blob/main/corpus_creation_with_comments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting a domain-specific parallel corpus from Wikipedia

# 1. Setup

## 1.1 Wikipedia API

[Wikipedia-API](https://github.com/martin-majlis/Wikipedia-API) is a Python wrapper for Wikipedias' API. It supports extracting texts, sections, links, categories, translations, etc from Wikipedia.

In [1]:
!pip install wikipedia-api --quiet

## 1.2 Sentence Transformers



[Sentence Transformers](https://www.sbert.net/) is a Python framework for state-of-the-art sentence, text and image embeddings. 

In [None]:
!pip install sentence-transformers --quiet

## 1.3 Imports

In [None]:
from sentence_transformers import models, SentenceTransformer
from json import JSONDecodeError

import wikipediaapi

import pandas as pd
import numpy as np
import scipy

from nltk.tokenize import sent_tokenize
import nltk

nltk.download('punkt')

# 2. Extracting an IT-EN comparable corpus from Category: pages

First we will extract a comparable corpus by scraping Category: pages in Italian (it is more likely to find correspondences from IT to EN than the other way around).

For each page linked in the IT Category: page, we will be retrieving the EN equivalent.

*Here, we will limit the search to the summaries (first paragraph in each Wikipedia entry).* Maybe not?

## 2.1 Extracting the IT articles from the Category

In [6]:
page_titles = []
page_texts_it = []

def get_it_articles(category, page_titles, page_texts_it):

  wiki_wiki = wikipediaapi.Wikipedia('it')
  cat = wiki_wiki.page(category)   # category should be like .page("Categoria:Survival_horror")

  for p in cat.categorymembers.values():
    if p.namespace == wikipediaapi.Namespace.MAIN:
      # it is page => we can get text
      page_titles.append(p.title)
      page_texts_it.append(p.text)

In [9]:
get_it_articles('Categoria:Videogiochi_in_realtà_virtuale', page_titles, page_texts_it)

## 2.2 Extracting the corresponding EN articles

By leveraging the previously obtained titles, we can search for the equivalent pages on the EN Wikipedia.

In [13]:
page_texts_en = []

def get_langlinks(page, page_texts_en):

        langlinks = page.langlinks

        for k in sorted(langlinks.keys()):
            v = langlinks[k]

        try:
          page_en = page.langlinks['en']
          page_texts_en.append(page_en.text)

        except KeyError:
          page_texts_en.append(str('No match'))
          
        except JSONDecodeError:
           page_texts_en.append(str('No match'))

In [14]:
def get_en_articles(page_titles, page_texts_en):

  for title in page_titles:

    wiki_wiki = wikipediaapi.Wikipedia('it')
    page = wiki_wiki.page(str(title))
    get_langlinks(page, page_texts_en)

In [15]:
get_en_articles(page_titles, page_texts_en)

## 2.3 Comparable IT-EN corpus

We merge the IT and the EN pages to obtain a comparable corpus.

In [17]:
comparable_corpus_vg = pd.DataFrame(np.column_stack([page_texts_en, page_texts_it]), 
                               columns=['en', 'it'])

Now we only need to drop the rows where no English equivalent was found.

In [18]:
comparable_corpus_vg = comparable_corpus_vg[comparable_corpus_vg['en'].str.contains('No match')==False].reset_index(drop=True)

End result:

In [None]:
comparable_corpus_vg

# 3. Extracting parallel sentences from the comparable corpus