# Roots En Wikivoyage

https://huggingface.co/datasets/bigscience-data/roots_en_wikivoyage

## Imports

In [1]:
# !pip install faiss-cpu

In [2]:
# !pip install mistralai

In [3]:
# %pip install datasets

In [4]:
from dotenv import load_dotenv
import os

load_dotenv() # take environment variables from .env.

True

In [5]:
# import faiss
import ast

import pandas as pd
import requests
import numpy as np
from datasets import load_dataset
# from sentence_transformers import SentenceTransformer
from huggingface_hub import login
from mistralai import Mistral
import json

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
token = os.getenv("HUGGING_FACE")
api_key = os.getenv("MISTRAL_API_KEY")

In [7]:
login(token=token)

In [8]:
output_fpath = "output_data"
if not os.path.exists(output_fpath):
    os.mkdir(output_fpath)

## Loading dataset

In [9]:
dataset = load_dataset("bigscience-data/roots_en_wikivoyage", split="train")

In [10]:
type(dataset)

datasets.arrow_dataset.Dataset

## EDA

In [11]:
dataset

Dataset({
    features: ['meta', 'text'],
    num_rows: 24838
})

In [12]:
dataset[0]["text"][:100]

'This article describes routes that avoid a transit of the United States. Since the documentation req'

In [41]:
dataset[12]["text"][-1000:]

'ams Street. When you reach Wacker Drive, turn left around the Sears Tower and head into the main lobby. The Sears Tower likely needs no introduction, but the massive art installation inside is less well known. In the Heart of this Infinite Particle of Dust is the last of the day\'s sculptures, by Jacob Hashimoto. The cloud-like work is made up of 7,000 individual disks suspended from the ceiling, with varying lengths of string. Your tour ends here in the Sears Tower, and you might want to take this opportunity to ride the 20 miles per hour elevator to the Sears Tower Skydeck for some grand views of the journey you\'ve just taken. If you are thinking of dinner at this point, you are just a block away from the Quincy L station — you have your choice of neighborhoods to visit for dinner. Some especially good "ethnic" dining options not too far from the Loop are in Greektown a short walk across the river, Chinatown to the south (Red Line), and authentic Mexican cuisine in Pilsen (Pink Lin

In [13]:
all_meta = []
titles = []
languages = set()

In [14]:
ast.literal_eval(dataset[0]["meta"])

{'content_model': 'wikitext',
 'language': 'en',
 'title': 'Avoiding travel through the United States',
 'type': 'text'}

In [15]:
for data in dataset:
    all_meta.append(ast.literal_eval(data["meta"]))

In [16]:
?set

[31mInit signature:[39m set(self, /, *args, **kwargs)
[31mDocstring:[39m     
set() -> new empty set object
set(iterable) -> new set object

Build an unordered collection of unique elements.
[31mType:[39m           type
[31mSubclasses:[39m     LazySet, LazySet, LazySet, HTTPHeaderDictItemView

In [17]:
for meta in all_meta:
    titles.append(meta["title"])
    languages.add(meta["language"])

In [18]:
languages

{'en'}

In [19]:
len(titles)

24838

In [20]:
titles[:5]

['Avoiding travel through the United States',
 'E1 Long Distance Path',
 'Ferries in the Caspian Sea',
 'Longobard sites',
 'Caribbean Nicaragua']

In [21]:
sorted(titles)[:10]

["'s-Hertogenbosch",
 '100 Mile House',
 '20th-century South Africa',
 '88 Temple Pilgrimage',
 'A Coruña',
 'A Taste of Coastal Texas',
 'A seaside stroll in Helsinki',
 'Aa en Hunze',
 'Aachen',
 'Aalborg']

In [45]:
titles[4]

'Caribbean Nicaragua'

In [49]:
len(dataset[4]["text"])

4101

In [22]:
df = pd.DataFrame({"title": titles})

In [50]:
df[df["title"].str.contains("Moscow")]

Unnamed: 0,id,title
24,24,Moscow/Central-South
48,48,Moscow/Outskirts
313,313,Moscow/Zelenograd and New Moscow
2049,2049,West Moscow Oblast
3518,3518,North Moscow Oblast
7453,7453,Moscow/Central-West
7642,7642,Moscow to Urumqi
16875,16875,Moscow (Idaho)
18146,18146,Moscow/Central-North
19382,19382,Moscow/Central-East


In [52]:
len(dataset[23554]["text"])

67613

In [53]:
dataset[23554]["text"][:1100]

'For other places with the same name, see Moscow (disambiguation). Moscow is a huge city with several district articles that contain information about specific sights, restaurants, and accommodation. Since its founding in 1147, Moscow (Russian: Москва, Moskva) has been at the crossroads of history as the capital of empires and a frequent target for invaders. As the capital of the Russian Empire, the Soviet Union, and, today, the Russian Federation, it has played a central role in the development of the largest country in the world. For many, the sight of the Kremlin complex in the centre of the city is still loaded with symbolism and history. Today, Moscow is a thriving, exuberant capital city that overflows with life, culture and sometimes traffic. A sprawling metropolis, and among the largest cities on the European continent, Moscow is home to numerous museums, Soviet-era monoliths and post-Soviet kitsch, but continues to pave the way forward as Muscovites move into the 21st century.

In [55]:
dataset[23554]["text"][-1100:]

'eum, local history museum, art and history museum, etc. The monastery was founded in 1656 by Tzar Alexis II and Patriarch Nikon (his "cell", a three-storey house stands in the park outside the monastery walls) to resemble the original Jerusalem. 55.72836.81614 Savvino-Storozhevskiy monastery (Саввино-Сторожевский монастырь) (65 km (40 mi) W; Commuter trains from Belorussky station to Zvenigorod , several daily; travel time ~1 hr, 1.5 km (0.93 mi) west to monastery, which is on a nearby hill.). A beautiful monastery with interesting history, closely connected to Russian Tzars. 56.3537.53333315 Dmitrov (Дмитров) (65 km (40 mi) North from Moscow (trains from Savelovsky station, several daily, 11⁄2 hr)). A town, on Moscow Channel, with old churches, interesting sculptures in the streets and a number of museums  55.87858537.03591516 Snegiri (40 km (25 mi) NW from Moscow (Volokolamskoe hwy), trains from Rizhsky Station, several daily, travel time about an hour). - Settlement, that boasts a 

In [23]:
df.head()

Unnamed: 0,title
0,Avoiding travel through the United States
1,E1 Long Distance Path
2,Ferries in the Caspian Sea
3,Longobard sites
4,Caribbean Nicaragua


In [24]:
df = df.reset_index()

In [25]:
df.rename(columns={"index": "id"}, inplace=True)

In [26]:
df.to_csv(f"{output_fpath}/titles.csv", index=False)

### Dataset with page titles

## Links to pages

We assume that basic link to wikivoyage looks like this:

```
https://en.wikivoyage.org/wiki/North_America
```

In [27]:
def check_url(url):
    try:
        response = requests.head(url, allow_redirects=True, timeout=5)
        return response.status_code == 200
    except requests.RequestException:
        return False

In [28]:
titles[0].replace(" ", "_")

'Avoiding_travel_through_the_United_States'

In [29]:
title = titles[0].replace(" ", "_")

link_template = f"https://en.wikivoyage.org/wiki/{title}"
link_template

'https://en.wikivoyage.org/wiki/Avoiding_travel_through_the_United_States'

In [30]:
link_status = []

In [31]:
titles[:5]

['Avoiding travel through the United States',
 'E1 Long Distance Path',
 'Ferries in the Caspian Sea',
 'Longobard sites',
 'Caribbean Nicaragua']

In [32]:
for title in titles[:10]:
    url = f"https://en.wikivoyage.org/wiki/{title.replace(" ", "_")}"
    print(url)
    link_status.append(check_url(url))

https://en.wikivoyage.org/wiki/Avoiding_travel_through_the_United_States
https://en.wikivoyage.org/wiki/E1_Long_Distance_Path
https://en.wikivoyage.org/wiki/Ferries_in_the_Caspian_Sea
https://en.wikivoyage.org/wiki/Longobard_sites
https://en.wikivoyage.org/wiki/Caribbean_Nicaragua
https://en.wikivoyage.org/wiki/Home_exchange
https://en.wikivoyage.org/wiki/The_Most_Beautiful_Villages_of_France
https://en.wikivoyage.org/wiki/Driving_in_Denmark
https://en.wikivoyage.org/wiki/First_and_business_class_flights
https://en.wikivoyage.org/wiki/Attractions


In [33]:
url = "https://en.wikivoyage.org/wiki/Avoiding_travel_through_the_United_States"

In [34]:
response = requests.head(url, allow_redirects=True, timeout=5)

In [35]:
response

<Response [403]>

In [36]:
response = requests.get(url)

In [37]:
response

<Response [403]>

Links work, but somehow don't work with requests