## Retrieving Wikipedia data

## Defining the tokenization function

In [1]:
import os
import nltk
from nltk.tokenize import word_tokenize

# Ensure you have the necessary NLTK resource downloaded
nltk.download('punkt')
nltk.download('punkt_tab')

def nb_tokens(text):
    # More sophisticated tokenization can be used here, such as punctuation
    tokens = word_tokenize(text)
    return len(tokens)

[nltk_data] Downloading package punkt to /home/ongin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/ongin/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Creating an Wikipedia API instance

In [None]:
import wikipediaapi
wikipedia = wikipediaapi.Wikipedia(language='en', user_agent='KnowledgeGraph/0.0.2 (honglin@duck.com)')

topic = "Solar-cell_efficiency"

### Root page summary

In [3]:
import textwrap

page = wikipedia.page(topic)

if page.exists():
    print(f"Title: {page.title}")
    print(f"Summary: {textwrap.fill(page.summary, width=80)}\n")
    print(f"Number of tokens: {nb_tokens(page.summary)}")
else:
    print("Page not found!")

Title: Photovoltaics
Summary: Photovoltaics (PV) is the conversion of light into electricity using
semiconducting materials that exhibit the photovoltaic effect, a phenomenon
studied in physics, photochemistry, and electrochemistry. The photovoltaic
effect is commercially used for electricity generation and as photosensors. A
photovoltaic system employs solar modules, each comprising a number of solar
cells, which generate electrical power. PV installations may be ground-mounted,
rooftop-mounted, wall-mounted or floating. The mount may be fixed or use a solar
tracker to follow the sun across the sky. Photovoltaic technology helps to
mitigate climate change because it emits much less carbon dioxide than fossil
fuels. Solar PV has specific advantages as an energy source: once installed, its
operation does not generate any pollution or any greenhouse gas emissions; it
shows scalability in respect of power needs and silicon has large availability
in the Earth's crust, although other materi

### Collecting URLs and fetch content to documents

In [4]:
print(page.fullurl)

# Get all the links on the page
links = page.links
print(f"Number of links: {len(links)}")

def safe_file_name(s):
    # Replace spaces with underscores
    s = s.replace(' ', '_')
    # Remove any characters that are not allowed in file names
    safe_str = ''.join(c for c in s if c.isalpha() or c.isdigit() or c in [' ', '.', '_', '-'])
    return safe_str

def file_exists_and_has_content(file_path):
    # Check if the file exists
    if not os.path.exists(file_path):
        return False
    
    # Check if the file is not empty
    with open(file_path, 'r', encoding='utf-8') as file:
        first_char = file.read(1)
        if first_char:
            return True
        else:
            return False

# Directory to store the output file
output_dir = './documents/'
os.makedirs(output_dir, exist_ok=True)

def save_document(page):
    file_name = safe_file_name(page.title)
    file_path = os.path.join(output_dir, f"{file_name}.txt")
    
    if file_exists_and_has_content(file_path):
        return

    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(f"Title: {page.title}\n\n")
        file.write(f"URL: {page.fullurl}\n\n")
        file.write(f"Content:\n{page.text}\n")

# Save topic document
save_document(page)

https://en.wikipedia.org/wiki/Photovoltaics
Number of links: 596


In [5]:

from tqdm import tqdm
import time

for link in tqdm(sorted(links), desc="Fetching Wikipedia articles"):
    page = wikipedia.page(link)
    if page.exists() and page.fullurl:
        save_document(page)
        time.sleep(0.5)

Fetching Wikipedia articles:   0%|          | 0/596 [00:00<?, ?it/s]

Fetching Wikipedia articles: 100%|██████████| 596/596 [12:25<00:00,  1.25s/it]
