## Retrieving Wikipedia data and metadata

## Defining the tokenization function

In [19]:
import nltk
from nltk.tokenize import word_tokenize

# Ensure you have the necessary NLTK resource downloaded
nltk.download('punkt')
nltk.download('punkt_tab')

def nb_tokens(text):
    # More sophisticated tokenization can be used here, such as punctuation
    tokens = word_tokenize(text)
    return len(tokens)

[nltk_data] Downloading package punkt to /home/ongin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/ongin/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Creating an Wikipedia API instance

In [20]:
import wikipediaapi
wikipedia = wikipediaapi.Wikipedia(language='en', user_agent='KnowledgeGraph/0.0.2 (honglin@duck.com)')

topic = "Solar_cell"
filename = "Solar_cell"

### Root page summary

In [21]:
import textwrap

page = wikipedia.page(topic)

if page.exists():
    print(f"Title: {page.title}")
    print(f"Summary: {textwrap.fill(page.summary, width=80)}")
    print(f"Number of tokens: {nb_tokens(page.summary)}")
else:
    print("Page not found!")

Title: Solar cell
Summary: A solar cell, also known as a photovoltaic cell (PV cell), is an electronic
device that converts the energy of light directly into electricity by means of
the photovoltaic effect. It is a type of photoelectric cell, a device whose
electrical characteristics (such as current, voltage, or resistance) vary when
it is exposed to light. Individual solar cell devices are often the electrical
building blocks of photovoltaic modules, known colloquially as "solar panels".
Almost all commercial PV cells consist of crystalline silicon, with a market
share of 95%. Cadmium telluride thin-film solar cells account for the remainder.
The common single-junction silicon solar cell can produce a maximum open-circuit
voltage of approximately 0.5 to 0.6 volts. Photovoltaic cells may operate under
sunlight or artificial light. In addition to producing solar power, they can be
used as a photodetector (for example infrared detectors), to detect light or
other electromagnetic radiati

### URLs and Citations

In [22]:
print(page.fullurl)

# Get all the links on the page
links = page.links
print(f"Number of links: {len(links)}")

# Print the first 10 link with a summary
for i, (title, link) in enumerate(links.items()):
    if i >= 10:
        break
    print(f"Title: {title}")
    print(f"URL: {link.fullurl}")
    print(f"Summary: {textwrap.fill(link.summary, width=80)}")
    print(f"Number of tokens: {nb_tokens(link.summary)}")
    print()

https://en.wikipedia.org/wiki/Solar_cell
Number of links: 568
Title: 1973 oil crisis
URL: https://en.wikipedia.org/wiki/1973_oil_crisis
Summary: In October 1973, the Organization of Arab Petroleum Exporting Countries (OAPEC)
announced that it was implementing a total oil embargo against countries that
had supported Israel at any point during the 1973 Yom Kippur War, which began
after Egypt and Syria launched a large-scale surprise attack in an ultimately
unsuccessful attempt to recover the territories that they had lost to Israel
during the 1967 Six-Day War. In an effort that was led by Faisal of Saudi
Arabia, the initial countries that OAPEC targeted were Canada, Japan, the
Netherlands, the United Kingdom, and the United States. This list was later
expanded to include Portugal, Rhodesia, and South Africa. In March 1974, OAPEC
lifted the embargo, but the price of oil had risen by nearly 300%: from US$3 per
barrel ($19/m3) to nearly US$12 per barrel ($75/m3) globally. Prices in the
Unit

### Writing the citations page and collecting the URLs

In [23]:
from datetime import datetime
from tqdm import tqdm

# Get all the links on the page
links = page.links
maxl = min(len(page.links), 50) # Maximum number of links to retrieve.

# Prepare a file to store the outputs
fname = filename + "_citations.txt"
with open(fname, "w") as file:
    # Write the citation header
    file.write(f"# Citations for {topic} in Wikipedia.\n")
    file.write("Root page: " + page.fullurl + "\n\n")
    counter = 0
    urls = []
    urls.append(page.fullurl)

    # Loop through the links and write the summary
    for link in tqdm(list(links)[:maxl], desc="Processing links"):
        try:
            counter += 1
            page_detail = wikipedia.page(link)
            summary = page_detail.summary
            file.write(f"## {counter}. {link}\n")
            file.write(f"URL: {page_detail.fullurl}\n")
            file.write(f"Summary: {textwrap.fill(summary, width=80)}\n")
            file.write("\n")
            urls.append(page_detail.fullurl)

        except wikipedia.exceptions.PageError:
            continue  # Skip if the page does not exist

    # Write the footer with the date, time and metadata
    file.write(f"Total links processed: {counter}\n")
    file.write(f"Date and time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")

Processing links: 100%|██████████| 50/50 [01:26<00:00,  1.72s/it]


In [24]:
# Write URLs to a file
ufname = filename + "_urls.txt"
with open(ufname, "w") as file:
    for url in urls:
        file.write(url + "\n")
# Read URLs from the file
with open(ufname, "r") as file:
    urls = [line.strip() for line in file]
print("Read URLs:")
for url in urls:
    print(url)

Read URLs:
https://en.wikipedia.org/wiki/Solar_cell
https://en.wikipedia.org/wiki/1973_oil_crisis
https://en.wikipedia.org/wiki/ARCO
https://en.wikipedia.org/wiki/Absorption_(electromagnetic_radiation)
https://en.wikipedia.org/wiki/Acrylate_polymer
https://en.wikipedia.org/wiki/Albedo
https://en.wikipedia.org/wiki/Albert_Einstein
https://en.wikipedia.org/wiki/Aleksandr_Stoletov
https://en.wikipedia.org/wiki/Alkaline_battery
https://en.wikipedia.org/wiki/Alternating_current
https://en.wikipedia.org/wiki/Aluminium%E2%80%93air_battery
https://en.wikipedia.org/wiki/American_Solar_Challenge
https://en.wikipedia.org/wiki/Amorphous_silicon
https://en.wikipedia.org/wiki/Ion
https://en.wikipedia.org/wiki/Anita_Ho-Baillie
https://en.wikipedia.org/wiki/Anode
https://en.wikipedia.org/wiki/Anomalous_photovoltaic_effect
https://en.wikipedia.org/wiki/Anti-reflective_coating
https://en.wikipedia.org/wiki/Antonio_Luque
https://en.wikipedia.org/wiki/ArXiv
https://en.wikipedia.org/wiki/Atmospheric_pressu