## Retrieving Wikipedia data and metadata

In [1]:
try:
    import wikipediaapi
except ImportError:
    print("Please install wikipedia-api using pip install wikipedia-api")

## Defining the tokenization function

In [2]:
import nltk
from nltk.tokenize import word_tokenize

# Ensure you have the necessary NLTK resource downloaded
nltk.download('punkt')

def nb_tokens(text):
    # More sophisticated tokenization can be used here, such as punctuation
    tokens = word_tokenize(text)
    return len(tokens)

[nltk_data] Downloading package punkt to /home/ongin/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Creating an Wikipedia API instance

In [3]:
wikipedia = wikipediaapi.Wikipedia(language='en', user_agent='KnowledgeGraph/0.0.1 (honglin@duck.com)')

topic = "Marketing"
filename = "Marketing"
maxl = 140 # Maximum number of links to retrieve.

### Root page summary

In [4]:
import textwrap
nltk.download('punkt_tab')
page = wikipedia.page(topic)

if page.exists():
    print(f"Title: {page.title}")
    print(f"Summary: {textwrap.fill(page.summary, width=80)}")
    print(f"Number of tokens: {nb_tokens(page.summary)}")
else:
    print("Page not found!")

[nltk_data] Downloading package punkt_tab to /home/ongin/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Title: Marketing
Summary: Marketing is the act of satisfying and retaining customers. It is one of the
primary components of business management and commerce. Marketing is usually
conducted by the seller, typically a retailer or manufacturer. Products can be
marketed to other businesses (B2B) or directly to consumers (B2C). Sometimes
tasks are contracted to dedicated marketing firms, like a media, market
research, or advertising agency. Sometimes, a trade association or government
agency (such as the Agricultural Marketing Service) advertises on behalf of an
entire industry or locality, often a specific type of food (e.g. Got Milk?),
food from a specific area, or a city or region as a tourism destination. Market
orientations are philosophies concerning the factors that should go into market
planning. The marketing mix, which outlines the specifics of the product and how
it will be sold, including the channels that will be used to advertise the
product, is affected by the environment su

### URLs and Citations

In [5]:
print(page.fullurl)

# Get all the links on the page
links = page.links
print(f"Number of links: {len(links)}")

# Print the first 10 links with a summary
for i, (title, link) in enumerate(links.items()):
    if i >= 10:
        break
    print(f"Title: {title}")
    print(f"URL: {link.fullurl}")
    print(f"Summary: {textwrap.fill(link.summary, width=80)}")
    print(f"Number of tokens: {nb_tokens(link.summary)}")
    print()

https://en.wikipedia.org/wiki/Marketing
Number of links: 405
Title: 24-hour news cycle
URL: https://en.wikipedia.org/wiki/24-hour_news_cycle
Summary: The 24-hour news cycle (or 24/7 news cycle) is the 24-hour investigation and
reporting of news, concomitant with fast-paced lifestyles. The vast news
resources available in recent decades have increased competition for audience
and advertiser attention, prompting media providers to deliver the latest news
in the most compelling manner in order to remain ahead of competitors.
Television, radio, print, online and mobile app news media all have many
suppliers that want to be relevant to their audiences and deliver news first. A
complete news cycle consists of the media reporting on some event, followed by
the media reporting on public and other reactions to the earlier reports. The
advent of 24-hour cable and satellite television news channels and, in more
recent times, of news sources on the World Wide Web (including blogs),
considerably sh

### Writing the citations page and collecting the URLs

In [6]:
from datetime import datetime

# Get all the links on the page
links = page.links

# Prepare a file to store the outputs
fname = filename + "_citations.txt"
with open(fname, "w") as file:
    # Write the citation header
    file.write(f"# Citations for {topic} in Wikipedia.\n")
    file.write("Root page: " + page.fullurl + "\n\n")
    counter = 0
    urls = []
    urls.append(page.fullurl)
    
    # Loop through the links and write the summary
    for link in links:
        try:
            counter += 1
            page_detail = wikipedia.page(link)
            summary = page_detail.summary
            file.write(f"## {counter}. {link}\n")
            file.write(f"URL: {page_detail.fullurl}\n")
            file.write(f"Summary: {textwrap.fill(summary, width=80)}\n")
            file.write("\n")
            urls.append(page_detail.fullurl)

            # Limit the number of links to maxl
            if counter >= maxl:
                break
        except wikipedia.exceptions.PageError:
            continue  # Skip if the page does not exist
    
    # Write the footer with the date, time and metadata
    file.write(f"Total links processed: {counter}\n")
    file.write(f"Date and time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")


In [7]:
# Write URLs to a file
ufname = filename + "_urls.txt"
with open(ufname, "w") as file:
    for url in urls:
        file.write(url + "\n")
# Read URLs from the file
with open(ufname, "r") as file:
    urls = [line.strip() for line in file]
print("Read URLs:")
for url in urls:
    print(url)

Read URLs:
https://en.wikipedia.org/wiki/Marketing
https://en.wikipedia.org/wiki/24-hour_news_cycle
https://en.wikipedia.org/wiki/Account-based_marketing
https://en.wikipedia.org/wiki/Activism
https://en.wikipedia.org/wiki/Adam_Smith
https://en.wikipedia.org/wiki/Adam_Smith_Institute
https://en.wikipedia.org/wiki/Advertising
https://en.wikipedia.org/wiki/Advertising_agency
https://en.wikipedia.org/wiki/Advertising_mail
https://en.wikipedia.org/wiki/Advertising_management
https://en.wikipedia.org/wiki/Advertising_slogan
https://en.wikipedia.org/wiki/Advocacy
https://en.wikipedia.org/wiki/Advocacy_group
https://en.wikipedia.org/wiki/Affinity_marketing
https://en.wikipedia.org/wiki/Agenda-setting_theory
https://en.wikipedia.org/wiki/Agile_marketing
https://en.wikipedia.org/wiki/Agricultural_Marketing_Service
https://en.wikipedia.org/wiki/Agricultural_marketing
https://en.wikipedia.org/wiki/Airborne_leaflet_propaganda
https://en.wikipedia.org/wiki/Alternative_facts
https://en.wikipedia.org