[![Open In Colab](https://github.com/MonashDataFluency/python-web-scraping/blob/master/images/colab-badge.svg?raw=1)](https://colab.research.google.com/github/MonashDataFluency/python-web-scraping/blob/master/notebooks/section-3-API-based-scraping.ipynb)

<img src="https://github.com/MonashDataFluency/python-web-scraping/blob/master/images/api.png?raw=1">

### A brief introduction to APIs
---

In this section, we will take a look at an alternative way to gather data than the previous pattern based HTML scraping. Sometimes websites offer an API (or Application Programming Interface) as a service which provides a high level interface to directly retrieve data from their repositories or databases at the backend.

From Wikipedia,

> "*An API is typically defined as a set of specifications, such as Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.*"

They typically tend to be URL endpoints (to be fired as requests) that need to be modified based on our requirements (what we desire in the response body) which then returns some a payload (data) within the response, formatted as either JSON, XML or HTML.

A popular web architecture style called `REST` (or representational state transfer) allows users to interact with web services via `GET` and `POST` calls (two most commonly used) which we briefly saw in the previous section.

For example, Twitter's REST API allows developers to access core Twitter data and the Search API provides methods for developers to interact with Twitter Search and trends data.

There are primarily two ways to use APIs :

- Through the command terminal using URL endpoints, or
- Through programming language specific *wrappers*

For example, `Tweepy` is a famous python wrapper for Twitter API whereas `twurl` is a command line interface (CLI) tool but both can achieve the same outcomes.

Here we focus on the latter approach and will use a Python library (a wrapper) called `wptools` based around the original MediaWiki API.

One advantage of using official APIs is that they are usually compliant of the terms of service (ToS) of a particular service that researchers are looking to gather data from. However, third-party libraries or packages which claim to provide more throughput than the official APIs (rate limits, number of requests/sec) generally operate in a gray area as they tend to violate ToS. Always be sure to read their documentation throughly.

### Wikipedia API
---

Let's say we want to gather some additional data about the Fortune 500 companies and since wikipedia is a rich source for data we decide to use the MediaWiki API to scrape this data. One very good place to start would be to look at the **infoboxes** (as wikipedia defines them) of articles corresponsing to each company on the list. They essentially contain a wealth of metadata about a particular entity the article belongs to which in our case is a company.

For e.g. consider the wikipedia article for **Walmart** (https://en.wikipedia.org/wiki/Walmart) which includes the following infobox :

![An infobox](https://github.com/MonashDataFluency/python-web-scraping/blob/master/images/infobox.png?raw=1)

As we can see from above, the infoboxes could provide us with a lot of valuable information such as :

- Year of founding
- Industry
- Founder(s)
- Products
- Services
- Operating income
- Net income
- Total assets
- Total equity
- Number of employees etc

Although we expect this data to be fairly organized, it would require some post-processing which we will tackle in our next section. We pick a subset of our data and focus only on the top **20** of the Fortune 500 from the full list.

Let's begin by installing some of libraries we will use for this excercise as follows,

In [1]:
# sudo apt install libcurl4-openssl-dev libssl-dev
!pip install wptools
!pip install wikipedia
!pip install wordcloud

Collecting wptools
  Downloading wptools-0.4.17-py2.py3-none-any.whl.metadata (14 kB)
Collecting html2text (from wptools)
  Downloading html2text-2024.2.26.tar.gz (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycurl (from wptools)
  Downloading pycurl-7.45.3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Downloading wptools-0.4.17-py2.py3-none-any.whl (38 kB)
Downloading pycurl-7.45.3-cp310-cp310-manylinux_2_28_x86_64.whl (4.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: html2text
  Building wheel for html2text (setup.py) ... [?25l[?25hdone
  Created wheel for html2text: filename=html2text-2024.2.26-py3-none-any.whl size=33111 sha256=a1041a729c25f4e549f6395c720fc077c59cf7b23a6c465d4bfbdf9eb494b55a
  Stored

Importing the same,

In [2]:
import json
import wptools
import wikipedia
import pandas as pd

print('wptools version : {}'.format(wptools.__version__)) # checking the installed version

wptools version : 0.4.17


Now let's load the data which we scrapped in the previous section as follows,

In [4]:
# If you dont have the file, you can use the below code to fetch it:
import urllib.request
url = 'https://raw.githubusercontent.com/MonashDataFluency/python-web-scraping/master/data/fortune_500_companies.csv'
urllib.request.urlretrieve(url, 'fortune_500_companies.csv')

('fortune_500_companies.csv', <http.client.HTTPMessage at 0x7b021bff1ba0>)

In [17]:
fname = 'fortune_500_companies.csv' # scrapped data from previous section
df = pd.read_csv(fname)             # reading the csv file as a pandas df
df.head(15)                           # displaying the first 5 rows

Unnamed: 0,rank,company_name,company_website
0,1,Walmart,http://www.stock.walmart.com
1,2,Exxon Mobil,http://www.exxonmobil.com
2,3,Berkshire Hathaway,http://www.berkshirehathaway.com
3,4,Apple,http://www.apple.com
4,5,UnitedHealth Group,http://www.unitedhealthgroup.com
5,6,McKesson,http://www.mckesson.com
6,7,CVS Health,http://www.cvshealth.com
7,8,Amazon.com,http://www.amazon.com
8,9,AT&T,http://www.att.com
9,10,General Motors,http://www.gm.com


|    |   rank | company_name       | company_website                  |
|---:|-------:|:-------------------|:---------------------------------|
|  0 |      1 | Walmart            | http://www.stock.walmart.com     |
|  1 |      2 | Exxon Mobil        | http://www.exxonmobil.com        |
|  2 |      3 | Berkshire Hathaway | http://www.berkshirehathaway.com |
|  3 |      4 | Apple              | http://www.apple.com             |
|  4 |      5 | UnitedHealth Group | http://www.unitedhealthgroup.com |


Let's focus and select only the top 20 companies from the list as follows,

In [6]:
no_of_companies = 20                         # no of companies we are interested
df_sub = df.iloc[:no_of_companies, :].copy() # only selecting the top 20 companies
companies = df_sub['company_name'].tolist()  # converting the column to a list

Taking a brief look at the same,

In [7]:
for i, j in enumerate(companies):   # looping through the list of 20 company
    print('{}. {}'.format(i+1, j))  # printing out the same

1. Walmart
2. Exxon Mobil
3. Berkshire Hathaway
4. Apple
5. UnitedHealth Group
6. McKesson
7. CVS Health
8. Amazon.com
9. AT&T
10. General Motors
11. Ford Motor
12. AmerisourceBergen
13. Chevron
14. Cardinal Health
15. Costco
16. Verizon
17. Kroger
18. General Electric
19. Walgreens Boots Alliance
20. JPMorgan Chase


### Getting article names from wiki

Right off the bat, as you might have guessed, one issue with matching the top 20 Fortune 500 companies to their wikipedia article names is that both of them would not be exactly the same i.e. they match character for character. There will be slight variation in their names.

To overcome this problem and ensure that we have all the company names and its corresponding wikipedia article, we will use the `wikipedia` package to get suggestions for the company names and their equivalent in wikipedia.

In [8]:
wiki_search = [{company : wikipedia.search(company)} for company in companies]

Inspecting the same,

In [9]:
for idx, company in enumerate(wiki_search):
    for i, j in company.items():
        print('{}. {} :\n{}'.format(idx+1, i ,', '.join(j)))
        print('\n')

1. Walmart :
Walmart, Criticism of Walmart, History of Walmart, Walmart shooting, Walmarting, Walmart (disambiguation), List of assets owned by Walmart, List of Walmart brands, Walmart Canada, Walmart de México y Centroamérica


2. Exxon Mobil :
ExxonMobil, Mobil, History of ExxonMobil, Criticism of ExxonMobil, ExxonMobil climate change denial, ExxonMobil Nigeria, Exxon Valdez oil spill, Esso, ExxonMobil Building, Rex Tillerson


3. Berkshire Hathaway :
Berkshire Hathaway, Berkshire Hathaway Energy, List of assets owned by Berkshire Hathaway, Berkshire Hathaway Assurance, Warren Buffett, List of Berkshire Hathaway publications, Charlie Munger, Ajit Jain, Duracell, The World's Billionaires


4. Apple :
Apple, Apple Inc., Apple (disambiguation), Apple silicon, Apples to Apples, Apple Watch, Apple Network Server, IPhone, Custard apple, Apple TV


5. UnitedHealth Group :
UnitedHealth Group, Optum, Richard T. Burke, Change Healthcare, Pharmacy benefit management, Andrew Witty, Stephen J. He

Now let's get the most probable ones (the first suggestion) for each of the first 20 companies on the Fortune 500 list,

In [10]:
most_probable = [(company, wiki_search[i][company][0]) for i, company in enumerate(companies)]
companies = [x[1] for x in most_probable]

print(most_probable)

[('Walmart', 'Walmart'), ('Exxon Mobil', 'ExxonMobil'), ('Berkshire Hathaway', 'Berkshire Hathaway'), ('Apple', 'Apple'), ('UnitedHealth Group', 'UnitedHealth Group'), ('McKesson', 'McKesson Corporation'), ('CVS Health', 'CVS Health'), ('Amazon.com', 'Amazon (company)'), ('AT&T', 'AT&T'), ('General Motors', 'General Motors'), ('Ford Motor', 'Ford Motor Company'), ('AmerisourceBergen', 'Cencora'), ('Chevron', 'Chevron Corporation'), ('Cardinal Health', 'Cardinal Health'), ('Costco', 'Costco'), ('Verizon', 'Verizon'), ('Kroger', 'Kroger'), ('General Electric', 'General Electric'), ('Walgreens Boots Alliance', 'Walgreens Boots Alliance'), ('JPMorgan Chase', 'JPMorgan Chase')]


We can notice that most of the wiki article titles make sense. However, **Apple** is quite ambiguous in this regard as it can indicate the fruit as well as the company. However we can see that the second suggestion returned by was **Apple Inc.**. Hence, we can manually replace it with **Apple Inc.** as follows,

In [11]:
companies[companies.index('Apple')] = 'Apple Inc.' # replacing "Apple"
print(companies) # final list of wikipedia article titles

['Walmart', 'ExxonMobil', 'Berkshire Hathaway', 'Apple Inc.', 'UnitedHealth Group', 'McKesson Corporation', 'CVS Health', 'Amazon (company)', 'AT&T', 'General Motors', 'Ford Motor Company', 'Cencora', 'Chevron Corporation', 'Cardinal Health', 'Costco', 'Verizon', 'Kroger', 'General Electric', 'Walgreens Boots Alliance', 'JPMorgan Chase']


### Retrieving the infoboxes

Now that we have mapped the names of the companies to their corresponding wikipedia article let's retrieve the infobox data from those pages.

`wptools` provides easy to use methods to directly call the MediaWiki API on our behalf and get us all the wikipedia data. Let's try retrieving data for **Walmart** as follows,

In [12]:
page = wptools.page('Walmart')
page.get_parse()    # parses the wikipedia article

en.wikipedia.org (parse) Walmart
en.wikipedia.org (imageinfo) File:Walmart 7.jpg
Walmart (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Walmart 7...
  infobox: <dict(28)> name, logo, image, image_caption, former_nam...
  iwlinks: <list(2)> https://commons.wikimedia.org/wiki/Category:W...
  pageid: 33589
  parsetree: <str(423921)> <root><template><title>Short descriptio...
  requests: <list(2)> parse, imageinfo
  title: Walmart
  wikibase: Q483551
  wikidata_url: https://www.wikidata.org/wiki/Q483551
  wikitext: <str(355676)> {{Short description|American multination...
}


<wptools.page.WPToolsPage at 0x7b021728d930>

As we can see from the output above, `wptools` successfully retrieved the wikipedia and wikidata corresponding to the query **Walmart**. Now inspecting the fetched attributes,

In [13]:
page.data.keys()

dict_keys(['requests', 'iwlinks', 'pageid', 'wikitext', 'parsetree', 'infobox', 'title', 'wikibase', 'wikidata_url', 'image'])

The attribute **infobox** contains the data we require,

In [14]:
page.data['infobox']

{'name': 'Walmart Inc.',
 'logo': 'Walmart logo.svg',
 'image': 'Walmart 7.jpg',
 'image_caption': 'Walmart location in [[Onalaska, Wisconsin]]',
 'former_name': '{{ubli\n| Wal-Mart Discount City (1962–1969)\n| Wal-Mart, Inc. (1969–1970)\n| Wal-Mart Stores, Inc. (1970–2018)}}',
 'type': '[[Public company|Public]]',
 'ISIN': '{{ISIN|sl|=|n|pl|=|y|US9311421039}}',
 'industry': '[[Retail]]',
 'predecessor': "Walton's Five and Dime",
 'traded_as': '{{Unbulleted list|NYSE|WMT|[[DJIA]] component|[[S&P 100]] component|[[S&P 500]] component}} {{NYSE|WMT}}',
 'foundation': '{{Start date and age|1962|7|2}} , in [[Rogers, Arkansas]]',
 'founders': '[[Sam Walton]], [[Bud Walton]]',
 'location_city': '[[Bentonville, Arkansas]]',
 'location_country': 'United States<br/> {{Coord|36|21|56|N|94|13|03|W|region:US-AR_type:landmark|display|=|title,inline}}',
 'locations': '10,586 (2022)',
 'area_served': 'Worldwide',
 'key_people': '{{plainlist|\n* [[Greg Penner]] ([[chairman]])\n* [[Doug McMillon]] ([[Pr

Let's define a list of features that we want from the infoboxes as follows,

In [15]:
wiki_data = []
# attributes of interest contained within the wiki infoboxes
features = ['founder', 'location_country', 'revenue', 'operating_income', 'net_income', 'assets',
        'equity', 'type', 'industry', 'products', 'num_employees']

In [18]:
wiki_data

[{'founder': '',
  'location_country': 'United States<br/> {{Coord|36|21|56|N|94|13|03|W|region:US-AR_type:landmark|display|=|title,inline}}',
  'revenue': '{{nowrap| |increase| |US$|648.12 billion|link|=|yes| ([[Fiscal Year|FY]]2024)|ref| name=N|{{cite web|url=https://s201.q4cdn.com/262069030/files/doc_financials/2024/ar/2024-annual-report-pdf-final-final.pdf|publisher=Walmart|access-date=February 17, 2022|title=Walmart Annual Report 2023|archive-date=February 21, 2023|archive-url=https://web.archive.org/web/20230221154737/https://s201.q4cdn.com/262069030/files/doc_financials/2023/q4/Earnings-Release-(FY23-Q4)-(final).pdf|url-status=live}}|</ref>|}} {{increase}} {{US$|648.12 billion|link|=|yes}} ([[Fiscal Year|FY]]2024)',
  'operating_income': '{{Increase}} {{US$|27.0 billion}} (FY2024)',
  'net_income': '{{Increase}} {{US$|16.27 billion}} (FY2024)',
  'assets': '{{nowrap| |Increase| |US$|252.399 billion| (FY2024)|ref| name= N|}} {{Increase}} {{US$|252.399 billion}} (FY2024)',
  'equi

Now fetching the data for all the companies (this may take a while),

In [16]:
for company in companies:
    page = wptools.page(company) # create a page object
    try:
        page.get_parse() # call the API and parse the data
        if page.data['infobox'] != None:
            # if infobox is present
            infobox = page.data['infobox']
            # get data for the interested features/attributes
            data = { feature : infobox[feature] if feature in infobox else ''
                         for feature in features }
        else:
            data = { feature : '' for feature in features }

        data['company_name'] = company
        wiki_data.append(data)

    except KeyError:
        pass

en.wikipedia.org (parse) Walmart
en.wikipedia.org (imageinfo) File:Walmart 7.jpg
Walmart (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Walmart 7...
  infobox: <dict(28)> name, logo, image, image_caption, former_nam...
  iwlinks: <list(2)> https://commons.wikimedia.org/wiki/Category:W...
  pageid: 33589
  parsetree: <str(423921)> <root><template><title>Short descriptio...
  requests: <list(2)> parse, imageinfo
  title: Walmart
  wikibase: Q483551
  wikidata_url: https://www.wikidata.org/wiki/Q483551
  wikitext: <str(355676)> {{Short description|American multination...
}
en.wikipedia.org (parse) ExxonMobil
en.wikipedia.org (imageinfo) File:Cube xom mine.png
ExxonMobil (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Cube xom ...
  infobox: <dict(26)> name, logo, image, image_caption, former_nam...
  iwlinks: <list(3)> https://commons.wikimedia.org/wiki/Category:E...
  pageid: 18848197
  parsetree: <str(108791)> <root><template><title>Short descr

Let's take a look at the first instance in `wiki_data` i.e. **Walmart**,

In [None]:
wiki_data[0]

{'founder': '[[Sam Walton]]',
 'location_country': 'U.S.',
 'revenue': '{{increase}} {{US$|523.964 billion|link|=|yes}} {{small|([[Fiscal Year|FY]] 2020)}}',
 'operating_income': '{{decrease}} {{US$|20.568 billion}} {{small|(FY 2020)}}',
 'net_income': '{{increase}} {{US$|14.881 billion}} {{small|(FY 2020)}}',
 'assets': '{{increase}} {{US$|236.495 billion}} {{small|(FY 2020)}}',
 'equity': '{{increase}} {{US$|74.669 billion}} {{small|(FY 2020)}}',
 'type': '[[Public company|Public]]',
 'industry': '[[Retail]]',
 'products': '{{hlist|Electronics|Movies and music|Home and furniture|Home improvement|Clothing|Footwear|Jewelry|Toys|Health and beauty|Pet supplies|Sporting goods and fitness|Auto|Photo finishing|Craft supplies|Party supplies|Grocery}}',
 'num_employees': '{{plainlist|\n* 2.2|nbsp|million, Worldwide (2018)|ref| name="xbrlus_1" |\n* 1.5|nbsp|million, U.S. (2017)|ref| name="Walmart"|{{cite web |url = http://corporate.walmart.com/our-story/locations/united-states |title = Walmart

So, we have successfully retrieved all the infobox data for the companies. Also we can notice that some additional wrangling and cleaning is required which we will perform in the next section.

Finally, let's export the scraped infoboxes as a single JSON file to a convenient location as follows,

In [None]:
with open('infoboxes.json', 'w') as file:
    json.dump(wiki_data, file)

### References

- https://phpenthusiast.com/blog/what-is-rest-api
- https://github.com/siznax/wptools/wiki/Data-captured
- https://en.wikipedia.org/w/api.php
- https://wikipedia.readthedocs.io/en/latest/code.html

# Direct Wikipedia - Scraping


In [21]:
import wptools
import pandas as pd

# Print wptools version
print('wptools version : {}'.format(wptools.__version__))

# Function to get Wikipedia page information using wptools
def get_wikipedia_data(person):
    try:
        # Fetch page data for the person
        page = wptools.page(person).get_parse()

        # Extract summary, fallback to wikitext if extext is missing
        summary = page.data.get('extext', page.data.get('wikitext', 'Summary not available'))
        url = page.data.get('url', f'https://en.wikipedia.org/wiki/{person.replace(" ", "_")}')

        # Return data as a dictionary
        return {
            'name': person,
            'summary': summary[:500] + '...',  # Limit summary length for readability
            'url': url
        }
    except Exception as e:
        print(f"Error fetching data for {person}: {e}")
        return None

# Option 1: User inputs names manually (Command Line Input)
def get_input_names():
    print("Enter names (separate by comma): ")
    input_names = input().split(',')
    return [name.strip() for name in input_names]  # Strip leading/trailing spaces

# Scrape data for the input names
crypto_people = get_input_names()
crypto_data = []
for person in crypto_people:
    data = get_wikipedia_data(person)
    if data:
        crypto_data.append(data)

# Convert to DataFrame
df_crypto = pd.DataFrame(crypto_data)

# Save the DataFrame to a JSON file for later use
df_crypto.to_json('crypto_people.json', orient='records', indent=4)

# Display the DataFrame
print(df_crypto)



wptools version : 0.4.17
Enter names (separate by comma): 
Vitalik Buterin


en.wikipedia.org (parse) Vitalik Buterin


              name                                            summary  \
0  Vitalik Buterin  {{Short description|Canadian programmer (born ...   

                                             url  
0  https://en.wikipedia.org/wiki/Vitalik_Buterin  


en.wikipedia.org (imageinfo) File:Vitalik Buterin TechCrunch Lond...
Vitalik Buterin (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Vitalik B...
  infobox: <dict(13)> name, native_name, native_name_lang, image, ...
  iwlinks: <list(1)> https://commons.wikimedia.org/wiki/Category:E...
  pageid: 41729424
  parsetree: <str(50693)> <root><template><title>Short description...
  requests: <list(2)> parse, imageinfo
  title: Vitalik Buterin
  wikibase: Q16197959
  wikidata_url: https://www.wikidata.org/wiki/Q16197959
  wikitext: <str(38934)> {{Short description|Canadian programmer (...
}


In [22]:
df_crypto

Unnamed: 0,name,summary,url
0,Vitalik Buterin,{{Short description|Canadian programmer (born ...,https://en.wikipedia.org/wiki/Vitalik_Buterin


In [47]:
import wptools
import wikipedia
import re
from collections import Counter
import nltk
from nltk.corpus import stopwords
import pandas as pd

# Download stopwords
nltk.download('stopwords')
STOPWORDS = set(stopwords.words('english'))



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [48]:
# Function to fetch Wikipedia data (full text)
def fetch_wikipedia_text(name):
    try:
        search_results = wikipedia.search(name)
        if len(search_results) == 0:
            print(f"No Wikipedia page found for {name}")
            return None

        correct_title = search_results[0]  # Take the first result
        page = wptools.page(correct_title).get_parse()

        if 'wikitext' in page.data:
            wikitext = page.data['wikitext']
            return wikitext
        else:
            print(f"No full text available for {name}")
            return None
    except Exception as e:
        print(f"Error fetching data for {name}: {e}")
        return None



In [69]:
import re

# Function to clean sentences
def clean_sentence(sentence):
    # Remove Wikipedia link syntax
    cleaned = re.sub(r'\[\[([^|\]]+)(?:\|[^\]]+)?\]\]', r'\1', sentence)  # Removes [[...]] and keeps the inner text
    # Remove patterns like |first=..., |last=..., and |url=...
    cleaned = re.sub(r'\|first=[^}]+|\|last=[^}]+|\|url=[^\s]+', '', cleaned)  # Remove |first=..., |last=..., and |url=...
    # Remove any remaining {{...}} references
    cleaned = re.sub(r'\{\{[^}]+\}\}', '', cleaned)  # Removes {{...}}
    # Remove any text starting with | and ending with space (including trailing text)
    cleaned = re.sub(r'\|[^ ]+\s*', '', cleaned)  # Remove any remaining |pattern
    # Remove leading/trailing whitespace
    return cleaned.strip()

# Example usage in your main processing loop
for line in lines:
    if line.startswith("Name:"):
        if entry:  # If entry already has data, append it to the list
            processed_data.append(entry)
            entry = {}  # Reset entry for the next item
        entry['Name'] = line.replace("Name: ", "").strip()
    elif line.startswith("Sentence:"):
        raw_sentence = line.replace("Sentence: ", "").strip()
        entry['Sentence'] = clean_sentence(raw_sentence)  # Clean the sentence
    elif line.startswith("Keywords:"):
        entry['Keywords'] = line.replace("Keywords: ", "").strip()



In [70]:
# Function to extract keywords from a sentence
def extract_keywords(sentence):
    words = re.findall(r'\w+', sentence.lower())
    significant_words = [word for word in words if word not in STOPWORDS and len(word) > 2]
    keyword_counts = Counter(significant_words)
    keywords = [word for word, _ in keyword_counts.most_common(3)]
    return ', '.join(keywords) if keywords else 'N/A'


In [71]:
# Function to write output to .txt file
def write_to_txt(output_file, data):
    with open(output_file, 'w', encoding='utf-8') as f:
        for entry in data:
            f.write(f"Name: {entry['Name']}\n")
            f.write(f"Sentence: {entry['Sentence']}\n")
            f.write(f"Keywords: {entry['Keywords']}\n")
            f.write("\n")


In [72]:
# Main script
input_names = input("Enter names (separate by comma): ").split(',')
input_names = [name.strip() for name in input_names]

wiki_data = []  # Renamed output_data to wiki_data

for name in input_names:
    print(f"Fetching text for {name}...")
    wikitext = fetch_wikipedia_text(name)
    if wikitext:
        sentences = split_text_into_sentences(wikitext)
        for sentence in sentences:
            keywords = extract_keywords(sentence)
            wiki_data.append({
                'Name': name,
                'Sentence': sentence,
                'Keywords': keywords
            })


Enter names (separate by comma): Vitalik Buterin
Fetching text for Vitalik Buterin...


en.wikipedia.org (parse) Vitalik Buterin
en.wikipedia.org (imageinfo) File:Vitalik Buterin TechCrunch Lond...
Vitalik Buterin (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Vitalik B...
  infobox: <dict(13)> name, native_name, native_name_lang, image, ...
  iwlinks: <list(1)> https://commons.wikimedia.org/wiki/Category:E...
  pageid: 41729424
  parsetree: <str(50693)> <root><template><title>Short description...
  requests: <list(2)> parse, imageinfo
  title: Vitalik Buterin
  wikibase: Q16197959
  wikidata_url: https://www.wikidata.org/wiki/Q16197959
  wikitext: <str(38934)> {{Short description|Canadian programmer (...
}


In [73]:
if len(wiki_data) > 0:
    output_file = 'scraped_wikipedia_info.txt'
    write_to_txt(output_file, wiki_data)
    print(f"Data written to {output_file}")

    # Read the .txt file into a DataFrame for inspection
    with open(output_file, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    # Process the lines to extract Name, Sentence, and Keywords
    processed_data = []
    entry = {}

    for line in lines:
        if line.startswith("Name:"):
            if entry:  # If entry already has data, append it to the list
                processed_data.append(entry)
                entry = {}  # Reset entry for the next item
            entry['Name'] = line.replace("Name: ", "").strip()
        elif line.startswith("Sentence:"):
            entry['Sentence'] = line.replace("Sentence: ", "").strip()
        elif line.startswith("Keywords:"):
            entry['Keywords'] = line.replace("Keywords: ", "").strip()

    # Append the last entry if it exists
    if entry:
        processed_data.append(entry)

    # Convert processed data to DataFrame
    df = pd.DataFrame(processed_data)

    # Display the DataFrame
    print(df)


Data written to scraped_wikipedia_info.txt
               Name                                           Sentence  \
0   Vitalik Buterin  {{Short description|Canadian programmer (born ...   
1   Vitalik Buterin  What a beautiful messy complicated journey is ...   
2   Vitalik Buterin  Buterin became involved with [[cryptocurrency]...   
3   Vitalik Buterin  |title=The Man Behind Ethereum Is Worried Abou...   
4   Vitalik Buterin  There, he took advanced courses and was a rese...   
5   Vitalik Buterin  He returned to Toronto later that year and pub...   
6   Vitalik Buterin                                                -->   
7   Vitalik Buterin  But when he failed to gain agreement, he propo...   
8   Vitalik Buterin  Buterin announced Ethereum more publicly at th...   
9   Vitalik Buterin  Buterin delivered a 25-minute speech, describi...   
10  Vitalik Buterin  111 --> operating on a decentralized permissio...   
11  Vitalik Buterin  He's not too excited that the community assign..

In [74]:
df

Unnamed: 0,Name,Sentence,Keywords
0,Vitalik Buterin,{{Short description|Canadian programmer (born ...,"buterin, vitalik, dmitrievich"
1,Vitalik Buterin,What a beautiful messy complicated journey is ...,"human, date, beautiful"
2,Vitalik Buterin,Buterin became involved with [[cryptocurrency]...,"ref, url, https"
3,Vitalik Buterin,|title=The Man Behind Ethereum Is Worried Abou...,"date, url, buterin"
4,Vitalik Buterin,"There, he took advanced courses and was a rese...","archive, url, www"
5,Vitalik Buterin,He returned to Toronto later that year and pub...,"buterin, archive, url"
6,Vitalik Buterin,-->,"ethereum, archive, url"
7,Vitalik Buterin,"But when he failed to gain agreement, he propo...","new, ref, tapscott"
8,Vitalik Buterin,Buterin announced Ethereum more publicly at th...,"miami, buterin, announced"
9,Vitalik Buterin,"Buterin delivered a 25-minute speech, describi...","buterin, delivered, minute"
