# Wikipedia Data Retrieval
Given that we will be testing vector search throughout these tutorials, we are going to need to play around with some data that an LLM was not originally trained on. That said, we'll be leveraging **Wikipedia** for this since it is constantly being updated with new information and also covers its content under a [Creative Commons license](https://en.wikipedia.org/wiki/Wikipedia:Copyrights). While I will be saving the information as a static CSV to this repo, there may come a point when you'll need to do a new pull of information just in case the information I'll be pulling is eventually baked into a new LLM.

## Notebook Setup

In order to retrieve the data from Wikipedia, we're transparently going to replicate functionality [the same notebook in the OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_Wikipedia_articles_for_search.ipynb). I tried several options in order to not do a straight up rip off of what OpenAI did, but I found that most of them fell short of what I needed to do. That said, here are the special libraries that we will be using as part of this notebook:

- `mwclient`: The "MC" standing for "MediaWiki", MediaWiki (if you're not aware) is sort of like the "parent" company overarching Wikipedia. If you were to consume their API directly (which you can do but is a pain), you'll likely find that documentation on a MediaWiki website. As this name implies, `mwclient` is the client we'll be using for downloading the Wikipedia articles.
- `mwparserfromhell`: A neighboring library to `mwclient`, this library will specifically split the Wikipedia articles into their respective sections.

Of course, these aren't the only libraries we'll be using, but you should already have familiarity with the others we'll be using.

In [1]:
# Importing the necessary Python libraries
import pandas as pd
import mwclient
import mwparserfromhell

## Retrieving Information from a Single Article
Just to ease ourselves into this process, let's demonstrate how to pull just a single article. At the time I am writing this, I am deep into the game **The Legend of Zelda: Tears of the Kingdom**, so we will be specifically grabbing the review ("Reception") information from that article. We'll execute all these cells in sequential order, but at a high level, here's what we're going to be doing:

1. Connecting the `mwclient` to the appropriate website (which in our case is the "vanilla" Wikipedia).
2. Retrieving the Wikipedia page by passing in the name of the article (in our case, "The Legend of Zelda: Tears of the Kingdom") into the `mwclient` instance.
3. Retreiving the full Wikipedia text from the body of the page.
4. Parsing the Wikipedia text with `mwparserfromhell`.
5. Retrieving the section headings of the article to see if the article contains the section we're looking for. (Note: This isn't a big deal for our Zelda example since we know it's there, but we're going to want to programmatically check for this in every other article later on.)
6. Retrieving the specific section we are looking for.
7. Removing any "code" characters from the section text.
8. Finally extracting the remaining text!

In [2]:
# Connecting the mwclient to Wikipedia
wikipedia_client = mwclient.Site('en.wikipedia.org')

In [3]:
# Getting the full text page from Wikipedia for Zelda
zelda_wiki_text = wikipedia_client.pages['The Legend of Zelda: Tears of the Kingdom'].text()

In [4]:
# Parsing the text with mwparserfromhell
parsed_wiki_text = mwparserfromhell.parse(zelda_wiki_text)

In [5]:
# Retrieving the section headings from the page
section_headings = [str(heading) for heading in parsed_wiki_text.filter_headings()]

In [6]:
# Checking to see if "Reception" is present in the Wikipedia article
if '==Reception==' in section_headings:

    # Retrieving all the sections in the parsed wiki text
    all_sections = parsed_wiki_text.get_sections(levels = [2])

    # Retreiving the appropriate section we need
    for section in all_sections:
        if str(section).startswith('==Reception=='):
            reception_section = section

    # Stripping out the code bits from the section text
    reception_text = reception_section.strip_code()

## Getting Multiple Articles in a Single Category
Now that we've demonstrated how to get text from a single article, we can make use of Wikipedia's "Categories" functionality to grab multiple articles from a given category. In the example below, I want to grab all the review ("Reception") text from any video game that launched in 2023. We will be making use of this page: [Category:2023 video games](https://en.wikipedia.org/wiki/Category:2023_video_games). Note that this list contains ALL video games launching in 2023, which means that there will be games on this list (e.g. Starfield, Spider-Man 2) as those games have not yet launched at the point of me doing this work. This is why we created our `if` statement in the cell above: if a "Reception" section is not present in the Wikipedia article, we'll simply skip it and move on.

Recall from above that we simply passed in the name of the precise page associated to *The Legend of Zelda: Tears of the Kingdom* to obtain its text. In order to iterate through a category, we simply need to pass in that category name and iterate over that to obtain each article. Everything that we are working with in these libraries has a correlated data type, and Wikipedia pages are represented by this data type: `mwclient.page.Page`. This is important to keep in mind as categories may have subcategories under them, and we aren't interested in recursively gathering all those. In the OpenAI Cookbook notebook linked above, they do have some code that demonstrates that recursion, so check it out over there if you want to learn more.

In [8]:
# Connecting the mwclient to Wikipedia
wikipedia_client = mwclient.Site('en.wikipedia.org')

In [9]:
# Collecting a list of everything under the 2023 video games Wikipedia category
all_games = wikipedia_client.pages['Category:2023 video games']

In [10]:
# Instantiating a list to hold the game reviews
vg_reviews = []

# Iterating over all the list of games
for game in all_games:

    # Checking to see if the instance is a Wikipedia page
    if type(game) == mwclient.page.Page:

        # Obtaining the name of the game as a string
        game_name = game.name

        # Getting the full text page from Wikipedia
        wiki_text = wikipedia_client.pages[game_name].text()

        # Parsing the text with mwparserfromhell
        parsed_wiki_text = mwparserfromhell.parse(wiki_text)

        # Retrieving the section headings from the page
        section_headings = [str(heading) for heading in parsed_wiki_text.filter_headings()]

        # Checking to see if "Reception" is present in the Wikipedia article
        if '==Reception==' in section_headings:

            # Retrieving all the sections in the parsed wiki text
            all_sections = parsed_wiki_text.get_sections(levels = [2])

            # Retreiving the appropriate section we need
            for section in all_sections:
                if str(section).startswith('==Reception=='):
                    reception_section = section

            # Stripping out the code bits from the section text
            review_text = reception_section.strip_code()

            # Appending the name of the game and review text to our collective list
            vg_reviews.append([game_name, review_text])

Now that we've retrieved all our video game reviews as an array of arrays, we can seamlessly convert this into a Pandas DataFrame that we can later save out as a CSV file!

In [11]:
# Establishing a Pandas DataFrame to store our final results
df_vg_reviews = pd.DataFrame(data = vg_reviews, columns = ['game_name', 'review_text'])

In [12]:
# Viewing the final results
df_vg_reviews.head()

Unnamed: 0,game_name,review_text
0,Age of Wonders 4,Reception\n\nAge of Wonders 4 received a posit...
1,Aliens: Dark Descent,"Reception\n\nAliens: Dark Descent received ""ge..."
2,Aquatico,Reception\nThe game received mixed reviews upo...
3,Atelier Ryza 3: Alchemist of the End & the Sec...,"Reception\n\nUpon release, Atelier Ryza 3 rece..."
4,Bayonetta Origins: Cereza and the Lost Demon,Reception\n\nBayonetta Origins: Cereza and the...


In [13]:
# Saving the reviews to file
df_vg_reviews.to_csv('../data/2023-vg-reviews.csv', index = False)