## Load Lab Data

This notebook walks through the process of loading wikipedia articles used in the lab into a watsonx.data relational database table. We use the [wikipedia python library](https://pypi.org/project/wikipedia/) to retrieve the wikipedia articles. We then create a table in the database to store the article. Finally, we load the article into the database.

#### Fetch wikipedia article

Code is provided for searching wikipedia articles as well as fetching a specific article by title.

In [1]:
import wikipedia

# fetch wikipedia articles
articles = {
    'Nobel price in literature': None, 
    '2023 Nobel price in literature': 72508137,
    '2024 Nobel price in literature': 75098159
}
for k,v in articles.items():
    if v:
        article = wikipedia.page(pageid=v)
    else:
        article = wikipedia.page(k)
    articles[k] = article.content
    print(f"Successfully fetched {k}")

print(f"Successfully fetched {len(articles)} articles ")



Successfully fetched Nobel price in literature
Successfully fetched 2023 Nobel price in literature
Successfully fetched 2024 Nobel price in literature
Successfully fetched 3 articles 


## Load wikipedia article into watsonx.data 

#### Connect to watsonx.data 

In [13]:
import sys
sys.path.append("../utils")
import pandas as pd
import wxd_utils

conf=wxd_utils.load_conf()
print(conf)

wxd_engine = wxd_utils.connect_wxd(conf)

ModuleNotFoundError: No module named 'dotenv'

### Create Schema in watsonx.data Hive Bucket to store wikipedia data

In [None]:
try: 
  create_schema_result = pd.read_sql("""

    CREATE SCHEMA hive_data.watsonxai WITH ( location = 's3a://hive-bucket/watsonx_ai')

    """, wxd_engine)
  
except sqlalchemy.exc.SQLAlchemyError as e:
  print("Error creating schema:", str(e))

### Create table to hold wikipedia data in schema from above

In [None]:
try:

    create_table_result = pd.read_sql("""

        CREATE TABLE hive_data.watsonxai.wikipedia
        (
            "id" varchar,
            "text" varchar, 
            "title" varchar  )
        WITH (
            format = 'PARQUET'
        )
     
    """, wxd_engine)
  
except sqlalchemy.exc.SQLAlchemyError as e:
  print("Error creating table:", str(e))

### Chunk and insert data

In [None]:
# Chunk data
def split_into_chunks(text, chunk_size):
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

split_articles = {}
for k,v in articles.items():
    split_articles[k] = split_into_chunks(v, 225)

# Insert data
for article_title, article_chunks in split_articles.items():

    for i, chunk in enumerate(article_chunks):
            
            escaped_chunk = chunk.replace("'", "''").replace("%", "%%")
            insert_stmt = f"insert into hive_data.watsonxai.wikipedia values ('{i+1}', '{escaped_chunk}', '{article_title}')"
            
            with wxd_engine.connect() as connection:
                connection.execute(insert_stmt)
            print(f"{article_title} {i+1}/{len(article_chunks)} INSERTED")
            
    print(f"{article_title} DONE")

In [None]:
# confirm data inserted

wiki_articles = pd.read_sql("select * from hive_data.watsonxai.wikipedia", wxd_engine)
wiki_articles