## **Install packages if not yet installed**

In [1]:
import sys

!{sys.executable} -m pip install bs4 # BeautifulSoup
!{sys.executable} -m pip install opendatasets # OpenDatasets
!{sys.executable} -m pip install azure.storage.blob # Azure Blob Storage
!{sys.executable} -m pip install azure-cosmos # Azure Cosmos DB



## **Reading the dataset**

**1.** Create a file `kaggle.json` and save your Kaggle username and API key. This will be used to download the dataset from Kaggle.

**2.** The URL of the dataset is [https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles](https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles "GeeksForGeeks Articles Dataset"). Using `opendatasets` package, download the dataset. Step 1 is required in order for this to automatically take in your username and API key.

**3.** Read the downloaded dataset.

In [2]:
import json
import opendatasets as od
import pandas as pd

In [3]:
# Creating kaggle.json file.
with open("kaggle.json", "w") as kaggleFile:
    kaggleFile.write(json.dumps({"username":"shivanielakurthy", "key":"da7b4ae4bd1b770cb8b74d3990fc7f43"}))

In [4]:
# Downloading the dataset.
od.download("https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles")

Downloading geeksforgeeks-articles.zip to ./geeksforgeeks-articles



100%|██████████| 1.31M/1.31M [00:00<00:00, 7.33MB/s]


In [5]:
# Reading the dataset.
articles=pd.read_csv(r"geeksforgeeks-articles/articles.csv")
articles.head()

Unnamed: 0,title,author_id,last_updated,link,category
0,5 Best Practices For Writing SQL Joins,priyankab14,"21 Feb, 2022",https://www.geeksforgeeks.org/5-best-practices...,easy
1,Foundation CSS Dropdown Menu,ishankhandelwals,"20 Feb, 2022",https://www.geeksforgeeks.org/foundation-css-d...,easy
2,Top 20 Excel Shortcuts That You Need To Know,priyankab14,"17 Feb, 2022",https://www.geeksforgeeks.org/top-20-excel-sho...,easy
3,Servlet – Fetching Result,nishatiwari1719,"17 Feb, 2022",https://www.geeksforgeeks.org/servlet-fetching...,easy
4,Suffix Sum Array,rohit768,"21 Feb, 2022",https://www.geeksforgeeks.org/suffix-sum-array/,easy


In [6]:
articles.shape

(34574, 5)

## **Dropping rows with null values**

In [7]:
articles=articles.dropna()

In [8]:
# Reset index.
articles=articles.reset_index().drop("index", axis=1)
articles.head()

Unnamed: 0,title,author_id,last_updated,link,category
0,5 Best Practices For Writing SQL Joins,priyankab14,"21 Feb, 2022",https://www.geeksforgeeks.org/5-best-practices...,easy
1,Foundation CSS Dropdown Menu,ishankhandelwals,"20 Feb, 2022",https://www.geeksforgeeks.org/foundation-css-d...,easy
2,Top 20 Excel Shortcuts That You Need To Know,priyankab14,"17 Feb, 2022",https://www.geeksforgeeks.org/top-20-excel-sho...,easy
3,Servlet – Fetching Result,nishatiwari1719,"17 Feb, 2022",https://www.geeksforgeeks.org/servlet-fetching...,easy
4,Suffix Sum Array,rohit768,"21 Feb, 2022",https://www.geeksforgeeks.org/suffix-sum-array/,easy


In [9]:
articles.shape

(34551, 5)

## **Create connection string to storage account**

The errors will be saved as a file in Azure Blob storage.

In [10]:
accountName="shivmlstorage"
accountKey="CqwiRcBzBgSaIopT2NeDAGdRdSp9EilVRSUtIEBn6AiKabc2nD5BUGHo8G1DGQhcWCa+QP8qw8ND+ASt0EGz4w=="
connectionString=f"DefaultEndpointsProtocol=https;AccountName={accountName};AccountKey={accountKey};EndpointSuffix=core.windows.net"

## **Values for connection to Azure Cosmos DB**

Each row of the dataset will be stored as an item in a container in Azure Cosmos DB.

In [11]:
from azure.cosmos import CosmosClient, PartitionKey

In [12]:
cosmosEndpoint="https://shivmlstorage.documents.azure.com:443/"
cosmosKey="4K6XRazr5I5SRoXOLZeYyHDuxpRfmjDCjf764Ih35xcZG10DrljpLZg8B86w11O0AGAgewIxt8evACDbGwqjbQ=="
cosmosDatabase="gfg-articles"
cosmosContainer="gfg-articles"

In [13]:
client=CosmosClient(url=cosmosEndpoint, credential=cosmosKey)
database=client.get_database_client(database=cosmosDatabase)
gfgArticlesContainer=database.get_container_client(cosmosContainer)

## **Scrap text from the URL to get article content**

**1.** Create a new column `text` to store the scrapped text using BeautifulSoup.

**2.** Define the function to scrap text given the URL as a parameter.

**3.** In batches of 1024, use multi-threading to call this function for each row and save the resulted dataframe to a `.csv` file in Azure Blob Container.

In [14]:
from bs4 import BeautifulSoup
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed

In [15]:
# Add new column to save the scrapped text from the URLs.
articles["text"]=""
articles.head()

Unnamed: 0,title,author_id,last_updated,link,category,text
0,5 Best Practices For Writing SQL Joins,priyankab14,"21 Feb, 2022",https://www.geeksforgeeks.org/5-best-practices...,easy,
1,Foundation CSS Dropdown Menu,ishankhandelwals,"20 Feb, 2022",https://www.geeksforgeeks.org/foundation-css-d...,easy,
2,Top 20 Excel Shortcuts That You Need To Know,priyankab14,"17 Feb, 2022",https://www.geeksforgeeks.org/top-20-excel-sho...,easy,
3,Servlet – Fetching Result,nishatiwari1719,"17 Feb, 2022",https://www.geeksforgeeks.org/servlet-fetching...,easy,
4,Suffix Sum Array,rohit768,"21 Feb, 2022",https://www.geeksforgeeks.org/suffix-sum-array/,easy,


In [16]:
# Dictionary to save the errors occurred while scrapping text.
scrapTextErrors={}

In [17]:
# Set timeout.
TIMEOUT_SECS=60

In [18]:
# Define a function to scrap text.
def scrapText(i, link):
    try:
        page=requests.get(link).text
        parser=BeautifulSoup(page, "html.parser")

        # Get the inner HTML of <div class="text"></div> tag. This consists of the main content.
        # Instead of recursively finding this tag with the above class name, I'm going iteratively to avoid max recursion errors.
        parser=parser.find("html", recursive=False)
        parser=parser.find("body", recursive=False)
        parser=parser.find("div", id="main", recursive=False)
        parser=parser.find("div", id="home-page", recursive=False)
        parser=parser.find("div", class_="article-page_flex", recursive=False)
        parser=parser.find("div", class_="leftBar", recursive=False)
        parser=parser.find("div", class_="article--viewer", recursive=False)
        parser=parser.find("div", class_="article--viewer_content", recursive=False)
        parser=parser.find("div", class_="a-wrapper", recursive=False)
        parser=parser.find("article", recursive=False)
        
        text=[""]
        for tag in parser.find("div", class_="text", recursive=False).contents:
            # Ignore all the <div> tags inside <div class="text"></div> as they do not have any
            # main content.
            if tag.name!="div":
                text.append(" ".join(tag.stripped_strings))
        # Return the main content.
        return i, "\n".join(text).strip("\n")
    
    except Exception as err:
        scrapTextErrors[i]={"link": link, "error": err}
    return i, ""

In [19]:
%%time
# Run the above function for all the links using multithreading.
# Test for batches.
futureResultErrors=[]
batchesCount, BATCH_SIZE=0, 1024
# Print batch size
print(f"Batch size: {BATCH_SIZE}")

for batch_start in range(0, articles.shape[0], BATCH_SIZE):
    future_to_url={}
    batchesCount+=1 # Batch number of the current batch.
    countEmptyText=0 # Count of empty `text` in the current batch.
    batch_end=batch_start+BATCH_SIZE if batch_start+BATCH_SIZE<articles.shape[0] else articles.shape[0]

    with ThreadPoolExecutor(max_workers=128) as executor: 
        for i in range(batch_start, batch_end):
            future_to_url[executor.submit(scrapText, i, articles.loc[i, "link"])]=i
            
        for future in as_completed(future_to_url):
            try:
                i, text=future.result(timeout=TIMEOUT_SECS)
                articles.loc[i, "text"]=str(text)

                # Convert this article to a dictionary and add the `id` as a string.
                gfgArticle=articles.loc[i].to_dict()
                gfgArticle["id"]=str(i)
                # Create an item in Cosmos DB.
                gfgArticlesContainer.create_item(gfgArticle)

                # If `text` is empty, update count.
                if text=="":
                    countEmptyText+=1
            except Exception as err:
                futureResultErrors.append(err)
    
    # Print status.
    print(f"Batch #{batchesCount}: Extracted `text` for {(batch_end-batch_start)-countEmptyText} links")
    # Empty text for this batch.
    articles.loc[batch_start:batch_end, "text"]=""

Batch size: 1024
Batch #1: Extracted `text` for 1023 links
Batch #2: Extracted `text` for 1024 links
Batch #3: Extracted `text` for 1024 links
Batch #4: Extracted `text` for 1024 links
Batch #5: Extracted `text` for 1023 links
Batch #6: Extracted `text` for 1023 links
Batch #7: Extracted `text` for 1024 links
Batch #8: Extracted `text` for 1023 links
Batch #9: Extracted `text` for 1022 links
Batch #10: Extracted `text` for 1022 links
Batch #11: Extracted `text` for 1024 links
Batch #12: Extracted `text` for 1024 links
Batch #13: Extracted `text` for 1023 links
Batch #14: Extracted `text` for 1024 links
Batch #15: Extracted `text` for 1024 links
Batch #16: Extracted `text` for 1022 links
Batch #17: Extracted `text` for 1024 links
Batch #18: Extracted `text` for 1023 links
Batch #19: Extracted `text` for 1023 links
Batch #20: Extracted `text` for 1024 links
Batch #21: Extracted `text` for 1021 links
Batch #22: Extracted `text` for 1023 links
Batch #23: Extracted `text` for 1023 links
Bat

## **Save errors to a Blob**

In [23]:
from azure.storage.blob import BlobClient

# Add the futureResultErrors to the scrapTextErrors and save to a blob.
scrapTextErrors["futureResult"]=futureResultErrors

# Converting the values to string.
for i, v in scrapTextErrors.items():
    scrapTextErrors[i]=str(v)

# Writing to blob.
blob=BlobClient.from_connection_string(conn_str=connectionString, container_name="gfg-articles-errors", blob_name="ScrapTextErrors.json")
blob.upload_blob(json.dumps(scrapTextErrors), overwrite=True)

{'etag': '"0x8DBB2E99E6B7F4E"',
 'last_modified': datetime.datetime(2023, 9, 11, 17, 7, 45, tzinfo=datetime.timezone.utc),
 'content_md5': bytearray(b'\r\x980\xe8\xee\xc1\xef\xed\xccH\xfcR\xac+\xafE'),
 'client_request_id': 'ba327966-50c5-11ee-b394-cdddc276a591',
 'request_id': 'fc9ca73a-901e-009e-6ed2-e458d9000000',
 'version': '2022-11-02',
 'version_id': None,
 'date': datetime.datetime(2023, 9, 11, 17, 7, 45, tzinfo=datetime.timezone.utc),
 'request_server_encrypted': True,
 'encryption_key_sha256': None,
 'encryption_scope': None}

The above errors occurred because those links do not have any content.