## **Install packages if not yet installed**

In [20]:
import sys

!{sys.executable} -m pip install bs4 # BeautifulSoup
!{sys.executable} -m pip install opendatasets # OpenDatasets
!{sys.executable} -m pip install azure.storage.blob # Azure Blob Storage



## **Reading the dataset**

**1.** Create a file `kaggle.json` and save your Kaggle username and API key. This will be used to download the dataset from Kaggle.

**2.** The URL of the dataset is [https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles](https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles "GeeksForGeeks Articles Dataset"). Using `opendatasets` package, download the dataset. Step 1 is required in order for this to automatically take in your username and API key.

**3.** Read the downloaded dataset.

In [21]:
import json
import opendatasets as od
import pandas as pd

In [22]:
# Creating kaggle.json file.
with open("kaggle.json", "w") as kaggleFile:
    kaggleFile.write(json.dumps({"username":"shivanielakurthy", "key":"da7b4ae4bd1b770cb8b74d3990fc7f43"}))

In [23]:
# Downloading the dataset.
od.download("https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles")

Downloading geeksforgeeks-articles.zip to ./geeksforgeeks-articles



100%|██████████| 1.31M/1.31M [00:00<00:00, 6.45MB/s]


In [24]:
# Reading the dataset.
articles=pd.read_csv(r"geeksforgeeks-articles/articles.csv")
articles.head()

Unnamed: 0,title,author_id,last_updated,link,category
0,5 Best Practices For Writing SQL Joins,priyankab14,"21 Feb, 2022",https://www.geeksforgeeks.org/5-best-practices...,easy
1,Foundation CSS Dropdown Menu,ishankhandelwals,"20 Feb, 2022",https://www.geeksforgeeks.org/foundation-css-d...,easy
2,Top 20 Excel Shortcuts That You Need To Know,priyankab14,"17 Feb, 2022",https://www.geeksforgeeks.org/top-20-excel-sho...,easy
3,Servlet – Fetching Result,nishatiwari1719,"17 Feb, 2022",https://www.geeksforgeeks.org/servlet-fetching...,easy
4,Suffix Sum Array,rohit768,"21 Feb, 2022",https://www.geeksforgeeks.org/suffix-sum-array/,easy


In [25]:
articles.shape

(34574, 5)

## **Dropping rows with null values**

In [26]:
articles=articles.dropna()

In [27]:
# Reset index.
articles=articles.reset_index().drop("index", axis=1)
articles.head()

Unnamed: 0,title,author_id,last_updated,link,category
0,5 Best Practices For Writing SQL Joins,priyankab14,"21 Feb, 2022",https://www.geeksforgeeks.org/5-best-practices...,easy
1,Foundation CSS Dropdown Menu,ishankhandelwals,"20 Feb, 2022",https://www.geeksforgeeks.org/foundation-css-d...,easy
2,Top 20 Excel Shortcuts That You Need To Know,priyankab14,"17 Feb, 2022",https://www.geeksforgeeks.org/top-20-excel-sho...,easy
3,Servlet – Fetching Result,nishatiwari1719,"17 Feb, 2022",https://www.geeksforgeeks.org/servlet-fetching...,easy
4,Suffix Sum Array,rohit768,"21 Feb, 2022",https://www.geeksforgeeks.org/suffix-sum-array/,easy


In [28]:
articles.shape

(34551, 5)

## **Create connection string to storage account**

The new dataframe will be saved as `.csv` files in Azure Blob storage.

In [29]:
connectionString="DefaultEndpointsProtocol=https;AccountName=shivmldatasets;AccountKey=Uoz2wy3N+KONfZAXvPc2QG4Z+G5S6BTvPn0zK6CaoCbM30tBtbToarFMZyo0EeimLD4P8RBuzoJJ+AStJ80Qiw==;EndpointSuffix=core.windows.net"

## **Scrap text from the URL to get article content**

**1.** Create a new column `text` to store the scrapped text using BeautifulSoup.

**2.** Define the function to scrap text given the URL as a parameter.

**3.** In batches of 1024, use multi-threading to call this function for each row and save the resulted dataframe to a `.csv` file in Azure Blob Container.

In [30]:
from bs4 import BeautifulSoup
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from azure.storage.blob import BlobClient

In [31]:
# Add new column to save the scrapped text from the URLs.
articles["text"]=""
articles.head()

Unnamed: 0,title,author_id,last_updated,link,category,text
0,5 Best Practices For Writing SQL Joins,priyankab14,"21 Feb, 2022",https://www.geeksforgeeks.org/5-best-practices...,easy,
1,Foundation CSS Dropdown Menu,ishankhandelwals,"20 Feb, 2022",https://www.geeksforgeeks.org/foundation-css-d...,easy,
2,Top 20 Excel Shortcuts That You Need To Know,priyankab14,"17 Feb, 2022",https://www.geeksforgeeks.org/top-20-excel-sho...,easy,
3,Servlet – Fetching Result,nishatiwari1719,"17 Feb, 2022",https://www.geeksforgeeks.org/servlet-fetching...,easy,
4,Suffix Sum Array,rohit768,"21 Feb, 2022",https://www.geeksforgeeks.org/suffix-sum-array/,easy,


In [32]:
# Dictionary to save the errors occurred while scrapping text.
scrapTextErrors={}

In [33]:
# Set timeout.
TIMEOUT_SECS=60

In [34]:
# Define a function to scrap text.
def scrapText(i, link):
    try:
        page=requests.get(link).text
        parser=BeautifulSoup(page, "html.parser")

        # Get the inner HTML of <div class="text"></div> tag. This consists of the main content.
        # Instead of recursively finding this tag with the above class name, I'm going iteratively to avoid max recursion errors.
        parser=parser.find("html", recursive=False)
        parser=parser.find("body", recursive=False)
        parser=parser.find("div", id="main", recursive=False)
        parser=parser.find("div", id="home-page", recursive=False)
        parser=parser.find("div", class_="article-page_flex", recursive=False)
        parser=parser.find("div", class_="leftBar", recursive=False)
        parser=parser.find("div", class_="article--viewer", recursive=False)
        parser=parser.find("div", class_="article--viewer_content", recursive=False)
        parser=parser.find("div", class_="a-wrapper", recursive=False)
        parser=parser.find("article", recursive=False)
        
        text=[""]
        for tag in parser.find("div", class_="text", recursive=False).contents:
            # Ignore all the <div> tags inside <div class="text"></div> as they do not have any
            # main content.
            if tag.name!="div":
                text.append(" ".join(tag.stripped_strings))
        # Return the main content.
        return i, "\n".join(text).strip("\n")
    
    except Exception as err:
        scrapTextErrors[i]={"link": link, "error": err}
    return i, ""

In [36]:
%%time
# Run the above function for all the links using multithreading.
# Test for batches.
futureResultErrors=[]
batchesCount, BATCH_SIZE=0, 1024
# Print batch size
print(f"Batch size: {BATCH_SIZE}")

for batch_start in range(0, articles.shape[0], BATCH_SIZE):
    future_to_url={}
    batchesCount+=1 # Batch number of the current batch.
    countEmptyText=0 # Count of empty `text` in the current batch.
    batch_end=batch_start+BATCH_SIZE if batch_start+BATCH_SIZE<articles.shape[0] else articles.shape[0]

    with ThreadPoolExecutor(max_workers=128) as executor: 
        for i in range(batch_start, batch_end):
            future_to_url[executor.submit(scrapText, i, articles.loc[i, "link"])]=i
            
        for future in as_completed(future_to_url):
            try:
                i, text=future.result(timeout=TIMEOUT_SECS)
                articles.loc[i, "text"]=str(text)
                # If `text` is empty, update count.
                if text=="":
                    countEmptyText+=1
            except Exception as err:
                futureResultErrors.append(err)
    
    # Print status.
    print(f"Batch #{batchesCount}: Extracted `text` for {(batch_end-batch_start)-countEmptyText} links")
    
    # Create blob for each batch.
    blob=BlobClient.from_connection_string(conn_str=connectionString, container_name="ml-datasets", blob_name=f"gfg-articles-scrapped-{batchesCount}.csv")
    blob.upload_blob(articles.loc[batch_start:batch_end].to_csv())

    # Empty text for this batch.
    articles.loc[batch_start:batch_end, "text"]=""

Batch size: 1024
Batch #1: Extracted `text` for 1023 links
Batch #2: Extracted `text` for 1024 links
Batch #3: Extracted `text` for 1024 links
Batch #4: Extracted `text` for 1024 links
Batch #5: Extracted `text` for 1023 links
Batch #6: Extracted `text` for 1023 links
Batch #7: Extracted `text` for 1024 links
Batch #8: Extracted `text` for 1023 links
Batch #9: Extracted `text` for 1021 links
Batch #10: Extracted `text` for 1022 links
Batch #11: Extracted `text` for 1024 links
Batch #12: Extracted `text` for 1024 links
Batch #13: Extracted `text` for 1024 links
Batch #14: Extracted `text` for 1024 links
Batch #15: Extracted `text` for 1024 links
Batch #16: Extracted `text` for 1022 links
Batch #17: Extracted `text` for 1024 links
Batch #18: Extracted `text` for 1023 links
Batch #19: Extracted `text` for 1023 links
Batch #20: Extracted `text` for 1024 links
Batch #21: Extracted `text` for 1021 links
Batch #22: Extracted `text` for 1023 links
Batch #23: Extracted `text` for 1023 links
Bat

## **Save errors to a Blob**

In [40]:
# Add the futureResultErrors to the scrapTextErrors and save to a blob.
scrapTextErrors["futureResult"]=futureResultErrors

# Converting the values to string.
for i, v in scrapTextErrors.items():
    scrapTextErrors[i]=str(v)

# Writing to blob.
blob=BlobClient.from_connection_string(conn_str=connectionString, container_name="ml-errors", blob_name="ScrapTextErrors.json")
blob.upload_blob(json.dumps(scrapTextErrors))

{'etag': '"0x8DBA807F74564D4"',
 'last_modified': datetime.datetime(2023, 8, 28, 20, 47, 16, tzinfo=datetime.timezone.utc),
 'content_md5': bytearray(b'\xfd\x7f\x13-\x8a\x9dm\xfd+\xf7\x958\xc2G-q'),
 'client_request_id': '130aa122-45e4-11ee-b78c-33c210a3064a',
 'request_id': 'aafe567d-b01e-0029-75f0-d9a912000000',
 'version': '2022-11-02',
 'version_id': None,
 'date': datetime.datetime(2023, 8, 28, 20, 47, 16, tzinfo=datetime.timezone.utc),
 'request_server_encrypted': True,
 'encryption_key_sha256': None,
 'encryption_scope': None}

The above errors occurred because those links do not have any content.

## **Cleaning the datasets stored in blob container**

Format URL to access each dataset: `https://<storage-account-name>.blob.core.windows.net/<blob-container-name>/<blob-name>`

**1.** After reading the first file, I noticed there is an extra row at the end and it's `text` is `NaN`. This row also appears in the next file and its value is not `NaN` and needs to be removed. This does not happen to the last file.

**2.** Also, a new column `Unnamed: 0` contains the original index. This column name needs to be renamed.

In [48]:
# Read the first file.
pd.read_csv("https://shivmldatasets.blob.core.windows.net/ml-datasets/gfg-articles-scrapped-1.csv")

Unnamed: 0.1,Unnamed: 0,title,author_id,last_updated,link,category,text
0,0,5 Best Practices For Writing SQL Joins,priyankab14,"21 Feb, 2022",https://www.geeksforgeeks.org/5-best-practices...,easy,SQL (Structured Query Language) allows us to s...
1,1,Foundation CSS Dropdown Menu,ishankhandelwals,"20 Feb, 2022",https://www.geeksforgeeks.org/foundation-css-d...,easy,Foundation CSS is an open-source & responsive ...
2,2,Top 20 Excel Shortcuts That You Need To Know,priyankab14,"17 Feb, 2022",https://www.geeksforgeeks.org/top-20-excel-sho...,easy,Although many of us are already aware of Micro...
3,3,Servlet – Fetching Result,nishatiwari1719,"17 Feb, 2022",https://www.geeksforgeeks.org/servlet-fetching...,easy,Servlet is a simple java program that runs on ...
4,4,Suffix Sum Array,rohit768,"21 Feb, 2022",https://www.geeksforgeeks.org/suffix-sum-array/,easy,Suffix Sum ArrayGiven an array arr[] of size N...
...,...,...,...,...,...,...,...
1020,1020,How to remove unused dependencies from composer?,viditawasthi2,"27 Jan, 2021",https://www.geeksforgeeks.org/how-to-remove-un...,easy,Removing unused Dependencies from Composer is ...
1021,1021,Modulo or Remainder Operator in Java,tejswini2000k,"23 Feb, 2022",https://www.geeksforgeeks.org/modulo-or-remain...,easy,Modulo or Remainder Operator returns the remai...
1022,1022,Google’s Coding Competitions You Can Consider ...,sanju6890,"28 Jan, 2021",https://www.geeksforgeeks.org/googles-coding-c...,easy,"Want to grow your coding skills, meet like-min..."
1023,1023,Functional Programming in Java 8+ using the St...,dimpalagrawal21,"09 Dec, 2021",https://www.geeksforgeeks.org/functional-progr...,easy,API is an acronym for Application Programming ...


In [47]:
# Read the last file.
pd.read_csv("https://shivmldatasets.blob.core.windows.net/ml-datasets/gfg-articles-scrapped-34.csv")

Unnamed: 0.1,Unnamed: 0,title,author_id,last_updated,link,category,text
0,33792,Maximum sum of Array formed by replacing each ...,coder001,"29 Oct, 2021",https://www.geeksforgeeks.org/maximum-sum-of-a...,expert,"Given an array arr[] of size N , the task is t..."
1,33793,Smallest N digit number whose sum of square of...,king_tsar,"17 Feb, 2022",https://www.geeksforgeeks.org/smallest-n-digit...,expert,"Given an integer N, find the smallest N digit ..."
2,33794,Minimum length subarray containing all unique ...,Sanjit_Prasad,"18 Nov, 2021",https://www.geeksforgeeks.org/minimum-length-s...,expert,Given an array of size N containing all elemen...
3,33795,Memory Access Methods,rajkumarupadhyay515,"26 Jul, 2021",https://www.geeksforgeeks.org/memory-access-me...,expert,These are 4 types of memory access methods:\n1...
4,33796,Count of subsequences whose product is a diffe...,sauravlal_2233,"17 Nov, 2021",https://www.geeksforgeeks.org/count-of-subsequ...,expert,Given an array arr[] containing N elements tha...
...,...,...,...,...,...,...,...
754,34546,Data Structures | Queue | Question 11,GeeksforGeeks,"28 Jun, 2021",https://www.geeksforgeeks.org/data-structures-...,expert,"An implementation of a queue Q, using two stac..."
755,34547,Data Structures | Binary Trees | Question 1,GeeksforGeeks,"28 Jun, 2021",https://www.geeksforgeeks.org/data-structures-...,expert,Which of the following is true about Binary Tr...
756,34548,Amazon Interview | Set 9,GeeksforGeeks,"28 Apr, 2017",https://www.geeksforgeeks.org/amazon-interview...,expert,How did it start?\nI completed and submitted t...
757,34549,Python Program for Rat in a Maze | Backtracking-2,GeeksforGeeks,"02 Aug, 2021",https://www.geeksforgeeks.org/python-program-f...,expert,We have discussed Backtracking and Knight’s to...


In [57]:
# Clean the datasets.
for i in range(1, 35):
    articlesi=pd.read_csv(f"https://shivmldatasets.blob.core.windows.net/ml-datasets/gfg-articles-scrapped-{i}.csv")
    # Remove last row.
    if i!=34:
        articlesi.drop(articlesi.shape[0]-1, inplace=True)
    # Rename column.
    articlesi.rename(columns={"Unnamed: 0": "id"}, inplace=True)
    # Save to blob.
    blob=BlobClient.from_connection_string(conn_str=connectionString, container_name="ml-datasets", blob_name=f"gfg-articles-scrapped-{i}.csv")
    blob.upload_blob(articlesi.to_csv(), overwrite=True)