<a href="https://colab.research.google.com/github/elakurthyshivani/GFG-Articles-Summarizer/blob/dev%2Fbuilding-dataset/dataset/BuildingDataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Install packages if not yet installed**

In [1]:
import sys

!{sys.executable} -m pip install bs4 # BeautifulSoup
!{sys.executable} -m pip install opendatasets # OpenDatasets
!{sys.executable} -m pip install pyspark # PySpark

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25l[?25hdone
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1256 sha256=00cd76fa2dbb1d77fccdb4ecb76869dffec0d9cb3abeb3e13688a285838e2557
  Stored in directory: /root/.cache/pip/wheels/25/42/45/b773edc52acb16cd2db4cf1a0b47117e2f69bb4eb300ed0e70
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22
Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packa

## **Reading the dataset**

**1.** Create a file `kaggle.json` and save your Kaggle username and API key. This will be used to download the dataset from Kaggle.

**2.** The URL of the dataset is [https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles](https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles "GeeksForGeeks Articles Dataset"). Using `opendatasets` package, download the dataset. Step 1 is required in order for this to automatically take in your username and API key.

**3.** Create a Spark Session to start working with PySpark.

**4.** Read the downloaded dataset.

In [2]:
import json
import opendatasets as od
from pyspark.sql import SparkSession

In [3]:
# Creating kaggle.json file.
with open("kaggle.json", "w") as kaggleFile:
    kaggleFile.write(json.dumps({"username":"shivanielakurthy", "key":"da7b4ae4bd1b770cb8b74d3990fc7f43"}))

In [4]:
# Downloading the dataset.
od.download("https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles")

Downloading geeksforgeeks-articles.zip to ./geeksforgeeks-articles


100%|██████████| 1.31M/1.31M [00:00<00:00, 2.46MB/s]







In [5]:
# Create a Spark Session.
spark=SparkSession.builder.config('spark.app.name', 'geeks_for_geeks_articles').getOrCreate()

In [6]:
# Reading the dataset.
articles=spark.read.option('header', True)\
          .option('inferSchema', True)\
          .csv(r"geeksforgeeks-articles/articles.csv")
articles.show(5, truncate=False)

+--------------------------------------------+----------------+------------+---------------------------------------------------------------------------+--------+
|title                                       |author_id       |last_updated|link                                                                       |category|
+--------------------------------------------+----------------+------------+---------------------------------------------------------------------------+--------+
|5 Best Practices For Writing SQL Joins      |priyankab14     |21 Feb, 2022|https://www.geeksforgeeks.org/5-best-practices-for-writing-sql-joins/      |easy    |
|Foundation CSS Dropdown Menu                |ishankhandelwals|20 Feb, 2022|https://www.geeksforgeeks.org/foundation-css-dropdown-menu/                |easy    |
|Top 20 Excel Shortcuts That You Need To Know|priyankab14     |17 Feb, 2022|https://www.geeksforgeeks.org/top-20-excel-shortcuts-that-you-need-to-know/|easy    |
|Servlet – Fetching Result  

## **Dropping rows with null values**

In [7]:
articles=articles.dropna()

## **Setup logging**

In [8]:
import logging
logging.basicConfig(filename='buildingdataset.log', level=logging.DEBUG,
                    format='%(asctime)s - %(levelname)s - %(message)s')

## **Scrap text from the URL to get article content**

In [9]:
from bs4 import BeautifulSoup
from pyspark.sql.functions import lit, col, udf
import requests

In [10]:
# Add new column to save the scrapped text from the URLs.
articles=articles.withColumn("text", lit(""))
articles.show(5, truncate=False)

+--------------------------------------------+----------------+------------+---------------------------------------------------------------------------+--------+----+
|title                                       |author_id       |last_updated|link                                                                       |category|text|
+--------------------------------------------+----------------+------------+---------------------------------------------------------------------------+--------+----+
|5 Best Practices For Writing SQL Joins      |priyankab14     |21 Feb, 2022|https://www.geeksforgeeks.org/5-best-practices-for-writing-sql-joins/      |easy    |    |
|Foundation CSS Dropdown Menu                |ishankhandelwals|20 Feb, 2022|https://www.geeksforgeeks.org/foundation-css-dropdown-menu/                |easy    |    |
|Top 20 Excel Shortcuts That You Need To Know|priyankab14     |17 Feb, 2022|https://www.geeksforgeeks.org/top-20-excel-shortcuts-that-you-need-to-know/|easy    |    

In [11]:
# Define a User Defined Function to scrap text.
def scrapText(link):
    try:
        page=requests.get(link).text
        parser=BeautifulSoup(page, "html.parser")
        # Get the inner HTML of <div class="text"></div> tag. This consists of the main content.
        text=[""]
        for tag in parser.find("div", class_="text").contents:
            # Ignore all the <div> tags inside <div class="text"></div> as they do not have any
            # main content.
            if tag.name!="div":
                text.append(" ".join(tag.stripped_strings))
        # Return the main content.
        return "\n".join(text).strip("\n")
    except Exception as err:
        logging.error(f"ScrapText error ({link}) : {err}")
        return ""

scrapTextUDF=udf(scrapText)

In [12]:
# Apply the UDF to text column using the link column.
articles=articles.withColumn("text", scrapTextUDF(articles["link"]))
articles.show(5)

+--------------------+----------------+------------+--------------------+--------+--------------------+
|               title|       author_id|last_updated|                link|category|                text|
+--------------------+----------------+------------+--------------------+--------+--------------------+
|5 Best Practices ...|     priyankab14|21 Feb, 2022|https://www.geeks...|    easy|SQL (Structured Q...|
|Foundation CSS Dr...|ishankhandelwals|20 Feb, 2022|https://www.geeks...|    easy|Foundation CSS is...|
|Top 20 Excel Shor...|     priyankab14|17 Feb, 2022|https://www.geeks...|    easy|Although many of ...|
|Servlet – Fetchin...| nishatiwari1719|17 Feb, 2022|https://www.geeks...|    easy|Servlet is a simp...|
|    Suffix Sum Array|        rohit768|21 Feb, 2022|https://www.geeks...|    easy|Suffix Sum ArrayG...|
+--------------------+----------------+------------+--------------------+--------+--------------------+
only showing top 5 rows



In [13]:
# Check the text present in a row.
print(articles.select("text").take(6)[5][0])

Kelvin and Celsius are two scales of temperature. Both of the scales are used in their own unique way. Kelvin scale is mainly used by scientists to measure the color temperature of the light source. On the other hand, the Celsius scale is used for general purposes like measuring the temperature of the water. These two scales can easily be converted from one to another. Before going straight to Kelvin to Celsius scale. Let’s know more about these scales and where are they used most.
kelvin
Kelvin is a scale of temperature. its unit is (K) as it is an absolute scale we do not use degree with it. kelvin is named after William Thomson, 1st Baron Kelvin. 0 (zero) Kelvin defines absolute null which means the kinetic energy of particles is very less at this temperature. In physical science, kelvin is the primary unit for measuring temperature.
Celsius
Celsius is also a scale of temperature. Its unit is in degree Celsius (°C ). it is named after astronomer Anders Celsius in 1948. On the Celsiu

## **Write to .parquet file**

In [15]:
# Write to Azure Blob Storage?

In [16]:
# Stop the spark session.
spark.stop()