<a href="https://colab.research.google.com/github/elakurthyshivani/GFG-Articles-Summarizer/blob/dev%2Fbuilding-dataset/dataset/BuildingDataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Install packages if not yet installed**

In [None]:
import sys

!{sys.executable} -m pip install bs4 # BeautifulSoup
!{sys.executable} -m pip install opendatasets # OpenDatasets
!{sys.executable} -m pip install pyspark # PySpark

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285397 sha256=ba62c97c509c0c9f78127dd4c4e1e4aea701a22e0b3a887671c2a03bca8be1d5
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1


## **Reading the dataset**

**1.** Create a file `kaggle.json` and save your Kaggle username and API key. This will be used to download the dataset from Kaggle.

**2.** The URL of the dataset is [https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles](https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles "GeeksForGeeks Articles Dataset"). Using `opendatasets` package, download the dataset. Step 1 is required in order for this to automatically take in your username and API key.

**3.** Create a Spark Session to start working with PySpark.

**4.** Read the downloaded dataset.

In [None]:
import json
import opendatasets as od
from pyspark.sql import SparkSession

In [None]:
# Creating kaggle.json file.
with open("kaggle.json", "w") as kaggleFile:
    kaggleFile.write(json.dumps({"username":"shivanielakurthy", "key":"da7b4ae4bd1b770cb8b74d3990fc7f43"}))

In [None]:
# Downloading the dataset.
od.download("https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-articles")

Skipping, found downloaded files in "./geeksforgeeks-articles" (use force=True to force download)


In [None]:
# Create a Spark Session.
spark=SparkSession.builder.config('spark.app.name', 'geeks_for_geeks_articles').getOrCreate()

In [None]:
# Reading the dataset.
articles=spark.read.option('header', True)\
          .option('inferSchema', True)\
          .csv(r"geeksforgeeks-articles/articles.csv")
articles.show(5, truncate=False)

+--------------------------------------------+----------------+------------+---------------------------------------------------------------------------+--------+
|title                                       |author_id       |last_updated|link                                                                       |category|
+--------------------------------------------+----------------+------------+---------------------------------------------------------------------------+--------+
|5 Best Practices For Writing SQL Joins      |priyankab14     |21 Feb, 2022|https://www.geeksforgeeks.org/5-best-practices-for-writing-sql-joins/      |easy    |
|Foundation CSS Dropdown Menu                |ishankhandelwals|20 Feb, 2022|https://www.geeksforgeeks.org/foundation-css-dropdown-menu/                |easy    |
|Top 20 Excel Shortcuts That You Need To Know|priyankab14     |17 Feb, 2022|https://www.geeksforgeeks.org/top-20-excel-shortcuts-that-you-need-to-know/|easy    |
|Servlet – Fetching Result  

## **Dropping rows with null values**

In [None]:
articles=articles.dropna()

## **Scrap text from the URL to get article content**

In [None]:
from bs4 import BeautifulSoup
from pyspark.sql.functions import lit
from urllib.request import urlopen # Or requests

In [None]:
# Add new column to save the scrapped text from the URLs.
articles=articles.withColumn("text", lit(""))
articles.show()

+--------------------+--------------------+------------+--------------------+--------+----+
|               title|           author_id|last_updated|                link|category|text|
+--------------------+--------------------+------------+--------------------+--------+----+
|5 Best Practices ...|         priyankab14|21 Feb, 2022|https://www.geeks...|    easy|    |
|Foundation CSS Dr...|    ishankhandelwals|20 Feb, 2022|https://www.geeks...|    easy|    |
|Top 20 Excel Shor...|         priyankab14|17 Feb, 2022|https://www.geeks...|    easy|    |
|Servlet – Fetchin...|     nishatiwari1719|17 Feb, 2022|https://www.geeks...|    easy|    |
|    Suffix Sum Array|            rohit768|21 Feb, 2022|https://www.geeks...|    easy|    |
|Kelvin To Celsius...|         ramneek2307|16 Feb, 2022|https://www.geeks...|    easy|    |
|How to Install Mo...|         ramneek2307|12 Feb, 2022|https://www.geeks...|    easy|    |
|7 Highest Paying ...|        vanshika4042|18 Feb, 2022|https://www.geeks...|   

In [None]:
def scrapText(link):
    page=urlopen(link).read().decode("utf-8")
    parser=BeautifulSoup(page, 'html.parser')
    return parser.find_all('div', attrs={"class", "text"})[0])