# Get Feeds

This notebook is used to retrieve and store the RSS feeds from the specified list of blogs. They are stored as parquet files in the __Files__ section of the Lakehouse based on the day they were retrieved to preserve the history. Additionally, the latest list is stored in the _feeds_ table, which will be used in the next steps.

Initially, we are running the _CONFIG_ notebook, which contains our configuration variables. 

### Library dependency
- __[feedparser](https://pypi.org/project/feedparser/)__: Parse Atom and RSS feeds in Python. _Requires installation_.
- __[pyspark.sql.functions.col](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.col.html)__: Returns a Column based on the given column name.
- __[pyspark.sql.functions.lit](https://spark.apache.org/docs/3.4.0/api/python/reference/pyspark.sql/api/pyspark.sql.functions.lit.html)__: Creates a Column of literal value.
- __[datetime.date](https://docs.python.org/3/library/datetime.html)__: The datetime module supplies classes for manipulating dates and times.
- __[re](https://docs.python.org/3/library/re.html)__: This module provides regular expression matching operations similar to those found in Perl.
- __[pyspark.sql.types](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/data_types.html)__: Define a schema structure and data types.


In [None]:
%run CONFIG

## Get feeds based on blog list

Retrieve the latest feed for each blog in the blog list, then store the latest feed in the Lakehouse. 



In [None]:
################################
# Get feeds based on blog list #
################################

import feedparser
from pyspark.sql.functions import col, lit
from datetime import date
import re

# retrieve current list of blogs from the Lakehouse (oroginally coming from the SharePoint List)
blogs = spark.read.format("delta").load("Tables/blogs")

# get current date 
current_date = date.today()

# loop through all the blogs and retrieve the RSS feed for each of them
for url in blogs.rdd.collect():
    try :
        feed = feedparser.parse(url.RSSFeed)['entries']
    except Exception as ex:
        print(f"[Error] Cloudn't get feed for '{url.RSSFeed}': " + str(ex))
        pass     

    # check if feed contains data. Is mostly relevant if you have invalid RSS Feeds that don't return any data. 
    if len(feed) > 0:     

        # create spark dataframe out of feed for further manipulation
        df = spark.createDataFrame(feed)

        # write raw results as parquet to Lakehouse. This is optional and can be deactivated by changing the parameter if not required
        if keep_raw_feeds == True: 
            df.write.option("header","true").mode("overwrite").parquet(f"Files/feeds_raw/{current_date.year}/{current_date.month}/{current_date.day}/{url.blog}")

        # 
        # check if "author" column is in feed, as some of them don't have it (e.g. Fabric Blog)
        if "author" in df.columns :
            df = df.select(
                col("author"),
                col("guidislink"),
                col("id"),
                col("link"),
                col("published"),
                col("summary"),
                col("title")
            ).withColumn("blog", lit(url.blog))
        # if there is no author, then use the title as author
        else :
            df = df.select(
                col("guidislink"),
                col("id"),
                col("link"),
                col("published"),
                col("summary"),
                col("title")
            ).withColumn("blog", lit(url.blog)).withColumn("author", lit(url.Title))

        # write cleaned file to lakehouse
        df.write.option("header","true").mode("overwrite").parquet(f"Files/feeds/{current_date.year}/{current_date.month}/{current_date.day}/{url.blog}")
        print("[Log] saved file: " + f"Files/feeds/{url.blog}")
    else :
        print(f"[Log] No feed parsed for {url.RSSFeed}")

## Combine feeds

Combine the most recent version of the feeds into a single table and store it in the lakehouse. 


In [None]:
###########################################
# Get all feeds and combine them in table #
###########################################

from pyspark.sql.types import *

# define table schema
schema = StructType(
   [StructField('author', StringType(), True),
    StructField('guidislink', BooleanType(), True),
    StructField('id', StringType(), True),
    StructField('link', StringType(), True),
    StructField('published', StringType(), True),
    StructField('summary', StringType(), True),
    StructField('title', StringType(), True),
    StructField('blog', StringType(), True)
   ]
  )

# read all the latest cleaned version of parquet files
df = spark.read.format("parquet").option("header", "true").schema(schema).load(f"Files/feeds/{current_date.year}/{current_date.month}/{current_date.day}/*")
display(df)

# store combine version as new table in Lakehouse
df.write.format("delta").mode("overwrite").saveAsTable("feeds")