Creating Spark Session and configuring CPU and Memory resources for this Spark Session. 
Read all the downloaded Parquet files in a Dataframe (df) from "parquet_files" folder. 
Infer schema from parquet files automatically and create a Table like temporary view "ccnews".
Now we can query data from "ccnews" just like any other tabe using SQL queries.

In [37]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master('local[4]') \
    .appName('content_analysis') \
    .config('spark.executor.memory', '10gb') \
    .config("spark.cores.max", "6") \
    .getOrCreate()

df = spark.read \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .option("compression", "snappy") \
    .parquet("./parquet_files/")

df.createOrReplaceTempView("ccnews")

By executing the following query we will discover what’s the most frequently covered category in "English" language in every Year.
Data is partitioned by Year and most visited catagory in each year is displayed.
You can also view 2nd or 3rd most visited catagory by changing the rank value in last line of query.

In [40]:
english_df = spark.sql("select news_category, frequency, published_year, news_language from ( \
(select news_category, frequency, published_year, news_language, \
ROW_NUMBER() OVER (PARTITION BY published_year ORDER BY frequency DESC) as rank \
from ( \
select categories as news_category, count(1) as frequency, \
any_value(language) as news_language, \
YEAR(published_date) as published_year from ccnews \
where language='en' group by categories, published_year having LENGTH(categories)!=0 and categories!='none' \
order by frequency desc ))) \
where rank <= 1 order by published_year desc") \
.show(df.count(),False)

                                                                                

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+--------------+-------------+
|news_category                                                                                                                                                                                                                                                    |frequency|published_year|news_language|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+--------------+-------------+
|News                                                                                                  

By executing the following query we will discover what’s the most frequently covered category in "Arabic" language in every Year.
Data is partitioned by Year and most visited catagory in each year is displayed.
You can also view 2nd or 3rd most visited catagory by changing the rank value in last line of query.

In [42]:
english_df = spark.sql("select news_category, frequency, published_year, news_language from ( \
(select news_category, frequency, published_year, news_language, \
ROW_NUMBER() OVER (PARTITION BY published_year ORDER BY frequency DESC) as rank \
from ( \
select categories as news_category, count(1) as frequency, \
any_value(language) as news_language, \
YEAR(published_date) as published_year from ccnews \
where language='ar' group by categories, published_year having LENGTH(categories)!=0 and categories!='none' \
order by frequency desc ))) \
where rank <= 2 order by published_year desc") \
.show(df.count(),False)



+--------------------------+---------+--------------+-------------+
|news_category             |frequency|published_year|news_language|
+--------------------------+---------+--------------+-------------+
|أخبار مصر                 |1388     |2024          |ar           |
|اخبار العالم              |872      |2024          |ar           |
|أخبار مصر                 |1242     |2023          |ar           |
|الأخبار                   |651      |2023          |ar           |
|أخبار عالمية              |716      |2022          |ar           |
|أخبار مصر                 |712      |2022          |ar           |
|اخبار اليمن               |18       |2021          |ar           |
|أخبار عالمية              |4        |2021          |ar           |
|أخبار عربية               |1        |2020          |ar           |
|أمريكا                    |1        |2020          |ar           |
|أهم الاخبار               |1        |2019          |ar           |
|ثقافة وفنون               |1        |2019      

                                                                                