# AWS Glue Studio Notebook
This notebook is used for reading JSON data from an S3 bucket, flattening the nested schema, and saving it as a Parquet file.


#### Optional: Run this cell to see available notebook commands ("magics").


####  Enviroment Setup



In [5]:
%idle_timeout 2880
%glue_version 3.0
%worker_type G.1X
%number_of_workers 5



Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.38.1 
Current idle_timeout is 2800 minutes.
idle_timeout has been set to 2880 minutes.
Setting Glue version to: 3.0
Previous worker type: G.1X
Setting new worker type to: G.1X
Previous number of workers: 5
Setting new number of workers to: 5


### Libaries 

In [1]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from datetime import datetime




Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::505802839350:role/service-role/AWSGlueServiceRole-news
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: 0e65cfee-a77d-4928-bc53-cecb47f390f3
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.38.1
--enable-glue-datacatalog true
Waiting for session 0e65cfee-a77d-4928-bc53-cecb47f390f3 to get into ready status...
Session 0e65cfee-a77d-4928-bc53-cecb47f390f3 has been created.



#### Initiazling context


In [2]:
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)





In [8]:
current_date = datetime.now().strftime("%Y-%m-%d")





#### Data Loading


In [9]:
input_path = f"s3://news-etl-09-08-23/raw-data/{current_date}/all_news_{current_date}.json"
df = spark.read.option("multiline", "true").option("inferschema", "true").json(input_path)





#### Data Transformation


In [10]:
df_flattened = df.select(
    "author",
    "content",
    "description",
    "publishedAt",
    F.col("source.id").alias("source_id"),
    F.col("source.name").alias("source_name"),
    "title",
    "url",
    "urlToImage"
)





#### Data Saving


In [11]:
output_path = f"s3://news-etl-09-08-23/transformed-data/all_news_flattened_{current_date}.parquet"
df_flattened.write.parquet(output_path)





#### Validation

In [12]:
df_parquet = spark.read.parquet(output_path)
df_parquet.printSchema()
df_parquet.show(5)


root
 |-- author: string (nullable = true)
 |-- content: string (nullable = true)
 |-- description: string (nullable = true)
 |-- publishedAt: string (nullable = true)
 |-- source_id: string (nullable = true)
 |-- source_name: string (nullable = true)
 |-- title: string (nullable = true)
 |-- url: string (nullable = true)
 |-- urlToImage: string (nullable = true)

+--------------------+--------------------+--------------------+--------------------+----------------+----------------+--------------------+--------------------+--------------------+
|              author|             content|         description|         publishedAt|       source_id|     source_name|               title|                 url|          urlToImage|
+--------------------+--------------------+--------------------+--------------------+----------------+----------------+--------------------+--------------------+--------------------+
|    Quentyn Kennemer|Watch Canelo defe...|Canelo Alvarez pu...|2023-09-30T22:00:01Z

#### Conclusion

In [None]:
Successfully flattened the nested JSON schema and saved the data as a Parquet file in S3.
