Skip to content

Building big data pipelines for news data and providing visualisations

Notifications You must be signed in to change notification settings

aloysiusng/Big-Data-Archiecture-for-News-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Solving business questions with news category data set

Business question 1:

Business Question: Can we visualize the distribution of the top 5 categories of articles by year in HuffPost?
Target beneficiary: HuffPost
How does it help? This helps HuffPost to understand the trend of the top 5 categories of articles by year. This can help them to focus on the categories that are trending and produce more articles on those categories.

Business question 2:

Business Question: Can we visualize the number of articles produced each year by various news agencies?
Target beneficiary: News Agencies
How does it help? This helps news agencies to understand the trend of the number of articles produced each year. This can help them to focus on the years that have the highest number of articles produced and produce more articles in those years.

Business question 3:

Business Question: Can we identify what words are commonly used for headlines?
Target beneficiary: Article authors
How does it help? This helps article authors to understand what words are commonly used for headlines. This can help them to spend less time thinking of a great headline and spend more time refining their article.

Fields of the dataset

For each article the attribute are as follows:

Attribute Attribute Description
category string category article belongs to
headline string headline of the article
authors string person authored the article
link string link to the post
short_description string short description of the article
date string date the article was published

Fields of supplementary data from news api

Source: https://newsapi.org/ For each article the attribute are as follows:

Attribute Attribute Description
source_id string id of news agency
source_name string name of news agency
author string author of article
title string title of article
description string description of article
url string url to article
urltoimage string url to image
publishedat string when was the article published

Repository Orientation

  • data folder
    • contains the data set from Kaggle
    • contains SQL queries for the data set used in Athena
    • contains the image for the pipelines' AWS architectures
  • glue folder
    • contains the code for the glue crawler and glue ETL job
  • lambda folder
    • contains the "get_news" folder for the lambda function
      • contains the code for the lambda function (Data Ingestion)
      • contains readme for installing the dependencies
      • readme also contains the code for the lambda function if subscribed to paid subscription
  • terraform folder
    • contains the terraform code for the architecture
  • visualizations folder
    • contains the manifest file for the visualizations on QuickSight
    • contains the example visualizations on QuickSight

Pre-requisites

  1. AWS account
  2. Terraform CLI
  3. News API key

How to set up the project

  1. Clone the repository
  2. cd to lambda/get_news directory, and read the readme.md there before proceeding to the next step
  3. Under the terraform directory, create a terraform.tfvars file with the following content:
AWS_ACCESS_KEY_ID     = "your_aws_access_key_id"
AWS_SECRET_ACCESS_KEY = "your_aws_secret_access_key"
NEWS_API_KEY          = "your_news_api_key"
AWS_ACCOUNT_ID        = "your_aws_account_id"
AWS_REGION            = "your_aws_region"
  1. Run the following commands:
cd terraform
terraform init
terraform apply
  1. Once the terraform is applied, the architecture will be created in AWS. The pipelines are linked to the business questions. AWS Architecture for all pipelines

  2. After provisioning the architecture, you can immediately head to AWS Glue to run the crawler and ETL job.

  3. Thereafter, you can head to AWS Athena to run the saved queries

  4. Examples of QuickSight visualizations are also available in the visualizations folder

  5. To tear down the architecture, run the following command:

cd terraform
terraform destroy

Other Details

Cost: Do note that there might be costs incurred for the AWS services used in this project.

Data Ingestion: Batch (AWS Lambda)

Data Processing: Batch (AWS Glue)

For the scheduled triggers:

  • 1200 daily => lambda get_news
  • 1220 daily => glue crawler
  • 1240 daily => glue ETL job

Limitations

The API key for News API is a free tier key, which means that it is limited to 100 requests per day. This means that the data ingestion is limited to 100 articles per day and it will not contain data for the past years. Will require a paid subscription to News API to get more data -> code prepared in /lambda/readme.md if subscribed to paid subscription

Points for improvement

  • Suboptimal data ingestion of Kaggle Data Set through 2 ways:

    1. Manual downloading of Kaggle dataset and uploading to S3
    2. Manual downloading of Kaggle dataset, replacing it in the code base and run terraform apply

Improvement: Using EC2 or a Lambda function to automate this process in the future

  • Modularising pipelines in Terraform instead of having one big file
    • Some modularisation is done for IAM and ETL jobs, but segregating the pipelines into different files will be better

Learning Points

  1. Deeper understanding of big data pipelines
  2. Using Terraform for big data pipelines
  3. Using AWS Glue, Athena, QuickSight
  4. Exploring PySpark’s functions like ML & SQL

Citation of data sources:

  1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).
  2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).
  3. https://www.kaggle.com/datasets/rmisra/news-category-dataset

About

Building big data pipelines for news data and providing visualisations

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published