ETL Process

Project that performs ETL process with extraction, cleaning, transformation, and loading data to GCP BigQuery.

Time of analysis v1: November 7, 2023.
Tools v1: .CSV, Python, SQL, BigQuery GCP.

Background

This project extracts the top headlines of the News API.

Getting started

Clone the repository in the folder:

git clone git@github.com:denisseee/ETL_process.git ETL_process
cd ETL_process

Check your python version and install the dependencies:

which python3
pip3 install pandas
pip3 install requests
pip3 install fugue
pip3 install "fugue[sql]"
python3 -m pip install python-dotenv

Create an .env file:

touch .env

Add the API Key to the .env file:

# .env

NEWS_API_KEY=APIKEYFROMNEWSAPI

Run the code to start the process:

python3 main.py

2 Files will generated:

top_headlines.csv
top_headlines_after_sql.csv

Workflow description

1. Inspect

df.head(5)
df.dtypes
# df.info()
df.describe()
print(df.shape) # (20, 9)

2. Transform

# Delete duplicates
df.drop_duplicates

# Missing values
df.fillna('NaN', inplace=True)

# Transform date to datetime format
# df['author'] = df['author'].astype('string') 
df['publishedAt'] = pd.to_datetime(df['publishedAt'], utc=True).dt.tz_convert(None)

# Delete rows with [Removed] news
df_filtered_index = df[df['title'] == '[Removed]'].index
df.drop(df_filtered_index, inplace=True)
df.reset_index(drop=True, inplace=True)
df.head(15)
df.shape # (19, 9)

# Change column name 
df.rename(columns = {'source.name':'source'}, inplace = True)

3. SQL

SELECT 
    `publishedAt` AS `date`,
    `source`,
    `author`,
    `title`,
    `description`
    `content`,
    `url`,
    `urlToImage` AS `image`
  FROM 
    df 
  PRINT

BigQuery: Load data with GCP SDK

First, starts GCP SDK:

➜ google-cloud-sdk gcloud init

Select or create a project:

➜  google-cloud-sdk gcloud config set project `PROJECT ID`

Create a database:

➜  google-cloud-sdk 
  bq --location=US mk -d \
  --default_table_expiration 3600 \
  --description "Dataset for ETL Process." \
  ETL_process

Dataset 'ProjectName:ETL_process' successfully created.

Create a table:

➜  google-cloud-sdk 
bq load --source_format=CSV --skip_leading_rows=1 ETL_process.top_headlines 
/Users/denisse/ETL_process/top_headlines_after_sql.csv 
date:datetime,source:string,author:string,title:string,content:string,url:string,image:string           

Upload complete.
.. (1s) Current status: DONE

Check the table from the terminal:

➜  google-cloud-sdk bq show ETL_process.top_headlines

Watch the schema and the preview from BigQuery: ETL_process.top_headlines.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

.gitignore

.gitignore

README.md

README.md

main.py

main.py

Repository files navigation

ETL Process

Table of contents

Background

Getting started

Workflow description

1. Inspect

2. Transform

3. SQL

BigQuery: Load data with GCP SDK

About

Releases

Packages

Languages

denisse-orellana/ETL_process

Folders and files

Latest commit

History

Repository files navigation

ETL Process

Table of contents

Background

Getting started

Workflow description

1. Inspect

2. Transform

3. SQL

BigQuery: Load data with GCP SDK

About

Topics

Resources

Stars

Watchers

Forks

Languages