Skip to content

ZhixueD/Bigquery-data-analyze-using-SQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

ELT-pipeline-BigQuery-data-analyze-using-SQL

In this project I analyze data from Spotify, datasets is a big csv file a 3.4G csv file, download from kaggle API: "datasets download -d dhruvildave/spotify-charts", I analyze the data in Google Cloud BigQuery using SQL. I load data from Google Cloud Storage to BigQuery. BigQuery is a Data Warehouse on Google Cloud Platform, which is powerful to handle big data and do data analysis.

https://www.kaggle.com/dhruvildave/spotify-charts

This is a big csv data, total 3.4 G, I use ELT pipeline, to do data analyze.

  1. Download data to google cloud storage
  2. From google cloud storage directly load data to Bigquery
  3. Data transform directly in Bigquery by SQL
  4. Data analyze in Bigquery by SQL

1. Download data and upload to google cloud storage

In linux:

    sudo apt-get install python3-pip
    pip3 install kaggle
    mkdir ~/.kaggle

copy kaggle.json key file from kaggle, and save it to ~/.kaggle

run command line:

    kaggle datasets download -d dhruvildave/spotify-charts

It start to download After finished, unzip it and upload to google cloud storage bucket.

google cloud storage data

2. From google cloud storage directly load data to Bigquery

Create a dataset in Bigquery, and create a table directly from google cloud storage

load data from GCS to bigquery

bigquery schema

3. Data transform directly in Bigquery by SQL

(1). clean data,we need to tidy the data as the artist column contains multiple values in same cell. We will separate each observation into its individual row

 Create or replace table spotify_dataset.spotify2
 SELECT * except (artist), split(artist,',') as artist FROM `t-osprey-337221.spotify_dataset.spotify`;

clean data

You will see artist mode change to repeated, that is an array, we save the result to table spotify2

clean data2

(2). We also found ' -' in artist array field, I should remove this ' -' in artist array field

  SELECT artist FROM `t-osprey-337221.spotify_dataset.spotify2` where ' -' in unnest(artist) LIMIT 1000

sql --

We run SQL to create table spotify3

  create or replace table spotify_dataset.spotify3 as 
  SELECT * except (artist), array(select * from unnest(artist) as x where x !=" -") as artist FROM `t-osprey-337221.spotify_dataset.spotify2`; 

We also flatten the artist field and create table spotify4

4. Data analyze in Bigquery by SQL

(1). List of unique artists on “Top 200”

 with templetable as (select title, date, artist_all FROM `t-osprey-337221.spotify_dataset.spotify3` as p, 
 unnest(p.artist) as artist_all where chart='top200')

 select count(distinct artist_all) from templetable

image

(2). Artist which appeared on the charts maximum number of times

We first flatten the artist field and create table spotify4

     Create table spotify_dataset.spotify4 as 
     (select p.* except(artist), artist_all FROM `t-osprey-337221.spotify_dataset.spotify3` as p, 
      unnest(p.artist) as artist_all)

And then:

     SELECT artist_all, count(*) as num FROM `t-osprey-337221.spotify_dataset.spotify4` 
     group by artist_all order by num desc

image

(3). Finding artists with highest number of streams

  SELECT artist_all, sum(streams) as total_streams FROM `t-osprey-337221.spotify_dataset.spotify4` 
  group by artist_all order by total_streams desc

image

(4). Artist with maximum number of songs to feature on chart

  SELECT artist_all, count(distinct title) as totol_songs FROM `t-osprey-337221.spotify_dataset.spotify4` 
  group by artist_all order by totol_songs desc

image

(5). Get the range of timeline of the data

   SELECT MIN(date) as begin, MAX(date) as time_end FROM `t-osprey-337221.spotify_dataset.spotify4` WHERE chart = 'top200';

image

(6). Total Number of songs by Taylor Swift has appeared in top 200

   SELECT count(distinct title) NoOfSongs FROM `t-osprey-337221.spotify_dataset.spotify4` WHERE artist_all LIKE '%Taylor Swift%';

image

(7). Counting number of times Taylor Swift appeared in the TOP 200 trend

 SELECT Count(*) N_InTop200 
 FROM `t-osprey-337221.spotify_dataset.spotify4` 
 WHERE artist_all 
 LIKE '%Taylor Swift%' 
 AND chart = 'top200';

image

(8). Listing Taylor's TOP 10 songs

  SELECT title, SUM(streams) streams 
  FROM `t-osprey-337221.spotify_dataset.spotify4` 
  WHERE artist_all LIKE '%Taylor Swift%' 
  AND streams IS NOT NULL 
  GROUP BY title 
  ORDER BY streams DESC 
  LIMIT 10;

image

(9). Seeing how many times each song has appeared in top 200

   SELECT title, count(title) AS count 
   FROM `t-osprey-337221.spotify_dataset.spotify4` 
   WHERE artist_all LIKE '%Taylor Swift%' 
   AND chart = 'top200' 
   GROUP BY title 
   ORDER BY count DESC;

image

(10). Highest, Lowest and the mean rank of the songs

     SELECT Title, MIN(rank) Highest, MAX(rank) Lowest, AVG(rank) Avg 
     FROM `t-osprey-337221.spotify_dataset.spotify4`
     WHERE artist_all like '%Taylor Swift%' 
     AND chart='top200' 
     GROUP BY title 
     ORDER BY Avg;

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published