In this project I analyze data from Spotify, datasets is a big csv file a 3.4G csv file, download from kaggle API: "datasets download -d dhruvildave/spotify-charts", I analyze the data in Google Cloud BigQuery using SQL. I load data from Google Cloud Storage to BigQuery. BigQuery is a Data Warehouse on Google Cloud Platform, which is powerful to handle big data and do data analysis.
https://www.kaggle.com/dhruvildave/spotify-charts
This is a big csv data, total 3.4 G, I use ELT pipeline, to do data analyze.
- Download data to google cloud storage
- From google cloud storage directly load data to Bigquery
- Data transform directly in Bigquery by SQL
- Data analyze in Bigquery by SQL
In linux:
sudo apt-get install python3-pip
pip3 install kaggle
mkdir ~/.kaggle
copy kaggle.json key file from kaggle, and save it to ~/.kaggle
run command line:
kaggle datasets download -d dhruvildave/spotify-charts
It start to download After finished, unzip it and upload to google cloud storage bucket.
Create a dataset in Bigquery, and create a table directly from google cloud storage
(1). clean data,we need to tidy the data as the artist column contains multiple values in same cell. We will separate each observation into its individual row
Create or replace table spotify_dataset.spotify2
SELECT * except (artist), split(artist,',') as artist FROM `t-osprey-337221.spotify_dataset.spotify`;
You will see artist mode change to repeated, that is an array, we save the result to table spotify2
(2). We also found ' -' in artist array field, I should remove this ' -' in artist array field
SELECT artist FROM `t-osprey-337221.spotify_dataset.spotify2` where ' -' in unnest(artist) LIMIT 1000
We run SQL to create table spotify3
create or replace table spotify_dataset.spotify3 as
SELECT * except (artist), array(select * from unnest(artist) as x where x !=" -") as artist FROM `t-osprey-337221.spotify_dataset.spotify2`;
We also flatten the artist field and create table spotify4
(1). List of unique artists on “Top 200”
with templetable as (select title, date, artist_all FROM `t-osprey-337221.spotify_dataset.spotify3` as p,
unnest(p.artist) as artist_all where chart='top200')
select count(distinct artist_all) from templetable
(2). Artist which appeared on the charts maximum number of times
We first flatten the artist field and create table spotify4
Create table spotify_dataset.spotify4 as
(select p.* except(artist), artist_all FROM `t-osprey-337221.spotify_dataset.spotify3` as p,
unnest(p.artist) as artist_all)
And then:
SELECT artist_all, count(*) as num FROM `t-osprey-337221.spotify_dataset.spotify4`
group by artist_all order by num desc
(3). Finding artists with highest number of streams
SELECT artist_all, sum(streams) as total_streams FROM `t-osprey-337221.spotify_dataset.spotify4`
group by artist_all order by total_streams desc
(4). Artist with maximum number of songs to feature on chart
SELECT artist_all, count(distinct title) as totol_songs FROM `t-osprey-337221.spotify_dataset.spotify4`
group by artist_all order by totol_songs desc
(5). Get the range of timeline of the data
SELECT MIN(date) as begin, MAX(date) as time_end FROM `t-osprey-337221.spotify_dataset.spotify4` WHERE chart = 'top200';
(6). Total Number of songs by Taylor Swift has appeared in top 200
SELECT count(distinct title) NoOfSongs FROM `t-osprey-337221.spotify_dataset.spotify4` WHERE artist_all LIKE '%Taylor Swift%';
(7). Counting number of times Taylor Swift appeared in the TOP 200 trend
SELECT Count(*) N_InTop200
FROM `t-osprey-337221.spotify_dataset.spotify4`
WHERE artist_all
LIKE '%Taylor Swift%'
AND chart = 'top200';
(8). Listing Taylor's TOP 10 songs
SELECT title, SUM(streams) streams
FROM `t-osprey-337221.spotify_dataset.spotify4`
WHERE artist_all LIKE '%Taylor Swift%'
AND streams IS NOT NULL
GROUP BY title
ORDER BY streams DESC
LIMIT 10;
(9). Seeing how many times each song has appeared in top 200
SELECT title, count(title) AS count
FROM `t-osprey-337221.spotify_dataset.spotify4`
WHERE artist_all LIKE '%Taylor Swift%'
AND chart = 'top200'
GROUP BY title
ORDER BY count DESC;
(10). Highest, Lowest and the mean rank of the songs
SELECT Title, MIN(rank) Highest, MAX(rank) Lowest, AVG(rank) Avg
FROM `t-osprey-337221.spotify_dataset.spotify4`
WHERE artist_all like '%Taylor Swift%'
AND chart='top200'
GROUP BY title
ORDER BY Avg;