# Analysis on Spotify Popular East Asian dataset

#### About Dataset

Special thanks to CRXXOM who provided this dataset on Kaggle which allows me to practice EDA. I decided to focus more on Top Artist dataset to find some insights.
This project was completed by Achmed Azri on 20230906.

This dataset consists of two parts which are the top artists and top tracks from 7 different 'query genres' from Spotify API namely chinese, japanese, korean, k-pop, j-pop, j-idol and j-dance.


#### Note:

1. Datasets related to top artists are sorted by number of followers on Spotify
2. Datasets related to top tracks are sorted by the popularity metric on Spotify
3. Data is extracted using Spotify API, feel free to use the datasets as long as it cope with their term of use
4. Some tracks/artists may be duplicated due to them being on two or more query genres
5. Inspirations of using the datasets include a EDA analysis on the top tracks and artists, finding sleeper hits/trending music, or just to explore some new taste and music!


#### Features in top artist datasets

1. artist_name: name of the artist
2. popularity: a popularity metric calculated by Spotify, ranging from 0-100 with 100 being the most popular
3. followers: number of followers of the artist on Spotify
4. artist_link: external link to artist page on Spotify
5. genres: list of genres that the artist is involved in
6. top_track: top track of the artist based on Spotify API
7. top_track_album: the album that the top track is in
8. top_track_popularity: the popularity of top track; a popularity metric calculated by Spotify, ranging from 0-100 with 100 being the most popular
9. top_track_release_date: release date of the top track
10. top_track_duration_ms: the duration of the top track in milliseconds
11. top_track_explicit: boolean value that indicated if the top track is explicit or not
12. top_track_link: external link to the top track on Spotify
13. top_track_album_link: external link to the album of the top track on Spotify
14. query_genre: query_genre of the artist



#### Problem Statement;

>* Which column that is unnecessary for our analysis?
>* How many unique artist_name in top 50 Songs?
>* How many explicit track listed in top 50 Songs?
>* Identifying most genre in top 50 Songs.
>* Relationship between song release date and top track popularity.
>* Relationship between song duration and top track popularity.
>* Relationship between artist followers and top track popularity.
>* Relationship between song duration and top track popularity.
>* Relationship between artist popularity and top track popularity.

Appreciate any comment and feedback,thanks!

In [None]:
# import libraries for EDA & Data Viz

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns

In [None]:
# reading spotify dataset

URL = '/kaggle/input/spotify-popular-east-asian-artists-and-tracks/east_asia_top_artists.csv'

top_artist = pd.read_csv(URL)
print('successfully read the dataset!')

# Step 1: Data Understanding

> * Understanding the shape of dataset
> * Understanding top 10 results of dataset
> * Understanding last 10 results of dataset
> * Understanding the summary of dataset
> * Understanding the data types of each columns
> * Understanding each columns


In [None]:
top_artist.shape 

In [None]:
top_artist.head(10)

In [None]:
top_artist.tail(10)

In [None]:
top_artist.describe()

In [None]:
top_artist.columns

In [None]:
top_artist.dtypes

# Step 2: Data Pre-processing

* Pre-processing missing values;
    * Pre-processing data by checking if there is any missing values by using isnull and sum method.
    * Pre-processing data by checking NaN value from each columns.
    * Cross checking missing values with top_tracks dataset.
    * Pre-processing missing values by using dropna().
    
    
* Pre-processing unnecessary columns;
    * Drop unnecessary columns for our analysis.
    
    
* Pre-processing data by checking duplicated values;
    * Checking duplicated values using .duplicate()
    
    
* Pre-processing data by correcting the datatype;
    * Convert top_track_release_date data type from object to datetime.

In [None]:
top_artist.isnull().sum()

In [None]:
top_artist[top_artist.isna().any(axis=1)]

Based on above findings, missing values from dataset are from 4 different artist where their genre are the same (j-idol);

> 1. Yui Aragaki
> 2. French Kiss
> 3. Rino Sashiara
> 4. Eriko Tamura

Due to the fact that we have top_tracks dataset, I will be checking the dataset to see whether there are relevant information that can be retrieve and replace the missing values with it. See below code & markdown for detail information.

In [None]:
# reading top_tracks dataset to see if there are relevant data that can be used to replace missing values in top_artist dataset

URL = '/kaggle/input/spotify-popular-east-asian-artists-and-tracks/east_asia_top_tracks.csv'
top_track = pd.read_csv(URL)

top_track

In [None]:
# filter dataframe with multiple condition
# all artist name used are from missing value in top_artist dataset

filter_list = ['Yui Aragaki', 'French Kiss', 'Rino Sashiara', 'Eriko Tamura']
top_track[top_track.artist_name.isin(filter_list)]


## Dealing with missing data problem: A conclusion and approach.

>* After checking above result, it was found that out of 4 artist, there is only 1 (Yui Aragaki) listed in top_tracks dataset.
>* However, none of the data is relevant to missing values considering it is not possible to make an average & some numerical data is strictly related to spotify metrics which we cannot simply assume. 
>* As conclusion, the missing values will be drop from our dataframe.

In [None]:
# Drop missing values from dataframe

top_artist.dropna(subset=["top_track"], axis=0, inplace=True)

print('succesfully drop all missing values!')

In [None]:
# rechecking the dataset

top_artist.isnull().sum()

In [None]:
# dropping unnecessary columns for our analysis

top_artist.drop(['Unnamed: 0','artist_link','top_track_album','top_track_album_link', 'top_track_link', 'query_genre'], axis=1, inplace=True)

In [None]:
top_artist

In [None]:
# checking duplicated values in spotify top artist dataset

top_artist[top_artist.duplicated()]

#### Findings from duplicated values; 

* Based on investigation from above dataframe, it was found that there are 89 duplicated records.
* Hence considering it will affect the analysis, all duplicated records will be dropped. 

In [None]:
# dropping all 89 duplicated records.

top_artist.drop_duplicates(inplace=True)
top_artist.duplicated().sum()

In [None]:
# change data types of top_track_release_date from object to datetime

top_artist['top_track_release_date'] = pd.to_datetime(top_artist['top_track_release_date'])


In [None]:
top_artist.dtypes

# Step 3: Exploratory Data Analysis (EDA)

#### Limit the dataset to top 50 data according to descending values in top track popularity

#### Problem Statement;

>* Which column that is unnecessary for our analysis?
>* How many unique artist_name in top 50 Songs?
>* How many explicit track listed in top 50 Songs?
>* Identifying most genre in top 50 Songs.
>* Relationship between song release date and top track popularity.
>* Relationship between song duration and top track popularity.
>* Relationship between artist followers and top track popularity.
>* Relationship between song duration and top track popularity.
>* Relationship between artist popularity and top track popularity.

In [None]:
top_50 = top_artist.sort_values(by='top_track_popularity', ascending=False).head(50)
top_50

In [None]:
# checking unique artist_name

top_50['artist_name'].nunique()

Now lets take a look at how many genres and its count that we have based on chart.

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(y=top_50['genres']).set(title='Genres Count')
plt.show()

It is obvious that there are a lot of k-pop songs in Genres columns. However, due to the fact that some song has been classified with multi-genre classification (such as, ['k-pop','k-pop girl group']), there are two labels in one song. Take example at Super Shy by NewJeans. Considering this problem, I decided to check how many songs under one single genre (k-pop, j-pop, etc) and group the data using one single genre.

In [None]:
# retrieve all songs with 'k-pop' as its genre

top_50[top_50['genres'].str.contains('k-pop')]

In [None]:
# checking how many k-pop songs listed in top 50

top_50[top_50['genres'].str.contains('k-pop')].count()

In [None]:
# retrieve songs with other than k-pop as its genre

top_50[~top_50['genres'].str.contains('k-pop')]

In [None]:
# checking how many j-pop songs listed in top 50 since it seems the second largest genres after k-pop

top_50[top_50['genres'].str.contains('j-pop')].count()

In [None]:
# checking the rest of genres

top_50[top_50['genres'].str.contains('j-pop|k-pop') == False].count()

Findings;

1. There are 29 k-pop songs.
2. There are 12 j-pop songs.
3. There are 9 song other than 2 genres mentioned above.

In [None]:
# assigned new value to songs' genre

top_50.loc[top_50['genres'].str.contains('k-pop', case=False), 'genres'] = 'k-pop' 
top_50.loc[top_50['genres'].str.contains('j-pop', case=False), 'genres'] = 'j-pop'
top_50

In [None]:
# plotted the new chart after data transformation

plt.figure(figsize=(8,6))
sns.countplot(data=top_50,y='genres').set(title='Genres Count')
plt.show()

After we have assigned new value to genres column, and recall its bar chart we can see that top genre listed in dataset is k-pop.

In [None]:
# Relationship between song release date vs top track popularity

plt.figure(figsize=(12,8))
plt.ticklabel_format(style = 'plain')
sns.scatterplot(data=top_50, x='top_track_release_date', y='top_track_popularity', hue='genres', size='followers').set(title='Song Release Date vs Top Track Popularity')
plt.show()

In [None]:
# Relationship between artist popularity vs top track popularity

plt.figure(figsize=(10,8))
sns.scatterplot(data=top_50, x='popularity', y='top_track_popularity', hue='genres', size='followers').set(title='Artist Popularity vs Top Track Popularity')
plt.show()

In [None]:
# Relationship between artist followers vs top track popularity

plt.figure(figsize=(10,8))
plt.ticklabel_format(style='plain')
sns.scatterplot(data=top_50, x='followers', y='top_track_popularity', hue='genres').set(title='Artist Followers vs Top Track Popularity')
plt.show()

In [None]:
# checking how many explicit songs in top 50 dataset

sns.countplot(data=top_50, x='top_track_explicit').set(title='Count of Explicit Track')
plt.show()

### Findings and Insight;

* There are 50 unique artist name in top 50 dataset.
* Out of 50, there are more than 40 songs that does not contains explicit track. This may give some ideas to us that its not necessary to create a song with explicit track in order to gain users' like.
* It was found that Spotify used multi-genre for some songs [take example; k-pop, k-pop girl group]. Upon googling, there is no difference in that except some details added into it. However, the classification wasn't being standardize across dataset. As for example, I found that some k-pop girl group only being labelled as 'k-pop', instead of 'k-pop, k-pop girl group' in genre column. Hence, I decided to make it into a single genre (if applicable).
* No correlation between song release date and top track popularity. 
* No correlation between song duration and top track popularity.
* No correlation between artist followers and top track popularity.
* Positive correlation between artist popularity and top track popularity. The higher artist's popularity, the higher their top track score.
* The top genre in top 50 dataset is K-pop followed by J-pop.