<div style="text-align: center; background-color: #750E21; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  FINAL PROJECT: RESEARCHING ON MUSIC TASTE WORDWIDELY 📌
</div>

<div style="text-align: center; background-color: #0766AD; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Stage 01 - Data collecting 📌
</div>

## **PURPOSE** 🚀

✨ In this notebook, our primary task is to gather data from multiple sources with the aim of **producing a dataset** consisting of ```2500 rows```; representing the top songs of all time as updated on YouTube.
* Where do we collect the data ? ✨
    * 🍕 Kworb.net statistic - `https://kworb.net/youtube/topvideos.html`
    * 🍕 Kworb.net statistic of each songs - `https://kworb.net/youtube/video/{video_id}.html`
    * 🍕 Youtube of each songs - `https://www.youtube.com/watch?v={video_id}`
    * 🍕 Using APIv3 from Youtube
    * 🍕 Using API from Spotify
    

## **IMPORT LIBRARY** 🎄

In [1]:
import requests 
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
from googleapiclient.discovery import build
import isodate
from datetime import datetime
import threading
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

import spotipy
from spotipy.oauth2 import SpotifyOAuth
from youtube_title_parse import get_artist_title

<div style="text-align: left; font-family: 'Trebuchet MS', Arial, sans-serif; color: #FF90BC; padding: 20px; font-size: 38px; font-weight: bold; border-radius: 0 0 0 0">
  STEP 1: Get data of toplist music video on Youtube from Kworb.net statistic 🔥
</div>

🧮 In this part, We intend to produce two files named `kworb_video_url.txt` and `youtube_video_url` from the website link ```https://kworb.net/youtube/topvideos.html``` by parsing html.

1️⃣ The following cell, we go the the forementioned website to explore data on the following features:
* ```Ranking```
* ```Video Url```
* ```Title```
* ```Views```
* ```Yesterday Views```

In [2]:
soup = BeautifulSoup(requests.get("https://kworb.net/youtube/topvideos.html").content, "html.parser")

music_data = []
for rank,tr in enumerate(soup.find_all("tr")[1:]):
    tds = tr.find_all("td")
    
    music_data.append({
        'Ranking': rank + 1,
        'Video Url': tds[0].a['href'],
        'Title': tds[0].text,
        'Views': tds[1].text,
        'Yesterday Views': tds[2].text,
    })

music_data = pd.DataFrame(music_data).set_index('Ranking')
music_data

Unnamed: 0_level_0,Video Url,Title,Views,Yesterday Views
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,video/kJQP7kiw5Fk.html,Luis Fonsi - Despacito ft. Daddy Yankee,8327715291,669951
2,video/JGwWNGJdvx8.html,Ed Sheeran - Shape of You (Official Music Video),6148215569,693362
3,video/RgKAFK5djSk.html,Wiz Khalifa - See You Again ft. Charlie Puth [...,6109043053,909804
4,video/OPf0YbXqDm0.html,Mark Ronson - Uptown Funk (Official Video) ft....,5098689027,703972
5,video/9bZkp7q19f0.html,PSY - GANGNAM STYLE(강남스타일) M/V,4975484655,987234
...,...,...,...,...
2496,video/Dcow-Jp3Ak4.html,Ke Personajes Ft Onda Sabanera | Pobre Corazón,325145639,763657
2497,video/HC172grgTwU.html,Same Time Same Jagah (Chaar Din) ● Sandeep Bra...,325054072,89076
2498,video/cAMHx-m9oh8.html,Kya Loge Tum | Akshay Kumar | Amyra Dastur | B...,324747138,324683
2499,video/Fd7lYEtevxQ.html,Xúc Xắc Xúc Xẻ - Bé Bảo An ft Phi Long,324729651,164758


2️⃣ Next, from those features extracted above, we use the ```Video Url``` to extract the key, in other words, ```video_id``` on the Youtube.\
   From the key, we will construct the link to the page of description for video achievements. The url has form like this: ```https://kworb.net/youtube/video/{video_id}.html```.\
   In addition, we also create a file of links that navigates to the mentioned songs on Youtube. The url has form like this:
```https://www.youtube.com/watch?v={video_id}```

In [3]:
music_video_id = []
for url in music_data['Video Url']:
    music_video_id.append(re.findall(r'video/(.*).html', url)[0])

def generate_video_url(video_id):
    url_arr = []
    for video in video_id:
        url_arr.append(f'https://www.youtube.com/watch?v={video}')
    return url_arr

def save_to_txt(url_arr, file_name):
    with open('../data/raw/' + file_name, 'w') as f:
        for url in url_arr:
            f.write(url + '\n')
    print('Save to txt file successfully!')

youtube_video_url = generate_video_url(music_video_id)
save_to_txt(youtube_video_url, 'youtube_video_url.txt')

#save a column of a dataframe to an array
kworb_video_url = music_data['Video Url'].to_numpy()
kworb_video_url = ['https://kworb.net/youtube/' + url for url in kworb_video_url]

save_to_txt(kworb_video_url, 'kworb_video_url.txt')

Save to txt file successfully!
Save to txt file successfully!


<div style="text-align: left; font-family: 'Trebuchet MS', Arial, sans-serif; color: #FF90BC; padding: 20px; font-size: 35px; font-weight: bold; border-radius: 0 0 0 0">
  STEP 2: Crawling data from youtube using api key 🔔
</div>

🔍 With crawling data from youtube using api ket, first we need to create an api key on Google Cloud Console. We have already done this.

In [4]:
api_key = 'AIzaSyDZDKAru1MToGkYvgVq_aSmArEX-qiedUU'

🔍 Since we need to have number of subscribers of each channel, we create a function using api key and channel id to crawl this information.

In [5]:
def get_channel_info_youtube(api_key, channel_id):
    youtube = build('youtube', 'v3', developerKey=api_key)

    try:
        response = youtube.channels().list(
            part='snippet, contentDetails, statistics',
            id=channel_id
        ).execute()

        channel_info = response['items'][0]

        # Extract relevant information
        channel_name = channel_info['snippet']['title']
        subscriber_count = channel_info['statistics']['subscriberCount']
        country = channel_info['snippet'].get("country", "")

        return {
            'channel_name': channel_name,
            'subscriber_count': subscriber_count,
            'country': country
        }

    except Exception as e:
        print(f'An error occurred: {e}')
        return None

- 🔍 Next, we are going to crawl some other informations on from youtube including: `view`, `like`, `duartion`, `channel name`, `subscriber`, `publish time`, `hashtag`. 
- 🔍 Since there are some videos that have been removed from youtube, we will check if the reponse `items` is empty or not, if it is empty we will assign all values to `NaN`.
- 🔍 Besides that, some videos don't allow to take `dislike` so if we don't get it, we will also assin it to `NaN`.

In [6]:
def get_video_info_youtube(api_key, video_id, view_list, like_list, duration_list, channel_name_list, subscriber_list, 
                   publish_time_list, hashtag_list, video_id_list, country_list):
    youtube = build('youtube', 'v3', developerKey=api_key)
    
    response = youtube.videos().list(
        part='snippet, contentDetails, statistics',
        id=video_id
    ).execute()

    if (not response['items']):
        video_id_list.append(video_id)
        view_list.append(np.nan)
        like_list.append(np.nan)
        duration_list.append(np.nan)
        channel_name_list.append(np.nan)
        subscriber_list.append(np.nan)
        country_list.append(np.nan)
        publish_time_list.append(np.nan)
        hashtag_list.append(np.nan)
    else:
        video_info = response['items'][0]

        # Extract relevant information
        views = video_info['statistics']['viewCount']
        
        try: 
            likes = video_info['statistics']['likeCount']
        except: 
            likes = np.nan
            
        duration_iso = video_info['contentDetails']['duration']
        channel_id = video_info['snippet']['channelId']

        # Get number of hashtags
        description = video_info['snippet']['description']
        hashtag_count = description.count('#')

        # Get published time
        published_at = video_info['snippet']['publishedAt']
        publish_time = datetime.strptime(published_at, '%Y-%m-%dT%H:%M:%SZ')

        # Convert ISO duration to human-readable format
        duration_human = isodate.parse_duration(duration_iso)

        # Extract channel name and subscribers
        channel_data = get_channel_info_youtube(api_key, channel_id)
        channel_name = channel_data['channel_name']
        subscribers = channel_data['subscriber_count']
        country = channel_data['country']

        view_list.append(views)
        like_list.append(likes)
        duration_list.append(str(duration_human))
        channel_name_list.append(channel_name)
        subscriber_list.append(subscribers)
        country_list.append(country)
        publish_time_list.append(publish_time)
        hashtag_list.append(hashtag_count)
        video_id_list.append(video_id)

In [7]:
def collect_data_youtube(music_video_id, api_key):
    # Init empty list to store the values of each attribute.
    view_list = []
    like_list = []
    duration_list = []
    channel_name_list = []
    subscriber_list = []
    country_list = []
    publish_time_list = []
    hashtag_list = []
    video_id_list = []
    
    threads = []
    for video_id in music_video_id:
        # Checking whether video_id is blank or not
        if (video_id == ''): 
            continue
        
        # Create thread
        while (threading.active_count() > 20):
            time.sleep(0.1)
        
        thread = threading.Thread(target=get_video_info_youtube, args=(api_key, video_id, view_list, like_list, duration_list, 
                                                               channel_name_list, subscriber_list, publish_time_list, 
                                                               hashtag_list, video_id_list, country_list))
        threads.append(thread)
        thread.start()
        
    for thread in threads:
        thread.join()
        
    data = pd.DataFrame({'Id': video_id_list,
                         'View': view_list,
                         'Like': like_list,
                         'Duration': duration_list,
                         'Channel_name': channel_name_list,
                         'Subscriber': subscriber_list,
                         'Country': country_list,
                         'Publish_time': publish_time_list,
                         'Hashtag': hashtag_list})
    
    return data

In [8]:
youtube_df = collect_data_youtube(music_video_id, api_key)
youtube_df

Unnamed: 0,Id,View,Like,Duration,Channel_name,Subscriber,Country,Publish_time,Hashtag
0,fRh_vgS2dFE,3741939384,16307930,0:03:26,JustinBieberVEVO,31500000,US,2015-10-22 20:00:02,5.0
1,RgKAFK5djSk,6109566451,42265384,0:03:58,Wiz Khalifa Music,15200,,2015-04-07 03:00:03,1.0
2,09R8_2nJtjg,3973123169,15907865,0:05:02,Maroon5VEVO,14400000,US,2015-01-14 15:00:11,0.0
3,k85mRPqvMbE,4191082035,17245404,0:02:52,Crazy Frog,14600000,DE,2009-06-17 04:30:53,3.0
4,CevxZvSJLk8,3919098026,16596905,0:04:30,KatyPerryVEVO,24600000,US,2013-09-05 20:00:22,0.0
...,...,...,...,...,...,...,...,...,...
2495,elueA2rofoo,328190893,1998244,0:04:01,BritneySpearsVEVO,4880000,,2009-10-25 07:10:44,5.0
2496,98WtmW-lfeE,328232666,1650718,0:03:50,KatyPerryVEVO,24600000,US,2010-08-10 19:32:36,3.0
2497,dGzpsBSJZow,328042675,2190840,0:04:56,CanserberoVEVO,210000,,2013-03-02 02:16:42,0.0
2498,DGIgXP9SvB8,327962856,2156263,0:04:53,williamVEVO,3720000,,2013-04-19 07:00:16,1.0


<div style="text-align: left; font-family: 'Trebuchet MS', Arial, sans-serif; color: #FF90BC; padding: 20px; font-size: 35px; font-weight: bold; border-radius: 0 0 0 0">
  STEP 3: Crawling data by parsing HTML with url save in <font color=lightgreen>kworb_video_url.txt</font> 🔔
</div>

🎯 From the extracted urls saved in *`k_worb_video_url.txt`*, now I use each link in the txt file to get access to the page and parsing html and extract the infomation in it.\
🎯 The information extracted included:
- `Title`
- `Most view per day`
- `Most view date`
- `Highest rank`
- `Time to highest rank`
- `Charted- duration`
- `ID`


In [9]:
def get_video_info_web(url, Title, Most_view_in4_1, Most_view_in4_2, Rank_in4_1, Rank_in4_2, Rank_in4_3, Video_id):
    html_content = requests.get(url)#.text
    soup = BeautifulSoup(html_content.content, "lxml")

    video_title = soup.title.text.split(" – ")[0].split("YouTube Stats of ")[1]
    Title.append(video_title)

    most_view_in4 = soup.text.split("Most views in a day: ")[1].split("\n")[0].split(" ")
    most_view_in4[0] = int(most_view_in4[0].replace(',', ''))
    most_view_in4[1] = most_view_in4[1][1:-1]
    Most_view_in4_1.append(most_view_in4[0])
    Most_view_in4_2.append(most_view_in4[1])

    rank_in4 = soup.text.split("Peaked at #")
    if(len(rank_in4) == 1): 
        rank_in4 = [np.nan, np.nan, np.nan]
    elif(len(rank_in4[1].split("\n")[0].split(" ")) <= 6):
        rank_in4 = [rank_in4[1].split("\n")[0].split(" ")[0], 'nan', rank_in4[1].split("\n")[0].split(" ")[4]]
    else: 
        rank_in4 = [rank_in4[1].split("\n")[0].split(" ")[0], rank_in4[1].split("\n")[0].split(" ")[2], rank_in4[1].split("\n")[0].split(" ")[7]]

    rank_in4[0] = float(rank_in4[0])
    rank_in4[1] = float(rank_in4[1])
    rank_in4[2] = float(rank_in4[2])

    Rank_in4_1.append(rank_in4[0])
    Rank_in4_2.append(rank_in4[1])
    Rank_in4_3.append(rank_in4[2])

    video_id = url[32:-5]
    Video_id.append(video_id)

In [10]:
def collect_data_web(course_urls_file):
    #load paths from file
    url_file = open(course_urls_file)
    urls = url_file.readlines()
    urls_filtered = [item[:-1] for item in urls]
    
    #init empty list to store the values of each attribute.
    Title = []
    Most_view_in4_1 = []
    Most_view_in4_2 = []
    Rank_in4_1 = []
    Rank_in4_2 = []
    Rank_in4_3 = []
    Video_id = []

    num_threads = 4

    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        # Submit tasks to the thread pool
        futures = [executor.submit(get_video_info_web, url, Title, Most_view_in4_1, 
                                   Most_view_in4_2, Rank_in4_1, Rank_in4_2, Rank_in4_3, 
                                   Video_id) for url in urls_filtered]

        # Wait for all tasks to complete
        for future in futures:
            future.result()
    
    data = pd.DataFrame({"Title": Title,
                         "Most view per day": Most_view_in4_1,
                         "Most-view-date": Most_view_in4_2, 
                         "Highest rank": Rank_in4_1,
                         "Time to Highest rank": Rank_in4_2,
                         "Charted-duration": Rank_in4_3, 
                         "Id": Video_id})

    return data

In [11]:
kworb_df = collect_data_web('../data/raw/kworb_video_url.txt')
kworb_df

Unnamed: 0,Title,Most view per day,Most-view-date,Highest rank,Time to Highest rank,Charted-duration,Id
0,Ed Sheeran - Shape of You (Official Music Video),14390704,2017/05/13,1.0,4.0,356.0,JGwWNGJdvx8
1,Wiz Khalifa - See You Again ft. Charlie Puth [...,8818084,2015/05/23,1.0,17.0,449.0,RgKAFK5djSk
2,Luis Fonsi - Despacito ft. Daddy Yankee,25794523,2017/08/05,1.0,35.0,359.0,kJQP7kiw5Fk
3,Mark Ronson - Uptown Funk (Official Video) ft....,6365428,2015/03/21,1.0,7.0,462.0,OPf0YbXqDm0
4,El Chombo - Dame Tu Cosita feat. Cutty Ranks (...,14419156,2018/04/22,1.0,5.0,296.0,FzG4uDgje3M
...,...,...,...,...,...,...,...
2495,Ke Personajes Ft Onda Sabanera | Pobre Corazón,3904383,2023/03/14,11.0,,38.0,Dcow-Jp3Ak4
2496,Same Time Same Jagah (Chaar Din) ● Sandeep Bra...,216961,2021/02/15,,,,HC172grgTwU
2497,Kya Loge Tum | Akshay Kumar | Amyra Dastur | B...,15118533,2023/05/19,1.0,1.0,17.0,cAMHx-m9oh8
2498,Xúc Xắc Xúc Xẻ - Bé Bảo An ft Phi Long,842423,2020/05/01,,,,Fd7lYEtevxQ


<div style="text-align: left; font-family: 'Trebuchet MS', Arial, sans-serif; color: #FF90BC; padding: 20px; font-size: 35px; font-weight: bold; border-radius: 0 0 0 0">
  STEP 4: Save 2 created dataframes to CSV files and Merge them 🔔
</div>

✅ After step 3 and step 4, we finally get the result of pretty features numbers.\
✅ Because, we apply multithreading using library `Threading` in the process of extracting data; therefore the origin order of data is not as the same. The solution we come up with is on the 2 dataframe we share the same feature known as `Id` so that when we could merge 2 dataframe on this column. 

In [12]:
kworb_df.to_csv('../data/raw/raw_kworb_data.csv', index=False)

In [13]:
youtube_df.to_csv('../data/raw/raw_youtube_data.csv', index=False)

In [14]:
youtube_kworb_df = pd.merge(youtube_df, kworb_df, on='Id')
youtube_kworb_df

Unnamed: 0,Id,View,Like,Duration,Channel_name,Subscriber,Country,Publish_time,Hashtag,Title,Most view per day,Most-view-date,Highest rank,Time to Highest rank,Charted-duration
0,fRh_vgS2dFE,3741939384,16307930,0:03:26,JustinBieberVEVO,31500000,US,2015-10-22 20:00:02,5.0,Justin Bieber - Sorry (PURPOSE : The Movement),11051088,2015/12/31,1.0,15.0,115.0
1,RgKAFK5djSk,6109566451,42265384,0:03:58,Wiz Khalifa Music,15200,,2015-04-07 03:00:03,1.0,Wiz Khalifa - See You Again ft. Charlie Puth [...,8818084,2015/05/23,1.0,17.0,449.0
2,09R8_2nJtjg,3973123169,15907865,0:05:02,Maroon5VEVO,14400000,US,2015-01-14 15:00:11,0.0,Maroon 5 - Sugar (Official Music Video),10684581,2015/01/16,1.0,1.0,304.0
3,k85mRPqvMbE,4191082035,17245404,0:02:52,Crazy Frog,14600000,DE,2009-06-17 04:30:53,3.0,Crazy Frog - Axel F (Official Video),4476499,2017/12/31,6.0,,354.0
4,CevxZvSJLk8,3919098026,16596905,0:04:30,KatyPerryVEVO,24600000,US,2013-09-05 20:00:22,0.0,Katy Perry - Roar,11294380,2013/09/07,2.0,5.0,445.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,elueA2rofoo,328190893,1998244,0:04:01,BritneySpearsVEVO,4880000,,2009-10-25 07:10:44,5.0,Britney Spears - Gimme More (Official HD Video),334800,2021/04/23,,,
2496,98WtmW-lfeE,328232666,1650718,0:03:50,KatyPerryVEVO,24600000,US,2010-08-10 19:32:36,3.0,Katy Perry - Teenage Dream (Official Music Video),477346,2010/09/25,15.0,,14.0
2497,dGzpsBSJZow,328042675,2190840,0:04:56,CanserberoVEVO,210000,,2013-03-02 02:16:42,0.0,Canserbero - Maquiavélico (Video Oficial),264932,2023/04/28,,,
2498,DGIgXP9SvB8,327962856,2156263,0:04:53,williamVEVO,3720000,,2013-04-19 07:00:16,1.0,will.i.am - #thatPOWER ft. Justin Bieber,1757495,2013/04/21,5.0,,27.0


In [15]:
youtube_kworb_df.to_csv('../data/raw_youtube_kworb_data.csv', index=False)

<div style="text-align: left; font-family: 'Trebuchet MS', Arial, sans-serif; color: #FF90BC; padding: 20px; font-size: 35px; font-weight: bold; border-radius: 0 0 0 0">
  STEP 5: CRAWL DATA WITH SPOTIFY API 🔔
</div>

✅ In this part, we will extend the features by extracting more information using `SPOTIFY API`.\
✅ We base on the titles of top songs extract from `Youtube` (in 2 previous steps), and we could use the module (we found on Github) named `youtube_parse_title` to get the saparate information on `song name` and `artist name`.\
✅ Some features being provided in this part are as follows:
- `Song Name`
- `Artist Name`
- `Popularity Score (Spotify)`
- `Release date (Spotify)`
- `Genres`

1️⃣ This is necessarity key to API, generated by `Spotify for Developer`. We will use library `spotipy` to access to information provided the API.

In [2]:
SPOTIPY_CLIENT_ID = 'abe1e613b24942b0aabf08c51e29fd92'
SPOTIPY_CLIENT_SECRET = '4a05a2e6b18942e2aa4cef9260d89a1c'
SPOTIPY_REDIRECT_URI = 'http://localhost:8888/callback'

In [3]:
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=SPOTIPY_CLIENT_ID,
                                                  client_secret=SPOTIPY_CLIENT_SECRET,
                                                  redirect_uri=SPOTIPY_REDIRECT_URI,
                                                  scope="user-library-read"))


2️⃣ We will work on the merge dataframe above.\
`raw_youtube_kworb_data.csv`

In [4]:
data_kworb = pd.read_csv('../data/raw/raw_youtube_kworb_data.csv')
data_kworb

Unnamed: 0,Id,View,Like,Duration,Channel_name,Subscriber,Country,Publish_time,Hasgtag,Title,Most view per day,Most-view-date,Highest rank,Time to Highest rank,Charted-duration
0,9bZkp7q19f0,4.977100e+09,27830052.0,0:04:13,officialpsy,18400000.0,,2012-07-15 07:46:32,4.0,PSY - GANGNAM STYLE(강남스타일) M/V,14924298,2012/12/21,1.0,36.0,482.0
1,hT_nvWreIhg,3.927344e+09,17491157.0,0:04:44,OneRepublicVEVO,5470000.0,,2013-05-31 07:00:36,2.0,OneRepublic - Counting Stars,3288973,2018/11/10,4.0,,482.0
2,JGwWNGJdvx8,6.149237e+09,32323841.0,0:04:24,Ed Sheeran,53900000.0,,2017-01-30 10:57:50,3.0,Ed Sheeran - Shape of You (Official Music Video),14390704,2017/05/13,1.0,4.0,356.0
3,lp-EO5I60KA,3.700048e+09,14912926.0,0:04:57,Ed Sheeran,53900000.0,,2014-10-07 13:57:37,3.0,Ed Sheeran - Thinking Out Loud (Official Music...,3771622,2015/02/14,3.0,1.0,304.0
4,CevxZvSJLk8,3.918655e+09,16596026.0,0:04:30,KatyPerryVEVO,24600000.0,US,2013-09-05 20:00:22,0.0,Katy Perry - Roar,11294380,2013/09/07,2.0,5.0,445.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,HC172grgTwU,3.250905e+08,1693226.0,0:05:17,Lokdhun Punjabi,13300000.0,IN,2016-01-18 03:30:00,8.0,Same Time Same Jagah (Chaar Din) ● Sandeep Bra...,216961,2021/02/15,,,
2496,Fd7lYEtevxQ,3.247416e+08,867653.0,0:03:07,Ruby Bảo An,1640000.0,,2011-01-31 13:52:25,8.0,Xúc Xắc Xúc Xẻ - Bé Bảo An ft Phi Long,842423,2020/05/01,,,
2497,Fp8msa5uYsc,3.244410e+08,3355081.0,0:03:33,JustinBieberVEVO,31500000.0,US,2021-10-08 04:00:10,3.0,Justin Bieber - Ghost,3261913,2021/10/08,14.0,,10.0
2498,6EGg0_l-edc,3.245723e+08,1617971.0,0:03:11,Henrique e Juliano,15800000.0,BR,2021-09-03 15:00:14,0.0,Henrique e Juliano - A MAIOR SAUDADE - DVD Ma...,2562984,2022/07/11,56.0,,29.0


3️⃣ 
- This function is written to return the artist and title separately by using module `get_artist_title` in `youtube_parse_html`.
- The aim of this is enhancing the accuracy when searching for song using Spotify API. Due to the fact that an artist could release songs with so many versions from remix, accoustic to slow, reverb; the return results would not meet our demand as the score of popularity is quite humble in comparison with the origin of song.
 

In [5]:
def get_artist_title_youtube(youtube_title):
    try:
        artist, title = get_artist_title(youtube_title)
        return artist, title
    except:
        title = youtube_title
        artist = ""
        return artist, title

4️⃣ This function is written to get track of the desired song. It will return in the form of a `json` file and we will use extract information later with the returned results.

In [17]:
def get_track(artist, title, youtube_title):
    try:
        if (artist == ""):
            search_str = title
        else:   
            search_str = f'artist:{artist} track:{title}'
        result = sp.search(search_str)
        track = result['tracks']['items'][0]
        return track
    except:
        try:
            result = sp.search(youtube_title)
            track = result['tracks']['items'][0]
            return track
        except:
            return np.nan

5️⃣ 
- Before calling those above functions, we will transform the column `Title` of dataframe to an array.
- Calling the function `get_artist_title_youtube` to separate artist and title name
- We still use multithread for better performance, however, we would use the library `ThreadPoolExecutor` instead so that even running in multiple threads, we still have approach to keep the orgin order of array of items.

In [18]:
YouTubeVideo_titles = data_kworb['Title'].to_numpy()
artists = []
titles = []

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(get_artist_title_youtube, youtubetitle) for youtubetitle in YouTubeVideo_titles]
    for future in futures:
        artist, title = future.result()
        artists.append(artist)
        titles.append(title)

6️⃣ Now, we continue to extract in formation by calling `get_track` function above.

In [19]:
tracks = []

with ThreadPoolExecutor(max_workers=15) as executor:
    futures = [executor.submit(get_track, artist, title, youtube_title) for artist, title, youtube_title in zip(artists, titles, YouTubeVideo_titles)]
    for future in futures:
        track = future.result()
        tracks.append(track)
        time.sleep(0.1)


📗 **L E T**\
⚔ Check the length of ```tracks``` .\
⚔ Check if there is lines with ```NaN``` values.

In [20]:
#Check how many tracks are found
print(len(tracks))
#Check how many tracks are nan values
print(sum(pd.isna(tracks)))


2500
0


✅ Wonderful! We could explore all data songs without a missing data.\
✅ Move to the next step, we extract information from the tracks.

In [21]:
#get some track information like popularity, song name, artist name, release date, genre
track_info = []
for track in tracks:
    if isinstance(track, dict):
        track_info.append((track['name'], track['artists'][0]['name'], track['album']['release_date'],track['popularity']))
    else:
        track_info.append((np.nan, np.nan, np.nan, np.nan))

track_info = pd.DataFrame(track_info, columns=['Song Name', 'Artist Name', 'Release Date (Spotify)', 'Popularity Score (Spotify)'])
track_info

Unnamed: 0,Song Name,Artist Name,Release Date (Spotify),Popularity Score (Spotify)
0,Gangnam Style (강남스타일),PSY,2012-01-01,75
1,Counting Stars,OneRepublic,2013-01-01,88
2,Shape of You,Ed Sheeran,2017-03-03,88
3,Thinking out Loud,Ed Sheeran,2014-06-21,85
4,Roar,Katy Perry,2013-10-18,74
...,...,...,...,...
2495,Same Time Tomorrow,Brandon Davis,2023-09-01,23
2496,Xúc Xắc Xúc Xẻ,Bé Bảo An,2020-08-08,11
2497,Ghost,Justin Bieber,2021-03-19,89
2498,A Maior Saudade / Como É Que A Gente Fica - Ao...,Henrique & Juliano,2023-12-01,52


📌 Save track_info to csv so that if we meet trouble with the following code we won't have to call for Spotify API again.\
❗️ The running time is long and the request number to API is limited.

In [22]:
track_info.to_csv('../data/raw/track_info.csv', index=False)

7️⃣
- Why don't we get genres in the previous step, together with `poparily score` or `release date`?
- The reason is that the returned result of searching for song with spotify API doesn't include the genre of the song there.
- And the genre we extract below is not belong to a particular song as well, because spotify API don't support. Insteads, we would use the genre from the artist - owner of the song instead. In some cases, genre of an artist is not kept the same, however, we believe that the number of those artist is not that much. 
- So let's start get the genres. !

In [23]:
def get_genre(track):
    if isinstance(track, dict):
        result = sp.artist(track['artists'][0]['external_urls']['spotify'])
        genre = result['genres']
        return genre
    else:
        return np.nan

genres = []
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(get_genre, track): track for track in tracks}
    for future in as_completed(futures):
        genre = future.result()
        genres.append(genre)


📑 Extend the track_info dataframe with feature `Genre`

In [24]:
track_info['Genre'] = genres
track_info

Unnamed: 0,Song Name,Artist Name,Release Date (Spotify),Popularity Score (Spotify),Genre
0,Gangnam Style (강남스타일),PSY,2012-01-01,75,"[k-rap, korean old school hip hop]"
1,Counting Stars,OneRepublic,2013-01-01,88,[panamanian pop]
2,Shape of You,Ed Sheeran,2017-03-03,88,"[colombian pop, dance pop, latin pop, pop]"
3,Thinking out Loud,Ed Sheeran,2014-06-21,85,"[latin pop, puerto rican pop]"
4,Roar,Katy Perry,2013-10-18,74,"[canadian pop, pop]"
...,...,...,...,...,...
2495,Same Time Tomorrow,Brandon Davis,2023-09-01,23,"[desi pop, filmi, punjabi pop]"
2496,Xúc Xắc Xúc Xẻ,Bé Bảo An,2020-08-08,11,[modern country pop]
2497,Ghost,Justin Bieber,2021-03-19,89,"[agronejo, arrocha, sertanejo, sertanejo unive..."
2498,A Maior Saudade / Como É Que A Gente Fica - Ao...,Henrique & Juliano,2023-12-01,52,"[canadian pop, pop]"


In [26]:
track_info.to_csv('../data/raw/raw_spotify_data.csv', index=False)

<div style="text-align: left; font-family: 'Trebuchet MS', Arial, sans-serif; color: #FF90BC; padding: 20px; font-size: 35px; font-weight: bold; border-radius: 0 0 0 0">
  STEP 6: MERGE DATA 🔔
</div>

⚔ We finally gather all data.\
⚔ We will concat the dataframe produced in step 2 and 3 with the dataframe extracted with Spotify dataframe.\
⚔ `final_raw_data.csv` would be the main data sample that our team use on later phrase.

In [27]:
raw_youtube_kworb = pd.read_csv('../data/raw/raw_youtube_kworb_data.csv')
raw_spotify = pd.read_csv('../data/raw/raw_spotify_data.csv')

In [30]:
final_raw_data = pd.concat([raw_youtube_kworb, raw_spotify], axis=1)
final_raw_data

Unnamed: 0,Id,View,Like,Duration,Channel_name,Subscriber,Country,Publish_time,Hasgtag,Title,Most view per day,Most-view-date,Highest rank,Time to Highest rank,Charted-duration,Song Name,Artist Name,Release Date (Spotify),Popularity Score (Spotify),Genre
0,9bZkp7q19f0,4.977100e+09,27830052.0,0:04:13,officialpsy,18400000.0,,2012-07-15 07:46:32,4.0,PSY - GANGNAM STYLE(강남스타일) M/V,14924298,2012/12/21,1.0,36.0,482.0,Gangnam Style (강남스타일),PSY,2012-01-01,75,"['k-rap', 'korean old school hip hop']"
1,hT_nvWreIhg,3.927344e+09,17491157.0,0:04:44,OneRepublicVEVO,5470000.0,,2013-05-31 07:00:36,2.0,OneRepublic - Counting Stars,3288973,2018/11/10,4.0,,482.0,Counting Stars,OneRepublic,2013-01-01,88,['panamanian pop']
2,JGwWNGJdvx8,6.149237e+09,32323841.0,0:04:24,Ed Sheeran,53900000.0,,2017-01-30 10:57:50,3.0,Ed Sheeran - Shape of You (Official Music Video),14390704,2017/05/13,1.0,4.0,356.0,Shape of You,Ed Sheeran,2017-03-03,88,"['colombian pop', 'dance pop', 'latin pop', 'p..."
3,lp-EO5I60KA,3.700048e+09,14912926.0,0:04:57,Ed Sheeran,53900000.0,,2014-10-07 13:57:37,3.0,Ed Sheeran - Thinking Out Loud (Official Music...,3771622,2015/02/14,3.0,1.0,304.0,Thinking out Loud,Ed Sheeran,2014-06-21,85,"['latin pop', 'puerto rican pop']"
4,CevxZvSJLk8,3.918655e+09,16596026.0,0:04:30,KatyPerryVEVO,24600000.0,US,2013-09-05 20:00:22,0.0,Katy Perry - Roar,11294380,2013/09/07,2.0,5.0,445.0,Roar,Katy Perry,2013-10-18,74,"['canadian pop', 'pop']"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,HC172grgTwU,3.250905e+08,1693226.0,0:05:17,Lokdhun Punjabi,13300000.0,IN,2016-01-18 03:30:00,8.0,Same Time Same Jagah (Chaar Din) ● Sandeep Bra...,216961,2021/02/15,,,,Same Time Tomorrow,Brandon Davis,2023-09-01,23,"['desi pop', 'filmi', 'punjabi pop']"
2496,Fd7lYEtevxQ,3.247416e+08,867653.0,0:03:07,Ruby Bảo An,1640000.0,,2011-01-31 13:52:25,8.0,Xúc Xắc Xúc Xẻ - Bé Bảo An ft Phi Long,842423,2020/05/01,,,,Xúc Xắc Xúc Xẻ,Bé Bảo An,2020-08-08,11,['modern country pop']
2497,Fp8msa5uYsc,3.244410e+08,3355081.0,0:03:33,JustinBieberVEVO,31500000.0,US,2021-10-08 04:00:10,3.0,Justin Bieber - Ghost,3261913,2021/10/08,14.0,,10.0,Ghost,Justin Bieber,2021-03-19,89,"['agronejo', 'arrocha', 'sertanejo', 'sertanej..."
2498,6EGg0_l-edc,3.245723e+08,1617971.0,0:03:11,Henrique e Juliano,15800000.0,BR,2021-09-03 15:00:14,0.0,Henrique e Juliano - A MAIOR SAUDADE - DVD Ma...,2562984,2022/07/11,56.0,,29.0,A Maior Saudade / Como É Que A Gente Fica - Ao...,Henrique & Juliano,2023-12-01,52,"['canadian pop', 'pop']"


In [31]:
final_raw_data.to_csv('../data/raw/final_raw_data.csv', index=False)