# 🎼 NB01: Data Collection

In this notebook, I will be mainly testing using my access token and obtaining data through different endpoints (alongside observing their outputs to get a sense of how I can use them) from the Spotify API and Genius API.

In [None]:
# Importing the necessary libraries
from dotenv import load_dotenv
from functions import *
from bs4 import BeautifulSoup
from pprint import pprint
from auth import *
import pandas as pd
import json
import lyricsgenius

In [2]:
# Defining access_token through calling the get_token() function
access_token = get_token()

In [None]:
# Testing the search_artist() function and printing the data - testing whether the token and API calling works
artist_data = search_artist("Taylor Swift", access_token)

{'artists': {'href': 'https://api.spotify.com/v1/search?offset=0&limit=1&query=Taylor%20Swift&type=artist', 'limit': 1, 'next': 'https://api.spotify.com/v1/search?offset=1&limit=1&query=Taylor%20Swift&type=artist', 'offset': 0, 'previous': None, 'total': 814, 'items': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/06HL4z0CvFAxyc27GXpf02'}, 'followers': {'href': None, 'total': 126975036}, 'genres': ['pop'], 'href': 'https://api.spotify.com/v1/artists/06HL4z0CvFAxyc27GXpf02', 'id': '06HL4z0CvFAxyc27GXpf02', 'images': [{'url': 'https://i.scdn.co/image/ab6761610000e5ebe672b5f553298dcdccb0e676', 'height': 640, 'width': 640}, {'url': 'https://i.scdn.co/image/ab67616100005174e672b5f553298dcdccb0e676', 'height': 320, 'width': 320}, {'url': 'https://i.scdn.co/image/ab6761610000f178e672b5f553298dcdccb0e676', 'height': 160, 'width': 160}], 'name': 'Taylor Swift', 'popularity': 100, 'type': 'artist', 'uri': 'spotify:artist:06HL4z0CvFAxyc27GXpf02'}]}}


#### Working on the Question - Top Hits Playlist

Here, I am gathering data from the Spotify user-created playlist "Pop Hits 2000s-2024"

In [None]:
# Getting playlist items from the Spotify API
# Doing this in 3 steps, as the API only allows 100 items per call
top_hits = get_playlist_items("6mtYuOxzl58vSGnEDtZ9uB", "items(track.artists.name, track.name, track.id)", "GB", 100, 0, access_token)
top_hits2 = get_playlist_items("6mtYuOxzl58vSGnEDtZ9uB", "items(track.artists.name, track.name, track.id)", "GB", 100, 100, access_token)
top_hits3 = get_playlist_items("6mtYuOxzl58vSGnEDtZ9uB", "items(track.artists.name, track.name, track.id)", "GB", 100, 200, access_token)

In [5]:
with open("../data/raw/top_hits.json", "w", encoding="utf-8") as f:
    json.dump(top_hits, f, ensure_ascii=False, indent=4)

with open("../data/raw/top_hits2.json", "w", encoding="utf-8") as f:
    json.dump(top_hits2, f, ensure_ascii=False, indent=4)

with open("../data/raw/top_hits3.json", "w", encoding="utf-8") as f:
    json.dump(top_hits3, f, ensure_ascii=False, indent=4)

with open("../data/raw/top_hits.json", "r") as f1, open("../data/raw/top_hits2.json", "r") as f2, open("../data/raw/top_hits3.json", "r") as f3:
    json1 = json.load(f1)
    json2 = json.load(f2)
    json3 = json.load(f3)

combined_items = json1["items"] + json2["items"] + json3["items"]

combined_json = {"items": combined_items}

with open("../data/raw/combined_top_hits.json", "w") as output_file:
    json.dump(combined_json, output_file, indent=2)

#### Working on the Question - Girly Pop Music Playlist

I am moving on to user-created Spotify playlist "girly pop songs" - a playlist I chose to collect data from due to its userbase popularity with 39,000 saves and diverse, popular, and more current/relevant tracks and artists 

For example, despite Sabrina Carpenter's current popularity, she didn't feature much on the Pop Hits playlist, possibly due to user-creation bias. My initial goal was to use Spotify-created playlists to avoid this. However, due to the new API restrictions, this wasn't possible and I aimed to diversify my dataset by using two different playlists. 

In [None]:
# Getting playlist items from the Spotify API for "girly pop songs"
# Once again doing this in 3 steps as the API only allows 100 items per call
women_in_pop_data = get_playlist_items("06fIJ0Q8SkYruBcJX2M6C8", "items(track.artists.name, track.name, track.id)", "GB", 100, 0, access_token)
women_in_pop_data2 = get_playlist_items("06fIJ0Q8SkYruBcJX2M6C8", "items(track.artists.name, track.name, track.id)", "GB", 100, 100, access_token)
women_in_pop_data3 = get_playlist_items("06fIJ0Q8SkYruBcJX2M6C8", "items(track.artists.name, track.name, track.id)", "GB", 100, 200, access_token)

In [7]:
with open("../data/raw/women_pop.json", "w", encoding="utf-8") as f:
    json.dump(women_in_pop_data, f, ensure_ascii=False, indent=4)

with open("../data/raw/women_pop2.json", "w", encoding="utf-8") as f:
    json.dump(women_in_pop_data2, f, ensure_ascii=False, indent=4)

with open("../data/raw/women_pop3.json", "w", encoding="utf-8") as f:
    json.dump(women_in_pop_data3, f, ensure_ascii=False, indent=4)

with open("../data/raw/women_pop.json", "r") as f1, open("../data/raw/women_pop2.json", "r") as f2, open("../data/raw/women_pop3.json", "r") as f3:
    json1 = json.load(f1)
    json2 = json.load(f2)
    json3 = json.load(f3)

combined_items = json1["items"] + json2["items"] + json3["items"]

combined_json = {"items": combined_items}

with open("../data/raw/combined_women_pop.json", "w") as output_file:
    json.dump(combined_json, output_file, indent=2)

### Testing the endpoints, Spotfiy API functions, and Genius API functions

I have collected my raw data that I will process and supplement with lyrics data in NB02. Here, I wanted to test some endpoints that I plan on using for the next steps of my analysis.

In [8]:
# Testing the search_song() function/output, alongside the Genius API access token
search_song("Please Please Please", access_token2)

'https://genius.com/Sabrina-carpenter-please-please-please-lyrics'

In [9]:
genius = lyricsgenius.Genius(access_token2, remove_section_headers=True, skip_non_songs=True)

In [10]:
artist = "Taylor Swift"
nb_songs = 1
langage = "english"

In [16]:
artist_genius = genius.search_artist(artist, max_songs=nb_songs, sort='popularity')

songs_data = [(song.title, song.lyrics) for song in artist_genius.songs if song is not None]

songs_df = pd.DataFrame(songs_data, columns=['Title', 'Lyrics'])

Searching for songs by Taylor Swift...

Song 1: "All Too Well (10 Minute Version) (Taylor’s Version) [From The Vault]"

Reached user-specified song limit (1).
Done. Found 1 songs.


In [17]:
data = pd.DataFrame({'artist':artist, 'title':titles, 'lyrics':lyrics})

In [None]:
# Calling the get_song_lyrics() function to get the lyrics of a song
# Using the Genius API endpoint "get songs" to get Genius URLs, webscraping Genius to get the lyrics to songs
lyrics = get_song_lyrics("Von dutch", "Charli XCX")

Searching for "Von dutch" by Charli XCX...
Done.
103 ContributorsTranslationsPortuguêsDeutschBahasa IndonesiaУкраїнськаRomânăРусскийTürkçePolskiEspañolItalianoHebrewFrançaisΕλληνικάVon dutch Lyrics[Verse]
It's okay to just admit that you're jealous of me
Yeah, I heard you talk about me, that's the word on the street
You're obsessing, just confess it, put your hands up
It's obvious I'm your number one

[Pre-Chorus]
It's alright to just admit that I'm the fantasy
You're obsessing, just confess it 'cause it's obvious
I'm your number one, I'm your number one
I'm your number one, yeah

[Chorus]
I'm just living that life
Von Dutch, cult classic, but I still pop
I get money, you get mad because the bank's shut
Yeah, I know your little secret, put your hands up
It's so obvious I'm your number one, life
Von Dutch, cult classic in your eardrums
Why you lying? You won't fuck unless he famous
Do that littlе dance, without it, you'd be namelеss
It's so obvious I'm your number one

[Post-Chorus]
I'm

Click [here](https://github.com/lse-ds105/w10-summative-deyavuz/tree/main?tab=readme-ov-file#table-of-contents) to navigate back to the Table of Contents!