# Data Collection Notebook

#### The imports below are as follows:
- spotipy and spotipy.oauth2 are used to connect directly to the Spotify API. spotipy.oauth2 is used to automatically create an authorization code rather than creating one manually. Spotify limits the authorization codes to 3600 seconds, or 1 hour so the oauth2 just allows for automatic creation rather than constantly running an auth code on the side to get credentials to use the API.
- Pandas is used to create a data frame, csv, and also read in data, etc.
- the Time import is used to create the retries for the API connection.
- OS is was used to check pathin existence, in this case we need to make sure the loop was saving the information into the same csv file instead of overwriting it.

In [1]:
import spotipy
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
import time
import os

In [53]:
# Set up credentials
client_id = 'this can be obtained from the spotify for developers site linked at the bottom of this notebook'
client_secret = 'this can be obtained from the spotify for developers site linked at the bottom of this notebook'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# Define a list of playlist URIs
playlist_uris = ['https://open.spotify.com/playlist/37i9dQZF1DZ06evO3GI3JK?si=285b301156544eb1']

# Create an empty list to store track information
tracks_data = []

# Check if the CSV file exists or create a new DataFrame
if os.path.isfile('./data/playlist_tracks2.csv'):
    df = pd.read_csv('./data/playlist_tracks2.csv')
else:
    df = pd.DataFrame(columns=['Track', 'Artist', 'Genre', 'Album Title', 'Album Type', 'Release Date',
                               'Thumbnail', 'Acousticness', 'Danceability', 'Energy', 'Instrumentalness',
                               'Popularity', 'Speechiness', 'Track ID'])

# Iterate over playlist URIs
for playlist_uri in playlist_uris:
    retries = 3  # Number of retries
    while retries > 0:
        try:
            # Get playlist tracks
            results = sp.playlist_tracks(playlist_uri)

            # Iterate over tracks
            for item in results['items']:
                track = item['track']

                # Get audio features for the track
                audio_features = sp.audio_features(track['id'])[0]

                # Get the genre(s) of the track
                genres = sp.artist(track['artists'][0]['id'])['genres']
                genre = ', '.join(genres) if genres else ''

                # Create a dictionary to store track information
                track_info = {
                    'Track': track['name'],
                    'Artist': track['artists'][0]['name'],
                    'Genre': genre,
                    'Album Title': track['album']['name'],
                    'Album Type': track['album']['album_type'],
                    'Release Date': track['album']['release_date'],
                    'Thumbnail': track['album']['images'][2]['url'],
                    'Acousticness': audio_features['acousticness'],
                    'Danceability': audio_features['danceability'],
                    'Energy': audio_features['energy'],
                    'Instrumentalness': audio_features['instrumentalness'],
                    'Popularity': track['popularity'],
                    'Speechiness': audio_features['speechiness'],
                    'Track ID': track['id']
                }

                # Append track information to the DataFrame
                df = pd.concat([df, pd.DataFrame(track_info, index=[0])], ignore_index=True)

            # Break out of the retry loop if successful
            break

        except spotipy.exceptions.SpotifyException as e:
            if e.http_status == 429:
                retry_after = int(e.headers.get('Retry-After', 10))
                print(f"Rate limit exceeded. Retrying after {retry_after} seconds.")
                time.sleep(retry_after)
                retries -= 1
                continue

# Save the DataFrame to the CSV file
df.to_csv('../data/playlist_tracks2.csv', index=False)

print("Track information saved to playlist_tracks2.csv.")

Track information saved to playlist_tracks2.csv.


---
## Sources
- This [link](https://www.linkedin.com/pulse/extracting-your-fav-playlist-info-spotifys-api-samantha-jones/?utm_source=share&utm_medium=member_ios&utm_campaign=share_via) leads to Samantha Jones Spotify Playlist project. I used her process as an idea for data collection. She mainly focused on creating a playlist using the API, I am extracting data from public playlists using the API.
- [Spotify Track Documentation](https://developer.spotify.com/documentation/web-api/reference/get-recommendations)
- [Spotify for Developers](https://developer.spotify.com/) is what gives access to the web API along with the additional functions to it.

#### SIDENOTE
- This [link](https://gist.github.com/andytlr/4104c667a62d8145aa3a) contains additional information based on genres which makes it easier to format them accordingly to match the Spotify records. 
    - For example, I was having issues using 'Post-Punk' and 'Hip-Hop' in the beginning but I learned that in order to make calls you need to have the genre names in lowercase with no special characters.
    - Spotify also has a call limit of 100 max but it can become an issue so lowering it to 50 seemed to work the best for me, this might differ by user based on the amount of genres you are using at a time and also on how busy the server is.