
###Vinit kataria. "Individual Submission."



## Assignment: A MongoDB JSON Document Database using Spotify API

Batch Processing using MongoDB and the Spotify API

Batch data processing involves collecting, storing, and processing data in large groups or "batches" rather than handling data in real-time as it comes in. This approach allows for more efficient handling of data, especially when dealing with large volumes of information that don’t need to be processed immediately. In this assignment, you will work with both MongoDB, a NoSQL database, and the Spotify API to implement batch processing techniques that involve collecting and analyzing music-related data.

These are the steps you will need to follow:
*   [1 - Install dependencies](#1)
*   [2 - Create an Atlas Client on MongoDB](#2)
*   [3 - Create an Spotify APP](#3)
*   [4 - Connect to your app using Spotify's SDK](#4)
*   [5 - Search for Artists by Genre in Spotify API](#5)
*   [6 - Explore your MongoDB collection](#6)
*   [7 - Get all albums from the featured Artists](#7)
*   [8 - Create New MongoDB collection](#8)
*   [9 - Explore your data!](#9)
*   [10 - Create an iteractive map using Folium!](#10)

**IMPORTANT!!!!**
## During the course of this assignment, you will encounter the word `None` in several places. Each time you see `None` replace it with the appropriate variable, method, string, or value for that specific code snippet—unless the `None` is used as a return value to indicate the absence of a result. In this case, `None` is intentionally returned to signify that that a result could not be obtained.

<a name='1'></a>
#1 - Install spotify sdk and pymongo in your Google Colab env

In [None]:
!pip install spotipy pymongo --upgrade

Collecting spotipy
  Downloading spotipy-2.25.1-py3-none-any.whl.metadata (5.1 kB)
Collecting pymongo
  Downloading pymongo-4.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting redis>=3.5.3 (from spotipy)
  Downloading redis-5.2.1-py3-none-any.whl.metadata (9.1 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading spotipy-2.25.1-py3-none-any.whl (31 kB)
Downloading pymongo-4.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading redis-5.2.1-py3-none-any.whl (261 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.5/261.5 kB

<a name='2'></a>
#2 - Create an Atlas Client on MongoDB

To sign up for a free MongoDB account, go to https://mongodb.com, then create a new free account. Once your account is set up, you will be taken to the screen to create your cluster. Use the default settings for their free Atlas cluster (MO, as they refer to it) and click Create Cluster to get started. This will take you to the Clusters page so you can begin creating your new cluster, which takes several minutes.

###Create your Database User and whitelist your IP address
Next, in the Atlas tab Security Quickstart, you will need to complete additional steps to get up and running:

*	Add your username and password, then click Create User—This enables you to log into your cluster.
*	Keep My Local Environment—This means adding your network IP addresses to the IP Access List. This can be modified at any time.
*	Click on Add My Current IP Address—This is a security measure that ensures only the IP addresses you verify are allowed to interact with your cluster. To connect to this cluster from multiple locations (school, home, work, etc.), you will need to whitelist each IP address from which you intend to connect.
Finally, click on Finish and Close.

###Connect to your Cluster

Go to Databases. Click Connect to continue. Connecting to a MongoDB Atlas database from Python requires a connection string. To get your connection string, click **Connect Your Application**. In **Select your driver and version**, choose Python 3.6 or later. Your connection string will appear below in **Add your connection string into your application code**. Click COPY to copy the string. Paste this string into the keys.py file as mongo_connection_string’s value. Replace “<PASSWORD>” in the connection string with your password, and replace the database name “myFirstDatabase” with “mySpotifyDatabase”,” which will be the database name in this assignment. At the bottom of the Connect to YourClusterName, click Close. You are now ready to interact with your Atlas cluster.


In [18]:
MONGO_STRING = 'mongodb+srv://spotify:Vinit123@spotify.qf1liyt.mongodb.net/?retryWrites=true&w=majority&appName=Spotify' #Include your mongo connection string here

In [19]:
from pymongo import MongoClient
#START YOUR CODE HERE
atlas_client = MongoClient(MONGO_STRING)   #Pass your cluster connection string to the client method
#END YOUR CODE HERE

In [20]:
#START YOUR CODE HERE
database = atlas_client["spotify"]               #Create a database object and name it for your atlas_client
featured_albums_collection = database["Albums"]      #Select a name for your collection
#END YOUR CODE HERE

<a name='3'></a>
#3 - Create a Spotify APP

To get access to Spotify's API resources, you need to create a Spotify account if you don't already have one. A trial account will be enough to complete this lab.

1. Go to https://developer.spotify.com/, create an account and log in.
2. Click on the account name in the right-top corner and then click on **Dashboard**.
3. Create a new APP using the following details:
   - App name: You can choose the name, make sure you select only use an alphanumeric string without special characters
   - App description: `DBMS test API application`
   - Website: leave empty
   - Redirect URIs: `http://localhost:6000`
   - API to use: select `Web API`
4. Click on **Save** button. If you get an error message saying that your account is not ready, you can log out, wait for a few minutes and then repeat again steps 2-4.
5. In the App Home page click on **Settings** and reveal `Client ID` and `Client secret`. Make sure you copy those and save them in a separated file!


Here's the link to [the Spotify API documentation](https://developer.spotify.com/documentation/web-api/tutorials/getting-started) that you can refer to while you're working on this assignment.

###client id 498fbe6a291a4d9c93b9d45b4d40b9f3

###client secret 5cfcb5ee4f1643bab1f11bd4da23078c

<a name='4'></a>
#4 - Create a Spotify ClientCredential object using the SDK

The Spotipy SDK is a Python client for interacting with Spotify’s Web API. It provides a range of functions to access and manage data related to artists, albums, tracks, playlists, and user profiles. Here’s an overview of some key capabilities that you will explore in this assignment:





*   **Accessing Artist Information**: With Spotipy, you can retrieve detailed information about artists, including their name, genres, popularity score, and followers. The SDK also allows access to an artist's top tracks and related artists, which can help students explore music trends and build up artist profiles for batch storage in MongoDB.

*   **Track and Album Metadata**: Spotipy enables access to metadata for tracks and albums, such as track name, album name, release date, and track popularity. Additionally, you can retrieve audio features like tempo, danceability, and energy, which provide in-depth details about the music and are valuable for data analysis.

*   **Searching for Content**: Using Spotipy’s search functionality, you can query the Spotify catalog by keywords for artists, albums, playlists, or tracks. This can be instrumental in batch processing, as users can search for multiple artists or songs and gather relevant data in one go.

*   **User Profile and Playlist Management**: Spotipy also supports accessing Spotify user profiles and playlists, though this is less relevant for the assignment. However, this feature could provide additional context or personalization if students wanted to explore user-based music preferences.


*   **Authorization and Access Control**: Spotipy handles authorization with Spotify’s OAuth, ensuring that only authenticated requests are made. This allows students to securely access data and manage the rate limits associated with the Spotify API.

In [21]:
import spotipy
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials

The first step in working with an API is understanding its authentication process. For Spotify, this involves using a Client ID and Client Secret generated by the Spotify app to obtain an access token. The access token is a string containing the credentials and permissions required to access specific resources. For more information, refer to the Spotify [API documentation](https://developer.spotify.com/documentation/web-api/concepts/access-token).

Since each API is designed with unique features, it’s essential to review its documentation thoroughly to access data responsibly. Throughout this lab, you’ll find links to documentation; it’s recommended to review these during and after the session as needed.

Now, let’s create variables to store the client_id and client_secret values.

In [22]:
CLIENT_ID = '498fbe6a291a4d9c93b9d45b4d40b9f3'     #Include your client ID here
CLIENT_SECRET = '5cfcb5ee4f1643bab1f11bd4da23078c' #Include your client Secret here

In [23]:
credentials = SpotifyClientCredentials(
        client_id=CLIENT_ID,
        client_secret=CLIENT_SECRET
    )

spotify = spotipy.Spotify(client_credentials_manager=credentials, language='en')  #You can change this if you want to get data from a different lenguage

When working with the Spotify API, you'll receive a temporary access token, with its validity period specified in the `expires_in` field (in seconds). Once this token expires, any subsequent requests will fail and return an error with a status code of 401, indicating that the request is unauthorized.

For each API request you send to Spotify, you need to include the access token in the request’s authorization header. The get_auth_header function is provided to streamline this process. It takes the access token as input and returns a properly formatted authorization header, which you can then include in your API requests.

**If you get an 401 response, please make sure to create your access token again by executing the code below!**

In [24]:
credentials.get_access_token()

  credentials.get_access_token()


{'access_token': 'BQBvTN9UldmukHTwRQF284jciV6O15Rb9xzSUY8zJqAk3eFBokjkY-0B5FTkVYCqQ5mjimhDthBG4OvaNqOFUti4VU2P9gNXkGciwcLopWxDOJPfxocCeLnk2cb1wzCoWrsXe5FXkQc',
 'token_type': 'Bearer',
 'expires_in': 3600,
 'expires_at': 1743214305}

The above token contains the expiration (in seconds) of the token. Once the token expires, you will need to create a new one.

<a name='5'></a>
#5 - Search for Artists by Genre in Spotify API

Select one of the following genres to fetch data from Spotify based on the genre of your choice:

* pop
* jazz
* hip-hop

**Your task**:
*   Select one genre from the list above for which you will retrieve data from the Spotify API.
*   Use the limit parameter to specify the number of artist records you want to retrieve. Default: 20. Minimum: 1. Maximum: 50


In [25]:
#START YOUR CODE HERE
GENRE = 'pop'   # Specify your genre here (e.g., "pop", "jazz", "hip-hop")
LIMIT = 20     # Specify the number of artist records you want to retrieve (max 50)
#END YOUR CODE HERE

Now, let's use your selected genre and the Spotify API to retrieve artist data and their albums. You will use the [search()](https://spotipy.readthedocs.io/en/2.22.1/#spotipy.client.Spotify.search) method to find artists based on your selected genre. Then, you will use the [artist_albums()](https://spotipy.readthedocs.io/en/2.22.1/#spotipy.client.Spotify.artist_albums) method to get all albums for each artist.

**Your tasks**:


1.   Use the `search()` method from the Spotify API to search for artists based on the genre you selected. Store the results in a variable called `search_results`.
2.   Extract the `artist_id` from each result and use the `artist_albums()` method to retrieve all albums associated with that artist.
2.   Loop through the results and insert each album into your MongoDB collection (`featured_albums_collection`).

HINT: You should explore the `search_results` and `artist_albums()` response to understand how data is returned. You can also refer to the [MongoDB documentation](https://www.mongodb.com/docs/manual/reference/method/db.collection.insertOne/?msockid=2c010af6d0b963db3ebe1e3ed1496248) to learn more about inserting documents into MongoDB.

In [26]:
#START YOUR CODE HERE
#Search for artists by genre
search_results = spotify.search(q=f"genre:{GENRE}", type="artist", limit=LIMIT)

#Loop through the artist results and fetch their albums
for artist in search_results['artists']['items']:
    artist_id = artist['id']

    #Retrieve the artist's albums
    albums = spotify.artist_albums(artist_id, album_type='album', limit=LIMIT)

    #Insert each album into MongoDB
    for album in albums['items']:
        featured_albums_collection.insert_one(album)
#END YOUR CODE HERE

<a name='6'></a>
#6 - Explore your MongoDB collection

In this task, you will connect to your MongoDB collection and query specific fields from your dataset. The goal is to extract key information about the artists and their albums from the collection. You will then load this data into a Python list called artists_data and later use it to perform further analysis.

**Your tasks:**


1.   Use the [find()](https://www.mongodb.com/docs/manual/reference/method/db.collection.find/) method in MongoDB to query your `featured_albums_collection` for specific fields. You should retrieve only the following fields:
* artist ID (`artists.id`)
* artist name (`artists.name`)
* artist URI (`artists.external_urls.spotify`)

2.   Once you get the data from your query, you should create a pandas DataFrame with the results. You should find a way to combine all records into the `artists_data` dictionary.



In [29]:
#START YOUR CODE HERE
artists_data = []

#Query MongoDB to get artist details
for album in featured_albums_collection.find({}, {
    "artists.id": 1,
    "artists.name": 1,
    "artists.external_urls.spotify": 1,
    "_id": 0
}):

    #Loop through the 'artists' list in each album
    for artist in album.get("artists", []):
        artist_info = {
            "artist_id": artist["id"],
            "artist_name": artist["name"],
            "artist_uri": artist["external_urls"]["spotify"]
        }

        #Avoid adding duplicates
        if artist_info not in artists_data:
            artists_data.append(artist_info)
#END YOUR CODE HERE
pop_df = pd.DataFrame(artists_data)
pop_df

Unnamed: 0,artist_id,artist_name,artist_uri
0,2HPaUgqeutzr3jx5a9WyDV,PARTYNEXTDOOR,https://open.spotify.com/artist/2HPaUgqeutzr3j...
1,3TVXtAsR1Inumwj472S9r4,Drake,https://open.spotify.com/artist/3TVXtAsR1Inumw...
2,1URnnhqYAYcrqrcwql10ft,21 Savage,https://open.spotify.com/artist/1URnnhqYAYcrqr...
3,1RyvyyTE3xzB2ZywiAwp0i,Future,https://open.spotify.com/artist/1RyvyyTE3xzB2Z...
4,06HL4z0CvFAxyc27GXpf02,Taylor Swift,https://open.spotify.com/artist/06HL4z0CvFAxyc...
5,1Xyo4u8uXC1ZmMpatF05PJ,The Weeknd,https://open.spotify.com/artist/1Xyo4u8uXC1ZmM...
6,7tYKF4w9nC0nq9CsPZTHyP,SZA,https://open.spotify.com/artist/7tYKF4w9nC0nq9...
7,6qGkLCMQkNGOJ079iEcC5k,Ben Platt,https://open.spotify.com/artist/6qGkLCMQkNGOJ0...
8,2wY79sveU1sp5g7SokKOiI,Sam Smith,https://open.spotify.com/artist/2wY79sveU1sp5g...
9,2YZyLoL8N0Wb9xBt1NhZWg,Kendrick Lamar,https://open.spotify.com/artist/2YZyLoL8N0Wb9x...


<a name='7'></a>
#7 - Get all albums from the featured Artists

Now that you have extracted a list of artists from your MongoDB collection in Task #6, your next step is to use the Spotify API to retrieve all albums released by each artist. The goal is to build a comprehensive dataset of albums from the artists featured in your selected genre.

The Spotify API provides an [artist_albums](https://spotipy.readthedocs.io/en/2.22.1/?highlight=artist_albums#spotipy.client.Spotify.artist_albums) endpoint that allows you to query for all albums released by a specific artist. To do this, you will need to use the artist_uri you extracted in Task #6.

Your tasks:


1.   Loop through the `artists_data` object and get the `artist_uri` for each artist. This value will be required for you to call the `artist_albums` endpoint to retrieve the artist's albums.
2.   Use the `spotify.artist_albums()` method to fetch the artist's albums. Store the results in a variable called `albums`.
3.   Add the artist's name to each album returned in the response. This is important for keeping track of which album belongs to which artist.
4.   Append each album to a new list called `artists_albums`.
5.   Spotify API works with something called "pagination". Pagination means that within the string response from the API, there will be another set of results contained in the `next` key. This allows us to create consecutive requests from the same element. Your job is to use the `next` [method](https://spotipy.readthedocs.io/en/2.22.1/?highlight=featured_playlists#spotipy.client.Spotify.next) to get the next albums from a given artist.
6.   Make sure to collect the following fields from each album:
* Artist Name
* Album Name
* Release Date
* Total Tracks
* Album Type
* Spotify URL
* Album Genre(s)
* Available Markets
7.   Continue adding the albums to your artists_albums list until you have exhausted all pages of results.



In [30]:
#START YOUR CODE HERE
artists_albums = []

#Loop through the artists_data list from Task #6
for artist in artists_data:
    artist_uri = artist['artist_uri']

    #Use the Spotify API to fetch all albums from the artist
    results = spotify.artist_albums(artist_uri, album_type='album')
    albums = results['items']

    # Get artist details to retrieve genre
    artist_details = spotify.artist(artist_uri)
    genres = artist_details['genres']  # This will be a list of genres

    #Append the albums and add the artist's name to each album
    for album in albums:
        album_info = {
            "artist_name": artist['artist_name'],
            "album_name": album['name'],
            "release_date": album['release_date'],
            "total_tracks": album['total_tracks'],
            "album_type": album['album_type'],
            "spotify_url": album['external_urls']['spotify'],
            "genres": genres,
            "available_markets": album['available_markets']
        }
        artists_albums.append(album_info)

    #Handle pagination to get the next set of albums
    while results['next']:
        results = spotify.next(results)
        for album in results['items']:
            album_info = {
                "artist_name": artist['artist_name'],
                "album_name": album['name'],
                "release_date": album['release_date'],
                "total_tracks": album['total_tracks'],
                "album_type": album['album_type'],
                "spotify_url": album['external_urls']['spotify'],
                "genres": genres,
                "available_markets": album["available_markets"]
            }
            artists_albums.append(album_info)
#END YOUR CODE HERE

In [31]:
# Preview the first 3 albums
for album in artists_albums[:3]:
    print(album)


{'artist_name': 'PARTYNEXTDOOR', 'album_name': '$ome $exy $ongs 4 U', 'release_date': '2025-02-14', 'total_tracks': 21, 'album_type': 'album', 'spotify_url': 'https://open.spotify.com/album/6Rl6YoCarF2GHPSQmmFjuR', 'genres': ['r&b'], 'available_markets': ['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CA', 'CL', 'CO', 'CR', 'CY', 'CZ', 'DK', 'DO', 'DE', 'EC', 'EE', 'SV', 'FI', 'FR', 'GR', 'GT', 'HN', 'HK', 'HU', 'IS', 'IE', 'IT', 'LV', 'LT', 'LU', 'MY', 'MT', 'MX', 'NL', 'NZ', 'NI', 'NO', 'PA', 'PY', 'PE', 'PH', 'PL', 'PT', 'SG', 'SK', 'ES', 'SE', 'CH', 'TW', 'TR', 'UY', 'US', 'GB', 'AD', 'LI', 'MC', 'ID', 'JP', 'TH', 'VN', 'RO', 'IL', 'ZA', 'SA', 'AE', 'BH', 'QA', 'OM', 'KW', 'EG', 'MA', 'DZ', 'TN', 'LB', 'JO', 'PS', 'IN', 'BY', 'KZ', 'MD', 'UA', 'AL', 'BA', 'HR', 'ME', 'MK', 'RS', 'SI', 'KR', 'BD', 'PK', 'LK', 'GH', 'KE', 'NG', 'TZ', 'UG', 'AG', 'AM', 'BS', 'BB', 'BZ', 'BT', 'BW', 'BF', 'CV', 'CW', 'DM', 'FJ', 'GM', 'GE', 'GD', 'GW', 'GY', 'HT', 'JM', 'KI', 'LS', 'LR', 'MW', 'MV', 'ML', 

<a name='8'></a>
#8 - Create New MongoDB collection

Now that you have your new object with all artists' albums, you will need to create a new Collection in your MongoDB cluster. Use the data you created above to store those in a new MongoDB collection.

Remember to look at this [documentation](https://www.mongodb.com/docs/manual/tutorial/insert-documents/#:~:text=Collection.-,insertOne()%20inserts%20a%20single%20document%20into%20a%20collection.,value%20to%20the%20new%20document) to learn more about MongoDB. Also, DO NOT FORGET to go and verify that the data is in your MongoDB cluster.

In [32]:
#START YOUR CODE HERE
database = atlas_client["spotify"]       #Select the name of your database
albums_collection = database["artist_albums"]  #Select the name of your collection

for album in artists_albums:
    albums_collection.insert_one(album)     #Insert the data into MongoDB
#END YOUR CODE HERE

In [35]:
for album in albums_collection.find().limit(3):
    print(album)

{'_id': ObjectId('67e750444552208a65681dd0'), 'artist_name': 'PARTYNEXTDOOR', 'album_name': '$ome $exy $ongs 4 U', 'release_date': '2025-02-14', 'total_tracks': 21, 'album_type': 'album', 'spotify_url': 'https://open.spotify.com/album/6Rl6YoCarF2GHPSQmmFjuR', 'genres': ['r&b'], 'available_markets': ['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CA', 'CL', 'CO', 'CR', 'CY', 'CZ', 'DK', 'DO', 'DE', 'EC', 'EE', 'SV', 'FI', 'FR', 'GR', 'GT', 'HN', 'HK', 'HU', 'IS', 'IE', 'IT', 'LV', 'LT', 'LU', 'MY', 'MT', 'MX', 'NL', 'NZ', 'NI', 'NO', 'PA', 'PY', 'PE', 'PH', 'PL', 'PT', 'SG', 'SK', 'ES', 'SE', 'CH', 'TW', 'TR', 'UY', 'US', 'GB', 'AD', 'LI', 'MC', 'ID', 'JP', 'TH', 'VN', 'RO', 'IL', 'ZA', 'SA', 'AE', 'BH', 'QA', 'OM', 'KW', 'EG', 'MA', 'DZ', 'TN', 'LB', 'JO', 'PS', 'IN', 'BY', 'KZ', 'MD', 'UA', 'AL', 'BA', 'HR', 'ME', 'MK', 'RS', 'SI', 'KR', 'BD', 'PK', 'LK', 'GH', 'KE', 'NG', 'TZ', 'UG', 'AG', 'AM', 'BS', 'BB', 'BZ', 'BT', 'BW', 'BF', 'CV', 'CW', 'DM', 'FJ', 'GM', 'GE', 'GD', 'GW', 'GY', 'HT

<a name='9'></a>
#9 - Explore your data!
You have now collected all albums from artists with new releases. Your next task is to explore and analyze this data using Python and MongoDB.

Answer the following questions based on the data in your collection:


1.   How many albums are stored in the collection?
2.   Which artist in the collection has the most albums?
3.   Which artist has the highest average number of tracks per album?
4.   What are the 3 most common genres in the collection?
5.   Who are the top 5 artists with the highest average number of tracks per album in the collection? (*Include the Artist Name*)
6.   How many albums in the collection have the genre you selected in Task #5?
7.   What are the top 10 albums with the most tracks? (*Include the Artist Name & the album's genre*)
8.   Which genre has the highest average number of tracks per album?
9.   Which genre in the collection has the most distinct (unique) artists associated with it?
10.  How many albums have the word "Deluxe Edition" in their title? (*Include the Artist Name*)

For your reference, here are the MongoDB commands that will be useful for these tasks:

[Aggregate](https://www.mongodb.com/docs/manual/reference/command/aggregate/)

[Find](https://www.mongodb.com/docs/manual/reference/command/find/)


In [39]:
# Question 1 - How many albums are stored in the collection?
# Answer:598
total_albums = albums_collection.count_documents({})
print("Total number of albums in the collection:", total_albums)


Total number of albums in the collection: 598


In [46]:
# Question 2 - Which artist in the collection has the most albums?
# Answer:Hans Zimmer, Albums: 112

DataAggregationPipeline = [
    {
        "$group": {
            "_id": "$artist_name",
            "album_count": {"$sum": 1}
        }
    },
    {
        "$sort": {"album_count": -1}
    },
    {
        "$limit": 10
    }
]

# Executing the pipeline
most_albums_artist = albums_collection.aggregate(DataAggregationPipeline)

# Printing the result
for artist in most_albums_artist:
    print(f"Artist: {artist['_id']}, Albums: {artist['album_count']}")



Artist: Hans Zimmer, Albums: 112
Artist: Tony Bennett, Albums: 73
Artist: Lil Wayne, Albums: 31
Artist: Taylor Swift, Albums: 30
Artist: Future, Albums: 29
Artist: 2 Chainz, Albums: 22
Artist: Cynthia Erivo, Albums: 21
Artist: Drake, Albums: 18
Artist: Lady Gaga, Albums: 18
Artist: Nicki Minaj, Albums: 18


In [47]:
# Question 3: Which artist has the highest average number of tracks per album?
# Answer Artist: Cynthia Erivo, Average Tracks: 27.38

AvgTracksPipeline = [
    {
        "$group": {
            "_id": "$artist_name",
            "avg_tracks": {"$avg": "$total_tracks"}
        }
    },
    {
        "$sort": {"avg_tracks": -1}
    },
    {
        "$limit": 10
    }
]

# Executing the pipeline
most_tracks_artist = albums_collection.aggregate(AvgTracksPipeline)

# Printing the result
for artist in most_tracks_artist:
    print(f"Artist: {artist['_id']}, Average Tracks: {artist['avg_tracks']:.2f}")


Artist: Cynthia Erivo, Average Tracks: 27.38
Artist: Bradley Cooper, Average Tracks: 24.67
Artist: Nicki Minaj, Average Tracks: 21.72
Artist: SZA, Average Tracks: 20.00
Artist: Taylor Swift, Average Tracks: 19.73
Artist: Post Malone, Average Tracks: 19.73
Artist: Hozier, Average Tracks: 19.00
Artist: Drake, Average Tracks: 18.28
Artist: Lil Wayne, Average Tracks: 18.26
Artist: Hans Zimmer, Average Tracks: 17.63


In [48]:
# Question 4 - What are the 3 most common genres in the collection?
# Answer:Genre: soundtrack, Count: 112


MostCommonGenresPipeline = [
    {"$unwind": "$genres"},
    {"$group": {"_id": "$genres", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}},
    {"$limit": 3}
]

# Execute pipeline
top_genres = albums_collection.aggregate(MostCommonGenresPipeline)

# Print results
print("Top 3 Most Common Genres:")
for genre in top_genres:
    print(f"Genre: {genre['_id']}, Count: {genre['count']}")


Top 3 Most Common Genres:
Genre: soundtrack, Count: 112
Genre: rap, Count: 100
Genre: big band, Count: 73


In [49]:
# Question 5 - Who are the top 5 artists with the highest average number of tracks per album in the collection? (*Include the Artist Name*)
# Answer:Artist: Cynthia Erivo, Avg Tracks: 27.38

Top5AvgTracksPipeline = [
    {
        "$group": {
            "_id": "$artist_name",
            "avg_tracks": {"$avg": "$total_tracks"}
        }
    },
    {
        "$sort": {"avg_tracks": -1}
    },
    {
        "$limit": 5
    }
]

# Execute pipeline
top_5_artists = albums_collection.aggregate(Top5AvgTracksPipeline)

# Print results
print("Top 5 Artists with Highest Average Tracks per Album:")
for artist in top_5_artists:
    print(f"Artist: {artist['_id']}, Avg Tracks: {artist['avg_tracks']:.2f}")


Top 5 Artists with Highest Average Tracks per Album:
Artist: Cynthia Erivo, Avg Tracks: 27.38
Artist: Bradley Cooper, Avg Tracks: 24.67
Artist: Nicki Minaj, Avg Tracks: 21.72
Artist: SZA, Avg Tracks: 20.00
Artist: Taylor Swift, Avg Tracks: 19.73


In [50]:
# Question 6 - How many albums in the collection have the genre you selected in Task #5?
# Answer:Number of albums with the genre 'pop': 33
selected_genre = "pop"

genre_album_count = albums_collection.count_documents({
    "genres": selected_genre
})

print(f"Number of albums with the genre '{selected_genre}':", genre_album_count)


Number of albums with the genre 'pop': 33


In [52]:
# Question 7 - What are the top 10 albums with the most tracks? (*Include the Artist Name & the album's genre*)
# Answer:

Top10AlbumsPipeline = [
    {
        "$project": {
            "album_name": 1,
            "artist_name": 1,
            "genres": 1,
            "total_tracks": 1
        }
    },
    {
        "$sort": {"total_tracks": -1}
    },
    {
        "$limit": 10
    }
]

# Execute the pipeline
top_10_albums = albums_collection.aggregate(Top10AlbumsPipeline)

# Print results
print("Top 10 Albums with the Most Tracks:")
for album in top_10_albums:
    print(f"Artist: {album['artist_name']}, Album: {album['album_name']}, Tracks: {album['total_tracks']}, Genres: {album['genres']}")



Top 10 Albums with the Most Tracks:
Artist: Hans Zimmer, Album: Planet Earth III (Original Television Soundtrack), Tracks: 58, Genres: ['soundtrack']
Artist: Hans Zimmer, Album: Planet Earth II (Original Television Soundtrack), Tracks: 49, Genres: ['soundtrack']
Artist: Taylor Swift, Album: reputation Stadium Tour Surprise Song Playlist, Tracks: 46, Genres: []
Artist: Tony Bennett, Album: Tony Bennett At Carnegie Hall - The Complete Concert, Tracks: 44, Genres: ['vocal jazz', 'christmas', 'big band', 'adult standards', 'jazz']
Artist: Hans Zimmer, Album: White Fang (Original Soundtrack), Tracks: 42, Genres: ['soundtrack']
Artist: SZA, Album: SOS Deluxe: LANA, Tracks: 42, Genres: ['r&b']
Artist: Hans Zimmer, Album: Gladiator: 20th Anniversary Edition, Tracks: 35, Genres: ['soundtrack']
Artist: Hans Zimmer, Album: Frozen Planet II (Original Television Soundtrack), Tracks: 35, Genres: ['soundtrack']
Artist: Taylor Swift, Album: folklore: the long pond studio sessions (from the Disney+ spe

In [53]:
# Question 8 - Which genre has the highest average number of tracks per album?
# Answer:Genre: musicals, Average Tracks per Album: 24.24

HighestAvgTracksByGenrePipeline = [
    {"$unwind": "$genres"},  # Step 1: Expand genres array
    {
        "$group": {
            "_id": "$genres",
            "avg_tracks": {"$avg": "$total_tracks"}  # Calculate average tracks
        }
    },
    {"$sort": {"avg_tracks": -1}},    # Sort descending
    {"$limit": 1}   #  Just the top genre
]

# Execute
top_genre_avg_tracks = albums_collection.aggregate(HighestAvgTracksByGenrePipeline)

# Print result
for genre in top_genre_avg_tracks:
    print(f"Genre: {genre['_id']}, Average Tracks per Album: {genre['avg_tracks']:.2f}")


Genre: musicals, Average Tracks per Album: 24.24


In [54]:
# Question 9 - Which genre in the collection has the most distinct (unique) artists associated with it?
# Answer:Genre: rap, Unique Artists: 4
UniqueArtistsByGenrePipeline = [

    {"$unwind": "$genres"},


    {"$group": {
        "_id": {"genre": "$genres", "artist": "$artist_name"}
    }},


    {"$group": {
        "_id": "$_id.genre",
        "unique_artist_count": {"$sum": 1}
    }},


    {"$sort": {"unique_artist_count": -1}},


    {"$limit": 1}
]

# Run the pipeline
most_unique_artists = albums_collection.aggregate(UniqueArtistsByGenrePipeline)

# Display the result
for genre in most_unique_artists:
    print(f"Genre: {genre['_id']}, Unique Artists: {genre['unique_artist_count']}")



Genre: rap, Unique Artists: 4


In [55]:
top_genre = "rap"

# Get unique artist names for the top genre
unique_artists_pipeline = [
    {"$match": {"genres": top_genre}},
    {"$group": {"_id": "$artist_name"}}
]

unique_artists_cursor = albums_collection.aggregate(unique_artists_pipeline)

print(f" Unique artists in genre '{top_genre}':")
for artist in unique_artists_cursor:
    print("-", artist["_id"])


 Unique artists in genre 'rap':
- 2 Chainz
- Future
- Lil Wayne
- Drake


In [57]:
# Question 10 - How many albums have the word "Deluxe Edition" in their title? (*Include the Artist Name*)
# Answer:

deluxe_albums = albums_collection.find(
    {"album_name": {"$regex": "Deluxe Edition", "$options": "i"}},  # case-insensitive match
    {"album_name": 1, "artist_name": 1, "_id": 0}  # only return needed fields
)

# Print results
print(" Albums with 'Deluxe Edition' in the title:\n")
count = 0
for album in deluxe_albums:
    print(f"- Artist: {album['artist_name']}, Album: {album['album_name']}")
    count += 1

print(f"\n Total 'Deluxe Edition' albums found: {count}")



 Albums with 'Deluxe Edition' in the title:

- Artist: Taylor Swift, Album: folklore: the long pond studio sessions (from the Disney+ special) [deluxe edition]
- Artist: Taylor Swift, Album: 1989 (Deluxe Edition)
- Artist: Taylor Swift, Album: Red (Deluxe Edition)
- Artist: Taylor Swift, Album: Speak Now (Deluxe Edition)
- Artist: Lady Gaga, Album: The Fame Monster (Deluxe Edition)
- Artist: Hans Zimmer, Album: Man of Steel (Original Motion Picture Soundtrack) [Deluxe Edition]
- Artist: Hans Zimmer, Album: The Dark Knight Rises (Original Motion Picture Soundtrack) [Deluxe Edition]
- Artist: Justin Bieber, Album: Believe (Deluxe Edition)
- Artist: Justin Bieber, Album: Under The Mistletoe (Deluxe Edition)
- Artist: Nicki Minaj, Album: The Pinkprint (Deluxe Edition)
- Artist: Nicki Minaj, Album: Pink Friday ... Roman Reloaded (Deluxe Edition)
- Artist: Nicki Minaj, Album: Pink Friday (Deluxe Edition)

 Total 'Deluxe Edition' albums found: 12


<a name='10'></a>
#10 - Create an interactive map using Folium!

Folium is a Python library that simplifies the creation of interactive, visually appealing maps. It acts as a wrapper for the Leaflet.js JavaScript library, allowing users to create maps in Python without needing to write JavaScript. [Learn More](https://python-visualization.github.io/folium/latest/)

Here are a few key concepts about this library:

* **Map Initialization**: Folium provides a Map class that lets users set a central location and zoom level, initializing a map on which they can place markers or other geographic elements.
* **Adding Markers**: Folium’s Marker class lets students place icons on the map at specific locations, which can contain popups with details (like artist names and album information). This feature is key for visualizing different locations where an artist’s album is available.
* **Customization and Interactivity**: The library supports customizing marker icons, colors, and popups, making the map both interactive and visually informative. Users can click on markers to view additional information about the album or artist, which makes exploring data on a map engaging.

In this assignment, Folium will allow you to see where each album is available. Each marker represents a market (country) in which the artist’s album has been released, making it easy to see the global spread and reach of the album. By adding artist name, album name, album's genre and total tracks information to the markers, you can visually assess which artists and albums have the widest distribution.

**Geopy** is a Python library that enables geocoding—converting addresses or location names (like country codes) into latitude and longitude coordinates, which can then be used for plotting on a map.

Here’s how we will be using it:

* **Geocoding Services**: Geopy can connect to multiple geocoding providers (like Nominatim, Google Maps, etc.) to look up geographic data. When given a country code, Geopy queries the provider to retrieve the corresponding latitude and longitude.
* **Caching and Rate Limiting**: Geopy includes rate limits to prevent users from overwhelming the service with requests. This is particularly helpful in this assignment, because many albums might be available in multiple countries! And we will provide 1 request per country per album.

In [59]:
import folium
import time
from pymongo import MongoClient
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut

Your tasks:


1.   Because your collection can have many albums and many markets you will need to first create a new query using pymongo to retrieve specific records from your collection that will be used for your map. Your task is to create a variable to store the TOP 3 albums ordered by the highest number of `total_tracks`.
2.   We will then use the Nominatim geocoding provider to create a geolocator instance. Assign a name to the geolocator.
3.   We have defined a function called `get_coordinates` that will take as an input a `country_code` and will return a tuple with the coordinates of that country. We want to reduce having multiple API requests for countries that we already asked the coordinates for (this is because multiple albums may have the same market). Your task is to check if the `country_code` passed to the `get_coordinates` was already provided by checking if it is within our `coordinates_cache` dictionary.
4.   Inside the `get_coordinates` function, check if the `country_code` is already in the coordinates_cache. If it is, return the cached coordinates.
5.   If the `country_code` has not been retrieved before, use the geocode method to get the coordinates from the geocoding provider and store the results in the `coordinates_cache`.
6.   Iterate through the list of albums obtained in step 1. For each album, extract the following information: `artist_name`, `album_name`, `total_tracks`, and `available_markets`.
7.   Since each album may have multiple `available_markets`, loop through each market. For each market, call the `get_coordinates` function to retrieve its coordinates.
8.   Once you have the coordinates for each market, create a Folium marker for each market. Each marker should display the artist_name, album_name, total_tracks, and the current market in a popup. Customize the marker's colors and icon as desired.
9.  If multiple albums are available in the same market, collect all relevant album details and display them in a single marker's popup.
10. Finally, add each marker to the Folium map to visualize the albums and their respective markets. This marker should have the `artist_name`, `album_name`, `total_tracks`, and `available_markets`. You can change the colors and icon if you like. Learn more about this [here](https://python-visualization.github.io/folium/latest/reference.html#folium.map.Marker)


In [60]:
# Initialize a Folium map centered globally
map_ = folium.Map(location=[20, 0], zoom_start=2)
coordinates_cache = {}

#START YOUR CODE HERE
geolocator = Nominatim(user_agent="album_mapper") #Name your geolocator

top_3_albums = list(albums_collection.find({}, {
    "artist_name": 1,
    "album_name": 1,
    "total_tracks": 1,
    "genres": 1,
    "available_markets": 1,
    "_id": 0
}).sort("total_tracks", -1).limit(3))    #Create your query using pymongo here

def get_coordinates(country_code):
    if country_code in coordinates_cache:
        return coordinates_cache[country_code]

    try:
        location = geolocator.geocode(country_code, timeout=10)
        if location:
            coords = (location.latitude, location.longitude)
            coordinates_cache[country_code] = coords
            return coords
        else:
            return None
    except GeocoderTimedOut:
        return None

# Create a dictionary to store markets and associated album info
market_info = {}

# Loop through each album and collect market information
for album in top_3_albums:
    artist_name = album["artist_name"]
    album_name = album["album_name"]
    total_tracks = album["total_tracks"]
    album_genre = album.get("genres", [])
    available_markets = album["available_markets"]

    for market in available_markets:
        if market not in market_info:
            market_info[market] = []

        market_info[market].append({
            "artist_name": artist_name,
            "album_name": album_name,
            "total_tracks": total_tracks,
            "album_genre": ", ".join(album_genre) if album_genre else "N/A"
        })

# Loop through each market to add markers
for market, albums in market_info.items():
    coords = get_coordinates(market)
    if coords:
        # Prepare popup content
        popup_content = f"<b>Market:</b> {market}<br>"
        for album in albums:
            popup_content += (f"Artist: {album['artist_name']}<br>"
                              f"Album: {album['album_name']}<br>"
                              f"Total Tracks: {album['total_tracks']}<br>"
                              f"Genre: {album['album_genre']}<br><br>")

        # Add a marker with a popup showing all albums for that market
        folium.Marker(
            location=coords,
            popup=popup_content,
            icon=folium.Icon(color="blue", icon="music")
        ).add_to(map_)

    time.sleep(1)

#END YOUR CODE HERE
map_


In [61]:
map_.save("top_albums_map.html")


# Video Submission

In your **5-minute** (maximum) video, ensure you:

1. **Explain your understanding of the assignment** and the process for each of the 10 steps:
   - **Clear Overview:** Provide a comprehensive overview covering each of the 10 steps in the assignment, such as setting up MongoDB, accessing the Spotify API, and querying data.
   - **Thoughtful Reflections:** Highlight the challenges you encountered and how you solved them. Discuss any key choices you made (e.g., structuring MongoDB queries), showcasing your thought process and approach.
   - **Depth of Understanding:** Demonstrate a strong understanding of each step, reflecting on both the successes and obstacles you faced.

2. **Provide a detailed explanation of the insights you gained** from querying the data:
   - **Learning Outcomes:** Clearly articulate what you learned from the data queries, connecting these insights to specific questions you aimed to answer.
   - **Contributions to Understanding:** Explain how these insights contribute to a deeper understanding of the data and your overall assignment, emphasizing the value of your analysis.

3. **Explain the interactive map** you created with **Folium**:
   - **Insights from the Map:** Describe the insights the map helped you uncover, including any geographical patterns or trends in the data.
   - **Enhancing Analysis:** Discuss how the interactive map enhanced your analysis and contributed to your overall findings, demonstrating its relevance to your assignment.

4. **Use of Individual Credentials**:
   -  Ensure you use your individual Spotify API credentials and MongoDB connection details during your project. This is crucial for accessing your personal data and ensuring the integrity of your work.

5. **Video Format**:
   - Your video must include a recording of yourself as well as your screen. Be sure to show your MongoDB collection, highlighting the data you collected and how it aligns with your assignment objectives.
