#Musicality Analysis Project

##Motivation

Haven't you ever related one moment of your live with a song? Music is everywhere. We want to analyze information related to music and see if we are able to create a recomendation system from the techniques created during the course.

In other words, we are interested in:

 * Analyse the colaborations between artists
 * Analyse song similarities between artists
 * Analyse the content of song's lyrics of different artists.


## Data used

To be able to achieve our main goals stated, we first defined the data that we would use on our analysis.

As there exists milions and milions of data related to artists and music all over the world, we short down our study  to information related to the top 100 of artists from 7 well-known music genres . The list of genres that we used for all our analysis is the following one:

* Pop
* Rock
* Hip-hop
* Blues
* Indie
* Country
* Soul

For each artist of each genre we are interested in also obtaining information about:

* Extensive list of his songs
* Other artists that the current artist has colaborated with
* Musical characteristics of his top songs
* Lyrics of his top songs

In order to obtain all this information, we focused our attention on the following tools.

### Spotify

Spotify is probably one of the most well-known music streaming platforms all over the world. It contains more than 40 milions of tracks that users can listen thanks to their multi-platform web and mobile applications. They also provide a *developer platform* that allows more advanced users to obtain data and extra features. 

More concretely, we are interested in the Spotify Web API endpoints. Using them, we have been able to obtain our desired information about music artists, albums, and tracks directly from the Spotify Data Catalogue. Furthermore, we have also been interested in obtaining calculated audio features of tracks.

The functions used to download our desired data are the following ones:

* Search by genre
* Get an Artist's Top Tracks
* Get an Artist's Albums
* Get an Album's Tracks
* Get a Track

To be able to download all data needed, we have used the already existing spotipy library for Python. For being able to use it, first some credentials needed to be created on the website and, afterwards, defined on the code.

The installation and code developed to use it is the following one.

In [0]:
#Spotify library installation
!pip install spotipy



In [0]:
#Import needed libraries
import sys
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy
import pprint
import csv
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import pandas as pd

In [0]:
#Spotify autentification credentials
client_id = "b06999e849764d0cb9d85ff2b4762fd9"
client_secret = "4565b9044d694deb9cf54eddb9b08e69"

client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

First, the artists list of top 100 artists for each genre has been created. In this case, the *Search by genre* endpoint has been used by assigning that the type of the response information desired is artist. The maximum length of objects on each response  that spotify allows is just 50, so we needed to make the call twice by using the parameters *limit* and *offset* to obtain our information desired.

The downloaded list of artists has been saved on *artists.csv* file and for each artist the following information has been kept.

* Artist
* Genre
* Artist uri
* Artist id

In [0]:
#Download the top 100 artists for each genre defined
def downloadTopArtistsGenre(genres):
  #Open 
  file = open("artists.csv", "w+")
  fileCsv = csv.writer(file)
  fileCsv.writerow(['Artist','Genre','Artist_uri','Artist_id'])
  for genre in genres:
    result = sp.search(q='genre:'+genre, type='artist',limit=50,offset=0)
    for artist in result['artists']['items']:
      #artist JSON dict like: https://developer.spotify.com/documentation/web-api/reference/artists/get-artist/
      row = [artist['name'].encode('ascii', 'ignore'),genre,artist['uri'].encode('ascii', 'ignore'),artist['id'].encode('ascii', 'ignore')]
      fileCsv.writerow(row)
    result = sp.search(q='genre:'+genre, type='artist',limit=50,offset=50)
    for artist in result['artists']['items']:
      row = [artist['name'].encode('ascii', 'ignore'),genre,artist['uri'].encode('ascii', 'ignore'),artist['id'].encode('ascii', 'ignore')]
      fileCsv.writerow(row)
  file.close()

In [0]:
#Genres defined
genres = ['pop','rock','hip-hop','blues','indie','country','soul']
downloadTopArtistsGenre(genres)

Afterwards, the colaborations for each artists wanted to be retrieved. To do it, we had to obtain the list of songs for each artists and then checked if the song has multiple artists or not.

As the spotify platform does not allow to directly obtain the songs for each artist, first all the albums for each artist were obtained again with the *limit* and *offset* parameters. And then, for each album retrieved, all songs were downloaded and analyzed. A similar procedure was executed for the singles of each artist to obtain as much information as possible related with colaborations because we realized that some remix songs created were not included in any album. This time, we just retrieved, at maximum 50 singles for each artist.

Furthermore, we realised that we had to exclude the *compilation* albums and lists retrieved on the same service because, otherwise, information regarding the list of reproductions created by Spotify was also retrieved and processed on our creation of the graph.

In this step we decided to do not filter the colaborations within our list of artists, as it would be nice to also abstract some information from the other colaborations at some point of our further analysis. That is why, each time that more than one artists was found on a song, a new entry was added to the *artists_colab.csv* or *artists_colab_singles.csv* files with the following format without any other filtering.

* Artist
* Artist id
* Artist colab.
* Artist colab. id
* Song
* Song id

In [0]:
#Downloaded all colaborations of one artist based on his albums
def downloadColaborationsPerArtistsAlbums(artistsList):
  file = open("artists_colab.csv", "w+")
  fileCsv = csv.writer(file)
  fileCsv.writerow(['Artist', 'Artist_id', 'ArtistColab','ArtistColab_id', 'Song', 'Song_id'])

  artistsCol = list()
  for index, artist in artistsList.iterrows():
    total = 1
    offset = 0
    while(offset < total):
      result = sp.artist_albums(artist['Artist_uri'], limit=50, album_type='album', offset=offset)
      for album in result['items']:
        if (album['album_type'] != 'Compilation'):
          result1 = sp.album_tracks(album['uri'], limit=50)
          for song in result1['items']:   
            for art in song['artists']:
              # tests if art in artists - take only links between the top artists
              if(artist['Artist'] != art['name']):
                artistsCol.append(art['name'])
                row = [artist['Artist'].encode('ascii', 'ignore'), artist['Artist_id'].encode('ascii', 'ignore'), art['name'].encode('ascii', 'ignore'),art['id'].encode('ascii', 'ignore'),song['name'].encode('ascii', 'ignore'), song['id'].encode('ascii', 'ignore')]
                fileCsv.writerow(row)
      if (offset == 0):
        total = result['total']
      offset += 1
  print("Done")
  file.close()

#Downloaded all colaborations of one artist based on his singles
def downloadColaborationsPerArtistsSingles(artistsList):
  file = open("artists_colab_singles.csv", "w+")
  fileCsv = csv.writer(file)
  fileCsv.writerow(['Artist', 'Artist_id', 'ArtistColab','ArtistColab_id', 'Song', 'Song_id'])
  artistsCol = list()
  for index, artist in artistsList.iterrows():
    result = sp.artist_albums(artist['Artist_uri'], limit=50, album_type='single', offset=0)
    for album in result['items']:
      if (album['album_type'] != 'Compilation'):
        result1 = sp.album_tracks(album['uri'], limit=50)
        for song in result1['items']:   
          for art in song['artists']:
            # tests if art in artists - take only links between the top artists
            if(artist['Artist'] != art['name']):
              artistsCol.append(art['name'])
              row = [artist['Artist'].encode('ascii', 'ignore'), artist['Artist_id'].encode('ascii', 'ignore'), art['name'].encode('ascii', 'ignore'),art['id'].encode('ascii', 'ignore'),song['name'].encode('ascii', 'ignore'), song['id'].encode('ascii', 'ignore')]
              fileCsv.writerow(row)
  print("Done")
  file.close()

In [0]:
#Load artists form the previous saved file
artistsList = pd.read_csv('./SGI_Sportify_Project/artists.csv', encoding='utf-8')
#Create artists_colab.csv file
downloadColaborationsPerArtistsAlbums(artistsList)
#Create artists_colab_singles.csv file
downloadColaborationsPerArtistsSingles(artistsList)


## Network analysis of artists


## Network analysis of songs

## Text analysis of Lyrics

##Conclusions and discussion