# Data Enriching

## 1. Importing file

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('musicians.csv', encoding = 'Windows-1252')
df.head()

Unnamed: 0.1,Unnamed: 0,RANK,BRAND,CATEGORIES,SUBCATEGORIES,FOLLOWERS,ER,iPOSTS ON HASHTAG,MEDIA POSTED
0,0,1,Selena Gomez,celebrities,musicians,105.4M,2.62%,14.5M,1.2k
1,1,2,Taylor Swift,celebrities,musicians,95.2M,1.96%,10.5M,958
2,2,3,Ariana Grande,celebrities,musicians,92.3M,1.43%,16.9M,2.8k
3,3,4,Beyonce,celebrities,musicians,90.6M,2.53%,9.2M,1.4k
4,7,8,Justin Bieber,celebrities,musicians,77.9M,1.66%,26.2M,3.8k


### Dropping 'Unnamed' column

In [3]:
df.columns.values

array(['Unnamed: 0', 'RANK', 'BRAND', 'CATEGORIES', 'SUBCATEGORIES',
       'FOLLOWERS', 'ER', 'iPOSTS ON HASHTAG', 'MEDIA POSTED'],
      dtype=object)

In [4]:
df.drop('Unnamed: 0', axis=1, inplace=True)

In [5]:
df.head()

Unnamed: 0,RANK,BRAND,CATEGORIES,SUBCATEGORIES,FOLLOWERS,ER,iPOSTS ON HASHTAG,MEDIA POSTED
0,1,Selena Gomez,celebrities,musicians,105.4M,2.62%,14.5M,1.2k
1,2,Taylor Swift,celebrities,musicians,95.2M,1.96%,10.5M,958
2,3,Ariana Grande,celebrities,musicians,92.3M,1.43%,16.9M,2.8k
3,4,Beyonce,celebrities,musicians,90.6M,2.53%,9.2M,1.4k
4,8,Justin Bieber,celebrities,musicians,77.9M,1.66%,26.2M,3.8k


In [6]:
df.shape

(30, 8)

In [7]:
df.isnull().sum().sort_values(ascending=False)

MEDIA POSTED         0
iPOSTS ON HASHTAG    0
ER                   0
FOLLOWERS            0
SUBCATEGORIES        0
CATEGORIES           0
BRAND                0
RANK                 0
dtype: int64

### Creating a new column: 'SPOTIFY NAMES'

Since the musicians' Instagram usernames don't always match their usernames on Spotify, let's create another column.

In [8]:
df['BRAND']

0          Selena Gomez
1          Taylor Swift
2         Ariana Grande
3               Beyonce
4         Justin Bieber
5           Nicki Minaj
6            KATY PERRY
7           Miley Cyrus
8        Jennifer Lopez
9           Demi Lovato
10           badgalriri
11    Justin Timberlake
12              Zendaya
13        champagnepapi
14              Shakira
15         xoxo, Joanne
16          harrystyles
17          Niall Horan
18        One Direction
19         Shawn Mendes
20             euanitta
21                Ciara
22       Britney Spears
23            snoopdogg
24         Daddy Yankee
25        elliegoulding
26              50 Cent
27         Luan Santana
28      Louis Tomlinson
29        Tyga / T-Raww
Name: BRAND, dtype: object

In [9]:
df['SPOTIFY NAMES'] = df['BRAND']

In [10]:
df['SPOTIFY NAMES'].replace('KATY PERRY', 'Katy Perry', inplace=True)
df['SPOTIFY NAMES'].replace('badgalriri', 'Rihanna', inplace=True)
df['SPOTIFY NAMES'].replace('champagnepapi', 'Drake', inplace=True)
df['SPOTIFY NAMES'].replace('xoxo, Joanne', 'Lady Gaga', inplace=True)
df['SPOTIFY NAMES'].replace('harrystyles', 'Harry Styles', inplace=True)
df['SPOTIFY NAMES'].replace('euanitta', 'Anitta', inplace=True)
df['SPOTIFY NAMES'].replace('snoopdogg', 'Snoop Dogg', inplace=True)
df['SPOTIFY NAMES'].replace('elliegoulding', 'Ellie Goulding', inplace=True)
df['SPOTIFY NAMES'].replace('Tyga / T-Raww', 'Tyga', inplace=True)

In [11]:
df.head()

Unnamed: 0,RANK,BRAND,CATEGORIES,SUBCATEGORIES,FOLLOWERS,ER,iPOSTS ON HASHTAG,MEDIA POSTED,SPOTIFY NAMES
0,1,Selena Gomez,celebrities,musicians,105.4M,2.62%,14.5M,1.2k,Selena Gomez
1,2,Taylor Swift,celebrities,musicians,95.2M,1.96%,10.5M,958,Taylor Swift
2,3,Ariana Grande,celebrities,musicians,92.3M,1.43%,16.9M,2.8k,Ariana Grande
3,4,Beyonce,celebrities,musicians,90.6M,2.53%,9.2M,1.4k,Beyonce
4,8,Justin Bieber,celebrities,musicians,77.9M,1.66%,26.2M,3.8k,Justin Bieber


## 2. Setting up the API

- In order to find information about the most followed musicians on Instagram, we will use the **spotify API**, more specifically, its library called **spotipy**.
- To begin with, we have to install it:

### Installing

In [12]:
!pip3 install spotipy



### Importing the libraries

In [13]:
from spotipy.oauth2 import SpotifyClientCredentials
import sys
import pprint
import spotipy
import spotipy.util as util

from dotenv import load_dotenv
import os
import re
import requests

### Authentification

In [14]:
load_dotenv()

True

In [15]:
client_id = os.getenv("CLIENT_ID")
client_secret = os.getenv("CLIENT_SECRET")

In [16]:
sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials(client_id=client_id,
                                                                client_secret=client_secret))

- Now that we have installed **spotipy**, we want to extract information to enrich our dataset.
- We will start by finding each musicians' usernames on Spotify and then their followers.

## 3. Finding SPOTIFY FOLLOWERS

#### Let's create a list with every musicians' username on Spotify

In [17]:
musicians = list(df['SPOTIFY NAMES'])
musicians

['Selena Gomez',
 'Taylor Swift',
 'Ariana Grande',
 'Beyonce',
 'Justin Bieber',
 'Nicki Minaj',
 'Katy Perry',
 'Miley Cyrus',
 'Jennifer Lopez',
 'Demi Lovato',
 'Rihanna',
 'Justin Timberlake',
 'Zendaya',
 'Drake',
 'Shakira',
 'Lady Gaga',
 'Harry Styles',
 'Niall Horan',
 'One Direction',
 'Shawn Mendes',
 'Anitta',
 'Ciara',
 'Britney Spears',
 'Snoop Dogg',
 'Daddy Yankee',
 'Ellie Goulding',
 '50 Cent',
 'Luan Santana',
 'Louis Tomlinson',
 'Tyga']

### FOR loop

We are using Spotipy's endpoint **search** which will allow us to find what we are looking for, in this case, each artists' **total followers**.

In [18]:
spotify_followers = []

for x in musicians:
    track_id = sp.search(q='artist:' + x, type='artist')
    followers = track_id['artists']['items'][0]['followers']['total']
    spotify_followers.append(followers)
    
print(spotify_followers)

[23253551, 33421356, 52450478, 24925121, 39129410, 19126039, 16570488, 13037017, 8861658, 18329116, 39702807, 9838519, 2765611, 50541059, 19977298, 14394953, 10299531, 6156633, 18387241, 29701495, 9891719, 4600733, 7963498, 6300860, 20614030, 8593907, 6441780, 4782196, 4037381, 5975449]


Let's add the information to a new column in the dataframe.

In [19]:
df['SPOTIFY FOLLOWERS'] = spotify_followers

In [20]:
df.head()

Unnamed: 0,RANK,BRAND,CATEGORIES,SUBCATEGORIES,FOLLOWERS,ER,iPOSTS ON HASHTAG,MEDIA POSTED,SPOTIFY NAMES,SPOTIFY FOLLOWERS
0,1,Selena Gomez,celebrities,musicians,105.4M,2.62%,14.5M,1.2k,Selena Gomez,23253551
1,2,Taylor Swift,celebrities,musicians,95.2M,1.96%,10.5M,958,Taylor Swift,33421356
2,3,Ariana Grande,celebrities,musicians,92.3M,1.43%,16.9M,2.8k,Ariana Grande,52450478
3,4,Beyonce,celebrities,musicians,90.6M,2.53%,9.2M,1.4k,Beyonce,24925121
4,8,Justin Bieber,celebrities,musicians,77.9M,1.66%,26.2M,3.8k,Justin Bieber,39129410


## 4. Finding PHOTO

Now, let's use **search** endpoint to find each musicians' **profile photo**.

In [21]:
spotify_photo = []

for x in musicians:
    track_id = sp.search(q='artist:' + x, type='artist')
    photo = track_id['artists']['items'][0]['images'][0]['url']
    spotify_photo.append(photo)
    
print(spotify_photo)

['https://i.scdn.co/image/61fc36467d4945514cdf052ed49286aa8ea3b4fc', 'https://i.scdn.co/image/a37efbc7fd3f5f5df81b48ce9c6de53820b239c1', 'https://i.scdn.co/image/f8c793519d837ca2f920c561535fe62ef32e8e5b', 'https://i.scdn.co/image/ad8b0e5a18a5a443a2678768bd73f59833941abc', 'https://i.scdn.co/image/3ff69320dd62625e0e24737c1965027695369a31', 'https://i.scdn.co/image/2a832cd2b8dd5d0deef6d682ca52ea29f5dda859', 'https://i.scdn.co/image/ecebace064c7a48b7ae4a611b82887aa79163c1e', 'https://i.scdn.co/image/f040656afb8bc16cd44781dd39ea3d02184f5c0f', 'https://i.scdn.co/image/a23a2778e5787da9188cdaa959cedee9391ae4d2', 'https://i.scdn.co/image/f188b71ba0f97ef22c35c820f0a67084cffd24ec', 'https://i.scdn.co/image/1fc2f537d678d701d7d143a8fd4f0c2f29fbde22', 'https://i.scdn.co/image/5b73ff32952c810c98bd5fbe8860ca99414ad6aa', 'https://i.scdn.co/image/fba29219867535aaed72b65c2cb363040cb98103', 'https://i.scdn.co/image/60cfab40c6bb160a1906be45276829d430058005', 'https://i.scdn.co/image/93e6b100a00437a05f57aa

In [22]:
df['SPOTIFY PHOTO'] = spotify_photo

In [23]:
df.head()

Unnamed: 0,RANK,BRAND,CATEGORIES,SUBCATEGORIES,FOLLOWERS,ER,iPOSTS ON HASHTAG,MEDIA POSTED,SPOTIFY NAMES,SPOTIFY FOLLOWERS,SPOTIFY PHOTO
0,1,Selena Gomez,celebrities,musicians,105.4M,2.62%,14.5M,1.2k,Selena Gomez,23253551,https://i.scdn.co/image/61fc36467d4945514cdf05...
1,2,Taylor Swift,celebrities,musicians,95.2M,1.96%,10.5M,958,Taylor Swift,33421356,https://i.scdn.co/image/a37efbc7fd3f5f5df81b48...
2,3,Ariana Grande,celebrities,musicians,92.3M,1.43%,16.9M,2.8k,Ariana Grande,52450478,https://i.scdn.co/image/f8c793519d837ca2f920c5...
3,4,Beyonce,celebrities,musicians,90.6M,2.53%,9.2M,1.4k,Beyonce,24925121,https://i.scdn.co/image/ad8b0e5a18a5a443a26787...
4,8,Justin Bieber,celebrities,musicians,77.9M,1.66%,26.2M,3.8k,Justin Bieber,39129410,https://i.scdn.co/image/3ff69320dd62625e0e2473...


## 5. Editing df

In [24]:
df.dtypes

RANK                  int64
BRAND                object
CATEGORIES           object
SUBCATEGORIES        object
FOLLOWERS            object
ER                   object
iPOSTS ON HASHTAG    object
MEDIA POSTED         object
SPOTIFY NAMES        object
SPOTIFY FOLLOWERS     int64
SPOTIFY PHOTO        object
dtype: object

Since the column "FOLLOWERS" is of type *object*, and we want to be able to plot it later on in a nice bar chart, let's convert it into a numeric variable.

In [25]:
df["FOLLOWERS"].value_counts()

12.9M     2
14.6M     1
31.1M     1
14M       1
57.8M     1
39.6M     1
92.3M     1
55.1M     1
16.9M     1
36M       1
20.4M     1
51.3M     1
11.8M     1
11.6M     1
17.2M     1
105.4M    1
90.6M     1
70.8M     1
13.3M     1
12.2M     1
95.2M     1
18.3M     1
47.3M     1
59.6M     1
28.4M     1
14.5M     1
77.9M     1
19.4M     1
16.6M     1
Name: FOLLOWERS, dtype: int64

In [26]:
df.replace('M','',regex=True, inplace=True)

In [27]:
df["FOLLOWERS"] = df["FOLLOWERS"].astype(float)

In [28]:
df.dtypes

RANK                   int64
BRAND                 object
CATEGORIES            object
SUBCATEGORIES         object
FOLLOWERS            float64
ER                    object
iPOSTS ON HASHTAG     object
MEDIA POSTED          object
SPOTIFY NAMES         object
SPOTIFY FOLLOWERS      int64
SPOTIFY PHOTO         object
dtype: object

Let's change the name of the column to make sure that the data is in *Millions* of followers.

In [29]:
df.rename(columns={"FOLLOWERS": "FOLLOWERS (M)"}, inplace=True)

In [30]:
df.head()

Unnamed: 0,RANK,BRAND,CATEGORIES,SUBCATEGORIES,FOLLOWERS (M),ER,iPOSTS ON HASHTAG,MEDIA POSTED,SPOTIFY NAMES,SPOTIFY FOLLOWERS,SPOTIFY PHOTO
0,1,Selena Gomez,celebrities,musicians,105.4,2.62%,14.5,1.2k,Selena Gomez,23253551,https://i.scdn.co/image/61fc36467d4945514cdf05...
1,2,Taylor Swift,celebrities,musicians,95.2,1.96%,10.5,958,Taylor Swift,33421356,https://i.scdn.co/image/a37efbc7fd3f5f5df81b48...
2,3,Ariana Grande,celebrities,musicians,92.3,1.43%,16.9,2.8k,Ariana Grande,52450478,https://i.scdn.co/image/f8c793519d837ca2f920c5...
3,4,Beyonce,celebrities,musicians,90.6,2.53%,9.2,1.4k,Beyonce,24925121,https://i.scdn.co/image/ad8b0e5a18a5a443a26787...
4,8,Justin Bieber,celebrities,musicians,77.9,1.66%,26.2,3.8k,Justin Bieber,39129410,https://i.scdn.co/image/3ff69320dd62625e0e2473...


## 6. Saving enriched dataset

In [31]:
df.to_csv('musicians_enriched.csv')