<img src="./Images/spotifydatacleaning.jpg" width="500" height="100" align ="center">

### 

####  Data is taken from [Kaggle](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks) which has been authored by Yamac Eren Ay and  was collected using Spotify Web API. 

### Explorating and Understanding Data to test our Hypothesis:






### Importing Dependencies

In [1]:
# Dependencies
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### Setting up Path and Loading Data file to a dataframe

In [2]:
file_path = "Resources/data.csv"

Spotify_data_df = pd.read_csv(file_path,low_memory=False)

### Display DataFrame

In [3]:
Spotify_data_df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.991,['Mamie Smith'],0.598,168333,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,1920,0.0936,149.976,0.634,1920
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,150200,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,7,1920-01-05,0.0534,86.889,0.95,1920
2,0.993,['Mamie Smith'],0.647,163827,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,1920,0.174,97.6,0.689,1920
3,0.000173,['Oscar Velazquez'],0.73,422087,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,1920-01-01,0.0425,127.997,0.0422,1920
4,0.295,['Mixe'],0.704,165224,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,2,1920-10-01,0.0768,122.076,0.299,1920


### Displaying Total music releases by artists over 100 years

In [4]:
Rows = Spotify_data_df.shape[0]
Columns = Spotify_data_df.shape[1]
Total_before = pd.DataFrame({
                     " Total Rows": [Rows],
                     " Total Columns": [Columns]
                    })
Total_before

Unnamed: 0,Total Rows,Total Columns
0,174389,19


#### Note: Data Set contains Total of 1,74,389 records by artists till date and 19 audio Charecteristics. 



### Understanding and Exploring Audio Charecteristics:
##### Following are the features available to test

In [5]:
Spotify_data_df.columns

Index(['acousticness', 'artists', 'danceability', 'duration_ms', 'energy',
       'explicit', 'id', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'name', 'popularity', 'release_date', 'speechiness', 'tempo',
       'valence', 'year'],
      dtype='object')

### Feature Description: Explains how they are measured and what they mean
#### Primary:
    •	- id : Song unique ID

#### Numerical:
    •	- acousticnes : A confidence measure from 0.0 to 1.0 of whether the track is acoustic; 1-High , 0 -Low
    •	- danceability: How suitable a track is for dancing based on a combination of musical elements; 0-Least, 1-Most
    •	- energy: Measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy;0-Low,1-High
    •	- duration_ms : Duration of song in milliseconds ranging from 200k to 300k
    •	- instrumentalness :Predicts whether a track contains no vocals(ex: oohh and aahh) ;1-No vocals, 0-vocal
    •	- valence : Musical positiveness conveyed by a track. 
            0-negative(sad, depressed, angry) , 
            1-positive (happy, cheerful, euphoric)
    •	- popularity: Based on number of plays and downloads ; 0-Less Popular, 100-Very Popular
    •	- tempo : Overall estimated tempo of a track in beats per minutes(BPM);50-Low , 100-High
    •	- liveness: Detects the presence of an audience in the recording ; 0-No , 1 -Yes
    •	- loudness: Overall relative loudness of tracks in decibels (dB). -60 low ,0-High
    •	- speechiness: presence of spoken words in a track. 
              Values >0.66-entirely of spoken words. 
              Values between 0.33 and 0.66 describe tracks that may contain both music and speech,like rap music. 
              Values < 0.33 mostly music and other non-speech-like tracks.
    •	- year : The year music was recorded and release (1921 to 2020) 

#### Dummy:
    •	- mode (0 = Minor, 1 = Major)
    •	- explicit (0 = No explicit content, 1 = Explicit content)

#### Categorical:
    •	- key : All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1
    •	- artists :Song Artist
    •	- release_date : Date when album released 
    •	- name : Song Name


###  Interesting insights while exploring the data 

#### Some Interesting caveats:
    -	This data is simply a sample of tracks released in those 100 years and not a complete set. 
    -	According to Spotify developers site, the popularity is calculated by an algorithm and is based on the 
         total number of downloads and plays the track has had and how recent those plays are. While this is accurate for 
         newer tracks, could be a bias on older tracks.
    -   Valance descibes the mood of the singer. It would be interesting to know that it was measured based on 
         loudness. Louder the song, happier and cheerful it is.
    -   Not all the features are measured in the range 0-1 . ex: To plot the trending of loudness with other features
         over time, we would have to normalize the units to fit the scale.

## Keeping features that are within the interest of our Analysis

#### -  Filtering out colums  'duration_ms', 'explicit', 'id', 'key', 'liveness',  'mode', 'release_date', 'tempo' from the data set.
      
#### - Rearrange columns 'acousticness' and 'year' .
      

In [6]:
Spotify_selected_features = Spotify_data_df[['year','artists', 'release_date','acousticness',
                                   'danceability', 'energy','instrumentalness',
                                   'loudness','popularity',  'speechiness', 'valence', ]]
#Display cleaned Data set
Spotify_selected_features

Unnamed: 0,year,artists,release_date,acousticness,danceability,energy,instrumentalness,loudness,popularity,speechiness,valence
0,1920,['Mamie Smith'],1920,0.991000,0.598,0.224,0.000522,-12.628,12,0.0936,0.6340
1,1920,"[""Screamin' Jay Hawkins""]",1920-01-05,0.643000,0.852,0.517,0.026400,-7.261,7,0.0534,0.9500
2,1920,['Mamie Smith'],1920,0.993000,0.647,0.186,0.000018,-12.098,4,0.1740,0.6890
3,1920,['Oscar Velazquez'],1920-01-01,0.000173,0.730,0.798,0.801000,-7.311,17,0.0425,0.0422
4,1920,['Mixe'],1920-10-01,0.295000,0.704,0.707,0.000246,-6.036,2,0.0768,0.2990
...,...,...,...,...,...,...,...,...,...,...,...
174384,2020,"['DJ Combo', 'Sander-7', 'Tony T']",2020-12-25,0.009170,0.792,0.866,0.000060,-5.089,0,0.0356,0.1860
174385,2021,['Alessia Cara'],2021-01-22,0.795000,0.429,0.211,0.000000,-11.665,0,0.0360,0.2280
174386,2020,['Roger Fly'],2020-12-09,0.806000,0.671,0.589,0.920000,-12.393,0,0.0282,0.7140
174387,2021,['Taylor Swift'],2021-01-07,0.920000,0.462,0.240,0.000000,-12.077,69,0.0377,0.3200


#### Note: Here we can see that data needs further cleanup
            - Removing special charecters from the artists column . 
            - We may also need to parse data types and 
            - drop Null values.

### Removing special charecters from the column 'artist' using lstrip , rstrip  and str.replace functions

In [7]:
Spotify_Clean_artist= Spotify_selected_features.copy()
Spotify_Clean_artist["artists"]=Spotify_selected_features["artists"].map(lambda x: x.lstrip("['").rstrip("']")).astype(str)


In [8]:
Spotify_Clean_artist["artists"]=Spotify_Clean_artist["artists"].str.replace("'","").str.replace('"',"")

#### Note: Since special charecters also included single and double quotes,  multiple calls to str.replace function was need to remove them.

### Display dataframe with cleaned artist

In [9]:
Spotify_Clean_artist.head()

Unnamed: 0,year,artists,release_date,acousticness,danceability,energy,instrumentalness,loudness,popularity,speechiness,valence
0,1920,Mamie Smith,1920,0.991,0.598,0.224,0.000522,-12.628,12,0.0936,0.634
1,1920,Screamin Jay Hawkins,1920-01-05,0.643,0.852,0.517,0.0264,-7.261,7,0.0534,0.95
2,1920,Mamie Smith,1920,0.993,0.647,0.186,1.8e-05,-12.098,4,0.174,0.689
3,1920,Oscar Velazquez,1920-01-01,0.000173,0.73,0.798,0.801,-7.311,17,0.0425,0.0422
4,1920,Mixe,1920-10-01,0.295,0.704,0.707,0.000246,-6.036,2,0.0768,0.299


### Accounting for Null values if any

In [10]:
#Drop Null values in the dataframe if any
Spotify_clean_df =Spotify_Clean_artist.dropna(how='any')
Spotify_clean_df.shape[0]

174389

#### Note : None of the rows were dropped so dataframe has no Null values

### Analyze datatypes of each column

In [11]:
Spotify_clean_df.dtypes

year                  int64
artists              object
release_date         object
acousticness        float64
danceability        float64
energy              float64
instrumentalness    float64
loudness            float64
popularity            int64
speechiness         float64
valence             float64
dtype: object

#### All the features in interest have expected dtypes hence, no columns need any datatype conversion.

### Statistical Analyis of cleaned dataframe

In [12]:
Spotify_clean_df.describe()

Unnamed: 0,year,acousticness,danceability,energy,instrumentalness,loudness,popularity,speechiness,valence
count,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0
mean,1977.061764,0.499228,0.536758,0.482721,0.197252,-11.750865,25.693381,0.105729,0.524533
std,26.90795,0.379936,0.176025,0.272685,0.334574,5.691591,21.87274,0.18226,0.264477
min,1920.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0
25%,1955.0,0.0877,0.414,0.249,0.0,-14.908,1.0,0.0352,0.311
50%,1977.0,0.517,0.548,0.465,0.000524,-10.836,25.0,0.0455,0.536
75%,1999.0,0.895,0.669,0.711,0.252,-7.499,42.0,0.0763,0.743
max,2021.0,0.996,0.988,1.0,1.0,3.855,100.0,0.971,1.0


#### Provides basic statistical analysis of the dataframe 
                       
#### Intresting features:
             No. of releases :1,74,389 
             Start year : 1920             
             End Year : 2021
             Possible Outliers : Other than instrumentalness , all other columns seems to have data uniformly distributed. 
             
             
#### Note :  From the table , we can see that there are records with loudness greater than zero. According to spotify developers site, range is between -60 to 0 db .This could be an error so we need to remove these records. 

In [13]:
Spotify_clean_df = Spotify_clean_df[Spotify_clean_df["loudness"]<0]
loudness_after = pd.DataFrame({
                        "max_loudness":[Spotify_clean_df["loudness"].max()],
                        "min_loudness":[Spotify_clean_df["loudness"].min()]
                        })
loudness_after

Unnamed: 0,max_loudness,min_loudness
0,-0.007,-60.0


In [14]:
Spotify_clean_df.head(3)

Unnamed: 0,year,artists,release_date,acousticness,danceability,energy,instrumentalness,loudness,popularity,speechiness,valence
0,1920,Mamie Smith,1920,0.991,0.598,0.224,0.000522,-12.628,12,0.0936,0.634
1,1920,Screamin Jay Hawkins,1920-01-05,0.643,0.852,0.517,0.0264,-7.261,7,0.0534,0.95
2,1920,Mamie Smith,1920,0.993,0.647,0.186,1.8e-05,-12.098,4,0.174,0.689


In [15]:
Rows = Spotify_clean_df.shape[0]
Columns = Spotify_clean_df.shape[1]
Total_after = pd.DataFrame({
                     " Total Rows": [Rows],
                     " Total Columns": [Columns]
                    })
Total_after

Unnamed: 0,Total Rows,Total Columns
0,174354,11


### Saving File for further Analysis
#### Saving file in a .csv format

In [16]:
out_path = "Output_data/spotify_clean.csv"
Spotify_clean_df.to_csv(out_path , index=False , encoding="utf-8")

### Challenges during data exploration/manipulation 
         
   #### 1.SettingWithCopyError :
            - This error was thrown due to chaining functions to one another. It means is that you are chaining two indexing
              method together while trying to set a value.
   ##### The solution is simple: 
            1. Make a copy of the dataframe and avoid assigning values directly
            2. Convert multiple chaining actions into one using the .loc/.iloc methods in pandas. 

  #### 2.  : We can see that loudness and popularity are measured with different range. We may have to normalize the data to fit the scale or plot them seperatly to avoid data loss. Here is an intersting article
  [Normalization and standardization to fit the scale](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/) 