<div align="center"><h1>Model Preparation Notebook</h1></div>

This notebook contains the sequential steps to create the model from the raw dataset.

In this notebook the following steps will be 

        1. Importing the required libraries.
        2. Downloading the data
        3. Loading the data
        4. Data Visualization
        5. Preprocessing & EDA
        6. Prepare trainable data
        7. Create Model
        8. Load it into .pkl file


Follow the steps.

# 1. Importing Libraries :

In [1]:
from google_drive_downloader import GoogleDriveDownloader as gdd
import pandas as pd

# 2. Downloading the data :

In [2]:

"""
Data Link : https://drive.google.com/file/d/1whUKZ-BB4-VKEanDRVteti5mawjnux5o/view?usp=sharing

We'll take the id section and download the dataset

"""

gdd.download_file_from_google_drive(file_id='1whUKZ-BB4-VKEanDRVteti5mawjnux5o',
                                    dest_path='assets/spotify.csv')

Downloading 1whUKZ-BB4-VKEanDRVteti5mawjnux5o into assets/spotify.csv... Done.


# 3. Loading the Data :

In [3]:
df = pd.read_csv('assets/spotify.csv')

df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.991,['Mamie Smith'],0.598,168333,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,1920,0.0936,149.976,0.634,1920
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,150200,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,7,1920-01-05,0.0534,86.889,0.95,1920
2,0.993,['Mamie Smith'],0.647,163827,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,1920,0.174,97.6,0.689,1920
3,0.000173,['Oscar Velazquez'],0.73,422087,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,1920-01-01,0.0425,127.997,0.0422,1920
4,0.295,['Mixe'],0.704,165224,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,2,1920-10-01,0.0768,122.076,0.299,1920


# 4. Data Visualization:

In [4]:
df.columns

Index(['acousticness', 'artists', 'danceability', 'duration_ms', 'energy',
       'explicit', 'id', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'name', 'popularity', 'release_date', 'speechiness', 'tempo',
       'valence', 'year'],
      dtype='object')

### The most important task is to preprocess the data and get the intuition of each features so that we can perform the preprocessing.


AcousticNess : Ratio of lyrics and music

Danceability : Level of ease to dance with the song

duration : total time

Energy : Loudness

Explicit : Child prohibited

instrumentalness : instrumental 

Key : keys of music

Liveness : Fresh songs

Loudness : Volume

popularity : Likes fom people

release date : date of publish

speechiness : Level of present of lyrics in music

tempo : speed of the song

valence : positivity in music

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174389 entries, 0 to 174388
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   acousticness      174389 non-null  float64
 1   artists           174389 non-null  object 
 2   danceability      174389 non-null  float64
 3   duration_ms       174389 non-null  int64  
 4   energy            174389 non-null  float64
 5   explicit          174389 non-null  int64  
 6   id                174389 non-null  object 
 7   instrumentalness  174389 non-null  float64
 8   key               174389 non-null  int64  
 9   liveness          174389 non-null  float64
 10  loudness          174389 non-null  float64
 11  mode              174389 non-null  int64  
 12  name              174389 non-null  object 
 13  popularity        174389 non-null  int64  
 14  release_date      174389 non-null  object 
 15  speechiness       174389 non-null  float64
 16  tempo             17

In [6]:
df.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,valence,year
count,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0
mean,0.499228,0.536758,232810.0,0.482721,0.068135,0.197252,5.205305,0.211123,-11.750865,0.702384,25.693381,0.105729,117.0065,0.524533,1977.061764
std,0.379936,0.176025,148395.8,0.272685,0.251978,0.334574,3.518292,0.180493,5.691591,0.457211,21.87274,0.18226,30.254178,0.264477,26.90795
min,0.0,0.0,4937.0,0.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,1920.0
25%,0.0877,0.414,166133.0,0.249,0.0,0.0,2.0,0.0992,-14.908,0.0,1.0,0.0352,93.931,0.311,1955.0
50%,0.517,0.548,205787.0,0.465,0.0,0.000524,5.0,0.138,-10.836,1.0,25.0,0.0455,115.816,0.536,1977.0
75%,0.895,0.669,265720.0,0.711,0.0,0.252,8.0,0.27,-7.499,1.0,42.0,0.0763,135.011,0.743,1999.0
max,0.996,0.988,5338302.0,1.0,1.0,1.0,11.0,1.0,3.855,1.0,100.0,0.971,243.507,1.0,2021.0


# 5. Preprocessing & EDA:

In [7]:
# Check duplicate rows except first occurrence based on all columns
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
9525,0.567,['Neil Diamond'],0.515,180253,0.641,0,1BmVQ5RGqqtF5cnsv6cQYu,0.0642,5,0.322,-5.573,1,"Girl, You'll Be A Woman Soon",60,1968,0.0272,109.558,0.655,1968
9534,0.0271,['Neil Diamond'],0.56,163907,0.827,0,2SS3WeSe24ZqTlTSK4KzQZ,0.00285,8,0.0551,-4.157,1,Cherry Cherry,54,1968,0.0306,84.383,0.904,1968
16113,0.974,"['Johann Strauss II', 'Riccardo Muti', 'Wiener...",0.219,459053,0.0855,0,5zZbXSRIFe1uWNmEM7f2XI,0.922,0,0.355,-19.703,0,"Frühlingsstimmen, Walzer, Op. 410",34,2021-01-08,0.0404,171.849,0.156,2021
16663,0.355,"['Waylon Jennings', 'Willie Nelson']",0.626,184267,0.457,0,0sFq478LIo9BFwf2qzMzzF,9e-06,4,0.0668,-13.785,1,The Year 2003 Minus 25 - Remastered,43,1978-01-01,0.0384,102.166,0.474,1978
16669,0.202,['Ten Years After'],0.384,224133,0.516,0,19HjHUjCfDrEYhVSIKG6nK,0.18,9,0.114,-12.032,0,I'd Love to Change the World - 2004 Remaster,60,1971-11-11,0.0345,118.129,0.371,1971


In [8]:
#drop column 'id'
df.drop(['id'],axis=1,inplace=True)

In [9]:
#here we can see  2159 duplicate values for all columns

#dropping duplicate rows
df.drop_duplicates(keep=False,inplace=True)

In [10]:
df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.991,['Mamie Smith'],0.598,168333,0.224,0,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,1920,0.0936,149.976,0.634,1920
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,150200,0.517,0,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,7,1920-01-05,0.0534,86.889,0.95,1920
2,0.993,['Mamie Smith'],0.647,163827,0.186,0,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,1920,0.174,97.6,0.689,1920
3,0.000173,['Oscar Velazquez'],0.73,422087,0.798,0,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,1920-01-01,0.0425,127.997,0.0422,1920
4,0.295,['Mixe'],0.704,165224,0.707,1,0.000246,10,0.402,-6.036,0,Xuniverxe,2,1920-10-01,0.0768,122.076,0.299,1920


In [12]:
df.nunique()

acousticness          4906
artists              35426
danceability          1230
duration_ms          55268
energy                2288
explicit                 2
instrumentalness      5400
key                     12
liveness              1740
loudness             25447
mode                     2
name                135031
popularity              98
release_date         11006
speechiness           1632
tempo                83692
valence               1707
year                   102
dtype: int64

In [13]:
# Check duplicate rows except first occurrence based on all columns
duplicateRowsDF_name = df[df['name'].duplicated()]
duplicateRowsDF_name

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
133,0.700,['Greg Fieler'],0.197,224052,0.5110,0,0.000,4,0.0661,-11.444,0,I'll See You in C U B A,0,1920,0.0413,148.502,0.2670,1920
222,0.993,['Sergei Rachmaninoff'],0.389,218773,0.0880,0,0.527,1,0.3630,-21.091,0,"Morceaux de fantaisie, Op. 3: No. 2, Prélude i...",0,1921,0.0456,92.867,0.0731,1921
236,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berli...",0.279,831667,0.2110,0,0.878,10,0.6650,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",1,1921,0.0366,80.954,0.0594,1921
265,0.988,"['Gaetano Donizetti', 'Arturo Toscanini']",0.341,388867,0.1390,0,0.114,2,0.3040,-19.399,1,Overture,0,1921,0.0643,68.031,0.2870,1921
284,0.994,"['Sergei Rachmaninoff', 'Ruth Laredo']",0.248,117467,0.0876,0,0.907,5,0.1650,-25.786,1,"6 Songs, Op. 38: No. 3, Daisies (Version for P...",0,1921,0.0566,82.025,0.0770,1921
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174355,0.498,['ZAYN'],0.597,196493,0.3680,0,0.000,2,0.1090,-10.151,0,Connexion,52,2021-01-15,0.0936,171.980,0.5900,2021
174359,0.598,['Sean Paul'],0.735,211520,0.8460,0,0.000,10,0.1080,-3.110,0,Get Busy,1,2021-01-22,0.0355,100.197,0.7250,2021
174371,0.995,"['Ludovico Einaudi', 'Johannes Bornlöf']",0.343,206700,0.0165,0,0.878,9,0.0774,-30.915,0,Una Mattina,0,2021-01-23,0.0455,126.970,0.1510,2021
174375,0.988,"['Ludovico Einaudi', 'Johannes Bornlöf']",0.316,303333,0.0573,0,0.879,3,0.1200,-24.121,1,Night,0,2021-01-23,0.0515,81.070,0.0373,2021


In [14]:
#we can see here duplicate data in names
#here we can see  2159 duplicate values for all columns

#dropping duplicate rows
df.drop_duplicates(subset=['name'],keep=False,inplace=True)

In [15]:
# Check duplicate rows except first occurrence based on all columns
duplicateRowsDF_name = df[df['name'].duplicated()]
duplicateRowsDF_name

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year


In [16]:
#here we can see all the data has categorical relation
for column in df.columns:
  print(df[column].value_counts())


0.995000    2423
0.994000    1816
0.993000    1346
0.992000    1144
0.991000     995
            ... 
0.000088       1
0.000003       1
0.000030       1
0.000062       1
0.000843       1
Name: acousticness, Length: 4807, dtype: int64
['Tadeusz Dolega Mostowicz']                         1281
['Эрнест Хемингуэй']                                 1175
['Эрих Мария Ремарк']                                1062
['Francisco Canaro']                                  841
['Georgette Heyer', 'Irina Salkow']                   378
                                                     ... 
['Jascha Heifetz', 'Jack Benny', 'Emanuel Bay']         1
['Gabriel Slick']                                       1
['Maratone', 'Roxanne Emery', 'Mhammed El Alami']       1
['Josh Abbott Band', 'Pat Green']                       1
['Diana Ross', 'Marvin Gaye']                           1
Name: artists, Length: 30490, dtype: int64
0.7210    297
0.5650    295
0.7100    294
0.6310    280
0.7120    278
         ... 
0

In [17]:
#here 'release date'  same as the year
#drop column 'release date'
df.drop(['release_date'],axis=1,inplace=True)

In [18]:
df.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,valence,year
count,117842.0,117842.0,117842.0,117842.0,117842.0,117842.0,117842.0,117842.0,117842.0,117842.0,117842.0,117842.0,117842.0,117842.0,117842.0
mean,0.501678,0.537716,232498.7,0.481924,0.078402,0.198798,5.222136,0.220511,-11.897121,0.698444,24.310721,0.124958,116.926356,0.529467,1975.646671
std,0.382145,0.177186,158179.2,0.273626,0.268804,0.33445,3.522553,0.188696,5.776413,0.458936,21.754873,0.212975,30.350377,0.263907,27.273584
min,0.0,0.0,4937.0,0.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,1920.0
25%,0.0868,0.414,163467.0,0.246,0.0,0.0,2.0,0.101,-15.144,0.0,0.0,0.0361,93.6035,0.32,1952.0
50%,0.519,0.55,203880.5,0.4615,0.0,0.000588,5.0,0.144,-10.9125,1.0,24.0,0.0476,115.8935,0.543,1976.0
75%,0.904,0.672,265931.8,0.71,0.0,0.266,8.0,0.285,-7.561,1.0,41.0,0.0868,135.01275,0.747,1998.0
max,0.996,0.988,5338302.0,1.0,1.0,1.0,11.0,1.0,3.855,1.0,100.0,0.971,243.507,1.0,2021.0


In [19]:
df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,name,popularity,speechiness,tempo,valence,year
0,0.991,['Mamie Smith'],0.598,168333,0.224,0,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,0.0936,149.976,0.634,1920
2,0.993,['Mamie Smith'],0.647,163827,0.186,0,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,0.174,97.6,0.689,1920
3,0.000173,['Oscar Velazquez'],0.73,422087,0.798,0,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,0.0425,127.997,0.0422,1920
4,0.295,['Mixe'],0.704,165224,0.707,1,0.000246,10,0.402,-6.036,0,Xuniverxe,2,0.0768,122.076,0.299,1920
5,0.996,['Mamie Smith & Her Jazz Hounds'],0.424,198627,0.245,0,0.799,5,0.235,-11.47,1,Crazy Blues - 78rpm Version,9,0.0397,103.87,0.477,1920


In [20]:
df.shape

(117842, 17)

# 6. Prepare trainable data:

# 7. Create Model :

# 8. Load it into `.pkl` file :