# song_extra_info.csv
- song_id
- song name - the name of the song.
- isrc - International Standard Recording Code, theoretically can be used as an identity of a song. However, what worth to note is, ISRCs generated from providers have not been officially verified; therefore the information in ISRC, such as country code and reference year, can be misleading/incorrect. Multiple songs could share one ISRC since a single recording could be re-published several times.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re

In [2]:
song_extra=pd.read_csv('data/song_extra_info.csv')
print(song_extra.head(20))
print(song_extra.info())

                                         song_id  \
0   LP7pLJoJFBvyuUwvu+oLzjT+bI+UeBPURCecJsX1jjs=   
1   ClazTFnk6r0Bnuie44bocdNMM3rdlrq0bCGAsGUWcHE=   
2   u2ja/bZE3zhCGxvbbOB3zOoUjx27u40cf5g09UXMoKQ=   
3   92Fqsy0+p6+RHe2EoLKjHahORHR1Kq1TBJoClW9v+Ts=   
4   0QFmz/+rJy1Q56C1DuYqT9hKKqi5TUqx0sN0IwvoHrw=   
5   QU8f6JR0/cwLGSqJX2XDVzFK0DxMaIUY15ALJXK7ziw=   
6   O1Oj4CmnZhbHl7oyBaHSpGeu5gvcSmUydY3Awmv3uxk=   
7   Tr5R61AuEkN8UelOCzu09ZwQi7/HmP9sQmxf9rFngIg=   
8   ie9l12ZYXEaP4evrBBUvnNnZGdupHSX5NU+tEqB1SDg=   
9   6mICNlckUVGuoK/XGC7bnxXf5s2ZnkpFHShaGL/zM2Y=   
10  BUQwTuzZ8GKEiHtFoI1hFcKRK1W3EEpfD+VLcIVkUzQ=   
11  fuQO8mNakRgp0vDqDJbvorJvMcJMvSjldFKAz6g+27Y=   
12  oIkuw3YGuUhqJd8CMJxvBep4rEXXJxea71l1JO0EhfQ=   
13  jAeBPAOGuLjjF81uYHVj8sayYH6VQhaHGPhTfq+u8O4=   
14  gB4Fu5VOaGR+E1ITkBnb4yU2SdZFW6Q+K/OHPAZhZJk=   
15  uzWI7xZfL3gL2/B4ptZs0XfBuGC20ydak01SjhFuEtc=   
16  EXBuTr6J7UY6MDozwT/UDRVnmW0VGRVfeGBzrxVlX3k=   
17  y2QmHXZMAhfVXwyQoimo5ZvMbNdS8qKCRRqKqU7izew=   
18  isW4S3tq

In [3]:
#Check how many missing values there are in each column
song_extra.isnull().sum()

song_id         0
name            2
isrc       136548
dtype: int64

In [4]:
#Rows with missing names
song_extra[song_extra['name'].isnull()]

Unnamed: 0,song_id,name,isrc
273129,sNVAWeE2/q4auIOdlGc2H3WT2bw99rgk95+MPh81S84=,,TWAE31500124
800087,EqG1FQ2ZMDgqBC8vnSCTqN+TneeuQuSqKnljU2W9f44=,,


One of the songs has no isrc to check what is the name of the song. For the one that has isrc I could not find the name. This entries are kept as they are.

In [5]:
#Clean format of artist_name column: capitalize first letters and remove extra spaces
song_extra['name']=song_extra['name'].map(lambda x: str(x).title() if x!=np.nan else x)
song_extra['name']=song_extra['name'].map(lambda x: re.sub(' +',' ',x) if x!=np.nan else x)
#Bring back missing values
song_extra.replace('Nan',np.nan, inplace=True)
print(song_extra.head(20))

                                         song_id  \
0   LP7pLJoJFBvyuUwvu+oLzjT+bI+UeBPURCecJsX1jjs=   
1   ClazTFnk6r0Bnuie44bocdNMM3rdlrq0bCGAsGUWcHE=   
2   u2ja/bZE3zhCGxvbbOB3zOoUjx27u40cf5g09UXMoKQ=   
3   92Fqsy0+p6+RHe2EoLKjHahORHR1Kq1TBJoClW9v+Ts=   
4   0QFmz/+rJy1Q56C1DuYqT9hKKqi5TUqx0sN0IwvoHrw=   
5   QU8f6JR0/cwLGSqJX2XDVzFK0DxMaIUY15ALJXK7ziw=   
6   O1Oj4CmnZhbHl7oyBaHSpGeu5gvcSmUydY3Awmv3uxk=   
7   Tr5R61AuEkN8UelOCzu09ZwQi7/HmP9sQmxf9rFngIg=   
8   ie9l12ZYXEaP4evrBBUvnNnZGdupHSX5NU+tEqB1SDg=   
9   6mICNlckUVGuoK/XGC7bnxXf5s2ZnkpFHShaGL/zM2Y=   
10  BUQwTuzZ8GKEiHtFoI1hFcKRK1W3EEpfD+VLcIVkUzQ=   
11  fuQO8mNakRgp0vDqDJbvorJvMcJMvSjldFKAz6g+27Y=   
12  oIkuw3YGuUhqJd8CMJxvBep4rEXXJxea71l1JO0EhfQ=   
13  jAeBPAOGuLjjF81uYHVj8sayYH6VQhaHGPhTfq+u8O4=   
14  gB4Fu5VOaGR+E1ITkBnb4yU2SdZFW6Q+K/OHPAZhZJk=   
15  uzWI7xZfL3gL2/B4ptZs0XfBuGC20ydak01SjhFuEtc=   
16  EXBuTr6J7UY6MDozwT/UDRVnmW0VGRVfeGBzrxVlX3k=   
17  y2QmHXZMAhfVXwyQoimo5ZvMbNdS8qKCRRqKqU7izew=   
18  isW4S3tq

isrc has many missing values, this column probably won't be use to drive conclusions

In [7]:
#Export clean file as song_extra_info_clean.csv
song_extra.to_csv('data/song_extra_info_clean.csv')