### Bring in raw data and perform data cleaning
**program:** 02_data_clean <br>
**author:** chris chan<br>
**date:** jan 27,2021<br>
**desc:** Bring data in from postgres db and perform data cleaning <br>

**datasources:**<br>
- sb_analytic (balanced df thru 2010)
- billboard analytic (hot 100 thru 2019)
- spotify random (random thru 2020)

In [3]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_formats = ['svg']
%matplotlib inline

plt.rcParams['figure.figsize'] = (9, 6)
sns.set(context='notebook', style='whitegrid', font_scale=1.2)

In [4]:
from sqlalchemy import create_engine
import pandas as pd

In [5]:
engine = create_engine('postgresql://chrischan:localhost@localhost:5432/m3spotify')

### Work with Spotify & BB hot 100

**1b. sb analytic**

In [6]:
query='SELECT * FROM sb_analytic;'
sbdf=pd.read_sql(query,engine)
sbdf.head(2)

Unnamed: 0,SpotifyID,danceability,energy,key,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,loudness,is_hit,year
0,285pBltuF7vW8TeWk8hdRR,0.511,0.566,6,0,0.2,0.349,0.0,0.34,0.218,83.903,239836,-7.23,1,2018.0
1,7dt6x5M1jzdTEt8oCbisTK,0.68,0.578,10,1,0.04,0.331,0.0,0.135,0.341,145.038,231267,-5.804,1,2018.0


In [7]:
sbdf.describe()

Unnamed: 0,danceability,energy,key,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,loudness,is_hit,year
count,14682.0,14682.0,14682.0,14682.0,14682.0,14682.0,14682.0,14682.0,14682.0,14682.0,14682.0,14682.0,14682.0,14038.0
mean,-0.079946,-0.029549,4.609249,-0.006266,-0.583123,-0.448552,-0.595446,-0.482371,-0.155506,119.871835,240406.4,-8.447724,0.630568,2004.446787
std,26.080135,26.081806,26.443567,26.085725,26.066713,26.071495,26.067194,26.069647,26.078774,41.763557,90779.82,26.180325,0.482667,7.887024
min,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,1958.0
25%,0.497,0.51,2.0,0.0,0.0345,0.0175,0.0,0.0929,0.33,96.98525,200492.2,-9.4515,0.0,1999.0
50%,0.6145,0.684,6.0,1.0,0.0491,0.09835,5e-06,0.129,0.53,119.7705,231693.0,-6.815,1.0,2005.0
75%,0.722,0.823,8.0,1.0,0.105,0.35875,0.00165,0.258,0.724,139.84125,268600.0,-5.08425,1.0,2010.0
max,0.986,1.0,11.0,1.0,0.956,0.996,0.991,0.997,0.992,245.941,4802553.0,0.316,1.0,2019.0


In [8]:
sbdf.is_hit.value_counts(dropna=False)

1    9258
0    5424
Name: is_hit, dtype: int64

*We have ~4800 missing values. We will not impute therefore drop.*

In [9]:
sbdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14682 entries, 0 to 14681
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   SpotifyID         14682 non-null  object 
 1   danceability      14682 non-null  float64
 2   energy            14682 non-null  float64
 3   key               14682 non-null  int64  
 4   mode              14682 non-null  int64  
 5   speechiness       14682 non-null  float64
 6   acousticness      14682 non-null  float64
 7   instrumentalness  14682 non-null  float64
 8   liveness          14682 non-null  float64
 9   valence           14682 non-null  float64
 10  tempo             14682 non-null  float64
 11  duration_ms       14682 non-null  int64  
 12  loudness          14682 non-null  float64
 13  is_hit            14682 non-null  int64  
 14  year              14038 non-null  float64
dtypes: float64(10), int64(4), object(1)
memory usage: 1.7+ MB


*Drop duplicates*

In [10]:
sbdf = sbdf.drop_duplicates()
sbdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14682 entries, 0 to 14681
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   SpotifyID         14682 non-null  object 
 1   danceability      14682 non-null  float64
 2   energy            14682 non-null  float64
 3   key               14682 non-null  int64  
 4   mode              14682 non-null  int64  
 5   speechiness       14682 non-null  float64
 6   acousticness      14682 non-null  float64
 7   instrumentalness  14682 non-null  float64
 8   liveness          14682 non-null  float64
 9   valence           14682 non-null  float64
 10  tempo             14682 non-null  float64
 11  duration_ms       14682 non-null  int64  
 12  loudness          14682 non-null  float64
 13  is_hit            14682 non-null  int64  
 14  year              14038 non-null  float64
dtypes: float64(10), int64(4), object(1)
memory usage: 1.8+ MB


In [11]:
sbdf.isnull().sum().sum()
#bbdf.isnull().values.any()

644

In [12]:
sbdf = sbdf[sbdf['year'].notna()]

In [13]:
sbdf = sbdf[sbdf['SpotifyID'] != '39FgoYSPntDNk6vqbwKRKH'] 

In [14]:
sbdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14037 entries, 0 to 14681
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   SpotifyID         14037 non-null  object 
 1   danceability      14037 non-null  float64
 2   energy            14037 non-null  float64
 3   key               14037 non-null  int64  
 4   mode              14037 non-null  int64  
 5   speechiness       14037 non-null  float64
 6   acousticness      14037 non-null  float64
 7   instrumentalness  14037 non-null  float64
 8   liveness          14037 non-null  float64
 9   valence           14037 non-null  float64
 10  tempo             14037 non-null  float64
 11  duration_ms       14037 non-null  int64  
 12  loudness          14037 non-null  float64
 13  is_hit            14037 non-null  int64  
 14  year              14037 non-null  float64
dtypes: float64(10), int64(4), object(1)
memory usage: 1.7+ MB


**2. Save Dataframe for analysis**

In [15]:
sbdf.to_csv(r'../data/clean/sbdf_clean.csv', index = False, header=True)
print(sbdf)

                    SpotifyID  danceability  energy  key  mode  speechiness  \
0      285pBltuF7vW8TeWk8hdRR         0.511   0.566    6     0       0.2000   
1      7dt6x5M1jzdTEt8oCbisTK         0.680   0.578   10     1       0.0400   
2      78QR3Wp35dqAhFEc2qAGjE         0.897   0.662    1     0       0.2920   
3      2xLMifQCjDGFmkHkpNLD9h         0.834   0.730    8     1       0.2220   
4      2iUXsYOEPhVqEBwsqP70rE         0.596   0.854    7     0       0.4630   
...                       ...           ...     ...  ...   ...          ...   
14677  7xV2k7FEMtUT4IUu4L87it         0.562   0.525    9     1       0.0283   
14678  3e0tyTV5FiV1bcYeRjdDz2         0.404   0.636    4     0       0.0325   
14679  2CQwzG5nbS7ys8CHSlavVg         0.406   0.895    2     0       0.0563   
14680  0MS1NrmBWaCpPLFEXV0VMZ         0.329   0.963    4     1       0.1450   
14681  62wqW6Q9eTozrruWPt9Z9i         0.194   0.251    8     1       0.0371   

       acousticness  instrumentalness  liveness  va