# Data summaries for the MSS dataframe

### Load the `pandas` library

In [210]:
import pandas as pd
pd.__version__

'0.17.1'

### Load the `numpy` library

In [211]:
import numpy as np
np.__version__

'1.10.4'

### Read the dataframe from the `mss_df.pkl` _pickle_ file.

In [212]:
save_load_path = '/Users/David/Desktop'
mss_df = pd.read_pickle(save_load_path+'/mss_df.pkl')

### Display the dimensions (shape) and columns of the dataframe.

In [213]:
print('shape:',mss_df.shape)
print('columns:',mss_df.columns.values)
mss_df.dtypes

shape: (10000, 16)
columns: ['danceability' 'duration' 'energy' 'key' 'loudness' 'mode' 'tempo'
 'time_signature' 'artist_familiarity' 'artist_hotttnesss'
 'song_hotttnesss' 'genre' 'title' 'artist_location'
 'releaseartist_longitude' 'artist_latitude']


danceability               float64
duration                   float64
energy                     float64
key                          int32
loudness                   float64
mode                         int32
tempo                      float64
time_signature               int32
artist_familiarity         float64
artist_hotttnesss          float64
song_hotttnesss            float64
genre                       object
title                       object
artist_location             object
releaseartist_longitude     object
artist_latitude            float64
dtype: object

### Look for categorical variables and for single value variables

Notice that variable `mode` has two values, `key` has twelve (12) values and both `danceability` and `energy` have all values equal to zero (0).

Variables `mode` and `key` will become categorical variables. 

Variables `danceability` and `energy` will be removed from the dataframe.

In [209]:
for colname in mss_df.columns.values:
    print('Column:',colname)
    print(mss_df[colname].unique())
    print('')

Column: danceability
[ 0.]

Column: duration
[ 218.93179  148.03546  177.47546 ...,  314.5922   141.73995  386.19383]

Column: energy
[ 0.]

Column: key
[ 1  6  8  0  2  5  4  7  9 10  3 11]

Column: loudness
[-11.197  -9.843  -9.689 ..., -12.087 -12.574  -5.324]

Column: mode
[0 1]

Column: tempo
[  92.198  121.274  100.07  ...,  118.123  137.663  150.575]

Column: time_signature
[4 1 3 5 7 0]

Column: artist_familiarity
[ 0.58179377  0.63063004  0.48735679 ...,  0.38359402  0.65089597
  0.5820348 ]

Column: artist_hotttnesss
[ 0.40199754  0.41749964  0.34342838 ...,  0.22021045  0.31502571
  0.46905598]

Column: song_hotttnesss
[ 0.60211999         nan  0.60450074 ...,  0.66319397  0.69097095
  0.7803116 ]

Column: genre
[b'']

Column: title
[b"I Didn't Mean To" b'Soul Deep' b'Amor De Cabaret' ...,
 b'Sentimental Man' b'Zydeco In D-Minor' b'Shattered Life']

Column: artist_location
[b'California - LA' b'Memphis, TN' b'' ..., b'Aitkin, MN' b'Cullendale, AR'
 b'Alexandria, Egypt']

Col

### Make `mode` and `key` variables categorical

Also check their types and that it is one of the columns of `mss_df`.

In [200]:
mss_df['mode'] = mss_df['mode'].astype('category')
mss_df['key']  = mss_df['key'] .astype('category')
mss_df['key'].dtype, mss_df['mode'].dtype

(category, category)

### Remove constant variables (as they have only one value)

- Remove `danceability` and `energy` variables as all values are zero (0).
- Remove `genre` variable as all of its values are empty strings.

In [201]:
mss_df = mss_df.drop(['danceability','energy','genre'],axis=1)
mss_df.columns.values

array(['duration', 'key', 'loudness', 'mode', 'tempo',
       'artist_familiarity', 'artist_hotttnesss', 'song_hotttnesss',
       'title', 'artist_location'], dtype=object)

## Descriptive statistics

See [Descriptive statistics - pandas docs](http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics)

There are three ways to do this. 

1. Display a matrix with all numeric variables as columns and the statistics as rows.
1. For each column use `describe()` to get summary statistics for that column. 
1. Get each summary statistic for all variables. 

We start with the first option. Notice that `danceability` and `energy` are all zero (min and max both are zero). They will be removed from the dataframe. 

### Get descriptive stats for each column/variable

In [193]:
mss_df.describe()

Unnamed: 0,duration,loudness,tempo,artist_familiarity,artist_hotttnesss,song_hotttnesss
count,10000.0,10000.0,10000.0,9996.0,10000.0,5648.0
mean,238.507518,-10.485668,122.915449,0.565456,0.385552,0.342822
std,114.137514,5.399788,35.184412,0.160161,0.143647,0.24722
min,1.04444,-51.643,0.0,0.0,0.0,0.0
25%,176.0322,-13.16325,96.96575,0.467611,0.325266,0.0
50%,223.05914,-9.38,120.161,0.563666,0.380742,0.360371
75%,276.37506,-6.5325,144.01325,0.66802,0.453858,0.537504
max,1819.76771,0.566,262.828,1.0,1.082503,1.0


In [170]:
for colname in mss_df.columns.values:
    print('Column:',colname)
    print(mss_df[colname].describe())
    print('')

Column: danceability
count    10000
mean         0
std          0
min          0
25%          0
50%          0
75%          0
max          0
Name: danceability, dtype: float64

Column: duration
count    10000.000000
mean       238.507518
std        114.137514
min          1.044440
25%        176.032200
50%        223.059140
75%        276.375060
max       1819.767710
Name: duration, dtype: float64

Column: energy
count    10000
mean         0
std          0
min          0
25%          0
50%          0
75%          0
max          0
Name: energy, dtype: float64

Column: key
count     10000
unique       12
top           7
freq       1339
Name: key, dtype: int64

Column: loudness
count    10000.000000
mean       -10.485668
std          5.399788
min        -51.643000
25%        -13.163250
50%         -9.380000
75%         -6.532500
max          0.566000
Name: loudness, dtype: float64

Column: mode
count     10000
unique        2
top           1
freq       6911
Name: mode, dtype: int64

Colu

The second option displays each statistic (count, min, max, mean, median) for all variables. 

Start by counting the non-missing values for each variable.

In [171]:
mss_df.count(axis=0)

danceability          10000
duration              10000
energy                10000
key                   10000
loudness              10000
mode                  10000
tempo                 10000
artist_familiarity     9996
artist_hotttnesss     10000
song_hotttnesss        5648
genre                 10000
title                 10000
artist_location       10000
dtype: int64

Then compute the minimum for each variable.

In [172]:
mss_df.min(axis=0)

danceability                0
duration              1.04444
energy                      0
key                         0
loudness              -51.643
mode                        0
tempo                       0
artist_familiarity          0
artist_hotttnesss           0
song_hotttnesss             0
genre                     b''
title                     b''
artist_location           b''
dtype: object

Then compute the maximum of each variable.

In [173]:
mss_df.max(axis=0)

danceability                                            0
duration                                          1819.77
energy                                                  0
key                                                    11
loudness                                            0.566
mode                                                    1
tempo                                             262.828
artist_familiarity                                      1
artist_hotttnesss                                  1.0825
song_hotttnesss                                         1
genre                                                 b''
title                             b'\xc3\x9altimo Desejo'
artist_location       b'\xe5\x8f\xb0\xe5\x8c\x97, Taiwan'
dtype: object

Then compute the mean for each variable.

In [174]:
mss_df.mean(axis=0)

danceability            0.000000
duration              238.507518
energy                  0.000000
loudness              -10.485669
tempo                 122.915449
artist_familiarity      0.565456
artist_hotttnesss       0.385552
song_hotttnesss         0.342822
dtype: float64

Then compute the median for each variable.

In [175]:
mss_df.median(axis=0)

danceability            0.000000
duration              223.059140
energy                  0.000000
loudness               -9.380000
tempo                 120.161000
artist_familiarity      0.563666
artist_hotttnesss       0.380742
song_hotttnesss         0.360371
dtype: float64

### Grouped summary statistics

In [179]:
mss_df.groupby('mode').aggregate(np.mean)

Unnamed: 0_level_0,duration,loudness,tempo,artist_familiarity,artist_hotttnesss,song_hotttnesss
mode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,246.955471,-10.118045,123.735232,0.572518,0.38974,0.356552
1,234.731549,-10.649985,122.549031,0.562297,0.38368,0.336759


In [180]:
mss_df.groupby('key').aggregate(np.mean)

Unnamed: 0_level_0,duration,loudness,tempo,artist_familiarity,artist_hotttnesss,song_hotttnesss
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,237.823426,-11.058151,121.650326,0.54455,0.36831,0.315054
1,244.883665,-10.001926,123.349775,0.584262,0.397954,0.362002
2,233.626999,-10.83596,123.222114,0.562147,0.384628,0.338356
3,229.262293,-11.512135,118.350529,0.55191,0.380443,0.330331
4,234.139991,-10.557836,122.78616,0.561873,0.38492,0.35286
5,231.096605,-11.09046,122.702699,0.556122,0.38817,0.324215
6,247.835534,-9.542808,121.108782,0.577913,0.396189,0.375138
7,237.767248,-10.401917,124.835913,0.567382,0.384556,0.333403
8,243.467648,-10.590195,123.817277,0.563032,0.385343,0.372773
9,237.190127,-10.236632,125.600899,0.566308,0.382939,0.338552


### Display correlation matrix for all (numeric) variables

In [177]:
mss_df.corr()

Unnamed: 0,duration,loudness,tempo,artist_familiarity,artist_hotttnesss,song_hotttnesss
duration,1.0,0.057247,-0.009968,0.032884,0.021068,0.007042
loudness,0.057247,1.0,0.179354,0.239432,0.187415,0.226461
tempo,-0.009968,0.179354,1.0,0.065946,0.050259,0.079031
artist_familiarity,0.032884,0.239432,0.065946,1.0,0.811102,0.543748
artist_hotttnesss,0.021068,0.187415,0.050259,0.811102,1.0,0.522261
song_hotttnesss,0.007042,0.226461,0.079031,0.543748,0.522261,1.0


### Display covariance matrix for all (numeric) variables

In [178]:
mss_df.cov()

Unnamed: 0,duration,loudness,tempo,artist_familiarity,artist_hotttnesss,song_hotttnesss
duration,13027.372003,35.282129,-40.028926,0.601235,0.345422,0.193065
loudness,35.282129,29.157713,34.075157,0.207052,0.145371,0.289452
tempo,-40.028926,34.075157,1237.94283,0.371603,0.254014,0.681782
artist_familiarity,0.601235,0.207052,0.371603,0.025652,0.018638,0.020232
artist_hotttnesss,0.345422,0.145371,0.254014,0.018638,0.020635,0.016453
song_hotttnesss,0.193065,0.289452,0.681782,0.020232,0.016453,0.061118


### Variable descriptions

See the [Analyzer Documentation](http://developer.echonest.com/docs/v4/_static/AnalyzeDocumentation.pdf) 
and the [API Overview](http://developer.echonest.com/docs/v4/contents.html) 
at the Echo Nest developer site http://developer.echonest.com. 
Most of the following descriptions are from the documentation.
Some are from the API Overview.
See also the [FAQ](http://labrosa.ee.columbia.edu/millionsong/faq) at the MSD site. 

Numeric variable descriptions:

- `duration`: "_the duration of a track in seconds as precisely computed by the audio decoder._"
- `loudness`: "_the overall loudness of a track in decibels (dB). Loudness values in the Analyzer are averaged across an entire track and are useful for comparing relative loudness of segments and tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude)._"
- `tempo`: "_the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration._"
- `artist_familiarity`: "_numerical estimation of how familiar an artist currently is to the world._"
- `artist_hotttnesss`
- `song_hotttnesss`:

Categorical variable descriptions:

- `mode`: "_iindicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived._"
- `key`: "_the estimated overall key of a track. The key identifies the tonic triad, the chord, major or minor, which represents the final point of rest of a piece._"

Character variable descriptions:

- `title`
- `artist_location`

In [204]:
mss_df.dtypes

duration               float64
key                   category
loudness               float64
mode                  category
tempo                  float64
artist_familiarity     float64
artist_hotttnesss      float64
song_hotttnesss        float64
title                   object
artist_location         object
dtype: object