# Data summaries for the MSS dataframe

### Load the `pandas` library

In [2]:
import pandas as pd
pd.__version__

'0.17.1'

### Load the `numpy` library

In [3]:
import numpy as np
np.__version__

'1.10.4'

### Read the dataframe from the `mss_df.pkl` _pickle_ file.

In [4]:
save_load_path = '/Users/David/Desktop'
mss_df = pd.read_pickle(save_load_path+'/mss_df.pkl')

### Display the dimensions (shape) and columns of the dataframe.

In [5]:
print('shape:',mss_df.shape)
print('columns:',mss_df.columns.values)
mss_df.dtypes

shape: (10000, 115)
columns: ['artist_familiarity' 'artist_hotttnesss' 'artist_id' 'artist_latitude'
 'artist_location' 'artist_longitude' 'bc_0' 'bc_1' 'bc_2' 'bc_3' 'bc_4'
 'bc_5' 'bc_6' 'bc_7' 'bc_8' 'bc_9' 'duration' 'loudness' 'mode' 'release'
 'song_hotttnesss' 'song_id' 'sp_0' 'sp_1' 'sp_10' 'sp_11' 'sp_12' 'sp_13'
 'sp_14' 'sp_15' 'sp_16' 'sp_17' 'sp_18' 'sp_19' 'sp_2' 'sp_20' 'sp_21'
 'sp_22' 'sp_23' 'sp_24' 'sp_25' 'sp_26' 'sp_27' 'sp_28' 'sp_29' 'sp_3'
 'sp_30' 'sp_31' 'sp_32' 'sp_33' 'sp_34' 'sp_35' 'sp_4' 'sp_5' 'sp_6'
 'sp_7' 'sp_8' 'sp_9' 'st_0' 'st_1' 'st_10' 'st_11' 'st_12' 'st_13' 'st_14'
 'st_15' 'st_16' 'st_17' 'st_18' 'st_19' 'st_2' 'st_20' 'st_21' 'st_22'
 'st_23' 'st_24' 'st_25' 'st_26' 'st_27' 'st_28' 'st_29' 'st_3' 'st_30'
 'st_31' 'st_32' 'st_33' 'st_34' 'st_35' 'st_4' 'st_5' 'st_6' 'st_7' 'st_8'
 'st_9' 'tempo' 'title' 'track_id' 'k_0' 'k_1' 'k_2' 'k_3' 'k_4' 'k_5'
 'k_6' 'k_7' 'k_8' 'k_9' 'k_10' 'k_11' 'ts_0' 'ts_1' 'ts_3' 'ts_4' 'ts_5'
 'ts_7']


artist_familiarity    float64
artist_hotttnesss     float64
artist_id              object
artist_latitude       float64
artist_location        object
artist_longitude      float64
bc_0                  float64
bc_1                  float64
bc_2                  float64
bc_3                  float64
bc_4                  float64
bc_5                  float64
bc_6                  float64
bc_7                  float64
bc_8                  float64
bc_9                  float64
duration              float64
loudness              float64
mode                  float64
release                object
song_hotttnesss       float64
song_id                object
sp_0                  float64
sp_1                  float64
sp_10                 float64
sp_11                 float64
sp_12                 float64
sp_13                 float64
sp_14                 float64
sp_15                 float64
                       ...   
st_33                 float64
st_34                 float64
st_35     

## Variable descriptions

The variable descriptions below are pulled from these sources:

- [Analyzer Documentation](http://developer.echonest.com/docs/v4/_static/AnalyzeDocumentation.pdf) 
- [API Overview](http://developer.echonest.com/docs/v4/contents.html) 
- Echo Nest developer site http://developer.echonest.com. 
- [FAQ](http://labrosa.ee.columbia.edu/millionsong/faq) at the MSD site
- [Field list](http://labrosa.ee.columbia.edu/millionsong/pages/field-list) at Lab

#### Numeric variable descriptions:

- `song_hotttnesss`
- `artist_hotttnesss`
- `artist_familiarity`: "_numerical estimation of how familiar an artist currently is to the world._"

- `duration`: "_the duration of a track in seconds as precisely computed by the audio decoder._"
- `loudness`: "_the overall loudness of a track in decibels (dB). Loudness values in the Analyzer are averaged across an entire track and are useful for comparing relative loudness of segments and tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude)._"
- `tempo`: "_the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration._"

#### Categorical variable descriptions:

- `mode`: "_indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived._"
- `key`: "_the estimated overall key of a track. The key identifies the tonic triad, the chord, major or minor, which represents the final point of rest of a piece._"
- `time_signature`: "_an estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure)._"

#### Character variable descriptions:

- `title`: This contains text and could be mined as such. 
- `release`: 
- `artist_name`: 
- `artist_location`: This should be a categorical variable.

#### Vector variable descriptions (includes segments, sections, etc.):

- `artist_terms`: "_tags provided by Echo Nest_"
- `artist_terms_freq`: ?
- `artist_terms_weight`: ?

Sections: "_a set of section markers, in seconds. Sections are defined by large variations in rhythm or timbre, e.g. chorus, verse, bridge, guitar solo, etc. Each section contains its own descriptions of tempo, key, mode, time signature, and loudness._"

- `sections_start`
- `sections_confidence`

Beats: "_list of beat markers, in seconds. A beat is the basic time unit of a piece of music; for example, each tick of a metronome. Beats are typically multiples of tatums._"

- `beats_start`
- `beats_confidence`

Tatums: "_list of tatum markers, in seconds. Tatums represent the lowest regular pulse train that a listener intuitively infers from the timing of perceived musical events (segments)._"

- `tatums_start`
- `tatums_confidence`

Bars: "_list of bar markers, in seconds. A bar (or measure) is a segment of time defined as a given number of beats. Bar offsets also indicate downbeats, the first beat of the measure._"
 
- `bars_start`
- `bars_confidence`

Segments: "_a set of sound entities (typically under a second) each relatively uniform in timbre and harmony. Segments are characterized by their perceptual onsets and duration in seconds, loudness (dB), pitch and timbral content._" The first three songs have between 500 and 1000 segments. 

The first five (5) _segment_ variables are vectors with one element per segment. 

- `segments_start`: start of segment in seconds?
- `segments_confidence`: confidence of ?

- `segments_loudness_max`: "_peak loudness value within the segment_"
- `segments_loudness_max_time`: "_offset within the segment of the point of maximum loudness_"
- `segments_loudness_start`: "_indicates the loudness level at the start of the segment_"

#### Matrix variable descriptions: 

The `segments_timbre` and `segments_pitches` variables are matrices with twelve (12) columns and the one row for each segment.

- `segments_timbre`: "_In simple terms, timbre is what makes a particular musical sound different from another, even when they have the same pitch and loudness. For instance, it is the difference between a guitar and a piano playing the same note at the same loudness._" --- Wikipedia
    - Each of the twelve (12) columns correspond to a _basis_ function: 
        - "_The first dimension represents the average loudness of the segment_"
        - "_The second emphasizes brightness_"
        - "_The third is more closely correlated to the flatness of a sound_"
        - "_The fourth to sounds with a stronger attack; etc._"
        - "_Timbre vectors are best used in comparison with each other._"
    - Each value is a coefficient for the corresponding basis function. 
- `segments_pitches`: "_Pitch is the quality that makes it possible to judge sounds as higher and lower_" (Wikipedia)
    - Each column corresponds to a [_pitch class_](https://en.wikipedia.org/wiki/Pitch_class) 
    which is a set of all pitches that are a whole number of octaves apart. (Wikipedia)
    - The higher values correspond to dominant notes. Often there is a `1.0` in one of the twelve fields.


http://www.ifs.tuwien.ac.at/~schindler/pubs/AMR2012.pdf


### Look for categorical variables and for single value variables

Notice that variable `mode` has two values, `key` has twelve (12) values and both `danceability` and `energy` have all values equal to zero (0).

Variables `mode` and `key` will become categorical variables. 

Variables `danceability` and `energy` will be removed from the dataframe.

In [6]:
for colname in mss_df.columns.values:
    print('Column:',colname)
    print(mss_df[colname].unique())
    print('')

Column: artist_familiarity
[ 0.58179377  0.63063004  0.48735679 ...,  0.38359402  0.65089597
  0.5820348 ]

Column: artist_hotttnesss
[ 0.40199754  0.41749964  0.34342838 ...,  0.22021045  0.31502571
  0.46905598]

Column: artist_id
[b'ARD7TVE1187B99BFB1' b'ARMJAGH1187FB546F3' b'ARKRRTF1187B9984DA' ...,
 b'ARMATAK1264D13AC27' b'AR5S26K1187B9A3141' b'AR0ILWV1187FB47CAA']

Column: artist_latitude
[             nan   3.51496800e+01   3.52196200e+01   4.36485600e+01
   3.71679300e+01   3.43110900e+01   3.88991000e+01   4.98038800e+01
   4.08262400e+01   3.78372100e+01   3.26782800e+01   4.07919500e+01
   4.07319700e+01   4.18841500e+01   2.79401700e+01   4.07038000e+01
   3.00861500e+01   2.94244900e+01   3.40534900e+01   4.23316800e+01
   5.15063200e+01   3.59146300e+01   3.43640100e+01   3.27781500e+01
   5.25160700e+01  -2.35628700e+01   5.44830300e+01   3.77791600e+01
  -3.78175300e+01   3.61677800e+01  -2.49162000e+01   1.81152600e+01
   4.15047100e+01   4.07145500e+01   6.45565300e+0

### Constant variables have been removed (as they have only one value)

- Remove `danceability` and `energy` variables as all values are zero (0).
- Remove `genre` variable as all of its values are empty strings.

In [7]:
# mss_df = mss_df.drop(['danceability','energy','genre'],axis=1)
# mss_df.columns.values

## Descriptive statistics

See [Descriptive statistics - pandas docs](http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics)

There are several ways to do this:

1. Display a matrix with all numeric variables as columns and the statistics as rows.
1. For each column use `describe()` to get summary statistics for that column. 
1. Get specific summary statistics for all variables.
1. Display grouped summary statistics

We start with the first option. Notice that `danceability` and `energy` are all zero (min and max both are zero). They will be removed from the dataframe. 

### 1. Get descriptive stats for each numeric column/variable

In [8]:
type(mss_df)

pandas.core.frame.DataFrame

In [9]:
mss_df.describe()

Unnamed: 0,artist_familiarity,artist_hotttnesss,artist_latitude,artist_longitude,bc_0,bc_1,bc_2,bc_3,bc_4,bc_5,...,k_8,k_9,k_10,k_11,ts_0,ts_1,ts_3,ts_4,ts_5,ts_7
count,9996.0,10000.0,3742.0,3742.0,9970.0,9970.0,9966.0,9964.0,9959.0,9953.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,0.565456,0.385552,37.157357,-63.933358,0.239461,0.235951,0.226657,0.226818,0.220867,0.22204,...,0.0528,0.104,0.07,0.0738,0.0003,0.1498,0.1278,0.6351,0.0589,0.0281
std,0.160161,0.143647,15.598494,50.508223,0.275182,0.23484,0.228602,0.229048,0.228798,0.226301,...,0.223645,0.305276,0.25516,0.261458,0.017319,0.356893,0.333884,0.481426,0.235449,0.165267
min,0.0,0.0,-41.28054,-162.4365,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.467611,0.325266,33.74831,-93.29156,0.035,0.054,0.05,0.05,0.046,0.049,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.563666,0.380742,38.99792,-80.23742,0.121,0.1485,0.141,0.143,0.134,0.136,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,0.66802,0.453858,43.64856,-9.40102,0.352,0.357,0.34375,0.337,0.332,0.33,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,1.0,1.082503,69.65102,174.76714,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### 2. Get descriptive stats for each categorical (specified) variable

In [14]:
for colname in ['key','mode','time_signature']:
    print('Column:',colname)
    print(mss_df[colname].describe())
    print('')

Column: key
count     10000
unique       12
top           7
freq       1339
Name: key, dtype: int64

Column: mode
count     10000
unique        2
top           1
freq       6911
Name: mode, dtype: int64

Column: time_signature
count     10000
unique        6
top           4
freq       6351
Name: time_signature, dtype: int64



### 3. Get specific statistics (count, min, max, mean, median) for all variables

Start by counting the non-missing values for each variable.

In [15]:
mss_df.count(axis=0)

duration              10000
key                   10000
loudness              10000
mode                  10000
tempo                 10000
time_signature        10000
artist_familiarity     9996
artist_hotttnesss     10000
song_hotttnesss        5648
title                 10000
artist_location       10000
release               10000
artist_longitude       3742
artist_latitude        3742
dtype: int64

Then compute the minimum for each variable.

In [16]:
mss_df.min(axis=0)

duration                                    1.04444
key                                               0
loudness                                    -51.643
mode                                              0
tempo                                             0
time_signature                                    0
artist_familiarity                                0
artist_hotttnesss                                 0
song_hotttnesss                                   0
title                                           b''
artist_location                                 b''
release               b' Lazy Afternoon En Anglais'
artist_longitude                           -162.436
artist_latitude                            -41.2805
dtype: object

Then compute the maximum of each variable.

In [17]:
mss_df.max(axis=0)

duration                                          1819.77
key                                                    11
loudness                                            0.566
mode                                                    1
tempo                                             262.828
time_signature                                          7
artist_familiarity                                      1
artist_hotttnesss                                  1.0825
song_hotttnesss                                         1
title                             b'\xc3\x9altimo Desejo'
artist_location       b'\xe5\x8f\xb0\xe5\x8c\x97, Taiwan'
release                                b'\xc3\x9cberfall'
artist_longitude                                  174.767
artist_latitude                                    69.651
dtype: object

Then compute the mean for each variable.

In [174]:
mss_df.mean(axis=0)

danceability            0.000000
duration              238.507518
energy                  0.000000
loudness              -10.485669
tempo                 122.915449
artist_familiarity      0.565456
artist_hotttnesss       0.385552
song_hotttnesss         0.342822
dtype: float64

Then compute the median for each variable.

In [175]:
mss_df.median(axis=0)

danceability            0.000000
duration              223.059140
energy                  0.000000
loudness               -9.380000
tempo                 120.161000
artist_familiarity      0.563666
artist_hotttnesss       0.380742
song_hotttnesss         0.360371
dtype: float64

### 4. Grouped summary statistics

In [18]:
mss_df.groupby('mode').aggregate(np.mean)

Unnamed: 0_level_0,duration,loudness,tempo,artist_familiarity,artist_hotttnesss,song_hotttnesss,artist_longitude,artist_latitude
mode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,246.955471,-10.118045,123.735232,0.572518,0.38974,0.356552,-61.690188,36.64921
1,234.731549,-10.649985,122.549031,0.562297,0.38368,0.336759,-64.832809,37.36111


In [19]:
mss_df.groupby('key').aggregate(np.mean)

Unnamed: 0_level_0,duration,loudness,tempo,artist_familiarity,artist_hotttnesss,song_hotttnesss,artist_longitude,artist_latitude
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,237.823426,-11.058151,121.650326,0.54455,0.36831,0.315054,-64.682113,36.677849
1,244.883665,-10.001926,123.349775,0.584262,0.397954,0.362002,-64.104942,37.781019
2,233.626999,-10.83596,123.222114,0.562147,0.384628,0.338356,-62.698095,38.701531
3,229.262293,-11.512135,118.350529,0.55191,0.380443,0.330331,-71.522321,36.867647
4,234.139991,-10.557836,122.78616,0.561873,0.38492,0.35286,-62.750859,38.83776
5,231.096605,-11.09046,122.702699,0.556122,0.38817,0.324215,-65.908903,35.337487
6,247.835534,-9.542808,121.108782,0.577913,0.396189,0.375138,-70.918426,35.933884
7,237.767248,-10.401917,124.835913,0.567382,0.384556,0.333403,-59.840107,35.158634
8,243.467648,-10.590195,123.817277,0.563032,0.385343,0.372773,-65.566909,38.338619
9,237.190127,-10.236632,125.600899,0.566308,0.382939,0.338552,-61.838658,37.842445


### Display correlation matrix for all (numeric) variables

In [80]:
mss_df.corr()

Unnamed: 0,duration,loudness,tempo,artist_familiarity,artist_hotttnesss,song_hotttnesss,artist_longitude,artist_latitude,k_0,k_1,...,k_8,k_9,k_10,k_11,ts_0,ts_1,ts_3,ts_4,ts_5,ts_7
duration,1.0,0.057247,-0.009968,0.032884,0.021068,0.007042,0.062514,0.045076,-0.002227,0.016708,...,0.010261,-0.003933,-0.00191,0.027872,0.084857,-0.121058,0.014701,0.047694,0.040995,0.025494
loudness,0.057247,1.0,0.179354,0.239432,0.187415,0.226461,0.029833,0.059995,-0.039393,0.026794,...,-0.004571,0.015713,-0.007432,0.056148,0.011939,-0.135549,-0.102826,0.20435,-0.034058,-0.047553
tempo,-0.009968,0.179354,1.0,0.065946,0.050259,0.079031,-0.001399,-0.043421,-0.01336,0.003692,...,0.006052,0.026005,-0.012262,-0.012792,-0.060521,-0.044191,-0.064889,0.102971,-0.063818,0.023827
artist_familiarity,0.032884,0.239432,0.065946,1.0,0.811102,0.543748,0.017805,0.012328,-0.048511,0.035079,...,-0.003574,0.001814,0.002423,0.042646,0.00176,-0.059041,-0.056103,0.089193,-0.003743,-0.01381
artist_hotttnesss,0.021068,0.187415,0.050259,0.811102,1.0,0.522261,0.016598,-0.000315,-0.0446,0.025821,...,-0.000344,-0.006198,0.001488,0.024685,-0.01136,-0.040882,-0.041911,0.061987,0.001708,-0.008857
song_hotttnesss,0.007042,0.226461,0.079031,0.543748,0.522261,1.0,0.008667,0.060418,-0.041593,0.023159,...,0.028748,-0.005899,0.008314,0.015601,0.00836,-0.044813,-0.043264,0.073368,-0.015544,-0.011061
artist_longitude,0.062514,0.029833,-0.001399,0.017805,0.016598,0.008667,1.0,0.018949,-0.005769,-0.000953,...,-0.007502,0.01443,-0.00493,-0.000662,,-0.018595,-0.036527,0.034941,0.016886,-0.008483
artist_latitude,0.045076,0.059995,-0.043421,0.012328,-0.000315,0.060418,0.018949,1.0,-0.011962,0.011218,...,0.017566,0.015282,0.005572,0.01212,,-0.0031,-0.004581,7e-06,0.020737,-0.011563
k_0,-0.002227,-0.039393,-0.01336,-0.048511,-0.0446,-0.041593,-0.005769,-0.011962,1.0,-0.111118,...,-0.087722,-0.126582,-0.101934,-0.104878,0.011251,0.001109,-0.015616,0.01185,-0.004483,-0.000158
k_1,0.016708,0.026794,0.003692,0.035079,0.025821,0.023159,-0.000953,0.011218,-0.111118,1.0,...,-0.070611,-0.101891,-0.082051,-0.084421,0.015854,-0.015297,-0.015192,0.003468,0.030392,0.008663


### Display covariance matrix for all (numeric) variables

In [21]:
mss_df.cov()

Unnamed: 0,duration,loudness,tempo,artist_familiarity,artist_hotttnesss,song_hotttnesss,artist_longitude,artist_latitude
duration,13027.372003,35.282129,-40.028926,0.601235,0.345422,0.193065,386.977898,86.174106
loudness,35.282129,29.157713,34.075157,0.207052,0.145371,0.289452,8.26342,5.132109
tempo,-40.028926,34.075157,1237.94283,0.371603,0.254014,0.681782,-2.433206,-23.321705
artist_familiarity,0.601235,0.207052,0.371603,0.025652,0.018638,0.020232,0.125403,0.026814
artist_hotttnesss,0.345422,0.145371,0.254014,0.018638,0.020635,0.016453,0.095523,-0.00056
song_hotttnesss,0.193065,0.289452,0.681782,0.020232,0.016453,0.061118,0.106265,0.234598
artist_longitude,386.977898,8.26342,-2.433206,0.125403,0.095523,0.106265,2551.080602,14.929101
artist_latitude,86.174106,5.132109,-23.321705,0.026814,-0.00056,0.234598,14.929101,243.313007
