# 2 Data Wrangling and Exploratory Data Analysis<a id='2_Data_wrangling'></a>

## 2.1 Contents<a id='2.1_Contents'></a>
* [2 Data wrangling](#2_Data_wrangling)
  * [2.1 Contents](#2.1_Contents)
  * [2.2 Imports](#2.2_Imports)
  * [2.3 Load Spotify Chart Data](#2.3_Load_Spotify_Chart_Data)
  * [2.4 Load Spotify Track Feature Data](#2.4_Load_Spotify_Track_Feature_Data)
  * [2.5 Explore The Data](#2.5_Explore_The_Data)
    * [2.5.1 Average Streams per Position](#2.5.1_Average_Streams_per_Position)
    * [2.5.2 Categorical Features](#2.5.2_Categorical_Features)
      * [2.5.2.1 Unique Genres](#2.5.2.1_Unique_Genres)
      * [2.5.2.2 Unique Genres](#2.5.2.2_Unique_Decades)
    * [2.5.3 Non-Categorical Features](#2.5.3_Non-Categorical_Features)
      * [2.5.3.1 Numeric Data Summary](#2.5.2.1_Numeric_data_summary)
      * [2.5.3.2 Distributions of Feature Values](#2.5.2.2_Distributions_Of_Feature_Values)

## 2.2 Imports<a id='2.2_Imports'></a>

In [1]:
import pandas as pd
from pathlib import Path
import spotipy

## 2.3 Load Spotify Chart Data<a id='2.1_Load_Spotify_Chart_Data'></a>

Spotify chart data is not available in the Spotify API and must be downloaded from https://spotifycharts.com/regional.  Each week has it's own .csv file and has been downloaded to a folder for this project.  We will use python to combine and clean this data so that we can link it to track data in the Spotify API via 'track id.'

In [2]:
source_files = sorted(Path(r"C:\Users\ashle\Documents\GitHub\Springboard\Capstone Project 3\Data").glob('*.csv'))

dataframes = []
for file in source_files:
    df = pd.read_csv(file) 
    df['source'] = file.name
    dataframes.append(df)

top_200_tracks = pd.concat(dataframes)
top_200_tracks.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Note that these figures are generated using a formula that protects against any artificial inflation of chart positions.,Unnamed: 4,source
0,Position,Track Name,Artist,Streams,URL,regional-global-weekly-2020-03-06--2020-03-13.csv
1,1,Blinding Lights,The Weeknd,41066317,https://open.spotify.com/track/0sf12qNH5qcw8qp...,regional-global-weekly-2020-03-06--2020-03-13.csv
2,2,The Box,Roddy Ricch,37470185,https://open.spotify.com/track/0nbXyq5TXYPCO7p...,regional-global-weekly-2020-03-06--2020-03-13.csv
3,3,Dance Monkey,Tones And I,36071262,https://open.spotify.com/track/1rgnBhdG2JDFTbY...,regional-global-weekly-2020-03-06--2020-03-13.csv
4,4,Don't Start Now,Dua Lipa,32169572,https://open.spotify.com/track/6WrI0LAC5M1Rw2M...,regional-global-weekly-2020-03-06--2020-03-13.csv


In [3]:
missing = pd.concat([top_200_tracks.isnull().sum(), 100 * top_200_tracks.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count')

Unnamed: 0,count,%
Unnamed: 0,0,0.0
Unnamed: 1,0,0.0
Unnamed: 2,0,0.0
Note that these figures are generated using a formula that protects against any artificial inflation of chart positions.,0,0.0
Unnamed: 4,0,0.0
source,0,0.0


There are no missing values in the data set but due to the csv formatting, the column names don't make sense.  By refering the head data above, we can create a dictionary to map the current column names to the appropriate column names.

In [4]:
top_200_tracks.rename(columns = {'Unnamed: 0':'Position','Unnamed: 1':'Track Name','Unnamed: 2':'Artist','Note that these figures are generated using a formula that protects against any artificial inflation of chart positions.':'Stream Count/Week','Unnamed: 4':'URL'}, inplace = True) 
top_200_tracks.head()

Unnamed: 0,Position,Track Name,Artist,Stream Count/Week,URL,source
0,Position,Track Name,Artist,Streams,URL,regional-global-weekly-2020-03-06--2020-03-13.csv
1,1,Blinding Lights,The Weeknd,41066317,https://open.spotify.com/track/0sf12qNH5qcw8qp...,regional-global-weekly-2020-03-06--2020-03-13.csv
2,2,The Box,Roddy Ricch,37470185,https://open.spotify.com/track/0nbXyq5TXYPCO7p...,regional-global-weekly-2020-03-06--2020-03-13.csv
3,3,Dance Monkey,Tones And I,36071262,https://open.spotify.com/track/1rgnBhdG2JDFTbY...,regional-global-weekly-2020-03-06--2020-03-13.csv
4,4,Don't Start Now,Dua Lipa,32169572,https://open.spotify.com/track/6WrI0LAC5M1Rw2M...,regional-global-weekly-2020-03-06--2020-03-13.csv


Now that the column names have been fixed, we can assume there is a row from each file with the column names listed that should not be included in the data set.  We are looking to have 52 files with 200 tracks each.  This means we should have 10,400 rows. Once we remove the extra header lines, we will check to see if we have the correct number of lines.  If the lines are divisible by 200 then we will need to check the source folder for extra/missing files.  If the lines don't tie and aren't divisible by 200, we can assume there are additional rows that need to be removed from the data set.

In [5]:
top_200_tracks.drop(top_200_tracks[top_200_tracks['Position'] == 'Position'].index, inplace = True) 
top_200_tracks.head()

Unnamed: 0,Position,Track Name,Artist,Stream Count/Week,URL,source
1,1,Blinding Lights,The Weeknd,41066317,https://open.spotify.com/track/0sf12qNH5qcw8qp...,regional-global-weekly-2020-03-06--2020-03-13.csv
2,2,The Box,Roddy Ricch,37470185,https://open.spotify.com/track/0nbXyq5TXYPCO7p...,regional-global-weekly-2020-03-06--2020-03-13.csv
3,3,Dance Monkey,Tones And I,36071262,https://open.spotify.com/track/1rgnBhdG2JDFTbY...,regional-global-weekly-2020-03-06--2020-03-13.csv
4,4,Don't Start Now,Dua Lipa,32169572,https://open.spotify.com/track/6WrI0LAC5M1Rw2M...,regional-global-weekly-2020-03-06--2020-03-13.csv
5,5,La Difícil,Bad Bunny,29598307,https://open.spotify.com/track/6NfrH0ANGmgBXyx...,regional-global-weekly-2020-03-06--2020-03-13.csv


In [6]:
top_200_tracks.shape

(10600, 6)

We have 200 more lines than we had expected to see.  The file names include the beginning and ending dates of each week.  Let's create a new column for the week ending date to see if we can identify where the extra 200 rows are coming from.

In [7]:
top_200_tracks['Week Ending'] = top_200_tracks.source.str[35:45]
top_200_tracks.head()

Unnamed: 0,Position,Track Name,Artist,Stream Count/Week,URL,source,Week Ending
1,1,Blinding Lights,The Weeknd,41066317,https://open.spotify.com/track/0sf12qNH5qcw8qp...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13
2,2,The Box,Roddy Ricch,37470185,https://open.spotify.com/track/0nbXyq5TXYPCO7p...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13
3,3,Dance Monkey,Tones And I,36071262,https://open.spotify.com/track/1rgnBhdG2JDFTbY...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13
4,4,Don't Start Now,Dua Lipa,32169572,https://open.spotify.com/track/6WrI0LAC5M1Rw2M...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13
5,5,La Difícil,Bad Bunny,29598307,https://open.spotify.com/track/6NfrH0ANGmgBXyx...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13


In [8]:
top_200_tracks['Week Ending'].unique()

array(['2020-03-13', '2020-03-20', '2020-03-27', '2020-04-03',
       '2020-04-10', '2020-04-17', '2020-04-24', '2020-05-01',
       '2020-05-08', '2020-05-15', '2020-05-22', '2020-05-29',
       '2020-06-05', '2020-06-12', '2020-06-19', '2020-06-26',
       '2020-07-03', '2020-07-10', '2020-07-17', '2020-07-24',
       '2020-07-31', '2020-08-07', '2020-08-14', '2020-08-21',
       '2020-08-28', '2020-09-04', '2020-09-11', '2020-09-18',
       '2020-09-25', '2020-10-02', '2020-10-09', '2020-10-16',
       '2020-10-23', '2020-10-30', '2020-11-06', '2020-11-13',
       '2020-11-20', '2020-11-27', '2020-12-04', '2020-12-11',
       '2020-12-18', '2020-12-25', '2021-01-01', '2021-01-08',
       '2021-01-22', '2021-01-29', '2021-02-05', '2021-02-12',
       '2021-02-19', '2021-02-26', '2021-03-05', '2021-03-12',
       '2021-03-19'], dtype=object)

It appears we downloaded an additional week's worth of data and did not duplicate any weeks in the folder since we have 53 unique week ending dates.  An additional week will help with the modelling so we will leaving the data as is.

Next, we know we need a field with the track id by itself to link to the data in the Spotify API.  Every URL starts the same way and the last set of characters are the track id.  By creating a new column and slicing out the first 31 characters, we should be able to separate out the track id.  We can make sure the slice is correct by comparing to the URL column.

In [9]:
top_200_tracks['id'] = top_200_tracks.URL.str[31:]
top_200_tracks.head()

Unnamed: 0,Position,Track Name,Artist,Stream Count/Week,URL,source,Week Ending,id
1,1,Blinding Lights,The Weeknd,41066317,https://open.spotify.com/track/0sf12qNH5qcw8qp...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,0sf12qNH5qcw8qpgymFOqD
2,2,The Box,Roddy Ricch,37470185,https://open.spotify.com/track/0nbXyq5TXYPCO7p...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,0nbXyq5TXYPCO7pr3N8S4I
3,3,Dance Monkey,Tones And I,36071262,https://open.spotify.com/track/1rgnBhdG2JDFTbY...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,1rgnBhdG2JDFTbYkYRZAku
4,4,Don't Start Now,Dua Lipa,32169572,https://open.spotify.com/track/6WrI0LAC5M1Rw2M...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,6WrI0LAC5M1Rw2MnX2ZvEg
5,5,La Difícil,Bad Bunny,29598307,https://open.spotify.com/track/6NfrH0ANGmgBXyx...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,6NfrH0ANGmgBXyxgV2PeXt


## 2.4 Load Spotify API Data<a id='2.1_Load_Spotify_API_Data'></a>

Now that we have the list of all top 200 songs over the past 53 weeks, we can take those track id's to pull their features out of the Spoitfy API.

In [10]:
from spotipy.oauth2 import SpotifyClientCredentials 
import time

client_id = "0054a24f2fc643c69d56d020dd5f70be"
client_secret = "98b4a4b772ad4eca934a92ca60c246a0"
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [11]:
track_list = list(set(top_200_tracks['id']))

In [12]:
print(len(track_list))
print(track_list)

1193
['5UusfWUMMLEXLMc1ViNZoe', '0zzsyWc45eUcSjw4FNHxeN', '7yN9Qx1HtE4c1fhakBz8Ay', '21N4Buj4xsyLb218lYle61', '2b8fOow8UzyDFAE27YhOZM', '0qJeyYAgv6UpvewUxRXAhb', '0EhpEsp4L0oRGM0vmeaN5e', '6RsRMf9e4KUyo3MecGffNL', '3rRin3LyLY92kpEbkCgwf4', '4saklk6nie3yiGePpBwUoc', '6pcywuOeGGWeOQzdUyti6k', '380HmhwTE2NJgawn1NwkXi', '7sKbyYeJnITO1Eh9xd0lKd', '0HC6S4VpCGAZvyxTdrMRIQ', '6juLaduD4STCUDWT0AYun4', '4umIPjkehX1r7uhmGvXiSV', '09mEdoA6zrmBPgTEN5qXmN', '0XinBYhf1X3kdvKQHOX971', '1g3J9W88hTG173ySZR6E9S', '02FaKXXL7KUtRc7K0k54tL', '39LLxExYz6ewLAcYrzQQyP', '2zYzyRzz6pRmhPzyfMEC8s', '7sjFIZ1g5QLJLGja3k592K', '2Y0ktCGrGoGcQFXsGztvhi', '2Oycxb8QbPkpHTo8ZrmG0B', '2vBET2pmrQqafaS6zIaYta', '5gEUDNQvoQjdjklrwPdGwD', '0pgj4EzB1XRqgZemoMNG5D', '6EDO9iiTtwNv6waLwa1UUq', '2Fxmhks0bxGSBdJ92vM42m', '7BqBn9nzAq8spo5e7cZ0dJ', '6kls8cSlUyHW2BUOkDJIZE', '3w9VRlKPvNxj40RdUGRweH', '3xgT3xIlFGqZjYW9QlhJWp', '2hDe0Ls5mVqs1XJqv7sbcM', '2IKJtXeR5UsaUjZB46fTOK', '1AnkdcHl86kEhDvhaKDuIe', '0nrRP2bk19rLc0orkWPQk2', '2lCkn

In [13]:
def getTrackFeatures(track_list):
  meta = sp.track(track_list)
  features = sp.audio_features(track_list)
    
  # meta
  name = meta['name']
  album = meta['album']['name']
  artist = meta['album']['artists'][0]['name']
  release_date = meta['album']['release_date']
  length = meta['duration_ms']
  popularity = meta['popularity']

  # features
  acousticness = features[0]['acousticness']
  danceability = features[0]['danceability']
  energy = features[0]['energy']
  instrumentalness = features[0]['instrumentalness']
  liveness = features[0]['liveness']
  loudness = features[0]['loudness']
  speechiness = features[0]['speechiness']
  tempo = features[0]['tempo']
  time_signature = features[0]['time_signature']

  track = [name, album, artist, release_date, length, popularity, danceability, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, time_signature]
  return track
#Loop over tracks and apply the function
#We’ll now loop over the tracks — applying the function we created— and save the dataset to a .csv file using pandas.

# loop over track ids 
tracks = []
for i in range(len(track_list)):
  time.sleep(.5)
  track = getTrackFeatures(track_list[i])
  tracks.append(track)

# create dataset
track_features = pd.DataFrame(tracks, columns = ['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'danceability', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature'])
track_features.to_csv("spotify.csv", sep = ',')

In [14]:
track_features.head()

Unnamed: 0,name,album,artist,release_date,length,popularity,danceability,acousticness,danceability.1,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature
0,@ MEH,@ MEH,Playboi Carti,2020-04-16,166799,68,0.876,0.0136,0.876,0.492,0.000283,0.0678,-8.11,0.153,151.044,4
1,Sine From Above (with Elton John),Chromatica,Lady Gaga,2020-05-29,244880,65,0.642,0.158,0.642,0.792,1e-05,0.68,-5.746,0.0488,122.965,4
2,DEUX TOILES DE MER,QALF,Damso,2020-09-17,315640,67,0.521,0.582,0.521,0.39,7.4e-05,0.112,-9.726,0.128,104.106,4
3,Chica Ideal,Chica Ideal,Sebastian Yatra,2020-10-16,183240,88,0.574,0.0847,0.574,0.891,0.0,0.16,-3.665,0.157,100.978,4
4,Memories,Memories,Maroon 5,2019-09-20,189486,88,0.764,0.837,0.764,0.32,0.0,0.0822,-7.209,0.0546,91.019,4


Now that we have pulled all of the features for those tracks, we need to add the track id back to the data set for linking back to the top 200 lists.

In [15]:
track_features['id'] = track_list 

In [16]:
track_features.head()

Unnamed: 0,name,album,artist,release_date,length,popularity,danceability,acousticness,danceability.1,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,id
0,@ MEH,@ MEH,Playboi Carti,2020-04-16,166799,68,0.876,0.0136,0.876,0.492,0.000283,0.0678,-8.11,0.153,151.044,4,5UusfWUMMLEXLMc1ViNZoe
1,Sine From Above (with Elton John),Chromatica,Lady Gaga,2020-05-29,244880,65,0.642,0.158,0.642,0.792,1e-05,0.68,-5.746,0.0488,122.965,4,0zzsyWc45eUcSjw4FNHxeN
2,DEUX TOILES DE MER,QALF,Damso,2020-09-17,315640,67,0.521,0.582,0.521,0.39,7.4e-05,0.112,-9.726,0.128,104.106,4,7yN9Qx1HtE4c1fhakBz8Ay
3,Chica Ideal,Chica Ideal,Sebastian Yatra,2020-10-16,183240,88,0.574,0.0847,0.574,0.891,0.0,0.16,-3.665,0.157,100.978,4,21N4Buj4xsyLb218lYle61
4,Memories,Memories,Maroon 5,2019-09-20,189486,88,0.764,0.837,0.764,0.32,0.0,0.0822,-7.209,0.0546,91.019,4,2b8fOow8UzyDFAE27YhOZM


In [17]:
track_features.set_index(['id'])

Unnamed: 0_level_0,name,album,artist,release_date,length,popularity,danceability,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
5UusfWUMMLEXLMc1ViNZoe,@ MEH,@ MEH,Playboi Carti,2020-04-16,166799,68,0.876,0.013600,0.876,0.492,0.000283,0.0678,-8.110,0.1530,151.044,4
0zzsyWc45eUcSjw4FNHxeN,Sine From Above (with Elton John),Chromatica,Lady Gaga,2020-05-29,244880,65,0.642,0.158000,0.642,0.792,0.000010,0.6800,-5.746,0.0488,122.965,4
7yN9Qx1HtE4c1fhakBz8Ay,DEUX TOILES DE MER,QALF,Damso,2020-09-17,315640,67,0.521,0.582000,0.521,0.390,0.000074,0.1120,-9.726,0.1280,104.106,4
21N4Buj4xsyLb218lYle61,Chica Ideal,Chica Ideal,Sebastian Yatra,2020-10-16,183240,88,0.574,0.084700,0.574,0.891,0.000000,0.1600,-3.665,0.1570,100.978,4
2b8fOow8UzyDFAE27YhOZM,Memories,Memories,Maroon 5,2019-09-20,189486,88,0.764,0.837000,0.764,0.320,0.000000,0.0822,-7.209,0.0546,91.019,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6IBcOGPsniK3Pso1wHIhew,Forever After All,What You See Ain't Always What You Get (Deluxe...,Luke Combs,2020-10-23,232533,80,0.487,0.191000,0.487,0.650,0.000000,0.0933,-5.195,0.0253,151.964,4
5H4mXWKcicuLKDn4Jy0sK7,Time Flies,Dark Lane Demo Tapes,Drake,2020-05-01,192931,74,0.864,0.201000,0.864,0.477,0.000000,0.1820,-5.786,0.2240,86.460,4
5i7ThJfYLAzp2DyZuFpF6j,Heart Of Glass (Live from the iHeart Festival),Heart Of Glass / Midnight Sky,Miley Cyrus,2020-09-29,213671,76,0.580,0.000335,0.580,0.908,0.000048,0.0870,-5.303,0.0341,115.016,4
1xK1Gg9SxG8fy2Ya373oqb,Bandido,Bandido,Myke Towers,2020-12-10,232853,94,0.713,0.122000,0.713,0.617,0.000000,0.0962,-4.637,0.0887,168.021,4


We have everything we need to merge the data sets for analysis.

In [18]:
top_200_features = pd.merge(top_200_tracks, track_features,
how='left', on='id')

In [19]:
top_200_features.head()

Unnamed: 0,Position,Track Name,Artist,Stream Count/Week,URL,source,Week Ending,id,name,album,...,danceability,acousticness,danceability.1,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature
0,1,Blinding Lights,The Weeknd,41066317,https://open.spotify.com/track/0sf12qNH5qcw8qp...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,0sf12qNH5qcw8qpgymFOqD,Blinding Lights,Blinding Lights,...,0.513,0.00147,0.513,0.796,0.000209,0.0938,-4.075,0.0629,171.017,4
1,2,The Box,Roddy Ricch,37470185,https://open.spotify.com/track/0nbXyq5TXYPCO7p...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,0nbXyq5TXYPCO7pr3N8S4I,The Box,Please Excuse Me For Being Antisocial,...,0.896,0.104,0.896,0.586,0.0,0.79,-6.687,0.0559,116.971,4
2,3,Dance Monkey,Tones And I,36071262,https://open.spotify.com/track/1rgnBhdG2JDFTbY...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,1rgnBhdG2JDFTbYkYRZAku,Dance Monkey,Dance Monkey,...,0.825,0.688,0.825,0.593,0.000161,0.17,-6.401,0.0988,98.078,4
3,4,Don't Start Now,Dua Lipa,32169572,https://open.spotify.com/track/6WrI0LAC5M1Rw2M...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,6WrI0LAC5M1Rw2MnX2ZvEg,Don't Start Now,Don't Start Now,...,0.794,0.0125,0.794,0.793,0.0,0.0952,-4.521,0.0842,123.941,4
4,5,La Difícil,Bad Bunny,29598307,https://open.spotify.com/track/6NfrH0ANGmgBXyx...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,6NfrH0ANGmgBXyxgV2PeXt,La Difícil,YHLQMDLG,...,0.685,0.0861,0.685,0.848,7e-06,0.0783,-4.561,0.0858,179.87,4


## 2.5 Explore the Data<a id='2.5_Explore_The_Data'></a>

### 2.5.1 Average Streams per Position<a id='2.5.1_Average_Streams_per_Position'></a>

While we would like to predict the chart position of a song, we will do so indirectly.  By understanding the averages of streams in each chart position, we can infer the chart position based upon the amount of streams we will forecast in our time series forecast.  

In [27]:
top_200_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10600 entries, 0 to 10599
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Position           10600 non-null  object 
 1   Track Name         10600 non-null  object 
 2   Artist             10600 non-null  object 
 3   Stream Count/Week  10600 non-null  object 
 4   URL                10600 non-null  object 
 5   source             10600 non-null  object 
 6   Week Ending        10600 non-null  object 
 7   id                 10600 non-null  object 
 8   name               10600 non-null  object 
 9   album              10600 non-null  object 
 10  artist             10600 non-null  object 
 11  release_date       10600 non-null  object 
 12  length             10600 non-null  int64  
 13  popularity         10600 non-null  int64  
 14  danceability       10600 non-null  float64
 15  acousticness       10600 non-null  float64
 16  danceability       106

In [33]:
top_200_features["Stream Count/Week"] = pd.to_numeric(top_200_features["Stream Count/Week"])

In [36]:
top_200_features.groupby(['Position'])['Stream Count/Week'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Position,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,53.0,4.226770e+07,8.627148e+06,31544941.0,37363418.0,40267123.0,45847200.0,80764045.0
2,53.0,3.409511e+07,4.832902e+06,23461354.0,30463204.0,33897676.0,36899304.0,48011162.0
3,53.0,3.066208e+07,4.017041e+06,22418356.0,28540616.0,30727192.0,33715483.0,38503261.0
4,53.0,2.824718e+07,3.402336e+06,21761667.0,26145345.0,28367431.0,30596452.0,38006961.0
5,53.0,2.626384e+07,3.113873e+06,19989826.0,24320342.0,26181061.0,27982035.0,37323134.0
...,...,...,...,...,...,...,...,...
196,53.0,4.712784e+06,3.899717e+05,4244510.0,4427619.0,4598325.0,4894734.0,6286261.0
197,53.0,4.698442e+06,3.879080e+05,4218201.0,4423039.0,4582940.0,4886940.0,6278993.0
198,53.0,4.690774e+06,3.861216e+05,4209179.0,4407901.0,4570806.0,4867644.0,6264613.0
199,53.0,4.678467e+06,3.844950e+05,4206572.0,4394985.0,4557461.0,4864305.0,6257992.0


In [None]:
    * [2.5.2 Categorical Features](#2.5.2_Categorical_Features)
      * [2.5.2.1 Unique Genres](#2.5.2.1_Unique_Genres)
      * [2.5.2.2 Unique Genres](#2.5.2.2_Unique_Decades)
    * [2.5.3 Non-Categorical Features](#2.5.3_Non-Categorical_Features)
      * [2.5.3.1 Numeric Data Summary](#2.5.2.1_Numeric_data_summary)
      * [2.5.3.2 Distributions of Feature Values](#2.5.2.2_Distributions_Of_Feature_Values)