# Podcast Metadata Exploration
In this notebook, I will be exploring the podcast metadata to see if we will be able accurately use Scout to search for podcast episodes.

In [1]:
import pandas as pd

In [2]:
metadata = pd.read_csv('metadata.tsv', sep = '\t')

In [3]:
metadata.head()

Unnamed: 0,show_uri,show_name,show_description,publisher,language,rss_link,episode_uri,episode_name,episode_description,duration,show_filename_prefix,episode_filename_prefix
0,spotify:show:2NYtxEZyYelR6RMKmjfPLB,Kream in your Koffee,A 20-something blunt female takes on the world...,Katie Houle,['en'],https://anchor.fm/s/11b84b68/podcast/rss,spotify:episode:000A9sRBYdVh66csG2qEdj,1: It’s Christmas Time!,On the first ever episode of Kream in your Kof...,12.700133,show_2NYtxEZyYelR6RMKmjfPLB,000A9sRBYdVh66csG2qEdj
1,spotify:show:15iWCbU7QoO23EndPEO6aN,Morning Cup Of Murder,Ever wonder what murder took place on today in...,Morning Cup Of Murder,['en'],https://anchor.fm/s/b07181c/podcast/rss,spotify:episode:000HP8n3hNIfglT2wSI2cA,The Goleta Postal Facility shootings- January ...,"See something, say something. It’s a mantra ma...",6.019383,show_15iWCbU7QoO23EndPEO6aN,000HP8n3hNIfglT2wSI2cA
2,spotify:show:6vZRgUFTYwbAA79UNCADr4,Inside The 18 : A Podcast for Goalkeepers by G...,Inside the 18 is your source for all things Go...,Inside the 18 GK Media,['en'],https://anchor.fm/s/81a072c/podcast/rss,spotify:episode:001UfOruzkA3Bn1SPjcdfa,Ep.36 - Incorporating a Singular Goalkeeping C...,Today’s episode is a sit down Michael and Omar...,43.616333,show_6vZRgUFTYwbAA79UNCADr4,001UfOruzkA3Bn1SPjcdfa
3,spotify:show:5BvKEjaMSuvUsGROGi2S7s,Arrowhead Live!,Your favorite podcast for everything @Chiefs! ...,Arrowhead Live!,['en-US'],https://anchor.fm/s/917dba4/podcast/rss,spotify:episode:001i89SvIQgDuuyC53hfBm,Episode 1: Arrowhead Live! Debut,Join us as we take a look at all current Chief...,58.1892,show_5BvKEjaMSuvUsGROGi2S7s,001i89SvIQgDuuyC53hfBm
4,spotify:show:7w3h3umpH74veEJcbE6xf4,FBoL,"The comedy podcast about toxic characters, wri...",Emily Edwards,['en'],https://www.fuckboisoflit.com/episodes?format=rss,spotify:episode:0025RWNwe2lnp6HcnfzwzG,"The Lion, The Witch, And The Wardrobe - Ashley...",The modern morality tail of how to stay good f...,51.78205,show_7w3h3umpH74veEJcbE6xf4,0025RWNwe2lnp6HcnfzwzG


In [4]:
metadata.columns

Index(['show_uri', 'show_name', 'show_description', 'publisher', 'language',
       'rss_link', 'episode_uri', 'episode_name', 'episode_description',
       'duration', 'show_filename_prefix', 'episode_filename_prefix'],
      dtype='object')

As we can see from the output above, the metadata table has columns like show_uri, show_name, show_description and duration, which will be important for us to search for specific podcasts

In [5]:
metadata.shape

(105360, 12)

There are currently 105,360 shows and episodes in Spotify's metadata file

In [6]:
# how many episodes per show are there?
episodes = metadata['episode_uri'].nunique()

In [11]:
shows = metadata['show_uri'].nunique()

In [12]:
episodes/shows

5.733565520243796

There are almost 6 episodes per show currently in this table

In [13]:
metadata['language'].value_counts()

['en']       79452
['en-US']    20264
['en-GB']     2335
['en-AU']     1377
['en-CA']      962
['en-PH']      368
['en-IE']      243
['en-NZ']      141
['en-ZA']       86
['nl-BE']       39
['hi']          32
['id']          28
['ml']          11
['ms']           9
['pt']           5
['ga']           2
['en-IN']        2
['en-JM']        2
['ta']           1
['es']           1
Name: language, dtype: int64

Because of the podcast data is only in English, I wondered how many of the different English dialects would be in the metadata. I wonder if the different spellings across the different shows will affect our search results at all.

In [14]:
metadata.dtypes

show_uri                    object
show_name                   object
show_description            object
publisher                   object
language                    object
rss_link                    object
episode_uri                 object
episode_name                object
episode_description         object
duration                   float64
show_filename_prefix        object
episode_filename_prefix     object
dtype: object

In [15]:
metadata['duration'].describe()

count    105360.000000
mean         33.845715
std          22.735674
min           0.175317
25%          13.552637
50%          31.643375
75%          50.446825
max         304.953900
Name: duration, dtype: float64

On average, a show is about 34 minutes long with the maximum duration being 305 minutes

In [16]:
metadata[metadata['show_name'].str.find('Dissect') != -1].shape

(117, 12)

In [17]:
metadata['show_name'].value_counts().head(50)

Chompers                                                                                               1072
Dr Berg’s Healthy Keto and Intermittent Fasting Podcast                                                 746
Coach Corey Wayne                                                                                       615
Optimal Living Daily: Personal Development & Minimalism                                                 424
Daily Fortnite                                                                                          417
The GaryVee Audio Experience                                                                            413
Everyday Positivity                                                                                     396
Pack-A-Day: Your Daily Packers Podcast                                                                  394
Self Improvement Daily                                                                                  376
The Orthobullets Podcast    

Looking at the output above, we can see the top 50 podcasts in the metadata based on the number of episodes each show has. The show with the most episodes being Chompers, a podcast for young kids that makes brushing teeth fun and enjoyable.

In [18]:
true_crime = metadata[metadata['show_name'].str.find('Unsolved Murders: True Crime Stories') != -1]

In [19]:
true_crime['show_uri']

102       spotify:show:6Mk7Wk6cmrgXWnitO3NdYe
214       spotify:show:6Mk7Wk6cmrgXWnitO3NdYe
426       spotify:show:6Mk7Wk6cmrgXWnitO3NdYe
1651      spotify:show:6Mk7Wk6cmrgXWnitO3NdYe
1836      spotify:show:6Mk7Wk6cmrgXWnitO3NdYe
                         ...                 
103259    spotify:show:6Mk7Wk6cmrgXWnitO3NdYe
103727    spotify:show:6Mk7Wk6cmrgXWnitO3NdYe
104070    spotify:show:6Mk7Wk6cmrgXWnitO3NdYe
104346    spotify:show:6Mk7Wk6cmrgXWnitO3NdYe
105245    spotify:show:6Mk7Wk6cmrgXWnitO3NdYe
Name: show_uri, Length: 194, dtype: object

Above, I wanted to explore how many shows focused on the true crime genre, considering its recent popularity and as we can see there are 194 shows dedicated to the genre.

In [20]:
import json

In [None]:
transcript_example = 'spotify-podcasts-2020/podcasts-transcripts/0/0/show_002B8PbILr169CdsS9ySTH/399kdfMnjw0KYANZU7CQJ0.json'

In [None]:
with open(transcript_example, 'r') as transcript:
    tran = json.load(transcript)

In [None]:
tran.keys()

In [None]:
tran['results'][1].keys()

In [None]:
tran['results'][1]['alternatives'][0].keys()

In [None]:
tran['results'][1]['alternatives'][0]['transcript']