# Chai Time Data Science
Chai Time Data Science shows are series of podcasts hosted by Sanyam Bhutani related to interviews with ML heroes across kaggle, Industry and Research. Contest has been organized marking 1st year anniversary of CTDS.show podcasts. <br>

<b>Podcast related stats and content (subtitles) are provided by the organizers. The goal of this contest is to use these datasets and come up with interesting insights or stories. </b><br>

<b>Judging criteria for the contest - 5 categories: Presentation, Story telling, Visualizations, Insights and Innovation. </b> <br>
    
Let us dive in, explore the data, find insights and see whether we could pen a beautiful story out of 1 marvelous sustained perennial year of 85 episodes. Podcasts were made available through Youtube, Spotify, Apple and all other Major podcast directories.<br>
Along the way we could get to know ML heroes better through this competition as well as by deciphering podcasts content.



# 1. Importing Packages
**Importing all the necessary packages as the first step**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly as plty
import seaborn as sns
import plotly.graph_objs as go
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.io as pio
import spacy
import os
%matplotlib inline

# 2. Loading Datasets & Taking a Peek
Loading the supplied datasets (description below) in and take a peek. Later we shall explore the need for any external datasets
* Episodes: Has the stats of all episodes of Chai Time Data Science show
* Youtube Thumbnail Types: Metadata of youtube thumbnail types used in CTDS show
* Anchor Thumbnail Types: Metadata of anchor thumbnail types used in CTDS show
* Description: Description of each episode
* Cleaned Subtitles: Subtitles of each episode (cleaned version)

In [None]:
path = '../input/chai-time-data-science/'

In [None]:
df_episodes = pd.read_csv(f'{path}Episodes.csv',parse_dates=['recording_date','release_date'])
df_yt = pd.read_csv(f'{path}YouTube Thumbnail Types.csv')
df_anchortn = pd.read_csv(f'{path}Anchor Thumbnail Types.csv')
df_desc = pd.read_csv(f'{path}Description.csv')

<h3>Lets take a peek at the datasets. Primary gathering info on # of records, missing values and get a feel about the different datasets.</h3>

In [None]:
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)
pd.set_option('display.expand_frame_repr', False)

In [None]:
print('\033[33m' + 'Episodes Dataset - INFO' + '\033[0m')
df_episodes.info(),
df_episodes.head()

In [None]:
print('\033[33m' + 'Youtube Thumbnail Types Dataset - INFO' + '\033[0m')
df_yt.info(),
df_yt.head()

In [None]:
print('\033[33m' + 'Anchor Thumbnail Types Dataset - INFO' + '\033[0m')
df_anchortn.info(),
df_anchortn.head()

In [None]:
print('\033[33m' + 'Description Dataset - INFO' + '\033[0m')
df_desc.info(),
df_desc.head()

# 3. Beginning the Data exploration

# **Questions that ponder**
* What brings the audience?
    * Heroes?
    * Content/format?
    * Asthetics like thumbnails, tea flavor, recording time, duration etc?
    * Subscribers?
    * Timeseries trend?
* Do viewers sustain?
* How does podcasts being received across multiple mediums?
* Lets find some patterns and insights...

* <b>Based on skimming through the available data, we could spot 2 different categories: Content Related (actual content - subtitles) & Non-Content Related (everything else..)
* Lets being with Non-Content related data..</b>

# 4 Non-Content Related
   # 4.1 **Episodes & Show**

In [None]:
print('\033[33m' + 'Episodes Dataset - Exploration stats' + '\033[0m')
df_episodes.describe(include='all').T

# What could we find on Episodes
**Evaluating through youtube, spotify and apple prime stats.**
(Rounded off)
* <u>Total</u> 
    * Episodes: 85 with 72 unique heroes
    * Duration of Videos: 27191 (hrs)
    * Youtube views: 43616
    * Spotify streams: 6720
    * Spotify listeners: 5455
    * Apple listeners: 1714
* <u>Average</u>
    * Episodes/month: 7
    * Duration of Videos: 3200 (hrs)
    * Youtube views: 513
    * Youtube watch duration: 5.3 minutes
    * Spotify streams: 80
    * Spotify listeners: 65
    * Apple listeners: 21
    * Apple Listen Duration: 29.33 minutes
    

* Highlight of this entire podcast series is getting audience across multiple platforms, consistency in producing podcasts perenially with ML heroes is not easy by any means. 
* CTDS have given us closer to 2 podcasts per week (1.77 to be exact) in last 1 year, which is monumental.
* Podcasts have reached over 55k viewers across platforms
* Youtube is the most preferred and tops the list interms of viewership. Spotify and Apple platforms didn't garner enough viewership with 5455 and 1714 unique listeners respectively. 
* Though apple/spotify podcasts didn't have enough views, they gave CTDS a good audience who watched podcasts for significant duration. <u>While Youtube average watch duration is just 5.3 minutes, apple avg listen duration had a whopping 29.33 minutes. Looks like lesser distractions (or format of the show) for podcasts sets the stage for better audience</u>, but youtube having a mass following garnered more views. 
* Should CTDS concentrate more on reaching podcasts-audio audience where engagement is good or to have a different format for youtube to capitalize viewership and actually being viewed. Why didn't youtube viewers get glued to the videos until the end? Is it due to the format, which is more suited for audio streaming? Lets explore further <br>
* Initial take looks like - the format of the episodes since being interview series is best suited for audio only podcasts

<h3>We shall start with exploration of episodes on youtube statistics as this medium has significant contribution with views and reaching more audience</h3>

In [None]:
fig = px.scatter(df_episodes, x='episode_id', y='youtube_subscribers', height=400, title='<b>Episodes Vs New Youtube Subscribers</b>', color='youtube_subscribers',
             color_continuous_scale='Viridis')
fig.update_layout(plot_bgcolor='rgb(255,255,255)')
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.data[0].update(mode='markers+lines')
fig.show()

<h3>youtube subscribers count wasn't iteratively increasing with episodes being released. There is marginal increasing trend between E12 and E44. There are peaks are valleys throughout the journey. Lets explore the trend further </h3>

In [None]:
df_episodes['yt_subs_cumulative'] = df_episodes['youtube_subscribers'].cumsum()
fig = make_subplots(rows=4, cols=1, subplot_titles=("<b>Youtube Views</b>", "<b>Youtube Subscribers</b>", "<b>Youtube Subscribers cumulative sum<b>", "<b>Episode Duration</b>"))

fig.append_trace(go.Scatter(name='youtube views', x=df_episodes.episode_id, y=df_episodes.youtube_views), row=1, col=1),
fig.append_trace(go.Scatter(name='youtube new subscribers', x=df_episodes.episode_id, y=df_episodes.youtube_subscribers), row=2, col=1),
fig.append_trace(go.Scatter(name='youtube subscribers cumulative sum', x=df_episodes.episode_id, y=df_episodes.yt_subs_cumulative), row=3, col=1)
fig.append_trace(go.Scatter(name='episode duration', x=df_episodes.episode_id, y=df_episodes.episode_duration), row=4, col=1)

fig.update_layout(height=1200, width=800, legend_orientation="h", plot_bgcolor='rgb(10,10,10)')
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.show()

# <u>**Episodes vs youtube subscribers & views**</u>
* Episodes that had most view rate brought in more subscribers for CTDS show. 3 major peaks (heroes below) were observed - after E1, E27 and E49 and 2 minor peaks after E42 and E43
    * Jeremy Howard, Parul Pandey, Abhishek Thakur brought in 139, 66, 60 youtube subscribers respectively. Expectations Vs Reality justified :)
* Episode 25 to 45 seems to be golder with better viewrship and subscribers peak.
* Episode 2 to 16 seems to be dull phase, with almost flat stats on viewership as well as subscribers.
* Top 5 viewed episodes got more subscribers in almost same relative order of magnitude.
* Subscribers increased as we progress over time and there hasn't been any significant downward trend except during fast.ai miniseries, which remained flat. 
* 2 significant observed peaks of subscriber increase were due to top 3 videos that garnered more views that had more than 1500 views
* Did subcribers brought in more views for episodes that followed those peaks? Seems to be NO based on the trend

In [None]:
df_episodes['release_dofweek'] = df_episodes['release_date'].dt.dayofweek

df_t = df_episodes.groupby(['release_dofweek'])['youtube_subscribers'].sum().reset_index()
df_t1 = df_episodes.groupby(['release_dofweek'])['episode_id'].count().reset_index()
df_t2 = df_episodes.groupby(['release_dofweek'])['youtube_views'].sum().reset_index()
print("Monday is represented as 0 index")
print('\033[33m' + 'Release Day of Week Vs youtube subscribers' + '\033[0m')
print(df_t)
print('\033[33m' + 'Release Day of Week Vs Episodes count' + '\033[0m')
print(df_t1)
print('\033[33m' + 'Release Day of Week Vs youtube  views' + '\033[0m')
print(df_t2)

In [None]:

fig = make_subplots(rows=2, cols=1, subplot_titles=("Youtube Views", "Youtube Subscribers"))

fig.append_trace(go.Bar(name='youtube views', x=df_episodes.release_dofweek, y=df_episodes.youtube_views), row=1, col=1),
fig.append_trace(go.Bar(name='youtube subscribers', x=df_episodes.release_dofweek, y=df_episodes.youtube_subscribers), row=2, col=1),

fig.update_layout(height=1000, width=800, title_text="<b>Episodes Day of Week (0-Monday) Vs Youtube Stats<b>", legend_orientation="h", plot_bgcolor='rgb(10,10,10)')
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.show()

<h3>Did weekend witnessed more views and subscribers on youtube? Trend is not evident as most views and subscribers rise happened on Sunday (466 subscribers & ~18k views) and Thurday(337 subscribers & ~16.3k views), mostly due to increased # of episodes published during those days.</h3

# 4.2 **Heroes**

# **<u>Heroes & Gender</u>**

In [None]:
print('\033[33m' + 'Episodes that had missing values in heroes column' + '\033[0m')
df_episodes[df_episodes['heroes'].isnull()]

In [None]:
df = df_episodes.groupby('heroes_gender').agg({'episode_id':'size', 'youtube_views':'mean'}).reset_index()
fig = make_subplots(rows=1, cols=2, subplot_titles=("<b>Gender distribution</b>", "<b>Genderwise - youtube avg views/episode</b>"))

fig.append_trace(go.Bar(name='Gender distribution', x=df.heroes_gender, y=df.episode_id, showlegend=False), row=1, col=1),
fig.append_trace(go.Bar(name='youtube avg views/episode', x=df.heroes_gender, y=df.youtube_views, showlegend=False), row=1, col=2),

fig.update_layout(barmode='stack', height=500, width=800, legend_orientation="h", plot_bgcolor='rgb(255,255,255)')
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.show()

* Out of 74 heroes related podcasts, <u>76 unique heroes were interviewed with repeats from Robert Braco, Edouard Harris and Shivam Bansal. Two interviews were with multiple heroes in same podcast.</u>
* 11 Missing values in 'heroes' column denotes that those episodes weren't interviews with ML heroes. They were Sanyam's Fastai course summary related, channel intro and AMA episode. They were all shorter duration videos as well exception being the AMA episode.
* ~88% of heroes were Men and only ~12% were female. Gender bias is evident with the data. Either it could due to market or preference of CTDS or availablity of heroes for interview..
* <u>Major pointer here - Episodes featuring Female did have very good Average views per episode when compared to episodes featuring male. Note to CTDS show - addressing gender bias would be tangibly positive as well for the channel..</u>

# **<u>Location & Nationality</u>**

In [None]:
l = df_episodes['heroes_location'].value_counts()[:15].index
v = df_episodes['heroes_location'].value_counts()[:10].values

fig = go.Figure()
fig.add_trace(go.Pie(labels=l, values=v, textinfo='label+percent', showlegend=False))
fig.update_layout(height=500, width=600, title_text="<b>Heroes Residing Location<b>")
fig.show()

In [None]:
print('\033[33m' + 'Heroes location & nationality' + '\033[0m')
df_episodes.groupby(['heroes_location','heroes_nationality']).size()

* Location of heroes is predominantly USA with ~53% contribution in being part of podcasts, followed by Canada (8.57%), Germany(7.14%), France (4.29%) and UK(4.29%).
* Locations and Nationals participated in CTDS show good diverse get that covers 19 different countries and 21 different locations.
* Looking at Nationalities residing in different locations, USA is most diverse country with 8 different nationals (excluding US). Rest of the data is very negligible to call out diverse nationals.
* Heroes from India & Russia (3 each) are most residing outside their home country.
* Heroes from Vietnam, Switzerland, Greece, Equador and Africans have been residing outside their home country

# **<u>Category</u>**

In [None]:
fig, ax = plt.subplots(1,3, figsize = (20,6), sharex=True)
sns.countplot(x='category',data=df_episodes,ax=ax[0])
sns.countplot(x='category',hue='heroes_gender', data=df_episodes,ax=ax[1])
sns.countplot(x='category',hue='recording_time', data=df_episodes,ax=ax[2])
ax[0].title.set_text('Category count')
ax[1].title.set_text('Category Vs Gender')
ax[2].title.set_text('Category Vs Recording Time')
plt.show()

* Heroes from the "industry" tops (40%) the podcasts closely followed by "Kaggle" (36%) and comparitively lesser researchers (10.5%). 
* Male dominance is pertinent here as well with Kagglers - completely being Male heroes. Within Industry its 18% to 82% Female to Male ratio. 
* <b><u>Better and best - Research field has 33.3% female to 66.67% Male.</u></b><br>
* Heroes from Industry seem to be convenient with night time recording probably because of geography. Same is true for "other" category

# **4.3 Heroes Vs youtube, Apple and Spotify**

In [None]:
df_tmp = df_episodes.sort_values(by='heroes')
fig = px.bar(df_tmp, x='heroes', y='youtube_views', color='youtube_views',color_continuous_scale=["red", "green", "blue", "yellow"],
              title = '<b>Heroes Vs Youtube Views</b>', height=500)
fig.update_layout(height=800, plot_bgcolor='rgb(10,10,10)')
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.show()

<h3>* Jeremy Howard's episode had the most views in youtube -> 4502, followed by Parul Pande with 2161 views and Abhishek Thakur with 1528 views<br>
* As we witnessed in section 4.1 above, those top 3 viewed videos helped significantly to CTDS youtube subscription base as well </h3>

In [None]:
df_tmp = df_episodes.sort_values(by='youtube_views',ascending=False)


fig = go.Figure(data=[
    go.Bar(name='Spotify listeners', x=df_tmp.heroes, y=df_tmp.spotify_listeners, marker_color='rgb(0, 102, 57)'),
    go.Bar(name='apple listeners', x=df_tmp.heroes, y=df_tmp.apple_listeners, marker_color='rgb(255, 128, 0)')
])
fig.update_layout(barmode='stack', title='<b>heroes vs spotify-apple</b>', legend=dict(x=-.1, y=1.5),plot_bgcolor='rgb(20,20,20)' )
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.show()

<h3>
* Relatively speaking, Spotify/Apple didn't garner enough attention with viewers like youtube, but they had good audience especially Apple podcasts based on the available data<br>
* Couldn't gather spotify avg listening duration, so couldn't relate the facts, but audio format seems to have worked well with audience listened to them. <u>Apple listen duration is 29.33 minutes when compared to youtube, which is mere 5.3 minutes - 5x more listening time</u>
* Spotify unique listeners are more (5455) when compared with apple that had 1714 unique listeners <br>
* Abhishek Thakur's episode had more listeners (456) from Spotify followed by a surprise Andrey Lukyanenko (251) and Ryan Chesler(214) related episodes didn't make it to top 5 list in youtube medium <br>
* Apple listener segment - Jeremy howard's episode top's the list with 96 unique listeners
</h3>

In [None]:
print('\033[33m' + 'Average Stats grouped by heroes ' + '\033[0m')
df_tmp = df_episodes[['episode_duration','heroes','youtube_views','spotify_streams','spotify_listeners','apple_listeners']].sort_values(by='spotify_listeners', ascending=False)
df_tmp.fillna(0).groupby(['heroes']).mean().head()

In [None]:
print("Total videos that had more than avg youtube views:", len(df_episodes[df_episodes['youtube_views'] > 513]))

* Out of 85 videos, 29 videos had above the "average" youtube views of 513

<h3>Did shorter duration/host-Hero episodes had positive impact on viewership?</h3>

* Host-Hero Episodes aka shorter duration episodes didn't have positive impact (increase) on viewership. 

# **4.4 Youtube specifics**

<H3><u>Youtube Impressions & Non-Impressions</u></H3>

In [None]:
fig = go.Figure(data=[
    go.Bar(name='yt impression_views', x=df_episodes.heroes, y=df_episodes.youtube_impression_views),
    go.Bar(name='yt non impression_views', x=df_episodes.heroes, y=df_episodes.youtube_nonimpression_views)
])
fig.update_layout(barmode = 'group', title='<b>heroes vs youtube impressions and non-impressions</b>', legend=dict(x=-.1, y=1.5), plot_bgcolor='rgb(255,255,255)')
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.show()

In [None]:
print("Total Youtube Impression views:", df_episodes.youtube_impression_views.sum())
print("Total Youtube Non Impression views:", df_episodes.youtube_nonimpression_views.sum())

<h3>* Non impression views from youtube are more than impression views that indicates CTDS brand - Sanyam's network has driven viewership more than (almost twice) youtube's internal recommendations<br>
* There are few episodes like Abhishek Thakur, Julian Chaumond, Chip Huyen - where youtube impressions overtook external sources, probably they are so popular and their network could have also contributed to the views..</h3>

**<H3><u>Impact of youtube thumbnails on viewership</u></H3>**

In [None]:
df = df_episodes.groupby('youtube_thumbnail_type').agg({'youtube_views':'sum', 'episode_id':'count','youtube_impression_views': 'sum','youtube_nonimpression_views': 'sum'}).reset_index()
fig = make_subplots(rows=1, cols=4, subplot_titles=("youtube views", "Episodes count","yt impression views" , "yt nonimpression views"))

fig.append_trace(go.Bar(name='youtube views', x=df.youtube_thumbnail_type, y=df.youtube_views, showlegend=False), row=1, col=1),
fig.append_trace(go.Bar(name='# of episodes', x=df.youtube_thumbnail_type, y=df.episode_id, showlegend=False), row=1, col=2),
fig.append_trace(go.Bar(name='youtube impression views', x=df.youtube_thumbnail_type, y=df.youtube_impression_views, showlegend=False), row=1, col=3),
fig.append_trace(go.Bar(name='youtube nonimpression views', x=df.youtube_thumbnail_type, y=df.youtube_nonimpression_views, showlegend=False), row=1, col=4),

fig.update_layout(barmode='stack', height=400, width=900, title = '<b>Youtube Thumbnail Type  Vs  views-episodes-impressions-nonimpressions</b>', legend_orientation="h", plot_bgcolor='rgb(255,255,255)')
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.show()

# Did youtube thumbnails change any audience behavior?
> *0 - default, 1 - default+custom annotation, 2-custom image+annotation, 3-customer image+ctds branding+Title/tags*
* Looking at the graph in comparison with episodes, looks like cosmetic professional change of thumbnail didn't have positive impact on episode viewership nor it didn't draw much of youtube impressions as well. Infact episodes with complete CTDS branding of thumbnail had lesser views per episode relatively as we go from default thumbnail.

** using natsort package in this kernel for couple of sorting needs, the library is handy for natural sorting related tasks**

# **5 Content Related**
# 5.1 Questions from Host Vs Duration

In [None]:
sub_path = '../input/chai-time-data-science/Cleaned Subtitles/'

In [None]:
df_e27 = pd.read_csv(f'{sub_path}E27.csv')
df_e1 = pd.read_csv(f'{sub_path}E1.csv')
df_e49 = pd.read_csv(f'{sub_path}E49.csv')
df_e33 = pd.read_csv(f'{sub_path}E33.csv')
df_e38 = pd.read_csv(f'{sub_path}E38.csv')
df_e26 = pd.read_csv(f'{sub_path}E26.csv')
df_e60 = pd.read_csv(f'{sub_path}E60.csv')
df_e35 = pd.read_csv(f'{sub_path}E35.csv')
df_e34 = pd.read_csv(f'{sub_path}E34.csv')
df_e25 = pd.read_csv(f'{sub_path}E25.csv')

In [None]:
def questions(df):
    df_qt = df[df['Text'].str.contains("\?") & df['Speaker'].str.contains("Sanyam Bhutani")]
    df_ttemp = df_qt['Text']
    return df_ttemp

In [None]:
d = questions(df_e27)
d.head()

In [None]:
type(d)

In [None]:
nlp = spacy.load('en', entity=False)

In [None]:
def word_count(df,e):
    df['tokens'] = df['Text'].apply(lambda x: nlp(x))
    df['Word_count'] = [len(token) for token in df.tokens]
    df_t = df.groupby(['Speaker'])['Word_count'].sum().reset_index()
    df_t['Episode'] = e
    return df_t

In [None]:
def q_count(df):
    df_qt = df[df['Text'].str.contains("\?") & df['Speaker'].str.contains("Sanyam Bhutani")]
    length = len(df_qt)
    return length

In [None]:
def c_count(df,e):
    df_ct = df.groupby('Speaker').agg({'char_count':'sum'}).reset_index()
    df_ct['episode_id'] = e
#     length = len(df_qt)
    return df_ct

In [None]:
!pip install natsort

In [None]:
ss_list = []
for f_name in os.listdir(f'{sub_path}'):
    ss_list.append(f_name)

In [None]:
from natsort import natsorted
s_list = natsorted(ss_list)

In [None]:
df_qct = pd.DataFrame(columns=['episode', 'q_count'])
for i in range(len(s_list)):
    Episodes = pd.read_csv(f'{sub_path}'+s_list[i])
    ep_id = s_list[i].split('.')[0]
    get_df = q_count(Episodes)
    df_qct = df_qct.append({'episode': ep_id,'q_count': get_df}, ignore_index=True)

In [None]:
df_lct = pd.DataFrame(columns=['episode_id', 'Speaker','char_count'])
for i in range(len(s_list)):
    Episodes = pd.read_csv(f'{sub_path}'+s_list[i])
    ep_id = s_list[i].split('.')[0]
    Episodes['char_count'] = Episodes['Text'].apply(len)
    get_df = c_count(Episodes,ep_id)
    df_lct = df_lct.append(get_df, ignore_index=True)

In [None]:
df_lct['speaker_g'] = df_lct['Speaker'].map({'Sanyam Bhutani': 'Host'})
df_lct["speaker_g"].fillna("Heroes", inplace = True)

<H1> <u>Questions from Host across Episodes </u> </H1>

In [None]:
fig = go.Figure(data = go.Scatter(x=df_qct.episode, y=df_qct.q_count, mode='markers+lines'))
fig.update_layout(title = '<b>Episodes Vs Questions</b>', height=700, width=900, xaxis_title="Episodes", yaxis_title="# of questions",legend_orientation="h", plot_bgcolor='rgb(255,255,255)')
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.show()

<H2>Questions being asked by the host and related inference</H2>
* Trend in questions being asked by host is almost at the constant mean level of 20/episode, while there is slight dip after E53. Probably based on the comments, suggestions and the learning over time would have brought standardization and maturity in asking just right questions. <br>
* E69 being the birthday episode related to AMA - question count wass at the peak as expected <br>
* There were very few episodes with considerable dip in questions.
    * E74 - there weren't any questions from the host. Verified subtitles as well - there seem to be missing data. Verified by watching related podcasts - data was infact missing. Will try to extract the data here for further analysis later...
    * E25 and E15 - question count is low. Need to verify whether missing data is genuinely missed.

<H1> <u>Episode Introduction Duration and its effect </u> </H1>

In [None]:
#Duration calculation
df_dur = pd.DataFrame(columns=['episode', 'intro_duration'])
for i in range(len(s_list)):
    Episodes = pd.read_csv(f'{sub_path}'+s_list[i])
    Episodes['Duration_Sec'] = Episodes['Time'].str.split(':').apply(lambda t: int(t[0]) * 60 + int(t[1]))
    ep_id = s_list[i].split('.')[0]
    intro_time = Episodes['Duration_Sec'][1]
#     get_df = q_count(Episodes)
    df_dur = df_dur.append({'episode': ep_id,'intro_duration': intro_time}, ignore_index=True)

In [None]:
fig = make_subplots(rows=2, cols=1, subplot_titles=("<b>Episode Intro Duration</b>", "<b>Youtube Views</b>"))

fig.append_trace(go.Scatter(name='<b>Episode Intro Duration</b>', x=df_dur.episode, y=df_dur.intro_duration, marker_color='rgb(0, 102, 57)'), row=1, col=1),
fig.append_trace(go.Scatter(name='<b>Youtube Views</b>', x=df_episodes.episode_id, y=df_episodes.youtube_views, marker_color='rgb(0, 76, 153)'), row=2, col=1),

fig.update_layout(height=800, width=900, legend_orientation="h", plot_bgcolor='rgb(255,255,255)')
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.show()

* Introduction duration was on a rise in the beginning until E35, later duration got a dip hovering around 150ish seconds. Probably based on the comments, suggestions and the learning over time, introduction was made crisp.
* How did it correlate with youtube views?
    * Looks like it didn't have any significant impact or marginally better with views after E23 (over the time subscribers improved as well), in pulling in the audience as after Episode 60 - complete branding and thumbnails have been revamped neither it had positive impact.
    * Introduction is the key for audience to begin listening to the podcast or viewing youtube videos as they set the stage on what is on the table to stay tuned.
    * As reiterated earlier, since youtube is visual medium, pictures of some sort that are related to the heroes being displayed along with the intro might look better. Just thoughts.

# Flavour of Tea and its effects

In [None]:
data = [dict(type = 'bar',x = df_episodes.flavour_of_tea, y = df_episodes.youtube_views, mode = 'markers',
             transforms = [dict(type = 'aggregate',groups = df_episodes.flavour_of_tea,
                                aggregations = [dict(target = 'y', func = 'avg', enabled = True),])])]

layout = dict(title = '<b>Tea Flavour vs Mean Youtube views<b>',xaxis = dict(title = 'Tea Flavour'),yaxis = dict(title = 'Mean Youtube views'))


fig_dict = dict(data=data,layout=layout)

pio.show(fig_dict, validate=False)

* Above visualization represents chai consumed by the host before the episodes against average youtube views/episode. <br>
* Did it had any effect on the content/audience? Lets find..
* <u>Episodes where sulemani chai variety was consumed got more avg views per episode with ~1k followed by Ginger Chai with 720 views/episode. Lets explore other parallels that had effect on the viewership or energy level of host across episodes..</u>

In [None]:
fig = px.scatter(df_episodes, x = df_episodes.episode_id, y=df_episodes.flavour_of_tea, title='<b>Flavors of Tea across Episodes</b>', color=df_episodes.flavour_of_tea)
fig.update_layout(plot_bgcolor='rgb(60,60,60)', xaxis={'categoryorder':'category ascending'})
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.update_xaxes(title_text='Episodes')
fig.show()

In [None]:
yy = df_lct[df_lct['speaker_g'].str.contains("Host")]
fig = px.bar(yy, x = yy.episode_id, y=yy.char_count, title='<b>Episodes Vs Conversation Text length of Host</b>')
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.update_xaxes(title_text='Episodes')
fig.update_yaxes(title_text='Host - Text Length')
fig.update_layout(plot_bgcolor='rgb(255,255,255)')
fig.show()

<b><u>Tea Flavour Vs Conversation text length of host across episodes</u></b>

In [None]:
tea = pd.merge(df_episodes,yy, how='inner',on='episode_id')
tea['char_count'] = tea['char_count'].astype(str).astype(int)
tt = tea.groupby('flavour_of_tea')['char_count'].sum()
tt

# Insights
* Host's energy level or interaction was more in Episodes E69 (Masala Chai), E63 (Paan Rose Green Tea), E48 (Ginger Chai), E44 (Herbal Tea), E35 (Ginger Chai) <br>
* After grouping character count of heros by Tea flavours, as found in the table above<br>
    * Masala Chai tops the energy level with ~181k characters uttered by the host
    * Closely followed by Ginger Chai with ~178k
* Host to consume more of Masala and Ginger flavor (could be his favorites) to be more energetic, interactive and bring more liveliness to the podcasts if necessary based on the situation as not all episodes need too much of talking by the host:)

<H1> <u> Word Count of speakers across Episodes </u> <H1>
    <H3>* (10 Most and Least viewed Episodes)</H3>

In [None]:
df_wordct = pd.DataFrame()
data = word_count(df_e27,'E27')
df_wordct = df_wordct.append(data, ignore_index = True)
data = word_count(df_e1,'E1')
df_wordct = df_wordct.append(data, ignore_index = True)
data = word_count(df_e49,'E49')
df_wordct = df_wordct.append(data, ignore_index = True)
data = word_count(df_e33,'E33')
df_wordct = df_wordct.append(data, ignore_index = True)
data = word_count(df_e38,'E38')
df_wordct = df_wordct.append(data, ignore_index = True)
data = word_count(df_e26,'E26')
df_wordct = df_wordct.append(data, ignore_index = True)
data = word_count(df_e60,'E60')
df_wordct = df_wordct.append(data, ignore_index = True)
data = word_count(df_e35,'E35')
df_wordct = df_wordct.append(data, ignore_index = True)
data = word_count(df_e34,'E34')
df_wordct = df_wordct.append(data, ignore_index = True)
data = word_count(df_e25,'E25')
df_wordct = df_wordct.append(data, ignore_index = True)
# print(df_wordct)

In [None]:
df_e14 = pd.read_csv(f'{sub_path}E14.csv')
df_e20 = pd.read_csv(f'{sub_path}E20.csv')
df_e7 = pd.read_csv(f'{sub_path}E7.csv')
df_e3 = pd.read_csv(f'{sub_path}E3.csv')
df_e16 = pd.read_csv(f'{sub_path}E16.csv')
df_e10 = pd.read_csv(f'{sub_path}E10.csv')
df_e12 = pd.read_csv(f'{sub_path}E12.csv')
df_e2 = pd.read_csv(f'{sub_path}E2.csv')
df_e8 = pd.read_csv(f'{sub_path}E8.csv')
df_e5 = pd.read_csv(f'{sub_path}E5.csv')

In [None]:
df_wordct_low = pd.DataFrame()
data = word_count(df_e14,'E14')
df_wordct_low = df_wordct_low.append(data, ignore_index = True)
data = word_count(df_e20,'E20')
df_wordct_low = df_wordct_low.append(data, ignore_index = True)
data = word_count(df_e7,'E7')
df_wordct_low = df_wordct_low.append(data, ignore_index = True)
data = word_count(df_e3,'E3')
df_wordct_low = df_wordct_low.append(data, ignore_index = True)
data = word_count(df_e16,'E16')
df_wordct_low = df_wordct_low.append(data, ignore_index = True)
data = word_count(df_e10,'E10')
df_wordct_low = df_wordct_low.append(data, ignore_index = True)
data = word_count(df_e12,'E12')
df_wordct_low = df_wordct_low.append(data, ignore_index = True)
data = word_count(df_e2,'E2')
df_wordct_low = df_wordct_low.append(data, ignore_index = True)
data = word_count(df_e8,'E8')
df_wordct_low = df_wordct_low.append(data, ignore_index = True)
# print(df_wordct_low)

In [None]:
df_wordct

In [None]:
fig = px.bar(df_wordct, x='Episode', y='Word_count', color='Speaker', color_discrete_sequence=px.colors.qualitative.Prism,
              title = '<b>Top Viewed Episodes - word count host Vs heros</b>', height=500, width=800)
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.update_layout(plot_bgcolor='rgb(0,0,0)')
fig.show()

<h3>In regards to Top viewed episodes, Heroes express more and being engaged well than the host, which possibly shows host are opening up gladly and expressing their thoughts, questions are good and other factors. Ratio of word count between host-hero here is 1 : 4.12</h3>

In [None]:
fig = px.bar(df_wordct_low, x='Episode', y='Word_count', color='Speaker',color_discrete_sequence=px.colors.qualitative.Prism,
              title = '<b>Least Viewed Episodes - word count host Vs heros</b>', height=500, width=800)
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.update_layout(plot_bgcolor='rgb(0,0,0)')
fig.show()

<h3>In regard to lesser viewed episodes, Host Vs hero ratio of words being spoken is much lesser when compared with top viewed videos. <br>
    Ratio of word count between host-hero here is just 1 : 2.75</h3>

# Insights from subtitles/content of the podcasts
* <h3>What we infer from Top 10 viewed videos?</h3>
    * Ratio between word count of Sanyam(host) : Heros is 4.12. This stats show a good interaction between host and heros as heroes were uttering 4x words than the host that represents that heroes were opening up and were engaging more in the podcast. *excluded an outlier - E49 where Sanyam was talking more than the host - ratio was -1.6.<br>
    * <u>Top viewed video between Jeremy and Sanyam - this word utterance ratio was whopping 5.8</u>. This was the best ratio in the stack and definitely more engaging one with more vocal contribution from the hero of the episode.<br>
    
* <h3>What we infer from least 10 viewed videos?</h3>
    * Ratio between word count of Sanyam(host) : Hero is just 2.75. This stats that heros weren't engaged in the interview much might be due to multiple factors. Will explore more on this regard. <br>
    * Word count average of the host Sanyam with lowest viewed video (2115 words) is almost the same as with most viewed content (2211 words)

# Concluding Statements
* Appreciation goes to CTDS for hosting such informative, unique podcast series with top ML/Data science researchers across the world
* Though not exponential, Youtube views and subscriptions are marginally increasing over time.
* Podcasts looks to be better presented as audio episodes than audio/video (youtube), which is evident from apple/spotify stats vs youtube stats. Former had solid viewership with reference to watch duration though viewership count was less.
* Youtube videos - better editing / adding exciting visuals / showing some of heroes great work visually like snaps etc might help with engaging the people as one format might not work well for all the medium.
* Youtube non-impression views are more than impressions (almost 2x), looks like CTDS has got good network and reputation in a circle that has driven non-impression views. To reach masses or larger crowd, organic growth is important.
* Most viewed videos had better engagement with heroes as ratio of average word count between host and hero is 4.12 whereas with  lowest viewed videos it dropped to just 2.75.
* Youtube is still better reached to masses, CTDS show need to focus more on visuals, seo, tags, better engaging questions. Views are still not organic as its driven mostly through non impressions than impressions.
* Masala and Ginger flavour chai seem to be positively impacting host's interaction with the heroes in the podcasts.
* Better suited podcast mediums - spotify and apple - either marketing/word of mouth will help in getting the audience. Once we get audience they seem to be listening through the episodes more than through youtube.
* Overall the contribution of CTDS is immense to Machine Learning/Data Science community. Sanyam's release of podcasts is consistent over the year, which shows his dedication. The fact that well respected heroes are willing to give interview to CTDS channel talks about the reputation Sanyam has in AI community.

All the above are just my humble opinion based on the data and inference. I totally value all the hard work put into running these podcasts. Kudos.
Hopefully this competion will drive the audience to these podcasts exponentially.

<h3>I am not sure whether Sanyam et al would be finding good insights and value from this kernel, but I have learned a lot and will cherish this kaggle kernel as this is my first kaggle submission :)</h3>