# Decoding Yoga Video Trends: Insights from YouTube Data

*This project uses the dataset provided by Ulrike Herold on [Kaggle](https://www.kaggle.com/datasets/ulrikeherold/youtube-channel-yoga-with-kassandra/data) under the CC BY-NC-SA 4.0 license.*


In [2]:
# necessary imports
import pandas as pd

#### **Impact of Video Titles:** Do video titles significantly impact engagement metrics? If so, can NLP techniques reveal patterns in successful titles?

Plan of action:
* I will extract the keywords from each title and then analyse their metrics - frequency, mean engagement ratio, etc.
* Save this dataframe. 
* Use it for data analysis to understand how title keywords impact the channel's engagement.

In [3]:
df = pd.read_csv("youtube_yoga.csv")
df.head()

ERROR! Session/line number was not unique in database. History logging moved to new session 85


Unnamed: 0,channelTitle,videoId,videoTitle,yogaSubject,YogaChallenge,release_date,release_time,duration,viewCount,likeCount,commentCount,defaultLAudioLanguage,videoCategoryLabel
0,Yoga with Kassandra,6INXj5B5uqE,"15 min Morning Yoga Flow - All Levels, No Props",Morning Yoga,False,2024-03-11,10:15:00,17:08,72861,4585,282.0,en,Howto & Style
1,Yoga with Kassandra,I9YJA0zL4yg,"25 min Earth Element Yoga - Grounding, Strengt...",Zodiac yoga,False,2024-03-04,11:15:01,25:23,69360,4239,266.0,en,Howto & Style
2,Yoga with Kassandra,XcXDqvQXgIs,"25 min Fire Element Yoga Flow - Core, Twists &...",Zodiac yoga,False,2024-02-26,11:15:01,24:52,63923,3142,206.0,en,Howto & Style
3,Yoga with Kassandra,C2RAjUEAoLI,10 min Gentle Morning Yoga for Beginners (NO P...,Morning Yoga,False,2024-02-19,11:15:00,11:18,161923,6662,180.0,en,Howto & Style
4,Yoga with Kassandra,bN5l0JJTOLw,20 min Intermediate Yoga Flow - Minimal Cues S...,Intermediate Yoga,False,2024-02-12,11:15:04,21:20,72164,4237,279.0,en,Howto & Style


In [4]:
# any null values?
df.isna().sum()

channelTitle             0
videoId                  0
videoTitle               0
yogaSubject              0
YogaChallenge            0
release_date             0
release_time             0
duration                 0
viewCount                0
likeCount                0
commentCount             2
defaultLAudioLanguage    0
videoCategoryLabel       0
dtype: int64

In [5]:
# it can be assumed that 2 missing comment counts mean that there are no comments on those videos
df['commentCount'] = df['commentCount'].fillna(0)

# introducing the Engagement Ratio = (comments+likes)/views columns

df['engagement_ratio'] = (df['likeCount']+df['commentCount'])/df['viewCount']

# also, i noticed that the release date is in descending order 
# let's rearrange our dataset in a chronological order

df = df.sort_values('release_date', ascending=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,channelTitle,videoId,videoTitle,yogaSubject,YogaChallenge,release_date,release_time,duration,viewCount,likeCount,commentCount,defaultLAudioLanguage,videoCategoryLabel,engagement_ratio
0,Yoga with Kassandra,YgeUkk1V_YA,Yoga Timelapse - Fidelity,Other,False,2013-04-21,20:24:47,03:44,17409,337,32.0,en-US,Sports,0.021196
1,Yoga with Kassandra,BbRroBq1I78,Intermediate Yoga Core Workout - 30 min Vinyas...,Intermediate Yoga,False,2014-04-05,17:08:58,24:24,37100,614,23.0,en-US,Sports,0.01717
2,Yoga with Kassandra,Apn9IOQMrns,1 Hour Yin Yoga - Beginners Full Body Yoga Str...,Beginner Yoga,False,2014-04-06,18:25:20,01:11,310255,2710,186.0,en,Howto & Style,0.009334
3,Yoga with Kassandra,4EUUNRfdlhw,Twist & Detox The Body - 30 min Yoga Class,Vinyasa Flow Yoga,False,2014-04-06,03:32:06,31:54,27571,589,27.0,en-US,Sports,0.022342
4,Yoga with Kassandra,ebV0f8h800U,Yoga For Sleep - Restorative Yoga For A Good N...,Vinyasa Flow Yoga,False,2014-04-09,13:36:00,25:10,47471,626,32.0,en-US,Sports,0.013861


In [6]:
# any null values?
df.isna().sum()

channelTitle             0
videoId                  0
videoTitle               0
yogaSubject              0
YogaChallenge            0
release_date             0
release_time             0
duration                 0
viewCount                0
likeCount                0
commentCount             0
defaultLAudioLanguage    0
videoCategoryLabel       0
engagement_ratio         1
dtype: int64

In [16]:
# why is there a missing engagement ratio? 

df[df['engagement_ratio'].isna()]

Unnamed: 0,channelTitle,videoId,videoTitle,yogaSubject,YogaChallenge,release_date,release_time,duration,viewCount,likeCount,commentCount,defaultLAudioLanguage,videoCategoryLabel,engagement_ratio
595,Yoga with Kassandra,MJR2QvajPP0,Yoga with Kassandra Live Stream,Other,False,2021-09-07,12:13:45,00:00,0,0,0.0,en,Howto & Style,


In [7]:
df = df[df['viewCount']>0].reset_index(drop=True)
df.tail()

Unnamed: 0,channelTitle,videoId,videoTitle,yogaSubject,YogaChallenge,release_date,release_time,duration,viewCount,likeCount,commentCount,defaultLAudioLanguage,videoCategoryLabel,engagement_ratio
847,Yoga with Kassandra,bN5l0JJTOLw,20 min Intermediate Yoga Flow - Minimal Cues S...,Intermediate Yoga,False,2024-02-12,11:15:04,21:20,72164,4237,279.0,en,Howto & Style,0.06258
848,Yoga with Kassandra,C2RAjUEAoLI,10 min Gentle Morning Yoga for Beginners (NO P...,Morning Yoga,False,2024-02-19,11:15:00,11:18,161923,6662,180.0,en,Howto & Style,0.042255
849,Yoga with Kassandra,XcXDqvQXgIs,"25 min Fire Element Yoga Flow - Core, Twists &...",Zodiac yoga,False,2024-02-26,11:15:01,24:52,63923,3142,206.0,en,Howto & Style,0.052376
850,Yoga with Kassandra,I9YJA0zL4yg,"25 min Earth Element Yoga - Grounding, Strengt...",Zodiac yoga,False,2024-03-04,11:15:01,25:23,69360,4239,266.0,en,Howto & Style,0.064951
851,Yoga with Kassandra,6INXj5B5uqE,"15 min Morning Yoga Flow - All Levels, No Props",Morning Yoga,False,2024-03-11,10:15:00,17:08,72861,4585,282.0,en,Howto & Style,0.066798


We have our data ready for extracting titles now.

For this part, I will be using a separate dataframe with only `videoTitle` and `engagement_ratio` columns.

In [26]:
df_title = df[['videoTitle','engagement_ratio']]
df_title.head()

Unnamed: 0,videoTitle,engagement_ratio
0,Yoga Timelapse - Fidelity,0.021196
1,Intermediate Yoga Core Workout - 30 min Vinyas...,0.01717
2,1 Hour Yin Yoga - Beginners Full Body Yoga Str...,0.009334
3,Twist & Detox The Body - 30 min Yoga Class,0.022342
4,Yoga For Sleep - Restorative Yoga For A Good N...,0.013861


In [27]:
# function to clean up text and remove any special characters
import re

def clean_text(content):
    '''
    clean the text by removing 
    special characters with regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", content).split())

In [28]:
df_title['clean_titles'] = [clean_text(title) for title in df_title['videoTitle']] 
df_title.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_title['clean_titles'] = [clean_text(title) for title in df_title['videoTitle']]


Unnamed: 0,videoTitle,engagement_ratio,clean_titles
0,Yoga Timelapse - Fidelity,0.021196,Yoga Timelapse Fidelity
1,Intermediate Yoga Core Workout - 30 min Vinyas...,0.01717,Intermediate Yoga Core Workout 30 min Vinyasa ...
2,1 Hour Yin Yoga - Beginners Full Body Yoga Str...,0.009334,1 Hour Yin Yoga Beginners Full Body Yoga Stretch
3,Twist & Detox The Body - 30 min Yoga Class,0.022342,Twist Detox The Body 30 min Yoga Class
4,Yoga For Sleep - Restorative Yoga For A Good N...,0.013861,Yoga For Sleep Restorative Yoga For A Good Nig...


In [29]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import nltk

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Function to lemmatize and preprocess text
def preprocess_title(title):
    tokens = title.lower().split()  # Tokenize and convert to lowercase
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and word.isalnum()]  # Lemmatize and remove stopwords
    return ' '.join(tokens)

# Apply preprocessing to the titles
df_title['lemmatized_titles'] = df_title['clean_titles'].apply(preprocess_title)

# Tokenize and preprocess titles using CountVectorizer
vectorizer = CountVectorizer(stop_words='english')  # Automatically removes stopwords
X = vectorizer.fit_transform(df_title['lemmatized_titles'])  # Create word frequency matrix
keywords = vectorizer.get_feature_names_out()  # Extract unique tokens (keywords)

# Convert word matrix to DataFrame
keywords_df = pd.DataFrame(X.toarray(), columns=keywords)

# Combine with the original DataFrame
df_keywords = pd.concat([df_title, keywords_df], axis=1)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_title['lemmatized_titles'] = df_title['clean_titles'].apply(preprocess_title)


In [30]:
# Calculate engagement ratio metrics for each keyword
keyword_impact = {}

for keyword in keywords:
    count_engagement = df_keywords[df_keywords[keyword] == 1]['engagement_ratio'].count()
    mean_engagement = df_keywords[df_keywords[keyword] == 1]['engagement_ratio'].mean()
    std_engagement = df_keywords[df_keywords[keyword] == 1]['engagement_ratio'].std()
    keyword_impact[keyword] = [count_engagement, mean_engagement, std_engagement]

# Convert to DataFrame for visualization
keyword_impact_df = pd.DataFrame.from_dict(keyword_impact, orient='index', columns=['Count', 'Mean Engagement Ratio', 'Standard Deviation'])
keyword_impact_df = keyword_impact_df.reset_index().rename(columns={'index': 'Keyword'}).sort_values(by='Count', ascending=False)
keyword_impact_df.reset_index(drop=True, inplace=True)
keyword_impact_df = keyword_impact_df[keyword_impact_df['Count']>0]

In [31]:
keyword_impact_df

Unnamed: 0,Keyword,Count,Mean Engagement Ratio,Standard Deviation
0,min,468,0.029511,0.010132
1,yoga,419,0.030532,0.011864
2,stretch,225,0.028922,0.008648
3,morning,193,0.031467,0.008101
4,30,165,0.028615,0.010441
...,...,...,...,...
813,eat,1,0.035006,
814,muladhara,1,0.013469,
815,erin,1,0.030720,
816,moving,1,0.029030,


In [32]:
# our dataset is ready now; we can save it 

keyword_impact_df.to_csv("title_words.csv",index=False)