# Analyzing Video Metadata

### Part 1: Analyzing Tags
- tag vs number of videos: most videos, least videos
- tag vs duration: most duration, least duration, average duration

### Part 2: Analyzing Categories

#### Topic Modelling 
- Unsupervised Topic Clustering to figure out which tags belong together. 
- Assigning an appropriate category manually.
- Assigning Categories to each video based on it's tags list

#### Category analysis
- category vs number of videos: most videos, least videos
- category vs duration: most duration, least duration, average duration

---

## Imports

In [112]:
import numpy as np
import pandas as pd
import isodate
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import matplotlib.pyplot as plt

---

## Reading and Understanding the Data

In [43]:
df = pd.read_json('video_relevant_data.json', orient='index')

In [44]:
df.head(5)

Unnamed: 0,id,publishedAt,tags,categoryId,duration,viewCount,likeCount,commentCount,topicCategories
KWWLwotNcTo,KWWLwotNcTo,2022-02-23T03:30:15Z,"[bgmi, dynamogaming, alphaclasher, hydrabts, h...",24,PT13M58S,91677,24724,437,"[https://en.wikipedia.org/wiki/Food, https://e..."
PC_pAgJopIA,PC_pAgJopIA,2021-08-27T14:00:45Z,"[polymars, game dev challenge, $1000, best gam...",28,PT15M4S,546503,16851,728,[https://en.wikipedia.org/wiki/Video_game_cult...
isAFtqGHz6Y,isAFtqGHz6Y,2019-03-16T01:10:24Z,"[python telugu tutorial, python telugu, python...",27,PT27M7S,670671,15805,1951,[https://en.wikipedia.org/wiki/Knowledge]
I2wURDqiXdM,I2wURDqiXdM,2018-07-07T02:16:12Z,"[howCode, how, code, howcode.org, howco.de, py...",27,PT6M41S,610607,25114,871,[https://en.wikipedia.org/wiki/Knowledge]
qzyVMhAW9FQ,qzyVMhAW9FQ,2021-08-21T10:48:05Z,"[simplified learner, python]",27,PT1M,1076470,84993,782,[https://en.wikipedia.org/wiki/Knowledge]


In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42 entries, KWWLwotNcTo to hEgO047GxaQ
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               42 non-null     object
 1   publishedAt      42 non-null     object
 2   tags             42 non-null     object
 3   categoryId       42 non-null     int64 
 4   duration         42 non-null     object
 5   viewCount        42 non-null     int64 
 6   likeCount        42 non-null     int64 
 7   commentCount     42 non-null     int64 
 8   topicCategories  42 non-null     object
dtypes: int64(4), object(5)
memory usage: 3.3+ KB


In [46]:
df.describe()

Unnamed: 0,categoryId,viewCount,likeCount,commentCount
count,42.0,42.0,42.0,42.0
mean,25.238095,4079991.0,92226.095238,5959.97619
std,4.853008,7409415.0,165078.384336,15453.845632
min,1.0,91677.0,4871.0,150.0
25%,27.0,581311.2,19410.0,743.75
50%,27.0,1190828.0,41445.0,1097.0
75%,27.0,3330402.0,80343.0,2889.25
max,28.0,30948170.0,778117.0,84335.0


In [47]:
df.shape

(42, 9)

### Data Preprocessing

In [48]:
# converting duration and publishedAt to datetime
df['duration'] = df['duration'].apply(lambda x: isodate.parse_duration(x))
df['publishedAt'] = pd.to_datetime(df['publishedAt'])

In [49]:
# dropping id column as index already has that value
df.drop('id', axis=1, inplace=True)

In [50]:
df.head(2)

Unnamed: 0,publishedAt,tags,categoryId,duration,viewCount,likeCount,commentCount,topicCategories
KWWLwotNcTo,2022-02-23 03:30:15+00:00,"[bgmi, dynamogaming, alphaclasher, hydrabts, h...",24,0 days 00:13:58,91677,24724,437,"[https://en.wikipedia.org/wiki/Food, https://e..."
PC_pAgJopIA,2021-08-27 14:00:45+00:00,"[polymars, game dev challenge, $1000, best gam...",28,0 days 00:15:04,546503,16851,728,[https://en.wikipedia.org/wiki/Video_game_cult...


In [51]:
# fetching complete list of unique tags from tags column
all_tags = list(set(",".join(df['tags'].apply(lambda x: ",".join(x)).to_list()).split(",")))

In [52]:
len(all_tags)

518

In [53]:
print(all_tags[0:10])

['Learn', 'master python', 'crossroads', 'applications', 'dr chuck', 'python scripting', 'why learn python', 'python programming examples', 'Feed', 'python coding']


In [54]:
# creating seperate column for each unique tag for further analysis
for tag in all_tags:
    df[tag] = df['tags'].apply(lambda x: tag in x).map(int)

In [55]:
df.shape

(42, 526)

In [56]:
df.head(1)

Unnamed: 0,publishedAt,tags,categoryId,duration,viewCount,likeCount,commentCount,topicCategories,Learn,master python,...,Learn Python in Hindi,reddy,Geeksforgeeks python,Numpy,telusko,python tutorial for beginners full,TELUGU TUTORIAL,python in hindi,python object,retics
KWWLwotNcTo,2022-02-23 03:30:15+00:00,"[bgmi, dynamogaming, alphaclasher, hydrabts, h...",24,0 days 00:13:58,91677,24724,437,"[https://en.wikipedia.org/wiki/Food, https://e...",0,0,...,0,0,0,0,0,0,0,0,0,0


---

### Analyzing tags

#### Video Count
- tag vs video count
- tag with most videos
- tag with least videos

In [72]:
tag_vs_vid_count_df = pd.DataFrame(df.iloc[:, 8:].sum().sort_values(ascending=False), columns=["total_vid_count"])

In [73]:
tag_vs_vid_count_df.head(2)

Unnamed: 0,total_vid_count
python,18
python tutorial,17


In [74]:
tag_vs_vid_count_df.tail(2)

Unnamed: 0,total_vid_count
snakes,1
retics,1


In [75]:
tag_vs_vid_count_df.to_csv('tags_vs_video_count.csv', index_label=['Tag'])

In [96]:
# tags with max videos
tag_vs_vid_count[tag_vs_vid_count['total_vid_count'] == tag_vs_vid_count['total_vid_count'].max()]

Unnamed: 0,total_vid_count
python,18


In [97]:
# tags with min videos
tag_vs_vid_count[tag_vs_vid_count['total_vid_count'] == tag_vs_vid_count['total_vid_count'].min()]

Unnamed: 0,total_vid_count
python in single video,1
python mastery,1
complete python,1
How to learn python programming for free,1
#livingthedream,1
...,...
why learn python programming,1
python basic tutorial malayalam,1
Flask in Tamil,1
snakes,1


---

#### Duration
- tag vs duration
- tag with most duration
- tag with least duration
- average duration for each tag

In [76]:
tag_vs_duration = {}

for column in df.iloc[:, 8:]:
    average_duration = df[df[column] == 1]['duration'].sum()/len(df[df[column] == 1])
    total_duration_for_tag = df[df[column] == 1]['duration'].sum()
    tag_vs_duration[column] = [average_duration, total_duration_for_tag]

tag_vs_duration_df = pd.DataFrame.from_dict(tag_vs_duration, orient='index', columns=['avg_duration', 'total_duration'])

In [77]:
tag_vs_duration_df.head(1)

Unnamed: 0,avg_duration,total_duration
Learn,0 days 00:14:34,0 days 00:14:34


In [92]:
tag_vs_count_duration_df = tag_vs_vid_count_df.join(tag_vs_duration_df)

In [93]:
tag_vs_count_duration_df.head(2)

Unnamed: 0,total_vid_count,avg_duration,total_duration
python,18,0 days 04:13:12.333333333,3 days 03:57:42
python tutorial,17,0 days 05:03:53.235294117,3 days 14:06:05


In [99]:
# max total duration
tag_vs_duration_df[tag_vs_duration_df['total_duration'] == tag_vs_duration_df['total_duration'].max()]

Unnamed: 0,avg_duration,total_duration
python tutorial,0 days 05:03:53.235294117,3 days 14:06:05


In [100]:
# min total duration
tag_vs_duration_df[tag_vs_duration_df['total_duration'] == tag_vs_duration_df['total_duration'].min()]

Unnamed: 0,avg_duration,total_duration
simplified learner,0 days 00:01:00,0 days 00:01:00


In [101]:
# max average duration
tag_vs_duration_df[tag_vs_duration_df['avg_duration'] == tag_vs_duration_df['avg_duration'].max()]

Unnamed: 0,avg_duration,total_duration
dr chuck,0 days 13:40:10,0 days 13:40:10
university of michigan,0 days 13:40:10,0 days 13:40:10
dr. chuck,0 days 13:40:10,0 days 13:40:10
python tutorial 2019,0 days 13:40:10,0 days 13:40:10
u of m,0 days 13:40:10,0 days 13:40:10
charles severance,0 days 13:40:10,0 days 13:40:10
py4e,0 days 13:40:10,0 days 13:40:10


In [102]:
# min average duration
tag_vs_duration_df[tag_vs_duration_df['avg_duration'] == tag_vs_duration_df['avg_duration'].min()]

Unnamed: 0,avg_duration,total_duration
simplified learner,0 days 00:01:00,0 days 00:01:00


#### Storing Tag vs Count & Duration Information in csv file

In [104]:
tag_vs_count_duration_df.to_csv('tag_analysis.csv', index_label=['Tag'])

---

### Category Analysis

#### Topic Modelling
- To group tags into categories we need to perform topic modelling i.e. unsupervised clustering of tags
- Then we need to manually assign relevant categories to topics

In [106]:
# converting list to string for topic modelling
df["tags_str"] = df['tags'].apply(lambda x: " ".join(x))

In [110]:
vect = TfidfVectorizer(stop_words='english')
X = vect.fit_transform(df["tags_str"])

pd.DataFrame(X.toarray(), columns=vect.get_feature_names())[0:10]

Unnamed: 0,10,100,1000,12,2000,2016,2018,2019,2020,2021,...,web,wild,wildlife,winner,wins,youtube,yt,zero,zoo,zoology
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.437019,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.048558,0.097115,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.165857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.101824,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.120809,...,0.0,0.0,0.0,0.0,0.0,0.0,0.103959,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.262089,0.131044,0.0,0.0,0.0,0.0,0.0,0.471932,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [114]:
# choosing 5 as the number of topics (clusters)
N_TOPICS = 5
nmf = NMF(n_components=N_TOPICS, init='nndsvd')
W = nmf.fit_transform(X)  # Document-topic matrix
H = nmf.components_       # Topic-term matrix

In [117]:
NUM_TOP_WORDS_TO_SHOW = 7
words = np.array(vect.get_feature_names())
topic_words = pd.DataFrame(np.zeros((N_TOPICS, NUM_TOP_WORDS_TO_SHOW)), 
                           index=[f'Topic {i + 1}' for i in range(N_TOPICS)],
                           columns=[f'Word {i + 1}' for i in range(NUM_TOP_WORDS_TO_SHOW)]
                          ).astype(str)

for i in range(N_TOPICS):
    ix = H[i].argsort()[::-1][:7]
    topic_words.iloc[i] = words[ix]

topic_words

Unnamed: 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7
Topic 1,python,programming,tutorial,learn,course,language,beginners
Topic 2,hydraemperorgaming,hydravss8ul,hydrabts,8bitthug,hydradanger,s8ulvlogs,mohitchiikara
Topic 3,telusko,navin,reddy,java,tutorial,google,ai
Topic 4,tamil,python,data,joes,flask,tutor,mysql
Topic 5,python,hindi,tutorial,learn,history,sir,mysirg


In [118]:
# assigning topics
topic_mapping = {
    'Topic 1': 'python tutorials for beginners',
    'Topic 2': 'gaming',
    'Topic 3': 'general programming',
    'Topic 4': 'python programming in tamil',
    'Topic 5': 'python programming in hindi',
}

In [119]:
W = pd.DataFrame(W, columns=[f'Topic {i + 1}' for i in range(N_TOPICS)])
W['max_topic'] = W.apply(lambda x: topic_mapping.get(x.idxmax()), axis=1)
W[pd.notnull(W['max_topic'])].head(2)

Unnamed: 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,max_topic
0,0.0,0.828908,0.0,0.0,0.0,gaming
1,0.0,0.0,0.016521,0.0,0.0,general programming


In [122]:
df['category'] = W['max_topic'].to_list()

In [123]:
df.head(2)

Unnamed: 0,publishedAt,tags,categoryId,duration,viewCount,likeCount,commentCount,topicCategories,Learn,master python,...,Numpy,telusko,python tutorial for beginners full,TELUGU TUTORIAL,python in hindi,python object,retics,tags_str,topic,category
KWWLwotNcTo,2022-02-23 03:30:15+00:00,"[bgmi, dynamogaming, alphaclasher, hydrabts, h...",24,0 days 00:13:58,91677,24724,437,"[https://en.wikipedia.org/wiki/Food, https://e...",0,0,...,0,0,0,0,0,0,0,bgmi dynamogaming alphaclasher hydrabts hydraa...,gaming,gaming
PC_pAgJopIA,2021-08-27 14:00:45+00:00,"[polymars, game dev challenge, $1000, best gam...",28,0 days 00:15:04,546503,16851,728,[https://en.wikipedia.org/wiki/Video_game_cult...,0,0,...,0,0,0,0,0,0,0,polymars game dev challenge $1000 best game wi...,general programming,general programming


#### Analyzing Categories
- category vs total videos
- category vs duration

In [132]:
category_df = pd.DataFrame(df['category'].value_counts())
category_df

Unnamed: 0,category
python tutorials for beginners,21
python programming in tamil,7
python programming in hindi,7
general programming,5
gaming,2


In [134]:
# Most Popular Category
category_df[category_df.category == category_df.category.max()]

Unnamed: 0,category
python tutorials for beginners,21


In [135]:
# Least Popular Category
category_df[category_df.category == category_df.category.min()]

Unnamed: 0,category
gaming,2


In [139]:
total_duration_cat_df = pd.DataFrame(df.groupby('topic')['duration'].sum())
total_duration_cat_df

Unnamed: 0_level_0,duration
topic,Unnamed: 1_level_1
gaming,0 days 00:23:47
general programming,0 days 08:27:47
python programming in hindi,0 days 20:45:13
python programming in tamil,0 days 14:22:46
python tutorials for beginners,3 days 06:04:41


In [142]:
avg_duration_cat_df = pd.DataFrame(df.groupby('topic')['duration'].sum().sort_values(ascending=False)/df['topic'].value_counts(), columns=['avg_duration'])
avg_duration_cat_df

Unnamed: 0,avg_duration
gaming,0 days 00:11:53.500000
general programming,0 days 01:41:33.400000
python programming in hindi,0 days 02:57:53.285714285
python programming in tamil,0 days 02:03:15.142857142
python tutorials for beginners,0 days 03:43:04.809523809


In [144]:
total_duration_cat_df.join(avg_duration_cat_df).sort_values(by='avg_duration', ascending=False)

Unnamed: 0_level_0,duration,avg_duration
topic,Unnamed: 1_level_1,Unnamed: 2_level_1
python tutorials for beginners,3 days 06:04:41,0 days 03:43:04.809523809
python programming in hindi,0 days 20:45:13,0 days 02:57:53.285714285
python programming in tamil,0 days 14:22:46,0 days 02:03:15.142857142
general programming,0 days 08:27:47,0 days 01:41:33.400000
gaming,0 days 00:23:47,0 days 00:11:53.500000


In [145]:
total_duration_cat_df.to_csv("category_analysis.csv")

---