# YouTube Video Classifier

In this project, I am going to try and predict the category of a video from YouTube based on its description.
The dataset used in this project is obtained from Kaggle: https://www.kaggle.com/rahulanand0070/youtubevideodataset

- Goal: Predict the category that the video belongs to using various classifier models and determine which model is the most suitable for the task.
- Input feature(s): Description (for the purpose of simplifying the project, I do not take the title of the video into account. The title can be used as an input feature in later update)
- Output feature: Category 

## Data Overview

### Importing libraries and reading the dataset

In [70]:
import re
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [71]:
df = pd.read_csv('./Youtube Video Dataset.csv')
df.head()

Unnamed: 0,Title,Videourl,Category,Description
0,Madagascar Street Food!!! Super RARE Malagasy ...,/watch?v=EwBA1fOQ96c,Food,🎥GIANT ALIEN SNAIL IN JAPAN! » https://youtu.b...
1,42 Foods You Need To Eat Before You Die,/watch?v=0SPwwpruGIA,Food,This is the ultimate must-try food bucket list...
2,Gordon Ramsay’s Top 5 Indian Dishes,/watch?v=upfu5nQB2ks,Food,We found 5 of the best and most interesting In...
3,How To Use Chopsticks - In About A Minute 🍜,/watch?v=xFRzzSF_6gk,Food,You're most likely sitting in a restaurant wit...
4,Trying Indian Food 1st Time!,/watch?v=K79bXtaRwcM,Food,HELP SUPPORT SINSTV!! Shop Our Sponsors!\nLast...


### Dataset Exploratory Analysis

In [72]:
df.shape

(11211, 4)

In [73]:
df.dtypes

Title          object
Videourl       object
Category       object
Description    object
dtype: object

In [74]:
df.columns

Index(['Title', 'Videourl', 'Category', 'Description'], dtype='object')

### Checking for missing field
Since I am only considering the description of the videos in this project, I will drop any row that is missing description attribute.

In [75]:
df.isnull().sum(axis=0)

Title           0
Videourl        0
Category        0
Description    83
dtype: int64

In [76]:
df = df.dropna()
df.isnull().sum()

Title          0
Videourl       0
Category       0
Description    0
dtype: int64

### Checking the value count of different categories

In [77]:
df['Category'].value_counts()

travel blog           2200
Science&Technology    2074
Food                  1828
manufacturing         1699
Art&Music             1682
History               1645
Name: Category, dtype: int64

Since the value counts of the categories are relatively even, I do not need to modify anything

### Cleaning up text
Converting all texts to lowercase, removing urls, emails and special characters that might affect the accuracy of the model by using regex.

In [78]:
start = time.process_time()
for i, row in df.iterrows():
    #lowercase
    desc = str(row['Description']).lower()
    #remove urls
    desc = re.sub('((http|ftp|https):\/\/)?[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&;:/~\+#]*[\w\-\@?^=%&;/~\+#])?', ' ', desc)
    #remove emails
    desc = re.sub('([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)', ' ', desc)
    #remove special chars
    desc = re.sub('[^A-Za-z\s]', '', desc)
    desc = re.sub('\s+', ' ', desc)
    df.at[i, 'Description'] = desc
print('execution time:', time.process_time()-start)

execution time: 8.800837000000229


In [79]:
df.head()

Unnamed: 0,Title,Videourl,Category,Description
0,Madagascar Street Food!!! Super RARE Malagasy ...,/watch?v=EwBA1fOQ96c,Food,giant alien snail in japan go on your tour of ...
1,42 Foods You Need To Eat Before You Die,/watch?v=0SPwwpruGIA,Food,this is the ultimate musttry food bucket list ...
2,Gordon Ramsay’s Top 5 Indian Dishes,/watch?v=upfu5nQB2ks,Food,we found of the best and most interesting indi...
3,How To Use Chopsticks - In About A Minute 🍜,/watch?v=xFRzzSF_6gk,Food,youre most likely sitting in a restaurant with...
4,Trying Indian Food 1st Time!,/watch?v=K79bXtaRwcM,Food,help support sinstv shop our sponsors last lon...


## Training and Testing

### Defining input and output features

In [80]:
X = df['Description']
y = df['Category']

### Declaring TFIDF vectorizer and classifiers

Since descriptions are strings, I need to convert them into feature vectors before working with them. In order to do that, I used sklearn's TfidfVectorizer.

I set stop_words='english' so that stop words (words that does not represent the content of the text such as 'the' or 'and') are removed from the descriptions. When training the models without removing stop words, I found that the accuracy is reduced across all models from 0.54% (SVM) to 3.75% (K-Nearest Neighbors)

In order for the TFIDF vectorizer to be applied to the description for training and testing, I use sklearn's Pipeline.

The classifier models that I will be training and testing are:
- K-Nearest Neighbors
- Logistic Regression
- Support Vector Machine (SVM)
- Decision Tree
- Random Forest

In [81]:
vectorizer = TfidfVectorizer(stop_words='english')
knn_clf = Pipeline([('tfidf', vectorizer), ('clf', KNeighborsClassifier(n_neighbors = 75))])
log_reg_clf = Pipeline([('tfidf', vectorizer), ('clf', LogisticRegression(max_iter=1000))])
svm_clf = Pipeline([('tfidf', vectorizer), ('clf', SVC())])
dec_tree_clf = Pipeline([('tfidf', vectorizer), ('clf', DecisionTreeClassifier())])
rand_forest_clf = Pipeline([('tfidf', vectorizer), ('clf', RandomForestClassifier(n_estimators=200))])

To measure the effectiveness of the classifiers, I run K-fold cross validations and compare the scores of the models.

#### Setting up K-Fold

In [82]:
kfold = KFold(n_splits=5, random_state=10, shuffle=True)

#### Measuring the scores of different models

In [83]:
knn_score = cross_val_score(knn_clf, X, y, cv = kfold, scoring = 'accuracy').mean() * 100
print("K-Nearest Neighbor Score:", knn_score)

K-Nearest Neighbor Score: 85.37026560465186


In [84]:
log_reg_score = cross_val_score(log_reg_clf, X, y, cv = kfold, scoring = 'accuracy').mean() * 100
print("Logistic Regression Score:", log_reg_score)

Logistic Regression Score: 90.70809735808675


In [85]:
dec_tree_score = cross_val_score(dec_tree_clf, X, y, cv = kfold, scoring = 'accuracy').mean() * 100
print("Decision Tree Score:", dec_tree_score)

Decision Tree Score: 82.53961254631172


In [86]:
rand_forest_score = cross_val_score(rand_forest_clf, X, y, cv = kfold, scoring = 'accuracy').mean() * 100
print("Random Forest Score:", rand_forest_score)

Random Forest Score: 90.15994427450862


In [87]:
svm_score = cross_val_score(svm_clf, X, y, cv = kfold, scoring = 'accuracy').mean() * 100
print("SVM Score:", svm_score)

SVM Score: 90.48340652351676
