<a href="https://colab.research.google.com/github/arqylabs/title_classifier_youtube/blob/master/title_classifier_youtube.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
## import statements ##
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import re

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, auc, f1_score, roc_auc_score

import warnings; warnings.simplefilter('ignore')

**Import Data from CSV file**

Using pandas to import the dataset and transform into a Dataframe.

In [0]:
data_files = 'USvideos.csv'

data = pd.read_csv(data_files)
data_videos = data[['video_id', 'title', 'category_id']]
data_videos['category_id'] = pd.to_numeric(data_videos['category_id'])
data_videos.head()

Unnamed: 0,video_id,title,category_id
0,2kyS6SvSYSE,WE WANT TO TALK ABOUT OUR MARRIAGE,22
1,1ZAPwfrtAFY,The Trump Presidency: Last Week Tonight with J...,24
2,5qpjK5DgCt4,"Racist Superman | Rudy Mancuso, King Bach & Le...",23
3,puqaWrEC7tY,Nickelback Lyrics: Real or Fake?,24
4,d380meD0W0M,I Dare You: GOING BALD!?,24


In [0]:
category_labels = 'datasets_4549_466349_US_category_id.json'

df = pd.read_json(category_labels)
df['items']

0     {'kind': 'youtube#videoCategory', 'etag': '"m2...
1     {'kind': 'youtube#videoCategory', 'etag': '"m2...
2     {'kind': 'youtube#videoCategory', 'etag': '"m2...
3     {'kind': 'youtube#videoCategory', 'etag': '"m2...
4     {'kind': 'youtube#videoCategory', 'etag': '"m2...
5     {'kind': 'youtube#videoCategory', 'etag': '"m2...
6     {'kind': 'youtube#videoCategory', 'etag': '"m2...
7     {'kind': 'youtube#videoCategory', 'etag': '"m2...
8     {'kind': 'youtube#videoCategory', 'etag': '"m2...
9     {'kind': 'youtube#videoCategory', 'etag': '"m2...
10    {'kind': 'youtube#videoCategory', 'etag': '"m2...
11    {'kind': 'youtube#videoCategory', 'etag': '"m2...
12    {'kind': 'youtube#videoCategory', 'etag': '"m2...
13    {'kind': 'youtube#videoCategory', 'etag': '"m2...
14    {'kind': 'youtube#videoCategory', 'etag': '"m2...
15    {'kind': 'youtube#videoCategory', 'etag': '"m2...
16    {'kind': 'youtube#videoCategory', 'etag': '"m2...
17    {'kind': 'youtube#videoCategory', 'etag': 

In [0]:
categories = json_normalize(data=df['items'])[['id','snippet.title']]
categories.columns = ['category_id','category_title']
categories['category_id'] = pd.to_numeric(categories['category_id'])
categories.sort_values(by=['category_id'], inplace=True)
categories

Unnamed: 0,category_id,category_title
0,1,Film & Animation
1,2,Autos & Vehicles
2,10,Music
3,15,Pets & Animals
4,17,Sports
5,18,Short Movies
6,19,Travel & Events
7,20,Gaming
8,21,Videoblogging
9,22,People & Blogs


In [0]:
data_videos = pd.merge(data_videos, categories, on='category_id')
data_videos.drop(list(data_videos[data_videos['title'] == 'Deleted video'].index),inplace=True)
data_videos

Unnamed: 0,video_id,title,category_id,category_title
0,2kyS6SvSYSE,WE WANT TO TALK ABOUT OUR MARRIAGE,22,People & Blogs
1,0mlNzVSJrT0,Me-O Cats Commercial,22,People & Blogs
2,STI2fI7sKMo,"AFFAIRS, EX BOYFRIENDS, $18MILLION NET WORTH -...",22,People & Blogs
3,KODzih-pYlU,BLIND(folded) CAKE DECORATING CONTEST (with Mo...,22,People & Blogs
4,8mhTWqWlQzU,Wearing Online Dollar Store Makeup For A Week,22,People & Blogs
...,...,...,...,...
40944,V6ElE2xs48c,Game of Zones - S5:E5: The Isle of Van Gundy,43,Shows
40945,V6ElE2xs48c,Game of Zones - S5:E5: The Isle of Van Gundy,43,Shows
40946,V6ElE2xs48c,Game of Zones - S5:E5: The Isle of Van Gundy,43,Shows
40947,V6ElE2xs48c,Game of Zones - S5:E5: The Isle of Van Gundy,43,Shows


In [0]:
df_unique = data_videos.sort_values('title', ascending=True).drop_duplicates(subset=['title'])

In [0]:
df_unique = df_unique.drop_duplicates()
list_video_id_duplicate = (df_unique['video_id'].value_counts() > 1)[(df_unique['video_id'].value_counts() > 1)==True].index
df_unique[df_unique['video_id'].isin(list_video_id_duplicate)]
df_unique = df_unique.sort_index().drop_duplicates(subset=['video_id'])
df_unique

Unnamed: 0,video_id,title,category_id,category_title
0,2kyS6SvSYSE,WE WANT TO TALK ABOUT OUR MARRIAGE,22,People & Blogs
1,0mlNzVSJrT0,Me-O Cats Commercial,22,People & Blogs
2,STI2fI7sKMo,"AFFAIRS, EX BOYFRIENDS, $18MILLION NET WORTH -...",22,People & Blogs
4,8mhTWqWlQzU,Wearing Online Dollar Store Makeup For A Week,22,People & Blogs
6,fCTKDn3Q8xQ,Idiot's Guide to Japanese Squat Toilets,22,People & Blogs
...,...,...,...,...
40889,pwGbwYAfSmg,5 books worth reading this summer,29,Nonprofits & Activism
40902,lM0yu7c6lQk,You're not crazy. Apple is slowing down older ...,43,Shows
40907,Q1CFfU2gXHw,Apple HomePod: Everything to know before you b...,43,Shows
40920,7_FJUSBFbJM,Game of Zones - Game of Zones - S5:E1: 'A Gold...,43,Shows


**Data Cleansing Step**

1. Remove any symbols in the comments
2. Convert to lowercase

In [0]:
def process_content(content):
    return " ".join(re.findall("[A-Za-z]+",content.lower()))

In [0]:
df_unique['processed_title'] = df_unique['title'].apply(process_content)
train_data = df_unique.copy()

In [0]:
train_data.head()

Unnamed: 0,video_id,title,category_id,category_title,processed_title
0,2kyS6SvSYSE,WE WANT TO TALK ABOUT OUR MARRIAGE,22,People & Blogs,we want to talk about our marriage
1,0mlNzVSJrT0,Me-O Cats Commercial,22,People & Blogs,me o cats commercial
2,STI2fI7sKMo,"AFFAIRS, EX BOYFRIENDS, $18MILLION NET WORTH -...",22,People & Blogs,affairs ex boyfriends million net worth google...
4,8mhTWqWlQzU,Wearing Online Dollar Store Makeup For A Week,22,People & Blogs,wearing online dollar store makeup for a week
6,fCTKDn3Q8xQ,Idiot's Guide to Japanese Squat Toilets,22,People & Blogs,idiot s guide to japanese squat toilets


In [0]:
# encoder = LabelEncoder()
# y = encoder.fit_transform(train_data['Category'])
# train_data['n_category'] = y
# print(y[:5])

**Data Distribution**

In [0]:
categories = train_data['category_title']
titles = train_data['processed_title']
N = len(titles)
print('Number of video',N)

Number of video 6343


In [0]:
labels = list(set(categories))
n_classes = len(labels)
print('possible categories',labels)

possible categories ['Autos & Vehicles', 'People & Blogs', 'Science & Technology', 'Sports', 'Education', 'Gaming', 'Howto & Style', 'Nonprofits & Activism', 'News & Politics', 'Entertainment', 'Music', 'Pets & Animals', 'Film & Animation', 'Travel & Events', 'Comedy', 'Shows']


In [0]:
for l in labels:
    print(('number of "%s" video = %d' % (l,len(train_data.loc[train_data['category_title'] == l]))))

number of "Autos & Vehicles" video = 70
number of "People & Blogs" video = 495
number of "Science & Technology" video = 381
number of "Sports" video = 448
number of "Education" video = 249
number of "Gaming" video = 103
number of "Howto & Style" video = 594
number of "Nonprofits & Activism" video = 14
number of "News & Politics" video = 505
number of "Entertainment" video = 1620
number of "Music" video = 798
number of "Pets & Animals" video = 139
number of "Film & Animation" video = 319
number of "Travel & Events" video = 59
number of "Comedy" video = 545
number of "Shows" video = 4


**Split the data**

Split the data for data training of 70% and data testing of 30% with random pick of 57.
- Data Training : `X_train & y_train`
- Data Testing : `X_test & y_test`

In [0]:
X_train, X_test, y_train, y_test = train_test_split(train_data['processed_title'],train_data['category_title'],test_size=0.3,random_state=57)

**Data Pipeline**

1. Data Cleansing using `CountVectorize` with stop_word='english' to remove the stop word
2. Data transforming using `TF-IDF`
3. Model Training using `LogisticRegression`

In [0]:
model = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression()),
                     ])

**Fitting the model**

In [0]:
text_clf = model.fit(X_train, y_train)

**Predict the data testing**

In [0]:
predicted = model.predict(X_test)

**Confusion Matrix**

In [0]:
confusion_matrix(y_test,predicted)

array([[  0,   0,   0,  16,   0,   0,   0,   0,   0,   0,   0,   0,   3,
          0,   2,   0],
       [  0,  31,   1,  99,   0,   0,   6,  12,   0,   0,   0,   0,   1,
          0,   2,   0],
       [  0,   1,   1,  53,   2,   0,   3,   2,   1,   0,   1,   0,   2,
          0,   0,   0],
       [  0,   2,   0, 447,   3,   0,   9,  18,   5,   0,   3,   1,   2,
          0,   2,   0],
       [  0,   2,   0,  50,  25,   0,   0,   1,   1,   0,   1,   0,   1,
          0,   1,   0],
       [  0,   0,   0,  32,   1,   3,   0,   0,   1,   0,   1,   0,   0,
          0,   0,   0],
       [  0,   0,   0,  97,   0,   0,  98,   4,   0,   0,   3,   0,   0,
          0,   0,   0],
       [  0,   1,   0,  85,   0,   0,   1, 146,   0,   0,   1,   0,   0,
          0,   0,   0],
       [  0,   0,   0,  89,   1,   0,   1,   2,  56,   0,   3,   0,   3,
          0,   1,   0],
       [  0,   0,   0,   4,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0],
       [  0,   6,   1,  98,   

**Accuracy Score & Classification Report**

In [0]:
print('accuracy_score',accuracy_score(y_test,predicted))
print('Reporting...')

accuracy_score 0.4908039936941671
Reporting...


In [0]:
print(classification_report(y_test, predicted, target_names=labels))

                       precision    recall  f1-score   support

     Autos & Vehicles       0.00      0.00      0.00        21
       People & Blogs       0.67      0.20      0.31       152
 Science & Technology       0.33      0.02      0.03        66
               Sports       0.36      0.91      0.52       492
            Education       0.76      0.30      0.43        82
               Gaming       1.00      0.08      0.15        38
        Howto & Style       0.75      0.49      0.59       202
Nonprofits & Activism       0.71      0.62      0.67       234
      News & Politics       0.80      0.36      0.50       156
        Entertainment       0.00      0.00      0.00         4
                Music       0.51      0.12      0.19       151
       Pets & Animals       0.92      0.24      0.38        50
     Film & Animation       0.54      0.19      0.28       104
      Travel & Events       0.00      0.00      0.00         3
               Comedy       0.83      0.59      0.69  

**Model Evaluation using Cross Validation**

On the data training

In [0]:
cross_val_score(model, X_train, y_train, cv=5)

array([0.48198198, 0.46058559, 0.45945946, 0.48310811, 0.4740991 ])

On the data testing

In [0]:
cross_val_score(model, X_test, y_test, cv=5)

array([0.39107612, 0.4015748 , 0.39370079, 0.39736842, 0.42105263])

**`predict_title` function**

In [0]:
def predict_title(model, new_data):
    test_data = pd.DataFrame(new_data, columns=['title'])
    test_data['processed_title'] = test_data['title'].apply(process_content)
    
    X_test = test_data['processed_title']
    predictions = model.predict(X_test)
    
    return predictions

**New Data Sample Test**

In [0]:
t1 = ['Rupiah is the best in Asia today.']
news_title = pd.DataFrame(t1, columns=['title'])

In [0]:
predict_title(model, t1)

array(['News & Politics'], dtype=object)