Using Youtube "Trending" page data to predict the 'Category' of a video given the 'Title'.
======
***

# Table of Contents

## Part 1: Reading and Merging Data Sources
## Part 2: Train (using Naive Bayes)
## Part 3: Test

# Part 1: Reading and Merging Data Sources
***


### Data source:
+ https://www.kaggle.com/datasnaek/youtube-new

### Import Modules

In [17]:
import numpy as np
import pandas as pd
import collections
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Import the CSV and take an initial look:

In [18]:
USvids = pd.read_csv("./dataCSV/USvideos.csv", header=0)
USvids.head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...


### Delete unused columns and rename the remaining columns:

In [19]:
keep_columns = ['title','category_id']
new_USvids = USvids[keep_columns]
new_USvids.to_csv("newUS.csv", index=False)
new_USvids = pd.read_csv("newUS.csv", header=0, names=['Title','Category_ID'])
new_USvids

Unnamed: 0,Title,Category_ID
0,WE WANT TO TALK ABOUT OUR MARRIAGE,22
1,The Trump Presidency: Last Week Tonight with J...,24
2,"Racist Superman | Rudy Mancuso, King Bach & Le...",23
3,Nickelback Lyrics: Real or Fake?,24
4,I Dare You: GOING BALD!?,24
...,...,...
40944,The Cat Who Caught the Laser,15
40945,True Facts : Ant Mutualism,22
40946,I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...,24
40947,How Black Panther Should Have Ended,1


### The data source provided descriptions of Category_ID in a seperate JSON file. 
### Let's look at the JSON file:

In [20]:
Categories_JSON = pd.read_json("./dataCSV/US_category_id.JSON")
Categories_JSON.head(3)

Unnamed: 0,kind,etag,items
0,youtube#videoCategoryListResponse,"""ld9biNPKjAjgjV7EZ4EKeEGrhao/1v2mrzYSYG6onNLt2...","{'kind': 'youtube#videoCategory', 'etag': '""ld..."
1,youtube#videoCategoryListResponse,"""ld9biNPKjAjgjV7EZ4EKeEGrhao/1v2mrzYSYG6onNLt2...","{'kind': 'youtube#videoCategory', 'etag': '""ld..."
2,youtube#videoCategoryListResponse,"""ld9biNPKjAjgjV7EZ4EKeEGrhao/1v2mrzYSYG6onNLt2...","{'kind': 'youtube#videoCategory', 'etag': '""ld..."


### Create a list of dictionaries with ID and Category label mapping:

In [21]:
CategoryDict = [{'id': item['id'], 'title': item['snippet']['title']} for item in Categories_JSON['items']]
CategoryDict

[{'id': '1', 'title': 'Film & Animation'},
 {'id': '2', 'title': 'Autos & Vehicles'},
 {'id': '10', 'title': 'Music'},
 {'id': '15', 'title': 'Pets & Animals'},
 {'id': '17', 'title': 'Sports'},
 {'id': '18', 'title': 'Short Movies'},
 {'id': '19', 'title': 'Travel & Events'},
 {'id': '20', 'title': 'Gaming'},
 {'id': '21', 'title': 'Videoblogging'},
 {'id': '22', 'title': 'People & Blogs'},
 {'id': '23', 'title': 'Comedy'},
 {'id': '24', 'title': 'Entertainment'},
 {'id': '25', 'title': 'News & Politics'},
 {'id': '26', 'title': 'Howto & Style'},
 {'id': '27', 'title': 'Education'},
 {'id': '28', 'title': 'Science & Technology'},
 {'id': '30', 'title': 'Movies'},
 {'id': '31', 'title': 'Anime/Animation'},
 {'id': '32', 'title': 'Action/Adventure'},
 {'id': '33', 'title': 'Classics'},
 {'id': '34', 'title': 'Comedy'},
 {'id': '35', 'title': 'Documentary'},
 {'id': '36', 'title': 'Drama'},
 {'id': '37', 'title': 'Family'},
 {'id': '38', 'title': 'Foreign'},
 {'id': '39', 'title': 'Horro

### Create a data frame of the above information

In [22]:
CategoriesDF = pd.DataFrame(CategoryDict)
Categories = CategoriesDF.rename(index=str, columns={"id": "Category_ID", "title": "Category"})
Categories.head(3)

Unnamed: 0,Category_ID,Category
0,1,Film & Animation
1,2,Autos & Vehicles
2,10,Music


# Part 2: Train (using Naive Bayes)
***

### Split 'Title' into a string of words using CountVectorizer:

In [23]:
vector = CountVectorizer()
counts = vector.fit_transform(new_USvids['Title'].values)
vector

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

### Use the naive Bayes model and target 'Category':

In [24]:
NB_Model = MultinomialNB()
targets = new_USvids['Category_ID'].values
NB_Model.fit(counts,targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Check Accuracy using a 90/10 train/test split

In [25]:
X= counts
y= targets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .1)

NBtest = MultinomialNB().fit(X_train, y_train)
nb_predictions = NBtest.predict(X_test)
acc_nb = NBtest.score(X_test, y_test)
print('The Naive Bayes Algorithm scored an accuracy of', acc_nb)


The Naive Bayes Algorithm scored an accuracy of 0.8964590964590965


In [26]:
from joblib import dump

In [27]:
dump(NB_Model, filename="USyoutube_trained.joblib")

['USyoutube_trained.joblib']

## Satisfactory accuracy, training using Historical Data is complete.
***

# Part 3: Test

### Enter hypothetical titles to predict the category for: 

In [28]:
title_input = input('Enter your Title: ')
Titles = [title_input]

Enter your Title: Messi the best


### Insert said titles into naive Bayes model:

In [29]:
Titles_counts = vector.transform(Titles)
Predict = NB_Model.predict(Titles_counts)
Predict

array([17])

### Output will be an array of numbers. Iterate through the Category Dictionary (from JSON file)  to find "title":

In [30]:
CategoryNamesList = []
for Category_ID in Predict:
    MatchingCategories = [x for x in CategoryDict if x["id"] == str(Category_ID)]
    if MatchingCategories:
        CategoryNamesList.append(MatchingCategories[0]["title"])

### Map these values to the Titles we want to Predict:

In [31]:
TitleDataFrame = []
for i in range(0, len(Titles)):
    TitleToCategories = {'Title': Titles[i],  'Category': CategoryNamesList[i]}
    TitleDataFrame.append(TitleToCategories)

### Convert the resulting Dict to a Data Frame:

In [32]:
PredictDF = pd.DataFrame(Predict)
TitleDF = pd.DataFrame(TitleDataFrame)
PreFinalDF = pd.concat([PredictDF, TitleDF], axis=1)
PreFinalDF.columns = (['Categ_ID', 'Predicted Category', 'Hypothetical Video Title'])
FinalDF = PreFinalDF.drop(['Categ_ID'],axis=1)
cols = FinalDF.columns.tolist()
cols = cols[-1:] + cols[:-1]
FinalDF= FinalDF[cols]

# View Final Prediction Results:

In [33]:
FinalDF

Unnamed: 0,Hypothetical Video Title,Predicted Category
0,Sports,Messi the best
