# Youtube Recommender System
#### Ann Yoo Abbott, Joy Chiang, Shubham Gupta, Annabelle Lee, Alex Kim
Dataset: https://www.kaggle.com/datasnaek/youtube-new

This is a Youtube Recommender System that outputs similar videos you might be interested given a video title.

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
import json
import gensim
import gensim.downloader as api
import warnings
warnings.filterwarnings('ignore')
from collections import Counter

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

## Part 0: Data Loading

We are only interested in English speaking countries so we select datasets from United States, Canada, and Great Britain. 

In [2]:
us_video_data = pd.read_csv("dataset/USvideos.csv")
ca_video_data = pd.read_csv("dataset/CAvideos.csv")
gb_video_data = pd.read_csv("dataset/GBvideos.csv")

video_data = us_video_data.append(ca_video_data, ignore_index=True)
video_data = video_data.append(gb_video_data, ignore_index=True)
print(video_data.shape)
video_data.head()

(120746, 16)


Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


There are ~120k trending videos. However, many of them are duplicated videos since one trending video may appears in multiple regions/dates. We will perform data cleaning to remove duplicated videos.

In [3]:
clean_vid_data = video_data.groupby(["title"]).first().reset_index()
print(clean_vid_data.shape)
clean_vid_data.head()

(30626, 16)


Unnamed: 0,title,video_id,trending_date,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,!! THIS VIDEO IS NOTHING BUT PAIN !! | Getting...,PNn8sECd7io,18.04.01,Markiplier,20,2018-01-03T19:33:53.000Z,"getting over it|""markiplier""|""funny moments""|""...",835930,47058,1023,8250,https://i.ytimg.com/vi/PNn8sECd7io/default.jpg,False,False,False,Getting Over It continues with RAGE BEYOND ALL...
1,"#1 Fortnite World Rank - 2,323 Solo Wins!",DvPW66IFhMI,18.09.03,AlexRamiGaming,20,2018-03-09T07:15:52.000Z,"PS4 Battle Royale|""PS4 Pro Battle Royale""|""Bat...",212838,5199,542,11,https://i.ytimg.com/vi/DvPW66IFhMI/default.jpg,False,False,False,Discord For EVERYONE - https://discord.gg/nhud...
2,"#1 Fortnite World Rank - 2,330 Solo Wins!",EXEaMjFeiEk,18.10.03,AlexRamiGaming,20,2018-03-10T06:26:17.000Z,"PS4 Battle Royale|""PS4 Pro Battle Royale""|""Bat...",200764,5620,537,45,https://i.ytimg.com/vi/EXEaMjFeiEk/default.jpg,False,False,False,Discord For EVERYONE - https://discord.gg/nhud...
3,#1 MOST ANTICIPATED VIDEO (Timber Frame House ...,bYvQmusLaxw,17.20.12,Pure Living for Life,24,2017-12-20T02:49:11.000Z,"timber frame|""timber framing""|""timber frame ra...",79152,7761,159,1965,https://i.ytimg.com/vi/bYvQmusLaxw/default.jpg,False,False,False,Shelter Institute: http://bit.ly/2iwXj8B\nFull...
4,#1 WORLD RANKED 1463 SOLO WINS! - FORTNITE BAT...,xQ4Q5b2WwO8,18.18.01,AlexRamiGaming,20,2018-01-17T18:00:05.000Z,"PS4 Battle Royale|""PS4 Pro Battle Royale""|""Bat...",541482,15430,891,40,https://i.ytimg.com/vi/xQ4Q5b2WwO8/default_liv...,False,False,False,►Twitter @AlexRamiGaming\n\n►Tips & Donations\...


After removing the duplicated videos, there are only ~30k unique videos left, which is $\frac{1}{4}$ of the origin dataset.

## Part 1: Category Classification Model
We will train a classifier that given a video title, it can predict its category.

In [4]:
def create_category_map():
    """ 
    Our training data contain only the category_id instead of category names.
    This funtion create a category map, each number maps to a specific category. 
    For example, 10 stands for Music category.
    """
    category_map = {}
    data = {}
    with open('dataset/US_category_id.json', 'r') as outfile:
        data = json.load(outfile)
        for item in data["items"]:
            category_map[item["id"]] = item["snippet"]["title"]
    return category_map
category_map = create_category_map()
category_map

{'1': 'Film & Animation',
 '2': 'Autos & Vehicles',
 '10': 'Music',
 '15': 'Pets & Animals',
 '17': 'Sports',
 '18': 'Short Movies',
 '19': 'Travel & Events',
 '20': 'Gaming',
 '21': 'Videoblogging',
 '22': 'People & Blogs',
 '23': 'Comedy',
 '24': 'Entertainment',
 '25': 'News & Politics',
 '26': 'Howto & Style',
 '27': 'Education',
 '28': 'Science & Technology',
 '29': 'Nonprofits & Activism',
 '30': 'Movies',
 '31': 'Anime/Animation',
 '32': 'Action/Adventure',
 '33': 'Classics',
 '34': 'Comedy',
 '35': 'Documentary',
 '36': 'Drama',
 '37': 'Family',
 '38': 'Foreign',
 '39': 'Horror',
 '40': 'Sci-Fi/Fantasy',
 '41': 'Thriller',
 '42': 'Shorts',
 '43': 'Shows',
 '44': 'Trailers'}

In [5]:
def preprocess_text_df(text):
    """ 
    This function transforms a series to lower case and to plain text without punctuations.  
    """
    text = text.apply(lambda x: x.lower())
    text = text.str.replace(r'[^a-zA-Z ]', '') #text.str.replace(r'[^\w\s]', '')
    return text

def preprocess_text(text):
    """ 
    This function transforms a string to lower case and to plain text without punctuations.  
    
    Example Input: 
        "Dua Lipa - IDGAF (Official Music Video)"
    Example Output:
        'dua lipa  idgaf official music video'
    """
    text = text.lower()
    regex = re.compile('[^a-zA-Z ]')
    text = regex.sub('', text)
    return text

def preprocess_tags(text):
    """ 
    This function transforms a series of tags or a string to lower case and to plain text without punctuations. 
    
    Example Input: 
        'getting over it|"markiplier"|"funny moments"|"rage"|"screaming"|"angry"|"funniest"|"best of"'
    Example Output:
        'getting over it  markiplier   funny moments   rage   screaming   angry   funniest   best of '
    """
    try:
        text = text.apply(lambda x: x.lower())
        text = text.str.replace(r'[^a-zA-Z ]', ' ')
    except:
        text = text.lower()
        regex = re.compile('[^a-zA-Z ]')
        text = regex.sub(' ', text)
    return text

In [6]:
def predict_category(X_train, y_train, X_test, vectorizer, clf):
    """
    This function takes a series and predict a series of category_id.
    """
    X_train = vectorizer.fit_transform(X_train)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(vectorizer.transform(X_test))
    return clf, vectorizer, y_pred

def predict_category_from_title_helper(X_train, y_train, X_test, vectorizer, clf):
    """
    Helper function for predict_category_from_title.
    """
    X_train = vectorizer.fit_transform(X_train)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(vectorizer.transform(pd.Series(X_test)))
    return y_pred

def predict_category_from_title(title):
    """
    This function takes a string and predict its category_id.
    We preset the vectorizer as TfidfVectorizer and we use SGDClassifier as our classification model.
    
    Example Input: 
        'dua lipa  idgaf official music video'
    Example Output:
        10
    """
    if not model:
        return predict_category_from_title_helper(X_train, y_train, title, TfidfVectorizer(), SGDClassifier())[0]
    else:
        return model.predict(vectorizer.transform(pd.Series(title)))[0]

After set up the category classification model functions, we want to pre-train the model and verify the performance.

In [7]:
clean_vid_data["title_clean"] = preprocess_text_df(clean_vid_data.title)
clean_vid_data["tags_clean"] = preprocess_tags(clean_vid_data.tags)
X_train, X_test, y_train, y_test = train_test_split(clean_vid_data["title_clean"] + clean_vid_data["tags_clean"], 
                                                    clean_vid_data['category_id'], 
                                                    test_size=0.1, 
                                                    shuffle=True
                                                    )
model, vectorizer, y_pred = predict_category(X_train, y_train, X_test, TfidfVectorizer(ngram_range=(1, 2)), SGDClassifier())
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.81      0.55      0.65       161
           2       0.90      0.90      0.90        31
          10       0.77      0.87      0.82       222
          15       0.61      0.77      0.68        22
          17       0.86      0.93      0.89       256
          19       0.83      0.69      0.75        29
          20       0.92      0.73      0.81       110
          22       0.77      0.55      0.64       301
          23       0.85      0.78      0.82       224
          24       0.76      0.88      0.82       980
          25       0.82      0.81      0.81       347
          26       0.79      0.89      0.83       178
          27       0.88      0.66      0.75        87
          28       0.84      0.70      0.76        99
          29       1.00      0.33      0.50         6
          43       0.50      0.70      0.58        10

    accuracy                           0.80      3063
   macro avg       0.81   

Our model has around an 80% accuracy.

In [8]:
def category_group(num):
    """
    category_group returns a dataframe in the category specified.
    """
    category_number = num    
    category_name = category_map[str(num)]
    rslt_df = clean_vid_data[clean_vid_data['category_id'] == category_number] 
    return rslt_df.reset_index()

def find_tags(vidtitle, rslt_df) : 
    """
    find_tags function returns a list of tags that relate to a given video title and the category specified.
    """
    row = rslt_df[rslt_df['title'] == vidtitle]
    tags = row['tags']
    list_of_tags = []
    for it in tags:
        ls = it.split('|')
        for l in ls:
            if l != '[none]':
                list_of_tags.append(l)    
    return list_of_tags #[0] if len(list_of_tags) else list_of_tags

def find_tags_from_string(vidtitle, tags) : 
    """
    find_tags function returns a list of tags that relate to a given video title and the category specified.
    """
    list_of_tags = []
    for it in tags:
        ls = it.split('|')
        for l in ls:
            if l != '[none]':
                list_of_tags.append(l)    
    return list_of_tags #[0] if len(list_of_tags) else list_of_tags

In [9]:
category_df10 = category_group(10)
display(category_df10.head())
a0 = find_tags("Dua Lipa - IDGAF (Official Music Video)", category_df10)
print("first: \n", a0, "\n")
a1 = find_tags("Azealia Banks - Anna Wintour", category_df10)
print("--------------------------------------------------------------------------------------------------------------")
print("second: \n", a1, "\n")

Unnamed: 0,index,title,video_id,trending_date,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,title_clean,tags_clean
0,29,"#AboveTheNoise feat. Serena Williams, Neymar J...",oWithLP0VlQ,17.29.11,Beats by Dre,10,2017-11-22T17:23:20.000Z,[none],2152261,7824,158,324,https://i.ytimg.com/vi/oWithLP0VlQ/default.jpg,False,False,False,"In a loud world full of distractions, it's nev...",abovethenoise feat serena williams neymar jr c...,none
1,39,#GiveAToast with Connor McDavid on November 20,PMvQ81VbwMA,17.17.11,Canadian Tire,10,2017-11-16T14:41:05.000Z,"jumpstart|""commercial""|""marketing""|""toasters""|...",7610,13,0,3,https://i.ytimg.com/vi/PMvQ81VbwMA/default.jpg,False,False,False,Order your limited edition toaster on November...,giveatoast with connor mcdavid on november,jumpstart commercial marketing toasters ...
2,47,#LightTheWorld Christmas Concert with The Pian...,t7Fa1GUf-LE,17.16.12,ThePianoGuys,10,2017-12-12T23:39:56.000Z,"ThePianoGuys|""The Piano Guys""|""Secrets""|""David...",196922,8223,128,443,https://i.ytimg.com/vi/t7Fa1GUf-LE/default.jpg,False,False,False,#LightTheWorld is a worldwide effort to share ...,lighttheworld christmas concert with the piano...,thepianoguys the piano guys secrets david...
3,84,$1 Guitar,elmet3x4AeI,17.13.12,Rob Scallon,10,2017-12-04T14:01:02.000Z,"$1 guitar|""one dollar guitar""|""dollar store gu...",804978,40019,713,2590,https://i.ytimg.com/vi/elmet3x4AeI/default.jpg,False,False,False,Rocking out on a budget.\n2nd channel video: h...,guitar,guitar one dollar guitar dollar store gu...
4,241,'The Wilderness' - Kellz & The Truth Experimen...,xLserXBaKV8,17.10.12,LATX,10,2017-12-06T13:38:40.000Z,"the truth experiment|""kellz""|""new music""|""new ...",129703,1140,52,14,https://i.ytimg.com/vi/xLserXBaKV8/default.jpg,False,False,False,Official music video for 'The Wilderness' by K...,the wilderness kellz the truth experiment o...,the truth experiment kellz new music new ...


first: 
 ['dua lipa', '"idgaf"', '"dl1"', '"I don\'t give a fuck"', '"dua idgaf"', '"dua lipa official"', '"dua new video"', '"dua lipa video"', '"dua leepa"', '"warner bros records"', '"warner Bros"', '"so i cut you off"', '"dua lipa i dont give a fuck"', '"Pop"', '"Dance"', '"2018"', '"Debut Album"', '"I Don\'t Give A F"', '"Dueling Duas"', '"Dueling Lipas"', '"Double Dua"', '"Double Lipa"', '"Dueling Dua Lipas"', '"Double Dua Lipas"'] 

--------------------------------------------------------------------------------------------------------------
second: 
 ['Azealia', '"Banks"', '"Anna"', '"Wintour"', '"eOne"', '"Music"', '"Dance"', '"Alternative/Indie"', '"Electronic"', '"Club/Dance"', '"Pop"'] 



## Part 2: Same Category Video Rankings
After we know the category of the given video title, we want to create two types of ranking system that return similar video titles.
1. Ranking based on title
2. Ranking based on tags

### 2.1 Ranking Based on Title
Given a video title, return similar video titles.
There are three methods to calulate the similarity between two titles:
**get_jaccard_sim** and **get_cosine_sim** measure the count of words that appear in two titles, while **get_gensim_sim** calculates the similarity using semantic meaning from a pre-trained model.

In [10]:
video = clean_vid_data[["title", "category_id"]]
video["title_clean"] = preprocess_text_df(video.title)
video.head()

Unnamed: 0,title,category_id,title_clean
0,!! THIS VIDEO IS NOTHING BUT PAIN !! | Getting...,20,this video is nothing but pain getting over...
1,"#1 Fortnite World Rank - 2,323 Solo Wins!",20,fortnite world rank solo wins
2,"#1 Fortnite World Rank - 2,330 Solo Wins!",20,fortnite world rank solo wins
3,#1 MOST ANTICIPATED VIDEO (Timber Frame House ...,24,most anticipated video timber frame house rai...
4,#1 WORLD RANKED 1463 SOLO WINS! - FORTNITE BAT...,20,world ranked solo wins fortnite battle roya...


In [11]:
word_vectors = api.load("glove-wiki-gigaword-100")

In [12]:
#source: https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50
def get_jaccard_sim(str1, str2): 
    """
    This function calculates the Jaccard similarity of given two strings.
    """
    if type(str1) == list:
        a = set(str1)
        b = set(str2)
    elif type(str1) == set:
        a = str1
        b = str1
    else:
        a = set(str1.split()) 
        b = set(str2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

def get_jaccard_sim_list(list1, list2): 
    """
    This function calculates the Jaccard similarity of given two lists. 
    """
    a = set(list1) 
    b = set(list2)
    c = a.intersection(b)
    if (len(a) + len(b) - len(c)) == 0:
        return 0
    return float(len(c)) / (len(a) + len(b) - len(c))

def get_vectors(*strs):
    """
    This function get vectors from the vectorizer. Pre-set to CountVectorizer.
    """
    text = [t for t in strs]
    vectorizer = CountVectorizer(text)
    vectorizer.fit(text)
    return vectorizer.transform(text).toarray()

def get_cosine_sim(*strs): 
    """
    This function calculates the cosine similarity of given strings. 
    """
    vectors = [t for t in get_vectors(*strs)]
    return cosine_similarity(vectors)[0][1]
    
def get_vectors_list(*lists):
    """
    This function get vectors from the vectorizer. Pre-set to CountVectorizer.
    """
    
    text = [" ".join(t) for t in lists]
    vectorizer = CountVectorizer(text)
    vectorizer.fit(text)
    return vectorizer.transform(text).toarray()

def get_cosine_sim_list(*lists): 
    """
    This function calculates the cosine similarity of given two lists.
    """
    vectors = [t for t in get_vectors_list(*lists)]
    return cosine_similarity(vectors)[0][1]

def get_gensim_sim(first, second):
    """
    This function calculates the similarity using semantic meaning from a pre-trained model.
    """
    try:
        similarity = word_vectors.n_similarity(first.split(), second.split())
    except:
        new_first = []
        new_second = []
        
        for f in first:
            if f in word_vectors:
                new_first += [f]
        for s in second:
            if s in word_vectors:
                new_second += [s]
        similarity = word_vectors.n_similarity(new_first or ['none'], new_second or ['none'])
    return similarity

def get_gensim_sim_list(first_list, second_list):
    """
    This function calculates the similarity using semantic meaning from a pre-trained model.
    """
    try:
        similarity = word_vectors.n_similarity(first_list, second_list)
    except:
        new_first = []
        new_second = []
        
        for f in first_list:
            if f in word_vectors:
                new_first += [f]
        for s in second_list:
            if s in word_vectors:
                new_second += [s]
        similarity = word_vectors.n_similarity(new_first or ['none'], new_second or ['none'])
    return similarity

In [13]:
def find_similar_title(v, function, num):
    """
    This function takes a video title and similarity function and return similar num of video titles.
    """
    category = predict_category_from_title(v)
    video_category = video.loc[video['category_id']==category]

    score = []
    v_split = v.split()
    for t in video_category.title_clean:
        t_split = t.split() 
        score += [function(v_split, t_split)]
    video_category["title_score"] = score
    
    video_category = video_category.loc[
        video_category['title_score']< 1].sort_values(by=['title_score'], ascending=False)
    return video_category.reset_index().drop(columns=['index', 'title_clean'])[0:num]

In [14]:
example_title1 = "#ProudToCreate: Pride 2018"
similar_video_titles = find_similar_title(example_title1, get_gensim_sim, 5)
print(example_title1)
display(similar_video_titles)

example_title2 = "Marshmello - FLY (Official Music Video)"
similar_video_titles = find_similar_title(example_title2, get_gensim_sim, 5)
print(example_title2)
display(similar_video_titles)

example_title3 = "iphone case"
print(example_title3)
find_similar_title(example_title3, get_gensim_sim, 3)

#ProudToCreate: Pride 2018


Unnamed: 0,title,category_id,title_score
0,2018 FIFA World Cup | Forget | ITV,24,0.493527
1,Pyeongchang Winter Olympics 2018 figure skatin...,24,0.436567
2,🏆 UEFA CHAMPIONS LEAGUE 2018,24,0.415133
3,Jimmy Kimmel Monologue 2/15/2018 - Olympic Pye...,24,0.396082
4,2017 Miss Universe Preliminary Competition,24,0.380358


Marshmello - FLY (Official Music Video)


Unnamed: 0,title,category_id,title_score
0,"T.I. Opens Up On Kanye Meeting, Gucci Mane, Gu...",10,0.706863
1,U2: Live in der Berliner U-Bahnlinie U2: „Get ...,10,0.680408
2,Calvin Harris - Nuh Ready Nuh Ready (Official ...,10,0.67999
3,K Money - Hurt You ft. Yung Tory (Official Video),10,0.674404
4,The Ginger Ed Man,10,0.674173


iphone case


Unnamed: 0,title,category_id,title_score
0,iPhone X – Selfies on iPhone X – Apple,28,0.820157
1,Retro iMac iPhone X Cases!,28,0.80353
2,Apple iPhone X - One Week Later,28,0.802672


### 2.2 Ranking Based on Tags
Given a video title, return similar video titles by tags.

In [15]:
def map_titles_to_tags(category):
    """
    This function maps titles to tags in given category.
    """
    category_df = category_group(category)
    title_tags_dict = {}
    for i in range(len(category_df)):
        video_title = category_df["title"][i]
        video_tags = find_tags_from_string(video_title, [category_df["tags"][i]])#find_tags(video_title, category_df)
        title_tags_dict[video_title] = video_tags
    return title_tags_dict

In [16]:
def find_similar_tag_video(title, func, num, interested_tags = None):
    """
    Given a video title, return similar num of video titles by tags.
    """
    category = predict_category_from_title(title)    
    category_df = category_group(category)
    
    if not interested_tags:
        interested_tags = find_tags(title, category_df)
    category_df = category_df[["title"]]
    title_tag_mapping = map_titles_to_tags(category)
    
    score = []
    for title, tags in title_tag_mapping.items():
        score.append(func(interested_tags, tags))
    category_df["tag_score"] = score
    
    category_df = category_df.loc[
        category_df['tag_score'] > 0].loc[
        category_df['tag_score'] != 1].loc[
        category_df['tag_score'] != len(interested_tags)].sort_values(by = ['tag_score'], ascending=False)
    return category_df.reset_index().drop(columns=['index'])[0:num]

In [17]:
title1 = "Calvin Harris, Dua Lipa - One Kiss (Lyric Video)"
print(title1)
similar_videos_tags = find_similar_tag_video(title1, get_jaccard_sim_list, 4)
display(similar_videos_tags)

print()

title2 = "Dua Lipa - IDGAF (Official Music Video)"
print(title2)
similar_videos_tags = find_similar_tag_video(title2, get_jaccard_sim_list, 10)
display(similar_videos_tags)

print()

title3 = "#ProudToCreate: Pride 2018"
print(title3)
similar_videos_tags = find_similar_tag_video(title3, get_jaccard_sim_list, 5)
display(similar_videos_tags)

Calvin Harris, Dua Lipa - One Kiss (Lyric Video)


Unnamed: 0,title,tag_score
0,"Calvin Harris, Dua Lipa - One Kiss (Official V...",0.52
1,"Calvin Harris, Dua Lipa - One Kiss (Behind the...",0.333333
2,"Calvin Harris, Dua Lipa - One Kiss (ZHU Remix)...",0.321429
3,Calvin Harris - Nuh Ready Nuh Ready (Official ...,0.258065



Dua Lipa - IDGAF (Official Music Video)


Unnamed: 0,title,tag_score
0,Azealia Banks - Anna Wintour,0.060606
1,Dua Lipa - New Rules (Live at The BRIT Awards ...,0.055556
2,Bruno Mars Wins Album Of The Year | Acceptance...,0.039216
3,Nina Nesbitt - Somebody Special (Official Video),0.038462
4,Bruno Mars Wins Record Of The Year | Acceptanc...,0.037736
5,"BURNS, Maluma, Rae Sremmurd - Hands On Me (Aud...",0.037037
6,Harry Styles - Kiwi (live in studio),0.037037
7,Harry Styles - Kiwi,0.037037
8,Backstreet Boys - Don't Go Breaking My Heart (...,0.037037
9,Grace VanderWaal - So Much More Than This (Beh...,0.037037



#ProudToCreate: Pride 2018


Unnamed: 0,title,tag_score
0,'Shoes Everywhere!' Out Of The Closet w/ Kimor...,0.073529
1,Top 10 Cringiest Moments from RuPaul's Drag Race,0.049383
2,Jessica Chastain Is Not Afraid to End Kenan Th...,0.028169
3,Queer Eye | Theme Song (All Things) Feat. Bett...,0.027027
4,Driver Disagrees With Jazz Over Raising Money ...,0.026667


## Part 3: Recommender System
This is the main function of recommender system that combine and call all functions above.

In [35]:
def make_recommendations(title, func, tag_func, num, tags = None):
    """
    This function takes a string, a similarity calculate function of your choice, number of 
    recommendation you want, and tags if exist, then output the desire video recommendation dataframe.
    """
    top_titles = find_similar_title(title, func, num)
    top_tags = find_similar_tag_video(title, tag_func, num, tags)
    
    category = predict_category_from_title(title)
    print("Predicted Category: " + str(category_map[str(category)]))
    category_df = category_group(category)
    
    top_titles = top_titles.drop(columns=['category_id'])
    interested_tags = find_tags(title, category_df)
    cleaned_title = preprocess_text(title)
    title_set = set(cleaned_title.split())
    
    title_supplement_score = []
    for t in top_titles["title"]:
        tags = find_tags(t, category_df)
        title_supplement_score.append(tag_func(interested_tags, tags))
    
    tags_supplement_score = []
    for t in top_tags["title"]:
        temp_t = set(preprocess_text(t).split())
        tags_supplement_score.append(func(title_set, temp_t))
    
    top_titles["tag_score"] = title_supplement_score
    top_tags["title_score"] = tags_supplement_score     
    
    combined_top = top_titles.append(top_tags, ignore_index = True)
    combined_top["Combined Score"] = np.array(combined_top["tag_score"]) +  np.array(combined_top["title_score"])
    
    combined_top = combined_top.sort_values(by = ['Combined Score'], ascending=False).reset_index().drop(columns=['index'])
    return combined_top[0:num]

## Part 4:Testing function examples
We will test our recommender system using trending videos today (12/09/2019)

Data from: https://www.youtube.com/channel/UCF0pVplsI8R5kcAqgtoRqoA

In [36]:
# from dataset
title = "#1 Fortnite World Rank - 2,330 Solo Wins!"
make_recommendations(title, get_gensim_sim, get_jaccard_sim_list, 5)

Predicted Category: Gaming


Unnamed: 0,tag_score,title,title_score,Combined Score
0,0.903226,"🔴 #1 World Ranked | 2,532 Solo Wins | Fortnite...",0.89121,1.794436
1,0.903226,"FREE VBUCKS TODAY | FORTNITE WORLD RECORD 2,75...",0.882726,1.785951
2,0.903226,"FORTNITE WORLD RECORD 2,732 SOLO WINS - New Fo...",0.866301,1.769527
3,0.666667,"#1 World Record 3,225 Solo Wins | Fortnite Liv...",0.880466,1.547133
4,0.666667,#1 WORLD RANKED 1463 SOLO WINS! - FORTNITE BAT...,0.874105,1.540772


In [37]:
# https://www.youtube.com/watch?v=Ufye3xSjcqM
# Category: Travel & Events
# Tags: #Singapore #buffet #luxury
# Title: BEST LUXURY BUFFET in Singapore!? Colony Buffet Review at Ritz Carlton

title = "BEST LUXURY BUFFET in Singapore!? Colony Buffet Review at Ritz Carlton"
tags = ["Singapore", "buffet", "luxury"]
make_recommendations(title, get_gensim_sim, get_jaccard_sim_list, 5, tags)

Predicted Category: Travel & Events


Unnamed: 0,tag_score,title,title_score,Combined Score
0,0.0,TAKE A LOOK AT THE SATANlC RlTUALS THAT OCCURR...,0.892067,0.892067
1,0.0,The world's darkest building is at the 2018 Ol...,0.872013,0.872013
2,0.0,LIVE : 129th Rose Parade in California - 2018 ...,0.86134,0.86134
3,0.0,Street Food in Ghana - GIANT CHOP-BAR LUNCH an...,0.857696,0.857696
4,0.0,"Indian STREET FOOD of YOUR DREAMS in KOLKATA, ...",0.854537,0.854537


In [39]:
# https://www.youtube.com/watch?v=Ufye3xSjcqM
# Category: Travel & Events
# Tags: #Singapore #buffet #luxury
# Title: BEST LUXURY BUFFET in Singapore!? Colony Buffet Review at Ritz Carlton

title = "BEST LUXURY BUFFET in Singapore!? Colony Buffet Review at Ritz Carlton"
tags = ["Singapore", "buffet", "luxury"]
make_recommendations(title, get_gensim_sim, get_gensim_sim_list, 5, tags)

Predicted Category: Travel & Events


Unnamed: 0,tag_score,title,title_score,Combined Score
0,1.0,TAKE A LOOK AT THE SATANlC RlTUALS THAT OCCURR...,0.892067,1.892067
1,1.0,LIVE : 129th Rose Parade in California - 2018 ...,0.86134,1.861339
2,1.0,Street Food in Ghana - GIANT CHOP-BAR LUNCH an...,0.857696,1.857696
3,1.0,"Indian STREET FOOD of YOUR DREAMS in KOLKATA, ...",0.854537,1.854537
4,0.78974,LUXURIOUS All You Can Eat BUFFET in Mumbai India!,0.806974,1.596714


In [42]:
# https://www.youtube.com/watch?v=ahZFCF--uRY
# Category: Entertainment
# Tags: #Ghostbusters #OfficialTrailer #Sony
# Title: GHOSTBUSTERS: AFTERLIFE - Official Trailer (HD)

title = "GHOSTBUSTERS: AFTERLIFE - Official Trailer (HD)"
tags = ["Sony",
        "Sony Pictures Entertainment",
        "Ghostbusters: Afterlife",
        "Ghostbusters 2020",
        "Ghostbusters: Afterlife Trailer",
        "Official Trailer",
        "Ghostbusters Official Trailer",
        "Jason Reitman",
        "Dan Aykroyd",
        "Harold Ramis",
        "Carrie Coon",
        "Finn Wolfhard",
        "Mckenna Grace",
        "Paul Rudd",
        "Ghostbusters 2020 Trailer",
        "Ghostbusters: Afterlife Sony",
        "Ghostbusters Afterlife 2020",
        "HD Trailer",
        "Trailer",
        "Trailer 2020"]

top_tags = find_similar_tag_video(title, get_cosine_sim_list, 4, tags)
print("top tags:")
display(top_tags)
make_recommendations(title, get_gensim_sim, get_cosine_sim_list, 5, tags)

top tags:


Unnamed: 0,title,tag_score
0,BLACK PANTHER Movie Clip Killmonger Vs T'Chall...,0.463048
1,JUSTICE LEAGUE Movie Clip Aquaman Island + Tra...,0.453878
2,BLACK DYNAMITE 2 Teaser Trailer #1 NEW (2018) ...,0.453878
3,WOODY WOODPECKER Trailer #1 NEW (2018) Live Ac...,0.453878


Predicted Category: Entertainment


Unnamed: 0,tag_score,title,title_score,Combined Score
0,0.463048,BLACK PANTHER Movie Clip Killmonger Vs T'Chall...,0.732653,1.1957
1,0.453878,JUSTICE LEAGUE Movie Clip Aquaman Island + Tra...,0.730853,1.184731
2,0.453878,WOODY WOODPECKER Trailer #1 NEW (2018) Live Ac...,0.713786,1.167664
3,0.453878,BLACK DYNAMITE 2 Teaser Trailer #1 NEW (2018) ...,0.698186,1.152064
4,0.418454,THE OUTLAW JOHNNY BLACK Trailer #1 NEW (2018) ...,0.623581,1.042036


In [23]:
# https://www.youtube.com/watch?v=oygrmJFKYZY
# Category: Music
# Title: Dua Lipa - Don't Start Now (Official Music Video)

title = "Dua Lipa - Don't Start Now (Official Music Video)"
tags = ["dua lipa",
        "dua",
        "new rules",
        "dua leepa",
        "idgaf",
        "scared to be lonely",
        "be the one",
        "dua lipa dont start now",
        "dua lipa dont start now video",
        "dua new video",
        "dua dont start now official video",
        "dua lipa dont start now official video",
        "dua dsn","dua lipa dont strat now",
        "dua new song",
        "dua new song 2019",
        "if you don't wanna see me dancing with somebody",
        "dont show up",
        "dont come out",
        "dont start caring about me now",
        "nabil",
        "nabil elderkin",
        "nabil dua lipa",
        "dua lipa nabil video",
        "new dua"]

top_tags3 = find_similar_tag_video(title, get_jaccard_sim_list, 4, tags)
print("top tags:")
display(top_tags3)
make_recommendations(title, get_jaccard_sim, get_jaccard_sim_list, 5, tags)

top tags:


Unnamed: 0,title,tag_score
0,Dua Lipa - Golden Slumbers,0.032258
1,"Dua Lipa - IDGAF ft. Charli XCX, Zara Larsson,...",0.026316
2,Dua Lipa - IDGAF (Official Music Video),0.020833


Predicted Category: Music


Unnamed: 0,tag_score,title,title_score,Combined Score
0,0.032258,Dua Lipa - Golden Slumbers,1.0,1.032258
1,0.026316,"Dua Lipa - IDGAF ft. Charli XCX, Zara Larsson,...",1.0,1.026316
2,0.020833,Dua Lipa - IDGAF (Official Music Video),1.0,1.020833
3,0.0,"#AboveTheNoise feat. Serena Williams, Neymar J...",0.0,0.0
4,0.0,Ozuna X Ele A El Dominio - Balenciaga ( Video...,0.0,0.0


In [24]:
# https://www.youtube.com/watch?v=oygrmJFKYZY
# Category: Music
# Title: Dua Lipa - Don't Start Now (Official Music Video)

title = "Dua Lipa - Don't Start Now (Official Music Video)"
tags = ["dua lipa",
        "dua",
        "new rules",
        "dua leepa",
        "idgaf",
        "scared to be lonely",
        "be the one",
        "dua lipa dont start now",
        "dua lipa dont start now video",
        "dua new video",
        "dua dont start now official video",
        "dua lipa dont start now official video",
        "dua dsn","dua lipa dont strat now",
        "dua new song",
        "dua new song 2019",
        "if you don't wanna see me dancing with somebody",
        "dont show up",
        "dont come out",
        "dont start caring about me now",
        "nabil",
        "nabil elderkin",
        "nabil dua lipa",
        "dua lipa nabil video",
        "new dua"]

top_tags2 = find_similar_tag_video(title, get_jaccard_sim_list, 4, tags)
print("top tags:")
display(top_tags2)
make_recommendations(title, get_gensim_sim, get_jaccard_sim_list, 5, tags)

top tags:


Unnamed: 0,title,tag_score
0,Dua Lipa - Golden Slumbers,0.032258
1,"Dua Lipa - IDGAF ft. Charli XCX, Zara Larsson,...",0.026316
2,Dua Lipa - IDGAF (Official Music Video),0.020833


Predicted Category: Music


Unnamed: 0,tag_score,title,title_score,Combined Score
0,0.020833,Dua Lipa - IDGAF (Official Music Video),0.881527,0.90236
1,0.0,"T.I. Opens Up On Kanye Meeting, Gucci Mane, Gu...",0.706863,0.706863
2,0.0,U2: Live in der Berliner U-Bahnlinie U2: „Get ...,0.680408,0.680408
3,0.0,Calvin Harris - Nuh Ready Nuh Ready (Official ...,0.67999,0.67999
4,0.0,K Money - Hurt You ft. Yung Tory (Official Video),0.674404,0.674404


In [45]:
# https://www.youtube.com/watch?v=oygrmJFKYZY
# Category: Music
# Title: Dua Lipa - Don't Start Now (Official Music Video)

title = "Dua Lipa - Don't Start Now (Official Music Video)"
tags = ["dua lipa",
        "dua",
        "new rules",
        "dua leepa",
        "idgaf",
        "scared to be lonely",
        "be the one",
        "dua lipa dont start now",
        "dua lipa dont start now video",
        "dua new video",
        "dua dont start now official video",
        "dua lipa dont start now official video",
        "dua dsn","dua lipa dont strat now",
        "dua new song",
        "dua new song 2019",
        "if you don't wanna see me dancing with somebody",
        "dont show up",
        "dont come out",
        "dont start caring about me now",
        "nabil",
        "nabil elderkin",
        "nabil dua lipa",
        "dua lipa nabil video",
        "new dua"]

top_tags2 = find_similar_tag_video(title, get_gensim_sim_list, 4, tags)
print("top tags:")
display(top_tags2)
make_recommendations(title, get_jaccard_sim, get_gensim_sim_list, 5, tags)

top tags:


Unnamed: 0,title,tag_score
0,Brockhampton Kicks Out Ameer Vann,0.338231
1,Lucas Lucco e Pabllo Vittar - Paraíso,0.31077
2,The Floppotron: Toto - Africa,0.288222
3,Chun Li (Music Video Teaser),0.257716


Predicted Category: Music


Unnamed: 0,tag_score,title,title_score,Combined Score
0,0.338231,Brockhampton Kicks Out Ameer Vann,1.0,1.338231
1,0.31077,Lucas Lucco e Pabllo Vittar - Paraíso,1.0,1.31077
2,0.288222,The Floppotron: Toto - Africa,1.0,1.288222
3,0.257716,Chun Li (Music Video Teaser),1.0,1.257716
4,0.257716,Nicki Minaj - Barbie Tingz (Music Video Teaser),1.0,1.257716


In [44]:
# https://www.youtube.com/watch?v=oygrmJFKYZY
# Category: Music
# Title: Dua Lipa - Don't Start Now (Official Music Video)

title = "Dua Lipa - Don't Start Now (Official Music Video)"
tags = ["dua lipa",
        "dua",
        "new rules",
        "dua leepa",
        "idgaf",
        "scared to be lonely",
        "be the one",
        "dua lipa dont start now",
        "dua lipa dont start now video",
        "dua new video",
        "dua dont start now official video",
        "dua lipa dont start now official video",
        "dua dsn","dua lipa dont strat now",
        "dua new song",
        "dua new song 2019",
        "if you don't wanna see me dancing with somebody",
        "dont show up",
        "dont come out",
        "dont start caring about me now",
        "nabil",
        "nabil elderkin",
        "nabil dua lipa",
        "dua lipa nabil video",
        "new dua"]

top_tags2 = find_similar_tag_video(title, get_gensim_sim_list, 4, tags)
print("top tags:")
display(top_tags2)
make_recommendations(title, get_gensim_sim, get_gensim_sim_list, 5, tags)

top tags:


Unnamed: 0,title,tag_score
0,Brockhampton Kicks Out Ameer Vann,0.338231
1,Lucas Lucco e Pabllo Vittar - Paraíso,0.31077
2,The Floppotron: Toto - Africa,0.288222
3,Chun Li (Music Video Teaser),0.257716


Predicted Category: Music


Unnamed: 0,tag_score,title,title_score,Combined Score
0,1.0,"T.I. Opens Up On Kanye Meeting, Gucci Mane, Gu...",0.706863,1.706863
1,1.0,U2: Live in der Berliner U-Bahnlinie U2: „Get ...,0.680408,1.680408
2,1.0,Calvin Harris - Nuh Ready Nuh Ready (Official ...,0.67999,1.67999
3,1.0,K Money - Hurt You ft. Yung Tory (Official Video),0.674404,1.674404
4,0.257716,Chun Li (Music Video Teaser),0.68849,0.946206


# Testing Live

In [None]:
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import ast
import json
import os
from urllib.request import Request, urlopen

In [46]:
# For ignoring SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Input from user

url = input('Enter Youtube Video Url- ')

# Making the website believe that you are accessing it using a mozilla browser
def fetch_data(link = url) : 
    
    try :
    
        req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        webpage = urlopen(req).read()

        # Creating a BeautifulSoup object of the html page for easy extraction of data.

        soup = BeautifulSoup(webpage, 'html.parser')
        html = soup.prettify('utf-8')
        video_details = {}
        other_details = {}

        for span in soup.findAll('span',attrs={'class': 'watch-title'}):
            video_details['TITLE'] = span.text.strip()

        for script in soup.findAll('script',attrs={'type': 'application/ld+json'}):
                channelDesctiption = json.loads(script.text.strip())
                video_details['CHANNEL_NAME'] = channelDesctiption['itemListElement'][0]['item']['name']

        for div in soup.findAll('div',attrs={'class': 'watch-view-count'}):
            video_details['NUMBER_OF_VIEWS'] = div.text.strip()

        for button in soup.findAll('button',attrs={'title': 'I like this'}):
            video_details['LIKES'] = button.text.strip()

        for button in soup.findAll('button',attrs={'title': 'I dislike this'}):
            video_details['DISLIKES'] = button.text.strip()

        for span in soup.findAll('span',attrs={'class': 'yt-subscription-button-subscriber-count-branded-horizontal yt-subscriber-count'}):
            video_details['NUMBER_OF_SUBSCRIPTIONS'] = span.text.strip()

        hashtags = []
        for span in soup.findAll('span',attrs={'class': 'standalone-collection-badge-renderer-text'}):
            for a in span.findAll('a',attrs={'class': 'yt-uix-sessionlink'}):
                temp = a.text.strip()
                hashtags.append(temp[1:])
        video_details['HASH_TAGS'] = hashtags

        with open('output_file.html', 'wb') as file:
            file.write(html)

        with open('data.json', 'w', encoding='utf8') as outfile:
            json.dump(video_details, outfile, ensure_ascii=False,indent=4)
            
        if len(video_details) == 0:
            return "NULL"
        else:
            return video_details
    except:

        return 'Not a valid website url'
    
data = fetch_data()
print(data)
make_recommendations(data["TITLE"], get_gensim_sim,get_gensim_sim_list, 5, data["HASH_TAGS"])

Enter Youtube Video Url- https://www.youtube.com/watch?v=GQrIiqPQ-KY
{'TITLE': 'The Chainsmokers, Illenium - Takeaway (Lyrics) ft. Lennon Stella', 'CHANNEL_NAME': 'Unique Vibes', 'NUMBER_OF_VIEWS': '13,169,055 views', 'LIKES': '146,875', 'DISLIKES': '2,155', 'NUMBER_OF_SUBSCRIPTIONS': '2.16M', 'HASH_TAGS': ['thechainsmokers', 'takeaway', 'illenium']}
Predicted Category: Music


Unnamed: 0,tag_score,title,title_score,Combined Score
0,1.0,Calvin Harris - Nuh Ready Nuh Ready (Official ...,0.676398,1.676398
1,1.0,Jayy Brown - It's Okay (Ft. LB) Official Video,0.669022,1.669022
2,1.0,"Mike WiLL Made-It, Rae Sremmurd, Big Sean - Ar...",0.661302,1.661302
3,1.0,"DJ Khaled - Top Off Trailer ft. JAY Z, Future,...",0.653624,1.653624
4,1.0,Chris Brown - On Purpose (Audio) ft. AGNEZ MO,0.651638,1.651638


In [47]:
title = "The Chainsmokers, Illenium - Takeaway (Lyrics) ft. Lennon Stella"

tags = ["Unique Vibes","Music","pop","pop music","the chainsmokers","takeaway","the chainsmokers takeaway","the chainsmokers takeaway lyrics","the chainsmokers illenium","the chainsmokers illenium takeaway","takeaway lyrics","takeaway the chainsmokers","lyrics takeaway","lyrics the chainsmokers takeaway","the chainsmokers lyrics","the chainsmokers lennon stella","takeaway lyrics the chainsmokers","takeaway the chainsmokers lyrics","chainsmokers takeaway","chainsmokers takeaway lyrics","chainsmokers illenium"]

top_tags3 = find_similar_tag_video(title, get_gensim_sim_list, 4, tags)
print("top tags:")
display(top_tags3)
make_recommendations(title, get_gensim_sim, get_gensim_sim_list, 5, tags)

top tags:


Unnamed: 0,title,tag_score
0,This is Me [Dave Aude Remix] (from The Greates...,0.597538
1,Robin Banks FT Dolo Brassi - Bazerk (Offici...,0.597538
2,The Greatest Showman - Never Enough [Official ...,0.597538
3,SZA: “Supermodel” Explained | The Process | GQ,0.597538


Predicted Category: Music


Unnamed: 0,tag_score,title,title_score,Combined Score
0,1.0,Calvin Harris - Nuh Ready Nuh Ready (Official ...,0.676398,1.676398
1,1.0,Jayy Brown - It's Okay (Ft. LB) Official Video,0.669022,1.669022
2,1.0,"Mike WiLL Made-It, Rae Sremmurd, Big Sean - Ar...",0.661302,1.661302
3,1.0,"DJ Khaled - Top Off Trailer ft. JAY Z, Future,...",0.653624,1.653624
4,1.0,Chris Brown - On Purpose (Audio) ft. AGNEZ MO,0.651638,1.651638
