# Exploration Notebook v0 - Jacopo

To avoid cluttering the main notebook, I'm going to do some exploration here, it might get very messy as a whole but it should be easy to follow.

This will probably be the first (zero) version of several notebooks. The version names should help with keeping track and I will name them -Jacopo to avoid confusion with the main notebook.

## Folder Structure
- `./`: Notebooks, Readme, gitignore and Data are in the root folder
- `data/ `: all data in this folder, will be ignored by git but it will eventually include quite a lot, subfolders like `out/` for results, `data-ssd/` for YouNiverse dataset symlinked to an ssd



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%reload_ext autoreload

In [8]:
data_path = "data/data-ssd/full/" 

#first of all let's create a 100k randomized sample for the bigger datasets so we can work with them nicely
import random
skip_rows = lambda i: i>0 and random.random() < 0.8
#df_metadata = pd.read_json(data_path + "yt_metadata_en.jsonl.gz", lines=True, compression="gzip", encoding="utf-8", nrows=100000)
df_metadata = pd.read_feather(path= data_path + "yt_metadata_helper.feather", use_threads=True)
df_metadata.head()



Unnamed: 0,categories,channel_id,dislike_count,display_id,duration,like_count,upload_date,view_count
0,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,1.0,SBqSc91Hn9g,1159,8.0,2016-09-28,1057.0
1,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,1.0,UuugEl86ESY,2681,23.0,2016-09-28,12894.0
2,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,779.0,oB4c-yvnbjs,1394,1607.0,2016-09-28,1800602.0
3,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,24.0,ZaV-gTCMV8E,5064,227.0,2016-09-28,57640.0
4,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,13.0,cGvL7AvMfM0,3554,105.0,2016-09-28,86368.0


In [9]:
df_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 8 columns):
 #   Column         Dtype         
---  ------         -----         
 0   categories     object        
 1   channel_id     object        
 2   dislike_count  float64       
 3   display_id     object        
 4   duration       int64         
 5   like_count     float64       
 6   upload_date    datetime64[ns]
 7   view_count     float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(3)
memory usage: 4.3+ GB


In [13]:
df_metadata['dislike_count'] = df_metadata['dislike_count'].notna().astype('int16')
df_metadata['like_count'] = df_metadata['like_count'].notna().astype('int16')
df_metadata['view_count'] = df_metadata['view_count'].notna().astype('int32')
df_metadata['duration'] = df_metadata['duration'].notna().astype('int16')
df_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 8 columns):
 #   Column         Dtype         
---  ------         -----         
 0   categories     object        
 1   channel_id     object        
 2   dislike_count  int16         
 3   display_id     object        
 4   duration       int16         
 5   like_count     int16         
 6   upload_date    datetime64[ns]
 7   view_count     int32         
dtypes: datetime64[ns](1), int16(3), int32(1), object(3)
memory usage: 2.9+ GB


In [15]:
# save pickle
import pickle
with open(data_path + "yt_metadata_helper.pkl", "wb") as f:
    pickle.dump(df_metadata, f)

In [17]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_metadata['categories'] = le.fit_transform(df_metadata['categories'])
df_metadata.head()

Unnamed: 0,categories,channel_id,dislike_count,display_id,duration,like_count,upload_date,view_count
0,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,SBqSc91Hn9g,1,1,2016-09-28,1
1,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,UuugEl86ESY,1,1,2016-09-28,1
2,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,oB4c-yvnbjs,1,1,2016-09-28,1
3,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,ZaV-gTCMV8E,1,1,2016-09-28,1
4,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,cGvL7AvMfM0,1,1,2016-09-28,1


In [21]:
print(df_metadata['categories'].value_counts())
print(le.classes_)

6     13720303
4     12276397
10     8881022
9      8305003
12     6910666
16     4354412
7      3968127
3      3795564
14     2403004
5      2359736
1      2256967
2      1172503
17     1096565
11      777449
13      645508
0         1522
15          41
8            5
Name: categories, dtype: int64
['' 'Autos & Vehicles' 'Comedy' 'Education' 'Entertainment'
 'Film & Animation' 'Gaming' 'Howto & Style' 'Movies' 'Music'
 'News & Politics' 'Nonprofits & Activism' 'People & Blogs'
 'Pets & Animals' 'Science & Technology' 'Shows' 'Sports'
 'Travel & Events']


In [22]:
df_metadata['categories'] = df_metadata['categories'].astype('int8')
df_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 8 columns):
 #   Column         Dtype         
---  ------         -----         
 0   categories     int8          
 1   channel_id     object        
 2   dislike_count  int16         
 3   display_id     object        
 4   duration       int16         
 5   like_count     int16         
 6   upload_date    datetime64[ns]
 7   view_count     int32         
dtypes: datetime64[ns](1), int16(3), int32(1), int8(1), object(2)
memory usage: 2.4+ GB


In [23]:
# save pickle
import pickle
with open(data_path + "yt_metadata_helper.pkl", "wb") as f:
    pickle.dump(df_metadata, f)

# save categories
with open(data_path + "categories.pkl", "wb") as f:
    pickle.dump(le.classes_, f)

In [24]:
df_metadata.head()

Unnamed: 0,categories,channel_id,dislike_count,display_id,duration,like_count,upload_date,view_count
0,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,SBqSc91Hn9g,1,1,2016-09-28,1
1,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,UuugEl86ESY,1,1,2016-09-28,1
2,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,oB4c-yvnbjs,1,1,2016-09-28,1
3,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,ZaV-gTCMV8E,1,1,2016-09-28,1
4,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,cGvL7AvMfM0,1,1,2016-09-28,1


In [25]:
df_metadata['upload_date'].dt.year.value_counts()

2018    15275859
2019    12723124
2017    12486407
2016     9352771
2015     6808073
2014     5179935
2013     4018279
2012     2926436
2011     1874984
2010     1085424
2009      694586
2008      338040
2007      137250
2006       23294
2005         332
Name: upload_date, dtype: int64

In [7]:
data_path = "data/data-ssd/full/" 
# let's import the tags from the full dataset in order to encode them
df_tags = pd.read_json(data_path + "yt_metadata_en.jsonl.gz", lines=True, compression="gzip", encoding="utf-8", 
dtype={'video_id': 'str', 'tags': 'str'}, nrows=1000000, convert_dates=['upload_date'])
df_tags.head()

Unnamed: 0,categories,channel_id,crawl_date,description,dislike_count,display_id,duration,like_count,tags,title,upload_date,view_count
0,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:26.270363,Lego City Police Lego Firetruck Cartoons about...,1.0,SBqSc91Hn9g,1159,8.0,"lego city,lego police,lego city police,lego ci...",Lego City Police Lego Firetruck Cartoons about...,2016-09-28,1057.0
1,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:26.914516,Lego Marvel SuperHeroes Lego Hulk Smash Iron-M...,1.0,UuugEl86ESY,2681,23.0,"Lego superheroes,lego hulk,hulk smash,lego mar...",Lego Marvel SuperHeroes Lego Hulk Smash Iron-M...,2016-09-28,12894.0
2,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:26.531203,Lego City Police Lego Fireman Cartoons about L...,779.0,oB4c-yvnbjs,1394,1607.0,"lego city,lego police,lego city police,lego fi...",Lego City Police Lego Fireman Cartoons about L...,2016-09-28,1800602.0
3,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:28.335329,Lego Harry Potter Complete Lego New Movie for ...,24.0,ZaV-gTCMV8E,5064,227.0,"Lego harry potter,new harry potter,harry potte...",Lego Harry Potter Complete Lego New Movie for ...,2016-09-28,57640.0
4,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:30.328487,Lego City Police LONG VIDEO for kids Lego Fire...,13.0,cGvL7AvMfM0,3554,105.0,"lego city,lego police,lego city police,lego fi...",Lego City Police 1 HOUR LONG VIDEO for kids Le...,2016-09-28,86368.0


In [8]:
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 12 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   categories     1000000 non-null  object        
 1   channel_id     1000000 non-null  object        
 2   crawl_date     1000000 non-null  object        
 3   description    1000000 non-null  object        
 4   dislike_count  980320 non-null   float64       
 5   display_id     1000000 non-null  object        
 6   duration       1000000 non-null  int64         
 7   like_count     980320 non-null   float64       
 8   tags           1000000 non-null  object        
 9   title          1000000 non-null  object        
 10  upload_date    1000000 non-null  datetime64[ns]
 11  view_count     999999 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(7)
memory usage: 91.6+ MB


In [2]:
# Load data
data_path = "data/data-ssd/full/" 

dfs = []
for df_tags in pd.read_json(data_path+'yt_metadata_en.jsonl.gz', compression="infer", chunksize=1000000, lines=True):
    df_tags.drop(columns=['categories', 'channel_id', 'crawl_date', 'description', 'dislike_count', 'like_count', 'view_count', 'upload_date'], inplace=True)

    dfs.append(df_tags)
df_tags = pd.concat(dfs)

df_tags.to_feather(data_path+'yt_metadata_tags.feather')

In [3]:
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 4 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   display_id  object
 1   duration    int64 
 2   tags        object
 3   title       object
dtypes: int64(1), object(3)
memory usage: 2.2+ GB


In [None]:
#df_tags.drop(columns=['display_id', 'duration'], inplace=True) # turns out this is more expensive than useful....
#df_tags.info()

In [2]:
data_path = "data/data-ssd/full/" 
df_tags = pd.read_feather(path= data_path + "yt_metadata_tags.feather", use_threads=True)
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 4 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   display_id  object
 1   duration    int64 
 2   tags        object
 3   title       object
dtypes: int64(1), object(3)
memory usage: 2.2+ GB


In [3]:
df_tags = df_tags['tags']
df_tags.info()

<class 'pandas.core.series.Series'>
RangeIndex: 72924794 entries, 0 to 72924793
Series name: tags
Non-Null Count     Dtype 
--------------     ----- 
72924794 non-null  object
dtypes: object(1)
memory usage: 556.4+ MB


In [5]:
df_tags.to_frame().to_feather(data_path+'yt_metadata_tags.feather')

In [6]:
df_tags.head()

0    lego city,lego police,lego city police,lego ci...
1    Lego superheroes,lego hulk,hulk smash,lego mar...
2    lego city,lego police,lego city police,lego fi...
3    Lego harry potter,new harry potter,harry potte...
4    lego city,lego police,lego city police,lego fi...
Name: tags, dtype: object

## Vectorization of tags

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
# gensim
import gensim
from gensim.models import Word2Vec, KeyedVectors, FastText, doc2vec
import gensim.downloader as api
from gensim.models.phrases import Phrases
# nltk
# import nltk
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize



In [14]:
for model_name, model_data in sorted(api.info()['models'].items()):
    print("{}: {}".format(model_name, model_data['description'][:40] + "..."))

__testing_word2vec-matrix-synopsis: [THIS IS ONLY FOR TESTING] Word vecrors ...
conceptnet-numberbatch-17-06-300: ConceptNet Numberbatch consists of state...
fasttext-wiki-news-subwords-300: 1 million word vectors trained on Wikipe...
glove-twitter-100: Pre-trained vectors based on  2B tweets,...
glove-twitter-200: Pre-trained vectors based on 2B tweets, ...
glove-twitter-25: Pre-trained vectors based on 2B tweets, ...
glove-twitter-50: Pre-trained vectors based on 2B tweets, ...
glove-wiki-gigaword-100: Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-200: Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-300: Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-50: Pre-trained vectors based on Wikipedia 2...
word2vec-google-news-300: Pre-trained vectors trained on a part of...
word2vec-ruscorpora-300: Word2vec Continuous Skipgram vectors tra...


In [2]:
glove = api.load("glove-wiki-gigaword-50")
print(glove.most_similar("cat"))

[('dog', 0.9218006134033203), ('rabbit', 0.8487821817398071), ('monkey', 0.8041081428527832), ('rat', 0.7891963720321655), ('cats', 0.7865270376205444), ('snake', 0.7798910737037659), ('dogs', 0.7795815467834473), ('pet', 0.7792249917984009), ('mouse', 0.7731667757034302), ('bite', 0.7728800177574158)]


In [16]:
print(glove.most_similar("youtube"))

[('myspace', 0.8685212135314941), ('uploaded', 0.8573760986328125), ('facebook', 0.8540773391723633), ('twitter', 0.8430657982826233), ('videos', 0.8099467158317566), ('video', 0.7907883524894714), ('downloaded', 0.7642441391944885), ('blog', 0.7627186179161072), ('download', 0.7616114020347595), ('downloads', 0.7580808401107788)]


In [17]:
print(glove.most_similar("google"))

[('yahoo', 0.8942785263061523), ('aol', 0.852712869644165), ('microsoft', 0.8450709581375122), ('internet', 0.8179759979248047), ('web', 0.8175380229949951), ('facebook', 0.8087005615234375), ('ebay', 0.7930072546005249), ('netscape', 0.7912958860397339), ('online', 0.7908353805541992), ('software', 0.7816097140312195)]


In [18]:
print(glove.most_similar("lego"))

[('mindstorms', 0.828197181224823), ('bionicle', 0.7207260131835938), ('jigsaw', 0.6471174955368042), ('dolls', 0.6425037980079651), ('technic', 0.6417526006698608), ('namco', 0.6410648226737976), ('diecast', 0.638368546962738), ('arcade', 0.6373211145401001), ('toy', 0.6369045376777649), ('playmobil', 0.6331797242164612)]


In [19]:
print(glove.most_similar("apple"))

[('blackberry', 0.7543067932128906), ('chips', 0.7438643574714661), ('iphone', 0.7429664134979248), ('microsoft', 0.7334205508232117), ('ipad', 0.7331036925315857), ('pc', 0.7217226624488831), ('ipod', 0.7199784517288208), ('intel', 0.7192243337631226), ('ibm', 0.7146540880203247), ('software', 0.7093585729598999)]


In [20]:
print(glove.most_similar_to_given("cat", ["dog", "mouse", "lego", "apple", "youtube", "google"]))

dog


In [35]:
# fine-tune the model with our tags
df_tags.head().apply(lambda x: x.split(","))[0]



['lego city',
 'lego police',
 'lego city police',
 'lego city episodes',
 'videos de lego city',
 'lego policia',
 'lego bomberos',
 'lego fire truck',
 'lego firetruck',
 'lego police chase',
 'lego robbers',
 'lego cartoons',
 'lego movies',
 'lego videos for kids']

In [3]:
#df_tags = df_tags.apply(lambda x: x.split(",")) # jesus christ everything takes forever, 30min for this + all the memory
df_tags.head()

NameError: name 'df_tags' is not defined

In [6]:
# modin
import modin.pandas as pd

In [7]:
# load with modin
data_path = "data/data-ssd/full/"
df_tags = pd.read_feather(path= data_path + "yt_metadata_tags.feather", use_threads=True)

KilledWorker: ('parse-c1a3ed1b9b8f3ff7b997c3767e86224f', <WorkerState 'tcp://127.0.0.1:56296', name: 2, status: closed, memory: 0, processing: 1>)

In [None]:
glove.build_vocab(df_tags, update=True)
glove.train(df_tags, total_examples=glove.corpus_count, epochs=glove.epochs)