# Exploration Notebook v0 - Jacopo

To avoid cluttering the main notebook, I'm going to do some exploration here, it might get very messy as a whole but it should be easy to follow.

This will probably be the first (zero) version of several notebooks. The version names should help with keeping track and I will name them -Jacopo to avoid confusion with the main notebook.

## Folder Structure
- `./`: Notebooks, Readme, gitignore and Data are in the root folder
- `data/ `: all data in this folder, will be ignored by git but it will eventually include quite a lot, subfolders like `out/` for results, `data-ssd/` for YouNiverse dataset symlinked to an ssd



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%reload_ext autoreload

In [8]:
data_path = "data/data-ssd/full/" 

#first of all let's create a 100k randomized sample for the bigger datasets so we can work with them nicely
import random
skip_rows = lambda i: i>0 and random.random() < 0.8
#df_metadata = pd.read_json(data_path + "yt_metadata_en.jsonl.gz", lines=True, compression="gzip", encoding="utf-8", nrows=100000)
df_metadata = pd.read_feather(path= data_path + "yt_metadata_helper.feather", use_threads=True)
df_metadata.head()



Unnamed: 0,categories,channel_id,dislike_count,display_id,duration,like_count,upload_date,view_count
0,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,1.0,SBqSc91Hn9g,1159,8.0,2016-09-28,1057.0
1,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,1.0,UuugEl86ESY,2681,23.0,2016-09-28,12894.0
2,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,779.0,oB4c-yvnbjs,1394,1607.0,2016-09-28,1800602.0
3,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,24.0,ZaV-gTCMV8E,5064,227.0,2016-09-28,57640.0
4,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,13.0,cGvL7AvMfM0,3554,105.0,2016-09-28,86368.0


In [9]:
df_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 8 columns):
 #   Column         Dtype         
---  ------         -----         
 0   categories     object        
 1   channel_id     object        
 2   dislike_count  float64       
 3   display_id     object        
 4   duration       int64         
 5   like_count     float64       
 6   upload_date    datetime64[ns]
 7   view_count     float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(3)
memory usage: 4.3+ GB


In [13]:
df_metadata['dislike_count'] = df_metadata['dislike_count'].notna().astype('int16')
df_metadata['like_count'] = df_metadata['like_count'].notna().astype('int16')
df_metadata['view_count'] = df_metadata['view_count'].notna().astype('int32')
df_metadata['duration'] = df_metadata['duration'].notna().astype('int16')
df_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 8 columns):
 #   Column         Dtype         
---  ------         -----         
 0   categories     object        
 1   channel_id     object        
 2   dislike_count  int16         
 3   display_id     object        
 4   duration       int16         
 5   like_count     int16         
 6   upload_date    datetime64[ns]
 7   view_count     int32         
dtypes: datetime64[ns](1), int16(3), int32(1), object(3)
memory usage: 2.9+ GB


In [15]:
# save pickle
import pickle
with open(data_path + "yt_metadata_helper.pkl", "wb") as f:
    pickle.dump(df_metadata, f)

In [17]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_metadata['categories'] = le.fit_transform(df_metadata['categories'])
df_metadata.head()

Unnamed: 0,categories,channel_id,dislike_count,display_id,duration,like_count,upload_date,view_count
0,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,SBqSc91Hn9g,1,1,2016-09-28,1
1,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,UuugEl86ESY,1,1,2016-09-28,1
2,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,oB4c-yvnbjs,1,1,2016-09-28,1
3,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,ZaV-gTCMV8E,1,1,2016-09-28,1
4,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,cGvL7AvMfM0,1,1,2016-09-28,1


In [21]:
print(df_metadata['categories'].value_counts())
print(le.classes_)

6     13720303
4     12276397
10     8881022
9      8305003
12     6910666
16     4354412
7      3968127
3      3795564
14     2403004
5      2359736
1      2256967
2      1172503
17     1096565
11      777449
13      645508
0         1522
15          41
8            5
Name: categories, dtype: int64
['' 'Autos & Vehicles' 'Comedy' 'Education' 'Entertainment'
 'Film & Animation' 'Gaming' 'Howto & Style' 'Movies' 'Music'
 'News & Politics' 'Nonprofits & Activism' 'People & Blogs'
 'Pets & Animals' 'Science & Technology' 'Shows' 'Sports'
 'Travel & Events']


In [22]:
df_metadata['categories'] = df_metadata['categories'].astype('int8')
df_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 8 columns):
 #   Column         Dtype         
---  ------         -----         
 0   categories     int8          
 1   channel_id     object        
 2   dislike_count  int16         
 3   display_id     object        
 4   duration       int16         
 5   like_count     int16         
 6   upload_date    datetime64[ns]
 7   view_count     int32         
dtypes: datetime64[ns](1), int16(3), int32(1), int8(1), object(2)
memory usage: 2.4+ GB


In [23]:
# save pickle
import pickle
with open(data_path + "yt_metadata_helper.pkl", "wb") as f:
    pickle.dump(df_metadata, f)

# save categories
with open(data_path + "categories.pkl", "wb") as f:
    pickle.dump(le.classes_, f)

In [24]:
df_metadata.head()

Unnamed: 0,categories,channel_id,dislike_count,display_id,duration,like_count,upload_date,view_count
0,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,SBqSc91Hn9g,1,1,2016-09-28,1
1,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,UuugEl86ESY,1,1,2016-09-28,1
2,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,oB4c-yvnbjs,1,1,2016-09-28,1
3,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,ZaV-gTCMV8E,1,1,2016-09-28,1
4,5,UCzWrhkg9eK5I8Bm3HfV-unA,1,cGvL7AvMfM0,1,1,2016-09-28,1


In [25]:
df_metadata['upload_date'].dt.year.value_counts()

2018    15275859
2019    12723124
2017    12486407
2016     9352771
2015     6808073
2014     5179935
2013     4018279
2012     2926436
2011     1874984
2010     1085424
2009      694586
2008      338040
2007      137250
2006       23294
2005         332
Name: upload_date, dtype: int64

In [7]:
data_path = "data/data-ssd/full/" 
# let's import the tags from the full dataset in order to encode them
df_tags = pd.read_json(data_path + "yt_metadata_en.jsonl.gz", lines=True, compression="gzip", encoding="utf-8", 
dtype={'video_id': 'str', 'tags': 'str'}, nrows=1000000, convert_dates=['upload_date'])
df_tags.head()

Unnamed: 0,categories,channel_id,crawl_date,description,dislike_count,display_id,duration,like_count,tags,title,upload_date,view_count
0,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:26.270363,Lego City Police Lego Firetruck Cartoons about...,1.0,SBqSc91Hn9g,1159,8.0,"lego city,lego police,lego city police,lego ci...",Lego City Police Lego Firetruck Cartoons about...,2016-09-28,1057.0
1,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:26.914516,Lego Marvel SuperHeroes Lego Hulk Smash Iron-M...,1.0,UuugEl86ESY,2681,23.0,"Lego superheroes,lego hulk,hulk smash,lego mar...",Lego Marvel SuperHeroes Lego Hulk Smash Iron-M...,2016-09-28,12894.0
2,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:26.531203,Lego City Police Lego Fireman Cartoons about L...,779.0,oB4c-yvnbjs,1394,1607.0,"lego city,lego police,lego city police,lego fi...",Lego City Police Lego Fireman Cartoons about L...,2016-09-28,1800602.0
3,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:28.335329,Lego Harry Potter Complete Lego New Movie for ...,24.0,ZaV-gTCMV8E,5064,227.0,"Lego harry potter,new harry potter,harry potte...",Lego Harry Potter Complete Lego New Movie for ...,2016-09-28,57640.0
4,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:30.328487,Lego City Police LONG VIDEO for kids Lego Fire...,13.0,cGvL7AvMfM0,3554,105.0,"lego city,lego police,lego city police,lego fi...",Lego City Police 1 HOUR LONG VIDEO for kids Le...,2016-09-28,86368.0


In [8]:
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 12 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   categories     1000000 non-null  object        
 1   channel_id     1000000 non-null  object        
 2   crawl_date     1000000 non-null  object        
 3   description    1000000 non-null  object        
 4   dislike_count  980320 non-null   float64       
 5   display_id     1000000 non-null  object        
 6   duration       1000000 non-null  int64         
 7   like_count     980320 non-null   float64       
 8   tags           1000000 non-null  object        
 9   title          1000000 non-null  object        
 10  upload_date    1000000 non-null  datetime64[ns]
 11  view_count     999999 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(7)
memory usage: 91.6+ MB


In [2]:
# Load data
data_path = "data/data-ssd/full/" 

dfs = []
for df_tags in pd.read_json(data_path+'yt_metadata_en.jsonl.gz', compression="infer", chunksize=1000000, lines=True):
    df_tags.drop(columns=['categories', 'channel_id', 'crawl_date', 'description', 'dislike_count', 'like_count', 'view_count', 'upload_date'], inplace=True)

    dfs.append(df_tags)
df_tags = pd.concat(dfs)

df_tags.to_feather(data_path+'yt_metadata_tags.feather')

In [3]:
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 4 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   display_id  object
 1   duration    int64 
 2   tags        object
 3   title       object
dtypes: int64(1), object(3)
memory usage: 2.2+ GB


In [None]:
#df_tags.drop(columns=['display_id', 'duration'], inplace=True) # turns out this is more expensive than useful....
#df_tags.info()

In [2]:
data_path = "data/data-ssd/full/" 
df_tags = pd.read_feather(path= data_path + "yt_metadata_tags.feather", use_threads=True)
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 4 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   display_id  object
 1   duration    int64 
 2   tags        object
 3   title       object
dtypes: int64(1), object(3)
memory usage: 2.2+ GB


In [3]:
df_tags = df_tags['tags']
df_tags.info()

<class 'pandas.core.series.Series'>
RangeIndex: 72924794 entries, 0 to 72924793
Series name: tags
Non-Null Count     Dtype 
--------------     ----- 
72924794 non-null  object
dtypes: object(1)
memory usage: 556.4+ MB


In [5]:
df_tags.to_frame().to_feather(data_path+'yt_metadata_tags.feather')

In [6]:
df_tags.head()

0    lego city,lego police,lego city police,lego ci...
1    Lego superheroes,lego hulk,hulk smash,lego mar...
2    lego city,lego police,lego city police,lego fi...
3    Lego harry potter,new harry potter,harry potte...
4    lego city,lego police,lego city police,lego fi...
Name: tags, dtype: object

## Vectorization of tags

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
# gensim
import gensim
from gensim.models import Word2Vec, KeyedVectors, FastText, doc2vec
import gensim.downloader as api
from gensim.models.phrases import Phrases
# nltk
# import nltk
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize



In [14]:
for model_name, model_data in sorted(api.info()['models'].items()):
    print("{}: {}".format(model_name, model_data['description'][:40] + "..."))

__testing_word2vec-matrix-synopsis: [THIS IS ONLY FOR TESTING] Word vecrors ...
conceptnet-numberbatch-17-06-300: ConceptNet Numberbatch consists of state...
fasttext-wiki-news-subwords-300: 1 million word vectors trained on Wikipe...
glove-twitter-100: Pre-trained vectors based on  2B tweets,...
glove-twitter-200: Pre-trained vectors based on 2B tweets, ...
glove-twitter-25: Pre-trained vectors based on 2B tweets, ...
glove-twitter-50: Pre-trained vectors based on 2B tweets, ...
glove-wiki-gigaword-100: Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-200: Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-300: Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-50: Pre-trained vectors based on Wikipedia 2...
word2vec-google-news-300: Pre-trained vectors trained on a part of...
word2vec-ruscorpora-300: Word2vec Continuous Skipgram vectors tra...


In [2]:
glove = api.load("glove-wiki-gigaword-50")
print(glove.most_similar("cat"))

[('dog', 0.9218006134033203), ('rabbit', 0.8487821817398071), ('monkey', 0.8041081428527832), ('rat', 0.7891963720321655), ('cats', 0.7865270376205444), ('snake', 0.7798910737037659), ('dogs', 0.7795815467834473), ('pet', 0.7792249917984009), ('mouse', 0.7731667757034302), ('bite', 0.7728800177574158)]


In [16]:
print(glove.most_similar("youtube"))

[('myspace', 0.8685212135314941), ('uploaded', 0.8573760986328125), ('facebook', 0.8540773391723633), ('twitter', 0.8430657982826233), ('videos', 0.8099467158317566), ('video', 0.7907883524894714), ('downloaded', 0.7642441391944885), ('blog', 0.7627186179161072), ('download', 0.7616114020347595), ('downloads', 0.7580808401107788)]


In [17]:
print(glove.most_similar("google"))

[('yahoo', 0.8942785263061523), ('aol', 0.852712869644165), ('microsoft', 0.8450709581375122), ('internet', 0.8179759979248047), ('web', 0.8175380229949951), ('facebook', 0.8087005615234375), ('ebay', 0.7930072546005249), ('netscape', 0.7912958860397339), ('online', 0.7908353805541992), ('software', 0.7816097140312195)]


In [18]:
print(glove.most_similar("lego"))

[('mindstorms', 0.828197181224823), ('bionicle', 0.7207260131835938), ('jigsaw', 0.6471174955368042), ('dolls', 0.6425037980079651), ('technic', 0.6417526006698608), ('namco', 0.6410648226737976), ('diecast', 0.638368546962738), ('arcade', 0.6373211145401001), ('toy', 0.6369045376777649), ('playmobil', 0.6331797242164612)]


In [19]:
print(glove.most_similar("apple"))

[('blackberry', 0.7543067932128906), ('chips', 0.7438643574714661), ('iphone', 0.7429664134979248), ('microsoft', 0.7334205508232117), ('ipad', 0.7331036925315857), ('pc', 0.7217226624488831), ('ipod', 0.7199784517288208), ('intel', 0.7192243337631226), ('ibm', 0.7146540880203247), ('software', 0.7093585729598999)]


In [20]:
print(glove.most_similar_to_given("cat", ["dog", "mouse", "lego", "apple", "youtube", "google"]))

dog


In [35]:
# fine-tune the model with our tags
df_tags.head().apply(lambda x: x.split(","))[0]



['lego city',
 'lego police',
 'lego city police',
 'lego city episodes',
 'videos de lego city',
 'lego policia',
 'lego bomberos',
 'lego fire truck',
 'lego firetruck',
 'lego police chase',
 'lego robbers',
 'lego cartoons',
 'lego movies',
 'lego videos for kids']

In [3]:
#df_tags = df_tags.apply(lambda x: x.split(",")) # jesus christ everything takes forever, 30min for this + all the memory
df_tags.head()

NameError: name 'df_tags' is not defined

## Parallelization with modin, ray, dask ... experiments

In [1]:
# modin
import modin.pandas as pd

In [2]:
# load with modin
data_path = "data/data-ssd/full/"
df_tags = pd.read_feather(path= data_path + "yt_metadata_tags.feather", use_threads=True)


    from distributed import Client

    client = Client()

distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-project-nan/dask-worker-space/worker-qellbot8', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-project-nan/dask-worker-space/worker-h7v02y7i', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-project-nan/dask-worker-space/worker-ik6kiyb_', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-project-nan/dask-worker-space/worker-co_tet5h', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-project-nan/dask-worker-space/worker-88ir7lm6', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-

KilledWorker: ('parse-c1a3ed1b9b8f3ff7b997c3767e86224f', <WorkerState 'tcp://127.0.0.1:58042', name: 2, status: closed, memory: 0, processing: 1>)

In [26]:
# I'll leave the output here for reference, lets try again
# let's try to explicitely use dask or ray
import modin.pandas as pd
import modin
print(modin.config.Engine.get())
print(modin.config.NPartitions.get())
print(modin.config.IsRayCluster.get())
print(modin.config.Engine.choices)
print(modin.config.Engine.get_help())
print(modin.config.NPartitions.get_help())
print(modin.config.IsRayCluster.get_help())
# https://github.com/modin-project/modin/pull/292#issuecomment-445915511 mmmh...
modin.config.Engine.put('Ray') # had to install ray from pip, not present in conda
print(modin.config.Engine.get())
data_path = "data/data-ssd/full/"
import timeit
start = timeit.default_timer()
df_tags = pd.read_feather(path= data_path + "yt_metadata_tags.feather", use_threads=True)
stop = timeit.default_timer()

Ray
10
None
('Ray', 'Dask', 'Python', 'Native')
MODIN_ENGINE: Distribution engine to run queries by.
	Provide a case-insensitive string (valid examples are: Ray, Dask, Python, Native)
MODIN_NPARTITIONS: How many partitions to use for a Modin DataFrame (along each axis).
	Provide an integer value
MODIN_RAY_CLUSTER: Whether Modin is running on pre-initialized Ray cluster.
	Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
Ray



    import ray
    ray.init()

2022-12-12 15:24:03,661	INFO worker.py:1528 -- Started a local Ray instance.


In [31]:
print('Time: ', stop - start)
df_tags.head()

Time:  304.4392979169497


Unnamed: 0,tags
0,"lego city,lego police,lego city police,lego ci..."
1,"Lego superheroes,lego hulk,hulk smash,lego mar..."
2,"lego city,lego police,lego city police,lego fi..."
3,"Lego harry potter,new harry potter,harry potte..."
4,"lego city,lego police,lego city police,lego fi..."


In [32]:
# let's run it again now that ray is installed and initialized
import time
start = time.time()
df_tags = pd.read_feather(path= data_path + "yt_metadata_tags.feather", use_threads=True)
stop = time.time()
print('Time: ', stop - start)
print(start)
print(stop)

[2m[36m(raylet)[0m Spilled 2216 MiB, 2 objects, write throughput 240 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
[2m[36m(raylet)[0m Spilled 4375 MiB, 4 objects, write throughput 339 MiB/s.
[2m[36m(raylet)[0m Spilled 18756 MiB, 17 objects, write throughput 663 MiB/s.


Time:  436.12635588645935
1670855744.754067
1670856180.8804228


In [36]:
# I just realized the bottleneck is the ssd at this point, not the cpu or pandas or ray!
# Now that is in memory tho it should be faster than pandas. Splitting the tags took 30min and didn't even finish, let's try again here
df_tags.info() # what.. took 50s...




<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 1 columns):
 #   Column  Non-Null Count     Dtype 
---  ------  -----------------  ----- 
 0   tags    72924794 non-null  object
dtypes: object(1)
memory usage: 556.4 MB


In [37]:
start = time.time()
df_tags = df_tags['tags'].apply(lambda x: x.split(","))
stop = time.time()
print('Time: ', stop - start)

Time:  0.844296932220459


In [38]:
# WHAT????? Did it actually work?
df_tags.head() # head() takes more time than apply, 
#clearly some operations are super  optimized while others require to aggregate the whole dataset in a single node and are suuuuuuper slow
# 5min and it actually didn't finish

KeyboardInterrupt: 

In [5]:
# I will copy the feather tags file to my disk and try again
data_path = "data/full/"
import time
start = time.time()
df_tags = pd.read_feather(path= data_path + "yt_metadata_tags.feather", use_threads=True)
stop = time.time()
print('Time: ', stop - start)

Time:  59.37336993217468


In [6]:
start = time.time()
print(df_tags.info())
stop = time.time()
print('Time: ', stop - start)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 1 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   tags    object
dtypes: object(1)
memory usage: 556.4+ MB
None
Time:  0.05313467979431152


In [7]:
start = time.time()
print(df_tags.head())
stop = time.time()
print('Time: ', stop - start)

                                                tags
0  lego city,lego police,lego city police,lego ci...
1  Lego superheroes,lego hulk,hulk smash,lego mar...
2  lego city,lego police,lego city police,lego fi...
3  Lego harry potter,new harry potter,harry potte...
4  lego city,lego police,lego city police,lego fi...
Time:  0.01375579833984375


In [8]:
start = time.time()
df_tags = df_tags['tags'].apply(lambda x: x.split(","))
print(df_tags.head())
stop = time.time()
print('Time: ', stop - start)

KeyboardInterrupt: 

In [1]:
#import pandas
import modin.pandas as pd
import modin
import os
os.environ["MODIN_ENGINE"] = "Dask"  # Modin will use Dask
# ray.init()
#ray.init(ignore_reinit_error=True)
# print(modin.config.Engine.get())
# print(modin.config.NPartitions.get())
# print(modin.config.Engine.put('Ray'))
# print(modin.config.Engine.get())
# It gives errors if I try to print these now 
data_path = "data/full/"
import time
start = time.time()
df_tags = pd.read_feather(path= data_path + "yt_metadata_tags.feather", use_threads=True)
stop = time.time()
print('Time: ', stop - start)


    from distributed import Client

    client = Client()

distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-project-nan/dask-worker-space/worker-6g0a94de', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-project-nan/dask-worker-space/worker-2xc3dfbb', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-project-nan/dask-worker-space/worker-xrgty7r_', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-project-nan/dask-worker-space/worker-jmyht63b', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-project-nan/dask-worker-space/worker-nnc9u29l', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/jacopoferro/Documents/ADA/ada-2022-

KilledWorker: ('parse-d6e9b0b42113e9ec9cc94cf6a479de0b', <WorkerState 'tcp://127.0.0.1:55891', name: 8, status: closed, memory: 0, processing: 1>)

In [2]:
import pandas as pd
import time
data_path = "data/full/"
start = time.time()
df_tags = pd.read_feather(path= data_path + "yt_metadata_tags.feather", use_threads=True)
stop = time.time()
print('Time: ', stop - start)

Time:  54.73409914970398


In [3]:
# So, on ssd im able to use modin with ray and it's incredibly fast on apply method but incredibly slow on head() and info() because of ssd bottleneck
# and how it has to aggregate things in a single node
# on disk Its much faster to load the data with normal pandas but im having bugs trying to use modin. With ray it doesnt work and with dask the division of 
# work makes the load fail
# let's just use pandas and see how slow the apply is
start = time.time()
#df_tags = df_tags['tags'].apply(lambda x: x.split(",")) # I stopped at 1min 30s, clearly much slower and even more memory consuming, it tries to copy everything in ram
stop = time.time()
print('Time: ', stop - start)

KeyboardInterrupt: 

In [1]:

from modin.config import Engine
Engine.put("Ray")
import modin.pandas as pd

import os
os.environ.items()
modin.config.Engine.get() 
# was having issues, I found that conda install modin modin-all modin-ray -c conda-forge  
#  conda install grpcio -c conda-forge          this needs to be installed from conda and not pip for my mac to work

ModuleNotFoundError: No module named 'modin'

In [1]:
# why not detected anymore?? -- ok finally, had to reinstall everything from scratch and force-reinstall
import numpy as np
import modin
from modin.config import Engine
print(modin.config.Engine.get()) # dask
# ray is only in pip, installed at the end, rest is conda-forge
Engine.put("Ray")
print(modin.config.Engine.get()) # ray
import modin.pandas as pd
import time
data_path = "data/full/"
start = time.time()
df_tags = pd.read_feather(path= data_path + "yt_metadata_tags.feather", use_threads=True) 
stop = time.time()
print('Time: ', stop - start)


ImportError: Please `pip install modin[ray]` to install compatible Ray version (>=1.4.0,<1.13.0).

In [1]:
# OK so, if I run it once and install things as I go, it works. If I try to run it again, it doesn't work.
import os 
os.environ.setdefault("MODIN_ENGINE", "Ray") # basically I need to set this before importing modin every time else it doesn't work -- still not that fast
#os.environ.setdefault("MODIN_ENGINE", "Dask") # but this doesn't work either to read the data... what should I doooooo
import modin.pandas as pd
import modin
from modin.config import Engine
print(modin.config.Engine.get()) # 
import time
data_path = "data/full/"
start = time.time()
df_tags = pd.read_feather(path= data_path + "yt_metadata_tags.feather", use_threads=True)
stop = time.time()
print('Time: ', stop - start)

Ray



    import ray
    ray.init()

2022-12-12 22:04:30,847	INFO worker.py:1528 -- Started a local Ray instance.


Time:  286.8405110836029


In [2]:
print('Time: ', stop - start)

Time:  286.8405110836029


In [3]:
# create csv file with tags
df_tags.to_csv("data/full/yt_metadata_tags.csv", index=False)

In [4]:
start = time.time()
df_tags = df_tags['tags'].apply(lambda x: x.split(","))
stop = time.time()
print('Time: ', stop - start)

Time:  2.705317974090576


Here's the situation: I do think using modin and such makes sense, but this feather business just complicates things.
Lets create a csv version of the feather file and see how it goes. It' going to be a bigger file but it should be easier to work with.

In [2]:
import os
os.environ.setdefault("MODIN_ENGINE", "Dask")
import modin.pandas as pd
from distributed.client import Client
from distributed import LocalCluster
client = Client(LocalCluster(n_workers=8, threads_per_worker=8))
import time
data_path = "data/full/"
start = time.time()
df_tags = pd.read_csv(data_path + "yt_metadata_tags.csv")
stop = time.time()
print('Time: ', stop - start)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 61953 instead
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/Users/jacopoferro/opt/anaconda3/envs/ada/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/jacopoferro/opt/anaconda3/envs/ada/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/jacopoferro/opt/anaconda3/envs/ada/lib/python3.10/site-packages/distributed/process.py", line 175, in _run
    target(*args, **kwargs)
  File "/Users/jacopoferro/opt/anaconda3/envs/ada/lib/python3.10/site-packages/distributed/nanny.py", line 938, in _run
    loop.run_sync(do_stop)
  File "/Users/jacopoferro/opt/anaconda3/envs/ada/lib/python3.10/site-packages/tornado/ioloop.py", line 523, in run_sync
    self.start()
  File "/Users/jacopoferro/opt/anaconda3/envs/ada/lib/python3.10/site-packages/tornado/platform/asyncio.py",

KeyboardInterrupt: 



In [1]:
import os
os.environ.setdefault("MODIN_ENGINE", "Ray")
import modin.pandas as pd
import ray
ray.init()
import time
data_path = "data/full/"
start = time.time()
df_tags = pd.read_csv(data_path + "yt_metadata_tags.csv")
stop = time.time()
print('Time: ', stop - start)

2022-12-12 23:01:26,715	INFO worker.py:1528 -- Started a local Ray instance.


Time:  78.19615006446838


In [3]:
#df_tags.head() #aaaaaaaaah why does this break everything

df_tags['tags'].loc[0]

'lego city,lego police,lego city police,lego city episodes,videos de lego city,lego policia,lego bomberos,lego fire truck,lego firetruck,lego police chase,lego robbers,lego cartoons,lego movies,lego videos for kids'

In [4]:
df_tags.info()

[2m[36m(raylet)[0m Spilled 11080 MiB, 10 objects, write throughput 1249 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.


<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 72924794 entries, 0 to 72924793
Data columns (total 1 columns):
 #   Column  Non-Null Count     Dtype 
---  ------  -----------------  ----- 
 0   tags    65801931 non-null  object
dtypes: object(1)
memory usage: 556.4 MB


In [5]:
df_tags = df_tags.dropna()

[2m[36m(raylet)[0m Spilled 11080 MiB, 11 objects, write throughput 1249 MiB/s.
[2m[36m(raylet)[0m Spilled 11080 MiB, 12 objects, write throughput 1249 MiB/s.
[2m[36m(raylet)[0m Spilled 16634 MiB, 17 objects, write throughput 1524 MiB/s.
[2m[36m(raylet)[0m Spilled 33683 MiB, 32 objects, write throughput 1839 MiB/s.


In [6]:
ray.available_resources()

{'CPU': 10.0,
 'memory': 20431613133.0,
 'node:127.0.0.1': 1.0,
 'object_store_memory': 2147483647.0}

In [8]:
ray.is_initialized()

True

In [9]:
df_tags.info()



<class 'modin.pandas.dataframe.DataFrame'>
Int64Index: 65801931 entries, 0 to 72924793
Data columns (total 1 columns):
 #   Column  Non-Null Count     Dtype 
---  ------  -----------------  ----- 
 0   tags    65801931 non-null  object
dtypes: object(1)
memory usage: 1004.1 MB


In [10]:
df_tags.dropna(inplace=True)

In [11]:
df_tags.info()



<class 'modin.pandas.dataframe.DataFrame'>
Int64Index: 65801931 entries, 0 to 72924793
Data columns (total 1 columns):
 #   Column  Non-Null Count     Dtype 
---  ------  -----------------  ----- 
 0   tags    65801931 non-null  object
dtypes: object(1)
memory usage: 1004.1 MB


In [14]:
# tags to vector
import gensim
from gensim.models import Word2Vec, KeyedVectors
import gensim.downloader as api

for i in range(0, 10):
    print(df_tags['tags'].loc[i])

lego city,lego police,lego city police,lego city episodes,videos de lego city,lego policia,lego bomberos,lego fire truck,lego firetruck,lego police chase,lego robbers,lego cartoons,lego movies,lego videos for kids
Lego superheroes,lego hulk,hulk smash,lego marverl superheroes,lego marvel super heroes movie,lego ironman,lego spiderman,spider-man,lego heroes,lego avengers,lego captain america
lego city,lego police,lego city police,lego fire truck,lego cartoons,lego for kids,lego episodes,lego polizi,lego policias,lego police chase,lego policias para niños
Lego harry potter,new harry potter,harry potter new movie,movie harry potter,movie harry potter lego,lego harry,lego movies,lego cartoons
lego city,lego police,lego city police,lego fire truck,lego cartoons,cartoons about lego,lego for kids,lego city movie,videos de lego city,lego policias,lego bomberos,lego polizi,polizi lego,lego para niños,lego fireman
lego marvel,lego marvel superheroes,lego superheroes,lego hulk,hulk smash,lego iro

In [20]:
for model_name, model_data in sorted(api.info()['models'].items()):
    print("{}: {}".format(model_name, model_data['description'][:] + "..."))

__testing_word2vec-matrix-synopsis: [THIS IS ONLY FOR TESTING] Word vecrors of the movie matrix....
conceptnet-numberbatch-17-06-300: ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting....
fasttext-wiki-news-subwords-300: 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens)....
glove-twitter-100: Pre-trained vectors based on  2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/project

In [22]:
conceptnet = api.load("conceptnet-numberbatch-17-06-300")



In [23]:
conceptnet.most_similar("cat")

KeyError: "Key 'cat' not present in vocabulary"

In [24]:
conceptnet.most_similar("/c/en/cat")



[('/c/ur/بلی', 0.9927220344543457),
 ('/c/rup/cãtush', 0.9910101294517517),
 ('/c/tr/kedi', 0.9870569109916687),
 ('/c/hsb/kóčka', 0.9868075847625732),
 ('/c/da/kat', 0.9857985973358154),
 ('/c/az/pişik', 0.9857766032218933),
 ('/c/kk/мысық', 0.9857648611068726),
 ('/c/lt/kate', 0.9850127100944519),
 ('/c/no/pus', 0.9846401214599609),
 ('/c/no/katt', 0.9840860366821289)]

In [40]:
en = [i for i in conceptnet.key_to_index if i.startswith("/c/en/")]
it = [i for i in conceptnet.key_to_index if i.startswith("/c/it/")]
fr = [i for i in conceptnet.key_to_index if i.startswith("/c/fr/")]
de = [i for i in conceptnet.key_to_index if i.startswith("/c/de/")]
es = [i for i in conceptnet.key_to_index if i.startswith("/c/es/")]
pt = [i for i in conceptnet.key_to_index if i.startswith("/c/pt/")]
nl = [i for i in conceptnet.key_to_index if i.startswith("/c/nl/")]

print("Number of English words: {}".format(len(en)))
print("Number of Italian words: {}".format(len(it)))
print("Number of French words: {}".format(len(fr)))
print("Number of German words: {}".format(len(de)))
print("Number of Spanish words: {}".format(len(es)))
print("Number of Portuguese words: {}".format(len(pt)))
print("Number of Dutch words: {}".format(len(nl)))
print("Total number of words: {}".format(len(conceptnet.key_to_index)))

Number of English words: 417194
Number of Italian words: 91828
Number of French words: 296986
Number of German words: 129405
Number of Spanish words: 44756
Number of Portuguese words: 47532
Number of Dutch words: 45245
Total number of words: 1917247


In [41]:
en[:10]

['/c/en/##',
 '/c/en/###',
 '/c/en/####',
 '/c/en/#####',
 '/c/en/#####_metres',
 '/c/en/#####ish',
 '/c/en/####_adapter',
 '/c/en/####_form',
 '/c/en/####_ish',
 '/c/en/####_metres']

In [42]:
en = [i[6:] for i in en]
it = [i[6:] for i in it]
fr = [i[6:] for i in fr]
de = [i[6:] for i in de]
es = [i[6:] for i in es]
pt = [i[6:] for i in pt]
nl = [i[6:] for i in nl]
en[:10]

['##',
 '###',
 '####',
 '#####',
 '#####_metres',
 '#####ish',
 '####_adapter',
 '####_form',
 '####_ish',
 '####_metres']

In [43]:
en.index("cat")

61702

In [59]:
conceptnet.key_to_index["/c/en/cat"]

310898

In [78]:
en_conceptnet = conceptnet[[conceptnet.key_to_index["/c/en/" + i] for i in en]]
it_conceptnet = conceptnet[[conceptnet.key_to_index["/c/it/" + i] for i in it]]
fr_conceptnet = conceptnet[[conceptnet.key_to_index["/c/fr/" + i] for i in fr]]
de_conceptnet = conceptnet[[conceptnet.key_to_index["/c/de/" + i] for i in de]]
es_conceptnet = conceptnet[[conceptnet.key_to_index["/c/es/" + i] for i in es]]
pt_conceptnet = conceptnet[[conceptnet.key_to_index["/c/pt/" + i] for i in pt]]
nl_conceptnet = conceptnet[[conceptnet.key_to_index["/c/nl/" + i] for i in nl]]

en_conceptnet = [{i, conceptnet.key_to_index["/c/en/" + i] }for i in en]
en_conceptnet[:10]

[{'##', 249196},
 {'###', 249197},
 {'####', 249198},
 {'#####', 249199},
 {'#####_metres', 249200},
 {'#####ish', 249201},
 {'####_adapter', 249202},
 {'####_form', 249203},
 {'####_ish', 249204},
 {'####_metres', 249205}]

In [79]:
en_conceptnet[310898]

{560094, 'reinvigorating'}

In [81]:
en_conceptnet = {i: conceptnet.key_to_index["/c/en/" + i] for i in en}
en_conceptnet["cat"]

310898

In [82]:
conceptnet.key_to_index["/c/en/cat"]

310898

In [83]:
conceptnet.index_to_key[310898]

'/c/en/cat'

In [85]:
conceptnet.most_similar(en_conceptnet["cat"])



[('/c/ur/بلی', 0.9927220344543457),
 ('/c/rup/cãtush', 0.9910101294517517),
 ('/c/tr/kedi', 0.9870569109916687),
 ('/c/hsb/kóčka', 0.9868075847625732),
 ('/c/da/kat', 0.9857985973358154),
 ('/c/az/pişik', 0.9857766032218933),
 ('/c/kk/мысық', 0.9857648611068726),
 ('/c/lt/kate', 0.9850127100944519),
 ('/c/no/pus', 0.9846401214599609),
 ('/c/no/katt', 0.9840860366821289)]

In [90]:

conceptnet.most_similar(en_conceptnet["cat"])[:10]





[('/c/ur/بلی', 0.9927220344543457),
 ('/c/rup/cãtush', 0.9910101294517517),
 ('/c/tr/kedi', 0.9870569109916687),
 ('/c/hsb/kóčka', 0.9868075847625732),
 ('/c/da/kat', 0.9857985973358154),
 ('/c/az/pişik', 0.9857766032218933),
 ('/c/kk/мысық', 0.9857648611068726),
 ('/c/lt/kate', 0.9850127100944519),
 ('/c/no/pus', 0.9846401214599609),
 ('/c/no/katt', 0.9840860366821289)]

In [91]:
# load only english words from conceptnet numberbatch 19.08
en_conceptnet = gensim.models.KeyedVectors.load_word2vec_format("data/conceptnet/numberbatch-en.txt", binary=False)
en_conceptnet.most_similar("cat")[:10]

[('noncat', 0.9977306723594666),
 ('catdom', 0.9933454394340515),
 ('cathood', 0.9817461371421814),
 ('felicidal', 0.980994462966919),
 ('sharp_claws', 0.9793973565101624),
 ('felicide', 0.9761825203895569),
 ('catch_mice', 0.9721653461456299),
 ('cat_translations', 0.9669778347015381),
 ('hunt_mice', 0.9575042128562927),
 ('catly', 0.950741708278656)]

In [93]:
# great, now
en_conceptnet.most_similar("cat_eating")

[('kitten_eating', 1.0),
 ('reptile_eating', 0.9997147917747498),
 ('domestic_pet_eating', 0.9997147917747498),
 ('animal_eating', 0.9921922087669373),
 ('dog_eating', 0.932854413986206),
 ('bird_eating', 0.922881007194519),
 ('fish_eating', 0.7960236072540283),
 ('eatathon', 0.7284303307533264),
 ('transporting_ammunition', 0.7045318484306335),
 ('unevent', 0.704103946685791)]

In [101]:
en_conceptnet.get_mean_vector("cat_eating cake")

array([-5.36012985e-02, -3.75471869e-03, -3.66847329e-02,  5.68630733e-02,
       -2.56850496e-02,  8.29177909e-03,  2.60618441e-02, -7.11389631e-02,
        4.34547104e-02, -9.07761103e-04, -1.08587241e-03,  1.01425573e-01,
       -1.57286569e-01, -6.47393614e-02,  1.36155143e-01,  3.84928994e-02,
       -6.12306260e-02,  2.35844627e-02,  1.51535133e-02,  1.13535477e-02,
        8.99253786e-02, -2.48617548e-02,  2.62076296e-02,  3.22448760e-02,
       -2.58419034e-03, -3.76479477e-02, -5.33153340e-02,  1.35394018e-02,
        4.19466086e-02, -6.48513157e-03, -2.70928107e-02,  1.20996144e-02,
        1.55380089e-02,  2.45250668e-03,  2.01369114e-02, -1.43000921e-02,
       -3.86177795e-03, -1.38926525e-02, -4.31856252e-02, -1.16776340e-02,
        2.26001274e-02, -2.12824680e-02, -6.01688810e-02,  2.66457126e-02,
       -5.57843857e-02, -1.93522777e-02,  1.47308661e-02,  2.80016032e-03,
        3.39927785e-02, -1.19846202e-02, -1.35007594e-02,  2.58385148e-02,
       -5.45354444e-04,  

In [111]:
en_conceptnet.get_mean_vector("cat_eating cake") - en_conceptnet.get_mean_vector("cat_eating")
en_conceptnet.distances("cat", ["dog"])

array([0.3327399], dtype=float32)

In [115]:
en_conceptnet.get_mean_vector("cat eating cake") - en_conceptnet.get_mean_vector("cat_eating cake")

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [116]:
df_tags['tags'].loc[0]

'lego city,lego police,lego city police,lego city episodes,videos de lego city,lego policia,lego bomberos,lego fire truck,lego firetruck,lego police chase,lego robbers,lego cartoons,lego movies,lego videos for kids'

In [125]:
en_conceptnet.get_mean_vector(df_tags['tags'].loc[0])

array([-0.02691581,  0.00509281, -0.03811713,  0.03563258, -0.0339757 ,
        0.01534749,  0.02615654, -0.08195128,  0.0339022 ,  0.0065336 ,
        0.01799841,  0.06635644, -0.1723129 , -0.07529027,  0.13962828,
        0.04143422, -0.07883903,  0.0350382 ,  0.00441601,  0.0093426 ,
        0.06817945, -0.01893389,  0.02360734,  0.06237936, -0.00964601,
       -0.02361721, -0.07434096,  0.00178824,  0.04441808, -0.02219541,
       -0.02297872,  0.03257639,  0.03710947,  0.02555624,  0.04567563,
       -0.00636792,  0.00566347, -0.01174579, -0.0438919 , -0.00148428,
        0.03936958, -0.05152805, -0.08072304,  0.03377918, -0.07357135,
       -0.04454736,  0.00227826,  0.00391911,  0.04186884, -0.02958278,
       -0.00072592,  0.01473713, -0.02180761,  0.11400161, -0.03269357,
        0.02720689,  0.01602898, -0.09596177,  0.03973094,  0.05220033,
       -0.03178803,  0.01008175,  0.03263663, -0.037671  ,  0.05715634,
       -0.01832351,  0.02100495, -0.00778977, -0.01885771, -0.00

In [128]:
# tags to vector
df_vecs = df_tags['tags'].apply(lambda x: en_conceptnet.get_mean_vector(x))
df_vecs.info()

Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.


In [None]:
df_vecs.to_csv("data/full/yt_metadata_tags_vecs.csv")

--- dunno what i was doing nder this line, bugs an such, i expanded above -----

In [4]:
start = time.time()
#print(df_tags.head()) # Again dangerouussss
stop = time.time()
print('Time: ', stop - start)

KeyboardInterrupt: 

In [8]:
# Let's build a set of tags, lowercase and remove duplicates with set, so we can save a label encoder
from collections import Counter
from sklearn.preprocessing import LabelEncoder
import time
start = time.time()
# lowercase
df_tags['tags'] = df_tags['tags'].apply(lambda x: [tag.lower() for tag in x])
le = LabelEncoder()
tags = set()
for i in range(len(df_tags)):
    tags.update(df_tags['tags'][i])
tags = list(tags)
le.fit(tags)
# save classes of encoder
import pickle
with open('data/full/tags.pkl', 'wb') as f:
    pickle.dump(le.classes_, f)
# save also a csv with the tags so its easier to read
df_tags.to_csv('data/full/tags.csv')
stop = time.time()
print('Time: ', stop - start)

UFuncTypeError: ufunc 'less' did not contain a loop with signature matching types (<class 'numpy.dtype[str_]'>, <class 'numpy.dtype[int64]'>) -> <class 'numpy.dtype[bool_]'>

In [None]:
# build vocabulary
import numpy as np
import gensim
from gensim.models import Word2Vec, KeyedVectors
from gensim.downloader import api

# build vocabulary
glove = api.load("glove-wiki-gigaword-50")
print(glove.most_similar("cat"))

In [None]:
glove.build_vocab(df_tags, update=True)
glove.train(df_tags, total_examples=glove.corpus_count, epochs=glove.epochs)