Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

SPDX-License-Identifier: Apache-2.0

# This notebook walks through intermediate results for data processing on Reddit user-behavior data

## Table of contents
1. Download Reddit comments dataset from PushShift.io for May 2008
2. Set rules for anomalous vs benign users along with data processing
3. Generate author/user labels and save to a csv file
4. Generate user and subreddit index files
5. Save edgelist data as csv file
6. Train/validation/test split
7. Get node features using NLP models

In [1]:
import json
import pandas as pd
import os 
import sys

In [2]:
sys.path.append('../../src/')

### 1. Download Reddit dataset and save it in a dataframe 

In [3]:
reddit_raw_data_file_path = '../../data/01_raw/user_behavior/RC_2008-05.zst'

In [4]:
records = map(json.loads, open(reddit_raw_data_file_path.rstrip(".zst"), encoding="utf8"))
df = pd.DataFrame.from_records(records)

In [5]:
df.head(10)

Unnamed: 0,link_id,author_flair_css_class,retrieved_on,controversiality,archived,name,edited,subreddit,score,created_utc,...,distinguished,id,author_flair_text,gilded,author,ups,body,parent_id,subreddit_id,downs
0,t3_6hoxb,,1425846806,0,True,t1_c03vgla,False,reddit.com,1,1209600017,...,,c03vgla,,0,AngelaMotorman,1,&gt;I need to print up a pamphlet of facts tha...,t1_c03veph,t5_6,0
1,t3_6holm,,1425846806,0,True,t1_c03vgli,False,pics,1,1209600008,...,,c03vgli,,0,[deleted],1,"its the adrenalin, can you imagine the excitem...",t3_6holm,t5_2qh0u,0
2,t3_6hl0a,,1425846806,0,True,t1_c03vglj,False,pics,1,1209600078,...,,c03vglj,,0,[deleted],1,"Statistically, 51% of you voted for him.\n",t1_c03v6yf,t5_2qh0u,0
3,t3_6hq4l,,1425846806,0,True,t1_c03vglk,False,reddit.com,0,1209600015,...,,c03vglk,,0,[deleted],0,[deleted],t3_6hq4l,t5_6,0
4,t3_6hoyd,,1425846806,0,True,t1_c03vgll,False,worldnews,0,1209600034,...,,c03vgll,,0,[deleted],0,[deleted],t3_6hoyd,t5_2qh13,0
5,t3_6hoyd,,1425846806,0,True,t1_c03vglm,False,worldnews,1,1209600034,...,,c03vglm,,0,BaronVonMannsechs,1,"""Savage"" may only imply inferiority from certa...",t1_c03vfth,t5_2qh13,0
6,t3_6hpzs,,1425846806,0,True,t1_c03vgln,False,programming,2,1209600050,...,,c03vgln,,0,tlack,2,"PHP is still huge. It has a lot of flaws, but ...",t3_6hpzs,t5_2fwo,0
7,t3_6hnvn,,1425846806,0,True,t1_c03vglo,False,business,2,1209600052,...,,c03vglo,,0,mightyarmenian,2,I don't see why joonix is getting downmodded. ...,t1_c03vek3,t5_2qgzg,0
8,t3_6hnim,,1425846806,0,True,t1_c03vglp,False,politics,1,1209600134,...,,c03vglp,,0,[deleted],1,strokes of a pen,t1_c03vcu9,t5_2cneq,0
9,t3_6hq4k,,1425846806,0,True,t1_c03vglq,False,pics,1,1209600068,...,,c03vglq,,0,amstrdamordeath,1,That is obviously a duck.,t3_6hq4k,t5_2qh0u,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536380 entries, 0 to 536379
Data columns (total 21 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   link_id                 536380 non-null  object
 1   author_flair_css_class  780 non-null     object
 2   retrieved_on            536380 non-null  int64 
 3   controversiality        536380 non-null  int64 
 4   archived                536380 non-null  bool  
 5   name                    536380 non-null  object
 6   edited                  536380 non-null  object
 7   subreddit               536380 non-null  object
 8   score                   536380 non-null  int64 
 9   created_utc             536380 non-null  object
 10  score_hidden            536380 non-null  bool  
 11  distinguished           0 non-null       object
 12  id                      536380 non-null  object
 13  author_flair_text       696 non-null     object
 14  gilded                  536380 non-n

### Observation about the data:
1. There are 536380 rows and 20 columns where each row is an unique post with 20 attributes/columns related to that comment
2. Most important attributes include author, sub-reddit, body and score. Body is the comment thread content, and score is the total votes received on Reddit (1 for one upvote and -1 for downvote). Each record represents one author posts something (body) related to the sub-reddit topic. 
3. Each unique author can have multiple comments across more than one subreddit with varying scores for each comment


### 2. Data processing

#### Data processing steps to get input for ELAND model. Steps include:
1. Drop records of absolute scores lesser than 10
2. Drop user if they have posted less than 10 times
3. Drop users that are [deleted]

#### We don't have ground truth labels for training the model. To generate labels on users that are neeeded for next step, we used a rule to group users into either benign and anomalous users based on their posts scores stats. 
   - Anomalous user: An author who has commented atleast 10 times and every score of theirs is lesser than or equal to -10
   - Benign user: An author who has commented atleast 10 times and every score of theirs is greater than or equal to 10

In [7]:
#Drop records if their absoulte value of score is lesser than 10
df_score = df.drop(df[abs(df.score) < 10].index)

In [8]:
df.shape, df_score.shape  #a lot of comments with less than score of 10

((536380, 21), (43343, 21))

In [9]:
#check lowest score and highest score
df_score.score.min(), df_score.score.max()

(-284, 1522)

In [10]:
df_score['author'].value_counts()

[deleted]          10453
nixonrichard         162
Poromenos            119
otakucode            115
UntakenUsername       96
                   ...  
ventomareiro           1
noirling               1
Spaceman_Spliff        1
cf26                   1
tdieckman              1
Name: author, Length: 8088, dtype: int64

In [11]:
df_score['subreddit'].value_counts()

reddit.com     11097
pics            5383
politics        4718
programming     4422
funny           3656
               ...  
Bacon              1
PHP                1
software           1
lgbt               1
Anarchism          1
Name: subreddit, Length: 61, dtype: int64

In [12]:
#Drop user if they have posted less than 10 times
counts = df_score['author'].value_counts()
res = df_score[~df_score['author'].isin(counts[counts < 10].index)]

#Drop users that are [deleted]
res = res.drop(res[res.author=='[deleted]'].index)

In [13]:
res['author'].value_counts()

nixonrichard       162
Poromenos          119
otakucode          115
UntakenUsername     96
7oby                85
                  ... 
eusephus            10
derefr              10
andrewnorris        10
FenPhen             10
ajrw                10
Name: author, Length: 787, dtype: int64

In [14]:
#Number of unique users
len(res.author.unique())

787

## Create user labels

In [15]:
benign = pd.DataFrame()
anomaly = pd.DataFrame()

In [16]:
benign = benign.append(res)
print(benign.shape)

(15529, 21)


In [17]:
#remove records that score less than 10 
benign = benign.drop(benign[benign.score < 10].index)

In [18]:
#check one example of benign author
benign.loc[benign['author'] == 'jonknee'].T

Unnamed: 0,230,14927,54120,113751,183996,238600,238693,299957,338384,338699,339453,353770,411425,412147,426088,426377,428278,431114,515403
link_id,t3_6hpa9,t3_6hta1,t3_6i3qt,t3_6iisd,t3_6j20f,t3_6jg07,t3_6jfve,t3_6jtyv,t3_6k4gz,t3_6k4gz,t3_6k4gz,t3_6k7u9,t3_6kn7n,t3_6kn7n,t3_6kq30,t3_6kqr3,t3_6kqr3,t3_6kr7s,t3_6lf1g
author_flair_css_class,,,,,,,,,,,,,,,,,,,
retrieved_on,1425846807,1425847139,1425847701,1425848538,1425849347,1425850043,1425850044,1425850814,1425851565,1425851567,1425851577,1425851835,1425852667,1425852676,1425852996,1425852999,1425853018,1425853102,1425854332
controversiality,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
archived,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
name,t1_c03vgrw,t1_c03vs4n,t1_c03wmef,t1_c03xwhz,t1_c03zeta,t1_c040l0r,t1_c040l3c,t1_c041whg,t1_c042q73,t1_c042qfu,t1_c042r0s,t1_c04323e,t1_c044any,t1_c044b80,t1_c044lzy,t1_c044m7z,t1_c044nos,t1_c044pvn,t1_c046izm
edited,False,True,False,False,False,False,True,False,True,False,False,False,True,False,False,False,False,True,True
subreddit,programming,reddit.com,reddit.com,reddit.com,entertainment,programming,programming,programming,programming,programming,programming,reddit.com,programming,programming,politics,reddit.com,reddit.com,reddit.com,reddit.com
score,23,11,10,10,15,15,29,11,35,28,11,25,12,17,13,10,12,48,47
created_utc,1209601191,1209672909,1209917033,1210200009,1210599323,1210820045,1210820440,1211125762,1211310090,1211310945,1211312809,1211384268,1211657168,1211660609,1211748243,1211749538,1211758839,1211773274,1212165734


In [19]:
##Anomalous author
anomaly = anomaly.append(res)

#Remove records with score larger than -10 
anomaly = anomaly.drop(anomaly[anomaly.score > -10].index)

In [20]:
#Example author
anomaly.loc[anomaly['author'] == 'I_AM_A_NEOCON']

Unnamed: 0,link_id,author_flair_css_class,retrieved_on,controversiality,archived,name,edited,subreddit,score,created_utc,...,distinguished,id,author_flair_text,gilded,author,ups,body,parent_id,subreddit_id,downs
1982,t3_6hqjc,,1425846829,0,True,t1_c03vi4l,False,reddit.com,-17,1209609935,...,,c03vi4l,,0,I_AM_A_NEOCON,-17,The proposition that the people are the best k...,t3_6hqjc,t5_6,0
1983,t3_6hqjc,,1425846829,0,True,t1_c03vi4m,False,reddit.com,-15,1209609935,...,,c03vi4m,,0,I_AM_A_NEOCON,-15,The proposition that the people are the best k...,t3_6hqjc,t5_6,0
1984,t3_6hqjc,,1425846829,0,True,t1_c03vi4n,False,reddit.com,-10,1209609935,...,,c03vi4n,,0,I_AM_A_NEOCON,-10,The proposition that the people are the best k...,t3_6hqjc,t5_6,0
33267,t3_6hymo,,1425847375,0,True,t1_c03w6ai,False,reddit.com,-12,1209760815,...,,c03w6ai,,0,I_AM_A_NEOCON,-12,Soldiers die in the middle east but people don...,t1_c03w66v,t5_6,0
33271,t3_6hymo,,1425847375,0,True,t1_c03w6am,False,reddit.com,-13,1209760822,...,,c03w6am,,0,I_AM_A_NEOCON,-13,Soldiers die in the middle east but people don...,t1_c03w66v,t5_6,0
33276,t3_6hymo,,1425847375,0,True,t1_c03w6ar,False,reddit.com,-12,1209760829,...,,c03w6ar,,0,I_AM_A_NEOCON,-12,Soldiers die in the middle east but people don...,t1_c03w66v,t5_6,0
33277,t3_6hymo,,1425847375,0,True,t1_c03w6as,False,reddit.com,-14,1209760829,...,,c03w6as,,0,I_AM_A_NEOCON,-14,Soldiers die in the middle east but people don...,t1_c03w66v,t5_6,0
33278,t3_6hymo,,1425847375,0,True,t1_c03w6at,False,reddit.com,-11,1209760829,...,,c03w6at,,0,I_AM_A_NEOCON,-11,Soldiers die in the middle east but people don...,t1_c03w66v,t5_6,0
33279,t3_6hymo,,1425847375,0,True,t1_c03w6au,False,reddit.com,-12,1209760829,...,,c03w6au,,0,I_AM_A_NEOCON,-12,Soldiers die in the middle east but people don...,t1_c03w66v,t5_6,0
33280,t3_6hymo,,1425847375,0,True,t1_c03w6av,False,reddit.com,-12,1209760829,...,,c03w6av,,0,I_AM_A_NEOCON,-12,Soldiers die in the middle east but people don...,t1_c03w66v,t5_6,0


In [21]:
#Same author can have high score comments and low score comments at the same time 
benign.loc[benign['author'] == 'I_AM_A_NEOCON']

Unnamed: 0,link_id,author_flair_css_class,retrieved_on,controversiality,archived,name,edited,subreddit,score,created_utc,...,distinguished,id,author_flair_text,gilded,author,ups,body,parent_id,subreddit_id,downs
47535,t3_6i2he,,1425847559,0,True,t1_c03whb7,False,pics,13,1209860326,...,,c03whb7,,0,I_AM_A_NEOCON,13,"The person died, in the snow, because of compl...",t1_c03wh59,t5_2qh0u,0
47563,t3_6i2he,,1425847559,0,True,t1_c03whbz,False,pics,85,1209860526,...,,c03whbz,,0,I_AM_A_NEOCON,85,No...he's just...resting. That's all.,t1_c03wh4j,t5_2qh0u,0
90755,t3_6idwb,,1425848153,0,True,t1_c03xeqe,True,pics,10,1210104280,...,,c03xeqe,,0,I_AM_A_NEOCON,10,",'``.._ ,'``.\n ...",t3_6idwb,t5_2qh0u,0
172604,t3_6iz5d,,1425849227,0,True,t1_c03z5zx,True,politics,30,1210532662,...,,c03z5zx,,0,I_AM_A_NEOCON,30,"Uh, hey look a naked chick!\n\n ...",t3_6iz5d,t5_2cneq,0
286327,t3_6jrnz,,1425850649,0,True,t1_c041lxd,True,pics,20,1211038785,...,,c041lxd,,0,I_AM_A_NEOCON,20,The guy behind Hillary is an assassin!,t3_6jrnz,t5_2qh0u,0
302936,t3_6juwd,,1425850850,0,True,t1_c041ysb,True,pics,11,1211140244,...,,c041ysb,,0,I_AM_A_NEOCON,11,Twenty bucks say the rocks are held up by gori...,t3_6juwd,t5_2qh0u,0
304864,t3_6jvfu,,1425850877,0,True,t1_c04209x,False,pics,14,1211151122,...,,c04209x,,0,I_AM_A_NEOCON,14,I'm already on the waiting list for one of tho...,t3_6jvfu,t5_2qh0u,0
305026,t3_6jvf9,,1425850878,0,True,t1_c0420ef,False,reddit.com,14,1211152191,...,,c0420ef,,0,I_AM_A_NEOCON,14,Bonus Army.,t3_6jvf9,t5_6,0
343045,t3_6k5jv,,1425851668,0,True,t1_c042tsm,True,pics,12,1211325658,...,,c042tsm,,0,I_AM_A_NEOCON,12,I'm going out on a limb here and saying the ca...,t3_6k5jv,t5_2qh0u,0
409427,t3_6kn92,,1425852647,0,True,t1_c04494e,True,business,12,1211649492,...,,c04494e,,0,I_AM_A_NEOCON,12,A recession means less people buying his Marg...,t1_c0448xb,t5_2qgzg,0


In [22]:
anomaly_author_names = anomaly.author.unique()
benign_author_names = benign.author.unique()

In [23]:
def common_member(a, b):
    """check common elements of a and b"""
    a_set = set(a)
    b_set = set(b)
 
    if (a_set & b_set):
        return (a_set & b_set)
    else:
        print("No common elements")

In [24]:
#Remove authors that overlap in benign and anomalous
overlap_authors = common_member(benign_author_names, anomaly_author_names)
len(overlap_authors)

327

In [25]:
benign = benign[~benign['author'].isin(overlap_authors)]
benign_author_names = benign.author.unique()
print("Number of benign users: ", len(benign.author.unique()))
print("Number of anomalous users: ", len(anomaly.author.unique()))

Number of benign users:  460
Number of anomalous users:  327


### 3. Generate author/user labels and save to a csv file

In [26]:
benign_user_label = pd.DataFrame()
benign_user_label['author'] = benign_author_names
benign_user_label['label'] = 0 #0 as benign user
anomalous_user_label = pd.DataFrame()
anomalous_user_label['author'] = anomaly_author_names
anomalous_user_label['label'] = 1

In [27]:
benign_user_label.shape, anomalous_user_label.shape

((460, 2), (327, 2))

In [28]:
benign_user_label.head(2)

Unnamed: 0,author,label
0,ultimatt42,0
1,jonknee,0


In [29]:
anomalous_user_label.head(2)

Unnamed: 0,author,label
0,I_AM_A_NEOCON,1
1,moogle516,1


In [30]:
user_label = pd.concat([benign_user_label, anomalous_user_label])

In [31]:
# Save user label
user_label_filepath = '../../data/02_intermediate/user_behavior/user_labels.csv'

In [32]:
from anomaly_detection_spatial_temporal_data.utils import ensure_directory

In [33]:
ensure_directory(user_label_filepath)
user_label.to_csv(user_label_filepath, index=False)

### 4. Generate user and subreddit index files

#### Each subreddit topic is given an index and saved as a pickle file. We will be naming the file p2index.pkl
#### Each author is also given an index and saved as a pickle file. We will be naming the file u2index.pkl

In [34]:
benign_prod_names = benign.subreddit.unique()
benign_prod_names = benign_prod_names.tolist()

anomaly_prod_names = anomaly.subreddit.unique()
anomaly_prod_names = anomaly_prod_names.tolist()

In [35]:
total_prod_names = benign_prod_names + anomaly_prod_names
total_prod_names = sorted(list(set(total_prod_names)))

In [36]:
p2index={}
count = 0
for subreddit in total_prod_names:
    p2index[subreddit]=count
    count+=1

In [37]:
total_author_names = benign_author_names.tolist() + anomaly_author_names.tolist()
total_author_names = sorted(list(set(total_author_names)))

In [38]:
u2index={}
count = 0
for author in total_author_names:
    u2index[author]=count
    count+=1

### Save the index mapping for author/user and subreddit topic 

In [39]:
import pickle
with open("../../data/02_intermediate/user_behavior/u2index.pkl","wb") as f:
    pickle.dump(u2index, f)

In [40]:
with open("../../data/02_intermediate/user_behavior/p2index.pkl","wb") as f:
    pickle.dump(p2index,f)

### 5. Save edge list as csv file

In [41]:
benign.shape, anomaly.shape

((8057, 21), (886, 21))

In [42]:
edgelist_df = benign.append(anomaly, ignore_index=True)
edgelist_df = edgelist_df.sort_values(by = 'retrieved_on')
print(edgelist_df.shape)

(8943, 21)


In [43]:
edgelist_df[['author','subreddit','retrieved_on']].head(10)

Unnamed: 0,author,subreddit,retrieved_on
0,ultimatt42,science,1425846806
1,jonknee,programming,1425846807
4,burtonmkz,science,1425846810
5,pavel_lishin,reddit.com,1425846810
6,pavel_lishin,reddit.com,1425846810
7,sblinn,politics,1425846810
2,dons,programming,1425846811
3,Jedravent,politics,1425846811
8,WebZen,politics,1425846811
9,doodahdei,politics,1425846812


In [44]:
edge_list_file_path = "../../data/02_intermediate/user_behavior/edge_list.csv"
edgelist_df[['author','subreddit','retrieved_on']].to_csv(edge_list_file_path, index=False)

### 6. Train/validation/test split 

In [45]:
import random

def generate_n_lists(num_of_lists, num_of_elements, value_from=0, value_to=100):
    s = random.sample(range(value_from, value_to + 1), num_of_lists * num_of_elements)
    return [s[i*num_of_elements:(i+1)*num_of_elements] for i in range(num_of_lists)]

l = generate_n_lists(2, 393, 0, 786)

In [46]:
len(l), len(l[0]), len(l[1])

(2, 393, 393)

In [47]:
import numpy as np

In [48]:
import numpy as np
data_tvt = (np.array(l[0][:195]), np.array(l[0][195:]), np.array(l[1]))
print(type(data_tvt))
print(len(data_tvt[0]),len(data_tvt[1]), len(data_tvt[2]))

<class 'tuple'>
195 198 393


In [49]:
with open("../../data/02_intermediate/user_behavior/data_tvt.pkl","wb") as f:
    pickle.dump(data_tvt,f)

### 7. Get node features using NLP models

- To get node feature for user/author, we preprocess comments from each author, get their Top 10 used words and feed these words into word2vec model to get embeddings as author node features.
- To get node feature for subreddit topic, we get the Top 10 used words for each topic and feed these words into word2vec model to get embeddings as subreddit topic node features. 


#### Steps for comments/posts body processing are:
1. Convert words to lower
2. Remove numbers
3. Remove punctuation and symbols
4. Normalize the words (lemmatize and stem the words)

In [50]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
import re
import collections
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import gensim.downloader

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Download the pretrained models

In [51]:
vectors = gensim.downloader.load('word2vec-google-news-300')



In [52]:
stopwords = set(nltk.corpus.stopwords.words('english'))
stemmer= PorterStemmer()

### Get the user node features (user2vec) 

In [53]:
type(vectors['hi']),vectors['hi'].shape

(numpy.ndarray, (300,))

In [56]:
final_user2vec_npy = np.zeros((len(u2index), 300))

for u in u2index:
    user = edgelist_df.loc[edgelist_df['author'] == u]
    comment_row_list = []
    for index, rows in user.iterrows():
        my_list = rows.body
        my_list = my_list.replace('\n'," ")
        my_list = my_list.replace('\t'," ")
        my_list = my_list.lower()
        my_list = ''.join([i for i in my_list if not i.isdigit()])
        my_list = re.sub(r'[^\w\s]', ' ', my_list)
        tokens = word_tokenize(my_list)
        my_list = [i for i in tokens if not i in stopwords]
        comment_row_list.append(my_list)
        
    flat_list = [x for xs in comment_row_list for x in xs]
    counter = collections.Counter(flat_list)
    top10 = counter.most_common(10)
    #print(f'top 10 words used by {u} are:', top10)
    final_vectors = np.zeros((10, 300))
    for i, w in enumerate(top10):
        try:
            embedding = vectors[w[0]]
            #embedding = embedding.tolist()
        except:
            #print('no embeddings created for word: {}'.format(w[0]))
            embedding = np.array([0] * 300)
        final_vectors[i,:]=embedding
    final_embeddings = np.sum(final_vectors, axis=0)    

#     if u2index[u] < 1:
#         print(final_vectors.shape, final_embeddings.shape)
    final_user2vec_npy[u2index[u],:] = final_embeddings

In [57]:
final_user2vec_npy.shape

(787, 300)

In [58]:
# Save the user2vec feature matrix 
userfeat_file = "../../data/02_intermediate/user_behavior/user2vec_npy.npz"
np.savez(userfeat_file,data=final_user2vec_npy)

#### Get the subreddit topic node features (prod2vec)

In [59]:
final_prod2vec_npy = np.zeros((len(p2index), 300))

for p in p2index:
    subreddit = edgelist_df.loc[edgelist_df['subreddit'] == p]
    subreddit_row_list = []
    for index, rows in subreddit.iterrows():
        my_list = rows.body
        my_list = my_list.replace('\n'," ")
        my_list = my_list.replace('\t'," ")
        my_list = my_list.lower()
        my_list = ''.join([i for i in my_list if not i.isdigit()])
        my_list = re.sub(r'[^\w\s]', ' ', my_list)
        tokens = word_tokenize(my_list)
        my_list = [i for i in tokens if not i in stopwords]
        subreddit_row_list.append(my_list)
        
    flat_list = [x for xs in subreddit_row_list for x in xs]
    counter = collections.Counter(flat_list)
    top10 = counter.most_common(10)
    #print(f'top 10 words for subreddit topic {p} are:', top10)

    final_vectors = np.zeros((10, 300))
    for i, w in enumerate(top10):
        try:
            embedding = vectors[w[0]]
            #embedding = embedding.tolist()
        except:
            #print('no embeddings created for word: {}'.format(w[0]))
            embedding = np.array([0] * 300)
        final_vectors[i,:]=embedding
    final_embeddings = np.sum(final_vectors, axis=0)
    final_prod2vec_npy[p2index[p],:] = final_embeddings

In [60]:
type(final_prod2vec_npy),final_prod2vec_npy.shape

(numpy.ndarray, (47, 300))

In [61]:
# Save the prod2vec feature matrix 
prodfeat_file = "../../data/02_intermediate/user_behavior/prod2vec_npy.npz"
np.savez(prodfeat_file,data=final_prod2vec_npy)

# References

Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The Pushshift Reddit Dataset.