Below, I'll be using text8 for my Word2Vec corpus and using the associated vectors on the titles and claps for the Medium articles. I'll be taking a supervised learning approach, with the inputs being the titles (which will be translated to average vectors) and the outputs being claps. I used a random forest model.

In [4]:
#getting model and training it on text8
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
import nltk

corpus = api.load('text8')  # download the corpus and return it opened as an iterable
model = Word2Vec(corpus)  # train a model from the corpus

In [5]:
import pandas as pd
import numpy as np
df = pd.read_csv('medium_data.csv')

I used nltk stopwords and the vocablist of word2vec to remove any extraneous words not useful to the model.

In [6]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/benjaminheim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


This function takes a title, finds the words useful to the model and then returns an average vector for the title. I then add this to the dataframe.

In [7]:
vocab = list(model.wv.index_to_key)
def vectorize_title(title, model):
    words = title.split()  # Split title into words
    words = [word.lower() for word in words]
    words = [word for word in words if word in vocab]
    words = [word for word in words if word not in stop_words]
    word_vectors = [model.wv[word] for word in words]
    if len(word_vectors) == 0:
        return np.zeros(model.vector_size)  # Return zero vector if no words are in the model
    return np.mean(word_vectors, axis=0)

title_vectors = []
for title in df.title:
    title_vectors.append(vectorize_title(title, model))
frames = [df]
    
for r in range(100):
    title_vectors_1dim = []
    for title in title_vectors:
        title_vectors_1dim.append(title[r])
    frames.append(pd.Series(title_vectors_1dim))
print(len(frames))
df_vec = pd.concat(frames, axis=1)

101


These metrics aren't useful for the model, so I'm removing them.

In [8]:
df_vec = df_vec.drop(columns=['url', 'reading_time', 'subtitle', 'responses', 'publication', 'date'])

In [9]:
title_length = [len(title.split()) for title in df.title]
frames = [df_vec, pd.Series(title_length)]
df_vec_len = pd.concat(frames, axis=1)
df_vec_len.to_csv('titles_length_vector.csv')
    

Used a random forest model to predict claps from the title.

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import numpy as np

features = df_vec_len.drop(columns=['id', 'title', 'claps'])
target = df_vec['claps']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=.2, random_state=42)
print(X_train)
X_train = np.asarray(X_train)
X_train.reshape(1, -1)
y_train = np.asarray(y_train)
y_train.reshape(1, -1)
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(list(X_train), list(y_train))

            0         1         2         3         4         5         6   \
163  -0.369952 -1.182686  0.841922 -0.086033 -0.500341 -0.043329 -0.445611   
203   0.079312 -1.627374 -0.220987 -0.463938 -0.169596  0.235753 -0.089013   
841   0.306818 -1.050667 -0.237552  0.319493 -0.347548 -0.270182  0.576232   
1630 -0.114850 -0.251863 -0.054964  0.011068  0.301893  0.339887 -0.206506   
471   0.825590 -1.231775 -0.512317 -0.698986 -0.993363 -0.154073 -0.598387   
...        ...       ...       ...       ...       ...       ...       ...   
1638  0.246301 -0.777275 -0.386258 -0.030485  0.028884  0.391967 -0.257849   
1095  1.319109 -1.911375 -0.496505 -0.525325 -1.119677  0.361343 -0.574371   
1130 -0.040695  0.078110  0.238289  0.000524 -0.448405  0.158006  0.402500   
1294  0.915747 -0.932416 -1.027201 -1.092982 -0.206993  0.303599 -0.171824   
860   0.661275  0.150820 -0.728130 -0.337474 -0.563908 -0.710904  0.303773   

            7         8         9   ...        91        92    

Below is the function to input a title and get the number of predicted claps.

In [12]:
def test_title(title):
    title_len = len(title.split())
    vectors = vectorize_title(title, model)
    frames = []
    for val in vectors:
        frames.append(pd.Series(val))
    frames.append(pd.Series(title_len))
    new_features = pd.concat(frames, axis=1)
    new_features = np.asarray(new_features)
    predicted_claps = rf.predict(new_features)
    return predicted_claps

In [13]:
print(test_title("How to Succeed in Business"))
print(test_title("How to Succeed in Business Without Really Trying"))

[510.83]
[333.85]
