# C. Embeddings

In this notebook, I used word embeddings to transform my textual data into numerical represenations. For this purpose, I used the data was cleaned in the notebook 'B. Vectorization' and that can be found in the gold_data folder. Further, I use the same base classifiers and resampling techniques that I used in the previous notebook.

First, I trained a Word2Vec model (both the continuous bag of words and the skipgram implementation) to transform my textual data. Then, I used Glove embeddings for text represenation. However, as I only have a limited dataset to train embedding models, I aslo used pretrained embedding models. Then, I could use these pretrained embeddings to transform my train and test set and train my models.

# 0. Data loading

In [1]:
# General Packages #
import os
import pandas as pd
import numpy as np
import string
import re
from scipy.stats import randint
import random
from collections import Counter

# Sklearn Packages #
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, StratifiedKFold, cross_val_predict, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, make_scorer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# NLTK Packages #
import nltk
from nltk.corpus import stopwords
from textblob import TextBlob, Word
from nltk.tokenize import word_tokenize

# Import necessary libraries for handling imbalanced data
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# Embedding related imports
import sys
import gensim
from gensim.models import Word2Vec, Doc2Vec
from gensim.models.phrases import Phraser, Phrases
from gensim.models import KeyedVectors
import gensim.downloader
from gensim.models.doc2vec import Doc2Vec, TaggedDocument



In [2]:
# Change to Working Directory with Training Data # 
os.chdir("/Users/Artur/Desktop/thesis_HIR_versie5/coding")
#os.chdir("/Users/juarel/Desktop/studies artur/thesis_HIR/coding")

# Load the preprocessed data #
df_train = pd.read_csv("./data/gold_data/train.csv", header = 0)
df_test = pd.read_csv("./data/gold_data/test.csv", header = 0)

# inspect the data
df_train.head(5)

Unnamed: 0,id,Headline,category,cleaned_headline
0,194578,Head Line: US Patent granted to BASF SE (Delaw...,,head u patent granted se delaware may titled c...
1,564295,Societe Generale Launches a Next-Generation Ca...,,societe generale launch nextgeneration card in...
2,504138,BARCLAYS PLC Form 8.3 - EUTELSAT COMMUNICATION...,,plc form communication
3,91379,ASML: 4Q Earnings Snapshot,,4q earnings snapshot
4,265750,Form 8.3 - AXA INVESTMENT MANAGERS : Booker Gr...,,form investment manager group plc


# 1. Define functions and parameters

Before we continue, we first define some useful functions and parameters that we use throughout this notebook. The first four functions and parameters were also used and defined in the previous notebook.

1. get_classification_metrics: Create a function that return the classification metrics for each model. The precision, recall and f1 score are all determined using the average value of all classes, without adjusting weights to these classes.

2. Define a dataframe to store the results of the different models. Moreover, also define a dictionary that stores the best parameters for each model.

3. Define the number of splits, the stratified cross validator to ensure class frequencies are considered, and the scoring metric based on the average F1 score. We use an F1 score as scoring metric as accuracy is not a good evaluation metric in our case.

4. Define a function that trains the defined model, the input data, the classifier and its parameter grid. Besides, it will also take 4 parameters as input that give more information about the model that is being trained. This is usefull for the storage of the performance of the different algorithms.



New functions specific to the embedding notebook:

5. Create a function to create embeddings based on a trained model.

6. Create a function to create embeddings based on a pretrained model.

In [3]:
# 1. Function that returns classication metrics
def get_classification_metrics(y_true, y_pred):
    
    # Calculate Model Performance Metrics
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='macro')
    recall = recall_score(y_true, y_pred, average='macro')
    f1 = f1_score(y_true, y_pred, average='macro')


    return accuracy, precision, recall, f1


In [4]:
# 2. Create an empty dataframe to store the results of all the models
results_all_df = pd.DataFrame()

# Add columns for the metrics
columns = ['vectorizer', 'FS', 'classifier', 'resampling','accuracy', 'precision', 'recall', 'f1']
for col in columns:
    results_all_df[col] = 0

# create an empty dictionary to store the optimal parameters
best_params_dict = {}

In [5]:
# 3. Define different parameters
# Define the number of folds for cross-validation
n_splits = 5

# Initialize the stratified k-fold object
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42) # ensures class balances are kept

# Define the scoring metric
scoring = make_scorer(f1_score, average= 'macro')

In [6]:
# 4. Define a function to train and evaluate the different models
def perform_grid_search(name, model, param_grid, X_train, X_test, y_train, y_test,
                       vectorizer, FS, classifier, resampling):
    
    # Define a seed value
    random.seed(7)
        
    # Perform the grid search using cross-validation
    grid_search = GridSearchCV(model, param_grid, cv=skf, scoring=scoring)
    grid_search.fit(X_train, y_train)

    # Get the best model and its hyperparameters
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_

    # Store the best parameters for the current category in the dictionary
    best_params_dict[name] = best_params
    print(f'best parameters: {best_params}')

    # Retrain the best model with the whole training set
    best_model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = best_model.predict(X_test)
    
    # Calculate the probabilities (not for SVM as this is not possible)
    if classifier != 'SVM':
        y_pred_proba = best_model.predict_proba(X_test)
        
        # Find the highest probability for each observation
        highest_prob = np.amax(y_pred_proba, axis = 1)
    
        # Create a DataFrame with test observations, highest probabilities, and predicted classes
        predictions_df = pd.DataFrame({'Observation_nr': y_test.index, 'Probability': highest_prob, 'Prediction': y_pred})
        
    else:
        # Create a DataFrame with test observations and predicted classes
        predictions_df = pd.DataFrame({'Observation_nr': y_test.index, 'Prediction': y_pred})
        
    # Store the final predictions with its probability for the test set
    predictions_df.to_csv(f'./Output/predictions/{name}.csv', index = False, header = True)
    #predictions_df.to_excel(f'./Output/predictions/{name}.xlsx', index = False, header = True)

    # Calculate the classification metrics
    accuracy, precision, recall, f1 = get_classification_metrics(y_test, y_pred)
    
    # print the results
    print(f'Results for {name}:')
    print(f'Accuracy: {accuracy}')
    print(f'Precision: {precision}')
    print(f'Recall: {recall}')
    print(f'F1: {f1}')
    
    # add the results to the dataframe with all the results
    results_all_df.loc[name] = [vectorizer, FS, classifier, resampling, accuracy, precision, recall, f1]

In [7]:
# 5. Define a function to transform the data into embeddings based on a trained model
def vectorize_embedding(headline, model, size):
    
    # Split the headline into individual words
    words = headline.split()

    # Retrieve word vectors for each word in the headline
    words_vecs = [model.wv[word] for word in words if word in model.wv]

    # If no word vectors are found (i.e., no matching words in the pretrained model),
    # return a zero vector of the defined dimension
    if len(words_vecs) == 0:
        return np.zeros(size)

    # Convert the list of word vectors into a numpy array and average across all words
    words_vecs = np.array(words_vecs)
    mean_vector = words_vecs.mean(axis=0)

    return mean_vector

In [8]:
# 6. Define a function to transform the data into embeddings based on a pretrained model
def vectorize_pretrained(headline, pre_trained_model, size):

    # Split the headline into individual words
    words = headline.split()

    # Retrieve word embedding for each word in the headline
    words_vecs = [pre_trained_model[word] for word in words if word in pre_trained_model]

    # If no word vectors are found (i.e., no matching words in the pretrained model),
    # return a zero vector of the specified dimension
    if len(words_vecs) == 0:
        return np.zeros(size)

    # Convert the list of word vectors into a numpy array and average across word vectors
    words_vecs = np.array(words_vecs)
    mean_vector = words_vecs.mean(axis=0)

    return mean_vector

In [9]:
# define the independent and dependent variables
X_train = df_train['cleaned_headline']
X_test = df_test['cleaned_headline']

y_train = df_train['category']
y_test = df_test['category']

# 2. Embeddings

In this notebook, I used the same base classifiers and resampling techniques as in the previous notebook 'B. Vectorization'. Moreover, I also initialize the models the first time they are used (section 2.1). Further, I again hypertune the parameters of each model trough cross validation with GridSearchCV. Moroever, I used the same hypertuned the parameters using the same values as in the previous notebook.

Further, I also used the same resampling strategies as in the previous notebook.

## 2.1 Word2Vec

The first embedding model that I used to transfrom my textual data into numerical vectors. Word2Vec is a neural network-based technique designed by Google to learn vector embeddings. Besides, it has two different ways of learning its context, which will both be implemented in this notebook:

1. CBOW = Continuous bag of words. This model trains each word against its context. In other words, what words are likely to appear near a given word.
2. Skipgram = This model trains each context against the word. Here, the context is given and the algorithm searches for the word that is likely to appear.

In [10]:
# Define with what vectorizer we build the models with for storage
vectorizer = 'Word2Vec'

## 2.1 Create word embeddings with CBOW

In [11]:
# Define the implementation method of word2vec
FS = 'cbow'

In [13]:
# Create an array of seperate words for each headline as input for the model
headlines_train = [headline.split() for headline in X_train]

In [19]:
# Define the dimension of the vector
size = 300

# define the cbow_model and train on training set headlines
cbow_model = Word2Vec(headlines_train, 
                 min_count = 2,          # Ignore words that appear less than this
                 vector_size = size,      # Dimensionality of word embeddings
                 workers = 8,            # Number of processors (parallelisation)
                 window = 5,             # Context window for words during training
                 epochs = 50,            # Number of epochs training over corpus
                 sg = 0)                 # 0 for CBOW and 1 for skipgram

In [20]:
# Transform the independent variable of the train and the test set
X_train_cbow = np.array([vectorize_embedding(headline, cbow_model, size) for headline in X_train])
X_test_cbow = np.array([vectorize_embedding(headline, cbow_model, size) for headline in X_test])

### 2.1.1 Without resampling

In [21]:
# Define the resampling technique
resampling = 'None'

#### A. Logistic regression

In [22]:
# define the model characteristics
model_name = 'cbow_log_w'
classifier = 'logR'

# Initialize the classifier
logreg = LogisticRegression(random_state = 7)

# Define the parameter grid
param_grid_log = {
    'penalty': ['None','l1', 'l2'], # normal, lasso or ridge
    'C': [0.1, 1, 10]               # The inverse penalization term (smaller is higher penalization)
}

In [23]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, logreg, param_grid_log, X_train_cbow, X_test_cbow, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


best parameters: {'C': 10, 'penalty': 'l2'}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Results for cbow_log_w:
Accuracy: 0.9148936170212766
Precision: 0.490976349981819
Recall: 0.341393754696866
F1: 0.39620140524709746


#### B. Decision Tree

In [103]:
# define the model characteristics
model_name = 'cbow_DT_w'
classifier = 'DT'

# Initialize the classifier
tree = DecisionTreeClassifier(random_state = 7)

# Define the parameter grid
param_grid_DT = {
    'criterion': ['gini'],          # Define the splitting criteria: Gini index for node impurity
    'min_samples_leaf': [1, 2],     # Define the minimum number of samples required to be at leaf node
    'max_features': [None, 'sqrt']  # Define the number of features to consider when looking for the best split
}

In [104]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, tree, param_grid_DT, X_train_cbow, X_test_cbow, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

best parameters: {'criterion': 'gini', 'max_features': None, 'min_samples_leaf': 1}
Results for cbow_DT_w:
Accuracy: 0.8632328463103385
Precision: 0.22000515730664244
Recall: 0.2291982476151896
F1: 0.22370150639969272


#### C. Support Vector Machine

In [105]:
# define the model characteristics
model_name = 'cbow_svm_w'
classifier = 'SVM'

# Initialize the classifier
svm = SVC(random_state = 7)

# Define the parameter grid
param_grid_svm = {
    'C': [0.1, 1, 10, 100], # inverse regularization parameter
    'kernel': ['linear', 'poly', 'rbf'], # what type of kernel need to be used (rbf = radial kernel)
}

In [106]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, svm, param_grid_svm, X_train_cbow, X_test_cbow, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

best parameters: {'C': 100, 'kernel': 'rbf'}
Results for cbow_svm_w:
Accuracy: 0.9107638246717218
Precision: 0.4642520570714127
Recall: 0.34885371463321097
F1: 0.3932336178387882


#### D. Random Forest Classifier

In [107]:
# define the model characteristics
model_name = 'cbow_rf_w'
classifier = 'RF'

# Initialize the classifier
rfc = RandomForestClassifier(random_state = 7, n_jobs = -1)

# Define the parameter grid
param_grid_rf = {
    'criterion': ['gini'],          # Define the splitting criteria: Gini index for node impurity
    'n_estimators': [100, 500],     # the number of trees to use when building the model
    'min_samples_leaf': [1, 2],     # Define the minimum number of samples required to be at leaf node
    'max_features': [None, 'sqrt']  # Define the number of features to consider when looking for the best split
}

In [108]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, rfc, param_grid_rf, X_train_cbow, X_test_cbow, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

KeyboardInterrupt: 

#### E. Adaboost classifier

In [None]:
# define the model characteristics
model_name = 'cbow_ada_w'
classifier = 'ADA'

# Initialize decision tree base estimator for AdaBoost
base_estimator = DecisionTreeClassifier(random_state = 7)

# Initialize AdaBoost classifier
ada = AdaBoostClassifier(base_estimator = base_estimator, random_state = 7)

# Define parameter grid for AdaBoost
param_grid_ada = {
    'n_estimators': [50, 100, 250],   # the maximum number of estimators before boosting is terminated
    'learning_rate': [0.1, 0.5, 1.0], # weight applied to each classifier at boosting iteration
                                      # A higher learning rate increases the contribution of each classifier. 
}

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, ada, param_grid_ada, X_train_cbow, X_test_cbow, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

### 2.1.2 With undersampling

In [None]:
# Define the resampling technique
resampling = 'Und'

In [None]:
# Define the categories and the maximum number of samples
categories = df_train["category"].unique()

#### First strategy

#Define the minimum number of samples for a class
min_samples = min(y_train.value_counts())

#Define the maximum number of samples per class to keep the ratio 1 to 4 for each class
max_samples = min_samples*4

#Create a dictionary to store the actual maximum imbalance per class
max_imbalance = {}

#Calculate the actual maximum imbalance for each class
for category in categories:
    
    # Check if the number of samples for the category is lower than the desired maximum
    if (y_train.value_counts()[category]) < max_samples:
        
        # Set the actual maximum to the number of available samples
        max_imbalance[category] = y_train.value_counts()[category]
        
    else:
        # Set the actual maximum to the desired maximum
        max_imbalance[category] = max_samples

#### Second strategy

In [None]:
# Calculate the number of samples in the biggest minority category
rus_n = df_train['category'].value_counts().sort_values(ascending=False)[1]

# Dictionary to store the actual maximum imbalance per class for undersampling
max_imbalance_u = {}

# Calculate the actual maximum imbalance for each class
for category in categories:
    if category == 'None':
        max_imbalance_u[category] = rus_n
    else:
        # Set the actual maximum to the number of available samples
        max_imbalance_u[category] = y_train.value_counts()[category]

In [None]:
# Create the random undersampler with maximum imbalance
undersampler = RandomUnderSampler(sampling_strategy = max_imbalance_u, random_state = 7)

# Undersample the data
X_train_cbow_und, y_train_cbow_und = undersampler.fit_resample(X_train_cbow, y_train)
#y_train_cbow_und.value_counts()

#### A. Logistic Regression

In [None]:
# define the model characteristics
model_name = 'cbow_log_u'
classifier = 'logR'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, logreg, param_grid_log, X_train_cbow_und, X_test_cbow, y_train_cbow_und, y_test,
                   vectorizer, FS, classifier, resampling)

#### B. Decision Tree

In [None]:
# define the model characteristics
model_name = 'cbow_DT_u'
classifier = 'DT'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, tree, param_grid_DT, X_train_cbow_und, X_test_cbow, y_train_cbow_und, y_test,
                   vectorizer, FS, classifier, resampling)

#### C. Support Vector Machine

In [None]:
# define the model characteristics
model_name = 'cbow_svm_u'
classifier = 'SVM'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, svm, param_grid_svm, X_train_cbow_und, X_test_cbow, y_train_cbow_und, y_test,
                   vectorizer, FS, classifier, resampling)

#### D. Random Forest Classifier

In [None]:
# define the model characteristics
model_name = 'cbow_rf_u'
classifier = 'RF'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, rfc, param_grid_rf, X_train_cbow_und, X_test_cbow, y_train_cbow_und, y_test,
                   vectorizer, FS, classifier, resampling)

#### E. Adaboost classifier

In [None]:
# define the model characteristics
model_name = 'cbow_ada_u'
classifier = 'ADA'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, ada, param_grid_ada, X_train_cbow_und, X_test_cbow, y_train_cbow_und, y_test,
                   vectorizer, FS, classifier, resampling)

### 2.1.3 With oversampling

In [None]:
# Define the resampling technique
resampling = 'Ove'

In [None]:
# Calculate the number of samples in the majority class
ove_n = df_train['category'].value_counts().sort_values(ascending=False)[0]

# Oversample until the number of observations equals a fourth of the majority class
max_samples = int(ove_n/4)

# Dictionary to store the actual maximum imbalance per class for oversampling
max_imbalance_o = {}

# Calculate the actual maximum imbalance for each class
for category in categories:
    if category == 'None':
        max_imbalance_o[category] = y_train.value_counts()[category]
    else:
        # Set the actual maximum to the number of available samples
        max_imbalance_o[category] = max_samples

In [None]:
# Create the SMOTE oversampler
oversampler = SMOTE(sampling_strategy=max_imbalance_o, random_state=7)

# Undersample the data
X_train_cbow_ove, y_train_cbow_ove = oversampler.fit_resample(X_train_cbow, y_train)
y_train_cbow_ove.value_counts()

#### A. Logistic Regression

In [None]:
# define the model characteristics
model_name = 'cbow_log_o'
classifier = 'logR'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, logreg, param_grid_log, X_train_cbow_ove, X_test_cbow, y_train_cbow_ove, y_test,
                   vectorizer, FS, classifier, resampling)

#### B. Decision Tree

In [None]:
# define the model characteristics
model_name = 'cbow_DT_o'
classifier = 'DT'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, tree, param_grid_DT, X_train_cbow_ove, X_test_cbow, y_train_cbow_ove, y_test,
                   vectorizer, FS, classifier, resampling)

#### C. Support Vector Machine

In [None]:
# define the model characteristics
model_name = 'cbow_svm_o'
classifier = 'SVM'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, svm, param_grid_svm, X_train_cbow_ove, X_test_cbow, y_train_cbow_ove, y_test,
                   vectorizer, FS, classifier, resampling)

#### D. Random Forest Classifier

In [None]:
# define the model characteristics
model_name = 'cbow_rf_o'
classifier = 'RF'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, rfc, param_grid_rf, X_train_cbow_ove, X_test_cbow, y_train_cbow_ove, y_test,
                   vectorizer, FS, classifier, resampling)

#### E. Adaboost classifier

In [None]:
# define the model characteristics
model_name = 'cbow_ada_o'
classifier = 'ADA'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, ada, param_grid_ada, X_train_cbow_ove, X_test_cbow, y_train_cbow_ove, y_test,
                   vectorizer, FS, classifier, resampling)

## 2.2 Create word embeddings with Skipgram

In [None]:
# Define the implementation method of word2vec
FS = 'skip'

In [None]:
# Define the dimension of the vector
size = 300

# define the cbow_model and train on training set headlines
skip_model = Word2Vec(headlines_train, 
                 min_count = 2,          # Ignore words that appear less than this
                 vector_size = size,      # Dimensionality of word embeddings
                 workers = 8,            # Number of processors (parallelisation)
                 window = 5,             # Context window for words during training
                 epochs = 20,            # Number of epochs training over corpus
                 sg = 1)                 # 0 for CBOW and 1 for skipgram

In [None]:
# Transform the independent variable of the train and the test set
X_train_skip = np.array([vectorize_embedding(headline, skip_model, size) for headline in X_train])
X_test_skip = np.array([vectorize_embedding(headline, skip_model, size) for headline in X_test])

### 2.2.1 Without resampling

In [None]:
# Define the resampling technique
resampling = 'None'

#### A. Logistic regression

In [None]:
# define the model characteristics
model_name = 'skip_log_w'
classifier = 'logR'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, logreg, param_grid_log, X_train_skip, X_test_skip, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

#### B. Decision Tree

In [None]:
# define the model characteristics
model_name = 'skip_DT_w'
classifier = 'DT'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, tree, param_grid_DT, X_train_skip, X_test_skip, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

#### C. Support Vector Machine

In [None]:
# define the model characteristics
model_name = 'skip_svm_w'
classifier = 'SVM'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, svm, param_grid_svm, X_train_skip, X_test_skip, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

#### D. Random Forest Classifier

In [None]:
# define the model characteristics
model_name = 'skip_rf_w'
classifier = 'RF'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, rfc, param_grid_rf, X_train_skip, X_test_skip, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

#### E. Adaboost classifier

In [None]:
# define the model characteristics
model_name = 'skip_ada_w'
classifier = 'ADA'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, ada, param_grid_ada, X_train_skip, X_test_skip, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

### 2.1.2 With undersampling

In [None]:
# Define the resampling technique
resampling = 'Und'

#### Second strategy

In [None]:
# Create the random undersampler with maximum imbalance
undersampler = RandomUnderSampler(sampling_strategy = max_imbalance_u, random_state = 7)

# Undersample the data
X_train_skip_und, y_train_skip_und = undersampler.fit_resample(X_train_skip, y_train)
#y_train_cbow_und.value_counts()

#### A. Logistic Regression

In [None]:
# define the model characteristics
model_name = 'skip_log_u'
classifier = 'logR'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, logreg, param_grid_log, X_train_skip_und, X_test_skip, y_train_skip_und, y_test,
                   vectorizer, FS, classifier, resampling)

#### B. Decision Tree

In [None]:
# define the model characteristics
model_name = 'skip_DT_u'
classifier = 'DT'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, tree, param_grid_DT, X_train_skip_und, X_test_skip, y_train_skip_und, y_test,
                   vectorizer, FS, classifier, resampling)

#### C. Support Vector Machine

In [None]:
# define the model characteristics
model_name = 'skip_svm_u'
classifier = 'SVM'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, svm, param_grid_svm, X_train_skip_und, X_test_skip, y_train_skip_und, y_test,
                   vectorizer, FS, classifier, resampling)

#### D. Random Forest Classifier

In [None]:
# define the model characteristics
model_name = 'skip_rf_u'
classifier = 'RF'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, rfc, param_grid_rf, X_train_skip_und, X_test_skip, y_train_skip_und, y_test,
                   vectorizer, FS, classifier, resampling)

#### E. Adaboost classifier

In [None]:
# define the model characteristics
model_name = 'skip_ada_u'
classifier = 'ADA'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, ada, param_grid_ada, X_train_skip_und, X_test_skip, y_train_skip_und, y_test,
                   vectorizer, FS, classifier, resampling)

### 2.1.3 With oversampling

In [None]:
# Define the resampling technique
resampling = 'Ove'

In [None]:
# Create the SMOTE oversampler
oversampler = SMOTE(sampling_strategy=max_imbalance_o, random_state=7)

# Undersample the data
X_train_skip_ove, y_train_skip_ove = oversampler.fit_resample(X_train_skip, y_train)
#y_train_skip_ove.value_counts()

#### A. Logistic Regression

In [None]:
# define the model characteristics
model_name = 'skip_log_o'
classifier = 'logR'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, logreg, param_grid_log, X_train_skip_ove, X_test_skip, y_train_skip_ove, y_test,
                   vectorizer, FS, classifier, resampling)

#### B. Decision Tree

In [None]:
# define the model characteristics
model_name = 'skip_DT_o'
classifier = 'DT'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, tree, param_grid_DT, X_train_skip_ove, X_test_skip, y_train_skip_ove, y_test,
                   vectorizer, FS, classifier, resampling)


#### C. Support Vector Machine

In [None]:
# define the model characteristics
model_name = 'skip_svm_o'
classifier = 'SVM'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, svm, param_grid_svm, X_train_skip_ove, X_test_skip, y_train_skip_ove, y_test,
                   vectorizer, FS, classifier, resampling)

#### D. Random Forest Classifier

In [None]:
# define the model characteristics
model_name = 'skip_rf_o'
classifier = 'RF'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, rfc, param_grid_rf, X_train_skip_ove, X_test_skip, y_train_skip_ove, y_test,
                   vectorizer, FS, classifier, resampling)

#### E. Adaboost classifier

In [None]:
# define the model characteristics
model_name = 'skip_ada_o'
classifier = 'ADA'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, ada, param_grid_ada, X_train_skip_ove, X_test_skip, y_train_skip_ove, y_test,
                   vectorizer, FS, classifier, resampling)

## 3. Pretrained embeddings

In [None]:
# define vectorizer method
vectorizer = 'pretrained'

### 3.1 Google news embeddings

In [109]:
# define type of pretrained embedding model
FS = 'google news'

Uses continious skipgram: http://vectors.nlpl.eu/repository/?ref=blog.paperspace.com

https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings-nlp/


In [111]:
# if downloaded offline
google_news = KeyedVectors.load_word2vec_format('/Users/Artur/Desktop/thesis_HIR_versie5/big files/google/GoogleNews-vectors-negative300.bin', binary=True)

In [None]:
# Download the "word2vec-google-news-300" embeddings if not downloaded ofline
#google_news = gensim.downloader.load('word2vec-google-news-300')

Vectorizes a headline using a pre-trained word embedding model.

    Args:
        headline (str): The input headline to be vectorized.
        pre_trained_model (KeyedVectors): The pre-trained word embedding model.
        size (int): The size of the resulting word vectors.

    Returns:
        np.ndarray: The vectorized representation of the headline.


In [115]:
def vectorize_pretrained(headline, pre_trained_model, size):

    # Split the headline into individual words
    words = headline.split()

    # Retrieve word embedding for each word in the headline
    words_vecs = [pre_trained_model[word] for word in words if word in pre_trained_model]

    # If no word vectors are found (i.e., no matching words in the pretrained model),
    # return a zero vector of the specified dimension
    if len(words_vecs) == 0:
        return np.zeros(size)

    # Convert the list of word vectors into a numpy array and average across word vectors
    words_vecs = np.array(words_vecs)
    mean_vector = words_vecs.mean(axis=0)

    return mean_vector


In [116]:
# Create embeddings of the train and test set based on the pretrained google_news model
X_train_google_news = np.vstack([vectorize_pretrained(headline, google_news, 300) for headline in X_train])
X_test_google_news = np.vstack([vectorize_pretrained(headline, google_news, 300) for headline in X_test])

### 3.1.1 Without resampling

In [110]:
# Define the resampling technique
resampling = 'None'

#### A. Logistic regression

In [None]:
# define the model characteristics
model_name = 'google_log_w'
classifier = 'logR'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, logreg, param_grid_log, X_train_google_news, X_test_google_news, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

#### B. Decision Tree

In [None]:
# define the model characteristics
model_name = 'google_DT_w'
classifier = 'DT'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, tree, param_grid_DT, X_train_google_news, X_test_google_news, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

#### C. Support Vector Machine

In [None]:
# define the model characteristics
model_name = 'google_svm_w'
classifier = 'SVM'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, svm, param_grid_svm, X_train_google_news, X_test_google_news, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

#### D. Random Forest Classifier

In [None]:
# define the model characteristics
model_name = 'google_rf_w'
classifier = 'RF'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, rfc, param_grid_rf, X_train_google_news, X_test_google_news, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

#### E. Adaboost classifier

In [None]:
# define the model characteristics
model_name = 'google_ada_w'
classifier = 'ADA'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, ada, param_grid_ada, X_train_google_news, X_test_google_news, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

### 2.1.2 With undersampling

In [None]:
# Define the resampling technique
resampling = 'Und'

#### Second strategy

In [None]:
# Create the random undersampler with maximum imbalance
undersampler = RandomUnderSampler(sampling_strategy = max_imbalance_u, random_state = 7)

# Undersample the data
X_train_google_news_und, y_train_google_news_und = undersampler.fit_resample(X_train_google_news, y_train)
#y_train_cbow_und.value_counts()

#### A. Logistic Regression

In [None]:
# define the model characteristics
model_name = 'google_log_u'
classifier = 'logR'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, logreg, param_grid_log, X_train_google_news_und, X_test_google_news,
                    y_train_google_news_und, y_test, vectorizer, FS, classifier, resampling)

#### B. Decision Tree

In [None]:
# define the model characteristics
model_name = 'google_DT_u'
classifier = 'DT'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, tree, param_grid_DT, X_train_google_news_und, X_test_google_news,
                    y_train_google_news_und, y_test, vectorizer, FS, classifier, resampling)

#### C. Support Vector Machine

In [None]:
# define the model characteristics
model_name = 'google_svm_u'
classifier = 'SVM'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, svm, param_grid_svm, X_train_google_news_und, X_test_google_news,
                    y_train_google_news_und, y_test, vectorizer, FS, classifier, resampling)

#### D. Random Forest Classifier

In [None]:
# define the model characteristics
model_name = 'google_rf_u'
classifier = 'RF'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, rfc, param_grid_rf, X_train_google_news_und, X_test_google_news,
                    y_train_google_news_und, y_test, vectorizer, FS, classifier, resampling)

#### E. Adaboost classifier

In [None]:
# define the model characteristics
model_name = 'google_ada_u'
classifier = 'ADA'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, ada, param_grid_ada, X_train_google_news_und, X_test_google_news,
                    y_train_google_news_und, y_test, vectorizer, FS, classifier, resampling)

### 2.1.3 With oversampling

In [None]:
# Define the resampling technique
resampling = 'Ove'

In [None]:
# Create the SMOTE oversampler
oversampler = SMOTE(sampling_strategy=max_imbalance_o, random_state=7)

# Undersample the data
X_train_google_news_ove, y_train_google_news_ove = oversampler.fit_resample(X_train_google_news, y_train)
#y_train_skip_ove.value_counts()

#### A. Logistic Regression

In [None]:
# define the model characteristics
model_name = 'google_log_o'
classifier = 'logR'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, logreg, param_grid_log, X_train_google_news_ove, X_test_google_news,
                    y_train_google_news_ove, y_test, vectorizer, FS, classifier, resampling)

#### B. Decision Tree

In [None]:
# define the model characteristics
model_name = 'google_DT_o'
classifier = 'DT'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, tree, param_grid_DT, X_train_google_news_ove, X_test_google_news,
                    y_train_google_news_ove, y_test, vectorizer, FS, classifier, resampling)


#### C. Support Vector Machine

In [None]:
# define the model characteristics
model_name = 'google_svm_o'
classifier = 'SVM'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, svm, param_grid_svm, X_train_google_news_ove, X_test_google_news,
                    y_train_google_news_ove, y_test, vectorizer, FS, classifier, resampling)

#### D. Random Forest Classifier

In [None]:
# define the model characteristics
model_name = 'google_rf_o'
classifier = 'RF'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, rfc, param_grid_rf, X_train_google_news_ove, X_test_google_news,
                    y_train_google_news_ove, y_test, vectorizer, FS, classifier, resampling)

#### E. Adaboost classifier

In [None]:
# define the model characteristics
model_name = 'google_ada_o'
classifier = 'ADA'

In [None]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, ada, param_grid_ada, X_train_google_news_ove, X_test_google_news,
                    y_train_google_news_ove, y_test, vectorizer, FS, classifier, resampling)

## 4. Glove embeddings

In [74]:
# Download the "glove-twitter-25" embeddings
glove_twitter_100 = gensim.downloader.load('glove-twitter-100')



In [75]:
X_train_glove = np.vstack([vectorize_pretrained(headline, glove_twitter_100, 100) for headline in X_train])
X_test_glove = np.vstack([vectorize_pretrained(headline, glove_twitter_100, 100) for headline in X_test])

In [76]:
# perform a grid search for the logistic regression model
# the results are automatically stored in results_all_df and best_params_dict
perform_grid_search(model_name, logreg, param_grid_log, X_train_glove, X_test_glove, y_train, y_test,
                   vectorizer, FS, classifier, resampling)

best parameters: {'C': 10, 'penalty': 'l2'}
Results for cbow_log_w:
Accuracy: 0.9040133160717588
Precision: 0.404330300307501
Recall: 0.18227815053730115
F1: 0.2288104955249781


In [29]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('/Users/Artur/Desktop/thesis_HIR_versie5/big files/glove_6B/glove.6B.300d.txt')

for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs

f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [25]:
headlines

[['head',
  'u',
  'patent',
  'granted',
  'se',
  'delaware',
  'may',
  'titled',
  'composition',
  'comprising',
  'acrylic',
  'block',
  'copolymer',
  'uvcurable',
  'copolymer',
  'method',
  'making',
  'using'],
 ['societe',
  'generale',
  'launch',
  'nextgeneration',
  'card',
  'integrating',
  'dynamic',
  'security',
  'code'],
 ['plc', 'form', 'communication'],
 ['4q', 'earnings', 'snapshot'],
 ['form', 'investment', 'manager', 'group', 'plc'],
 ['plc', 'transaction', 'share'],
 ['northern', 'form', 'group', 'plc'],
 ['plc', 'annual', 'report'],
 ['technology',
  'ag',
  'delaware',
  'applies',
  'u',
  'patent',
  'titled',
  'current',
  'sensor',
  'system',
  'method',
  'sensing',
  'current',
  'conductor'],
 ['plc', 'holding', 'company'],
 ['bank', 'set', '10th', 'coupon', 'rate', 'series', 'bo21', 'bond'],
 ['wipo',
  'publishes',
  'patent',
  '2substituted',
  '5phenyl12dihydro3h3benzazepine3carboxamide',
  'derivative',
  'brd4',
  'inhibitor',
  'treatmen

In [None]:
from keras.preprocessing.text import Tokenizer
# prepare tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
vocab_size = len(tokenizer.word_index) + 1

In [28]:
cd

/Users/Artur


In [1]:
stop_words

NameError: name 'stop_words' is not defined

In [None]:
# write away results
results_all_df.to_csv('./Output/Model performance/results_embeddings.csv', index = False, header = True)

In [None]:
# Write the dictionary with the best parameters away
with open('./Output/parameters/embeddings.json', 'w') as file:
    json.dump(best_params_dict, file)