<a href="https://colab.research.google.com/github/cse-teacher/suggestion-mining/blob/main/suggestion_mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Suggestion Mining
Suggestion mining is the task of extracting suggestions from user reviews

Developed: 11 Feb 2024 \\
Last Update: 11 Feb 2024 \\
Author: Muharram Mansoorizadeh plus Various AI tools (Google search, chatGPT, Gemini , ...)




## Install Required Packagaes

In [None]:
#Install required packages and libraries

!apt-get install libenchant-2-2
!pip install emoji
!pip install cleantext
!pip install nltk
!pip install pyenchant
!pip install scikit-learn lightgbm catboost
!pip install gensim
!pip install transformers sentencepiece sacremoses
!pip install ekphrasis

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  aspell aspell-en dictionaries-common enchant-2 hunspell-en-us libaspell15 libhunspell-1.7-0
  libtext-iconv-perl
Suggested packages:
  aspell-doc spellutils wordlist hunspell openoffice.org-hunspell | openoffice.org-core
  libenchant-2-voikko
The following NEW packages will be installed:
  aspell aspell-en dictionaries-common enchant-2 hunspell-en-us libaspell15 libenchant-2-2
  libhunspell-1.7-0 libtext-iconv-perl
0 upgraded, 9 newly installed, 0 to remove and 45 not upgraded.
Need to get 1,431 kB of archives.
After this operation, 5,501 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libtext-iconv-perl amd64 1.7-7build3 [14.3 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libaspell15 amd64 0.60.8-4build1 [325 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 dict

In [None]:
!pip3 install spacy

!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
ner_categories =['PERSON', 'PRDOUCT' , 'ORG', 'GPE']


In [None]:
text = 'John drives to Sidny school every day with his windows phone made by microsoft'
doc = nlp(text)
print(doc)

John drives to Sidny school every day with his windows phone made by microsoft


In [None]:
for ent in doc.ents:
  print(ent.text , ent.label, ent.label_)

John 380 PERSON
Sidny 384 GPE
microsoft 383 ORG


## Import data

Get the required data files from github repository

In [None]:
!git clone https://github.com/cse-teacher/suggestion-mining.git

Cloning into 'suggestion-mining'...
remote: Enumerating objects: 126, done.[K
remote: Counting objects: 100% (54/54), done.[K
remote: Compressing objects: 100% (54/54), done.[K
remote: Total 126 (delta 33), reused 0 (delta 0), pack-reused 72[K
Receiving objects: 100% (126/126), 2.50 MiB | 21.49 MiB/s, done.
Resolving deltas: 100% (67/67), done.


## Prepare data

In [None]:
# Read data from input files
#Reset environment
%reset -f

import numpy as np
import pandas as pd
import random
import sys

#Set default seed:
random.seed(42)

#Main Application
folder     = "./suggestion-mining/data/"
train_file = folder + "V1.4_Training.csv" #"Train_Augmented_03.csv" # V1.4_Training.csv" #  "Train_processed.csv" /suggestion-mining/data/Train_Augmented_03.csv
valid_file = folder + "SubtaskA_Trial_Test_Labeled.csv" #"validation_processed.csv"
test_file  = folder + "SubtaskA_EvaluationData_labeled.csv"

train_df = pd.read_csv(train_file,
                       encoding_errors='ignore', header=None,
                       names=["id", "sentence", "label"])

valid_df = pd.read_csv(valid_file,
                       encoding_errors='ignore', header=None,
                       names=["id", "sentence", "label"])

test_df  = pd.read_csv(test_file,
                       encoding_errors='ignore', header=None,
                       names=["id", "sentence", "label"])

all_df = pd.concat([train_df, valid_df, test_df], axis=0)


#Get the labels:
y_train_original = train_df['label'].values
y_valid_original = valid_df['label'].values
y_test_original  = test_df['label'].values
y_all_original  = all_df['label'].values
train_size = len(train_df['label'])
valid_size = len(valid_df['label'])
test_size  = len(test_df['label'])



**Preprocessing**

In [None]:
import sys
import re
import nltk
import cleantext
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def remove_nonalpha(text):
    #text = re.sub(r'[^A-Za-z0-9]+', ' ', text)
  text = re.sub(r'[^A-Za-z]+', ' ', text)
  text = re.sub(r'\s+', ' ', text)
  return text

def remove_nonalphanumeric(text):
    #text = re.sub(r'[^A-Za-z0-9]+', ' ', text)
  text = re.sub(r'\W+', ' ', text)
  text = re.sub(r'\s+', ' ', text)
  return text

def remove_stopwords_list(tokens):
  filtered_tokens = [w for w in tokens if not w.lower() in stop_words]
  return filtered_tokens

def remove_stopwords(text):
  tokens = word_tokenize(text)
  filtered_tokens = remove_stopwords_list(tokens)
  return ' '.join(filtered_tokens)

#-----------------------------------
# Replace hyperlinks
#
def replace_hyperlinks(text):
  text = re.sub(r'https?:\/\/\S+', 'hyperlink', text)
  return text

def stem(text):
  tokens = word_tokenize(text.strip())
  tokens_stem =[stemmer.stem(s) for s in tokens]
  return ' '.join(tokens_stem)

#----------------------------------------
# replace_named_entities:
#    Replaces each word or phrase in the input text with its
#    Named Entity Recognition (NER) tag label.
#    Args:
#    text (str): Input text
#
#    Returns:
#    str: Text with named entities replaced by their NER tag labels
#
def replace_named_entities(text):
    # Tokenize the text into words
    words = word_tokenize(text)

    # Tag the words with Part-of-Speech (POS) tags
    tagged_words = pos_tag(words)

    # Perform Named Entity Recognition (NER)
    named_entities = ne_chunk(tagged_words)

    # Replace entities with their NER tag labels
    replaced_text = []
    for entity in named_entities:
        if isinstance(entity, nltk.tree.Tree):
            label = entity.label()
            named_entity_text = " ".join([word for word, tag in entity.leaves()])
            #replaced_text.append(f'<{label}>{named_entity_text}</{label}>')
            replaced_text.append(f'{label}')
            #replaced_text.append('')
        else:
            replaced_text.append(entity[0])

    return " ".join(replaced_text)

#Global callings:
stemmer = SnowballStemmer("english")
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Example usage:
text = "Microsoft should seriously look into getting rid of Syamentc for all these paying stuff"
replaced_text = replace_named_entities(text)
print("Replaced Text:", replaced_text)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Replaced Text: PERSON should seriously look into getting rid of GPE for all these paying stuff


In [None]:
op_replace_hyperlinks      = True
op_remove_nonalphanumeric  = True
op_remove_nonalpha         = True
op_remove_stopwords        = False
op_replace_named_entities  = False
op_stem                    = False

if op_replace_hyperlinks == True:
  #replace named entities with their tag names:
  train_df['sentence']  = train_df['sentence'].apply(replace_hyperlinks)
  test_df['sentence']   = test_df['sentence'].apply(replace_hyperlinks)
  valid_df['sentence']  = valid_df['sentence'].apply(replace_hyperlinks)
  all_df['sentence']    = all_df['sentence'].apply(replace_hyperlinks)

if op_remove_nonalphanumeric == True:
  train_df['sentence'] = train_df['sentence'].apply(remove_nonalphanumeric)
  valid_df['sentence'] = valid_df['sentence'].apply(remove_nonalphanumeric)
  test_df['sentence']  = test_df['sentence'].apply(remove_nonalphanumeric)
  all_df['sentence']   = all_df['sentence'].apply(remove_nonalphanumeric)

if op_remove_nonalpha  == True:
  train_df['sentence'] = train_df['sentence'].apply(remove_nonalpha)
  valid_df['sentence'] = valid_df['sentence'].apply(remove_nonalpha)
  test_df['sentence']  = test_df['sentence'].apply(remove_nonalpha)
  all_df['sentence']   = all_df['sentence'].apply(remove_nonalpha)


if op_replace_named_entities == True:
  train_df['sentence']  = train_df['sentence'].apply(replace_named_entities)
  test_df['sentence']   = test_df['sentence'].apply(replace_named_entities)
  valid_df['sentence']  = valid_df['sentence'].apply(replace_named_entities)
  all_df['sentence']    = all_df['sentence'].apply(replace_named_entities)

if op_remove_stopwords == True:
  train_df['sentence'] = train_df['sentence'].apply(remove_stopwords)
  valid_df['sentence'] = valid_df['sentence'].apply(remove_stopwords)
  test_df['sentence']  = test_df['sentence'].apply(remove_stopwords)
  all_df['sentence']   = all_df['sentence'].apply(remove_stopwords)

if op_stem == True:
  train_df['sentence'] = train_df['sentence'].apply(stem)
  valid_df['sentence'] = valid_df['sentence'].apply(stem)
  test_df['sentence']  = test_df['sentence'].apply(stem)
  all_df['sentence']   = all_df['sentence'].apply(stem)


In [None]:
train_df['sentence'][195:200].tolist()


[' When creating an app that uses MediaStreamSource to stream audio in Windows Phone everything works just great ',
 ' When porting this app to Windows Phone the sound is flickering and its components are stack overflow ing ',
 ' Here is a discussion on MSDN forums hyperlink And here is a sample project source code hyperlink',
 ' we are publishing the same apps in Windows Phone Store and Windows App Store ',
 ' Now we want to bundle these Apps ']

## Feature extraction

In [None]:
#Extract BOW feature test
import nltk
import string
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import wordpunct_tokenize
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

#----------------------
# BOW Features
bow_vectorizer = CountVectorizer(analyzer='word',
                                 stop_words=None,
                                 lowercase=True,
                                 encoding='utf-8',
                                 min_df = 3 ,
                                 max_df = 0.975,
                                 ngram_range =(1,5))

bow_vectorizer.fit(all_df['sentence'])
train_bow_features = bow_vectorizer.transform(train_df['sentence']).toarray()
valid_bow_features = bow_vectorizer.transform(valid_df['sentence']).toarray()
test_bow_features  = bow_vectorizer.transform(test_df['sentence']).toarray()
all_bow_features   = bow_vectorizer.transform(all_df['sentence']).toarray()

#----------------------
# TF-IDF Features

# Fit the vectorizer on the sentences to learn vocabulary and IDF weights
tfidf_vectorizer = TfidfVectorizer(stop_words=None,
                                 lowercase=True,
                                 encoding='utf-8',
                                 min_df = 3 ,
                                 max_df = 0.95, #                                 max_features = 5000,
                                 ngram_range =(1,5))

tfidf_vectorizer.fit(all_df['sentence'])

# Transform the sentences into tf-idf vectors
train_tfidf_features = tfidf_vectorizer.transform(train_df['sentence']).toarray()
test_tfidf_features  = tfidf_vectorizer.transform(test_df['sentence']).toarray()
valid_tfidf_features = tfidf_vectorizer.transform(valid_df['sentence']).toarray()
all_tfidf_features   = tfidf_vectorizer.transform(all_df['sentence']).toarray()

#------------------------------------------------
# word2vec features
#
docs = [wordpunct_tokenize(doc) for doc in all_df['sentence']]
docs1 = [TaggedDocument(doc, [i]) for i, doc in enumerate(docs)]
model = Doc2Vec(docs1, vector_size=300, window=4, min_count=1, workers=4, epochs=100)

#Get the features:
vectors = [model.infer_vector(doc) for doc in(docs)]
all_d2v_features = np.array(vectors)
train_d2v_features = all_d2v_features[0:train_size,:]
valid_d2v_features = all_d2v_features[train_size:train_size+valid_size,:]
test_d2v_features  = all_d2v_features[train_size+valid_size:,:]

#define global features, empty at first:
X_train     = np.empty([])
X_test      = np.empty([])
X_valid     = np.empty([])
X_all       = np.empty([])
X_train_val = np.empty([])

y_train = y_train_original
y_valid = y_valid_original
y_test  = y_test_original
y_all   = y_all_original
y_train_val = np.concatenate((y_train , y_valid), axis= 0 )

In [None]:
#===============================================
# Utility functions
#

import tensorflow as tf
import string
import sklearn
import seaborn as sns
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import csv
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve
from sklearn.model_selection import cross_val_score
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from nltk.tokenize import word_tokenize
from imblearn.over_sampling import SMOTE, RandomOverSampler
from sklearn.ensemble import RandomForestClassifier
from datetime import datetime


#------------------------------------------------
# apply the given option
#
def select_optional_features(feature_group,
                 op_scale_features  = False,
                 op_upsample_smote  = False,
                 op_upsample_over   = False,
                 op_transform_pca   = False,
                 op_downsample_majority = False
                 ):
  global X_train, X_valid, X_test, X_all
  global y_train, y_valid, y_test, y_all
  global X_train_val , y_train_val
  global results_df, start_time_str

  description = feature_group
  y_train = y_train_original
  y_valid = y_valid_original
  y_test  = y_test_original
  # Convert the dictionary into DataFrame
  results_df = pd.DataFrame({'labels': y_test})
  start_time_str = f"{datetime.now()}"


  if feature_group == 'tfidf' :
    X_train = train_tfidf_features
    X_test  = test_tfidf_features
    X_valid = valid_tfidf_features
    X_all   = all_tfidf_features
  elif feature_group == 'bow':
    X_train = train_bow_features
    X_test  = test_bow_features
    X_valid = valid_bow_features
    X_all   = all_bow_features
  elif feature_group == 'd2v':
    X_train = train_d2v_features
    X_test  = test_d2v_features
    X_valid = valid_d2v_features
    X_all   = all_d2v_features

  if op_scale_features == True: # Scale numerical features
     scaler  = StandardScaler().fit(X_all); description += ', Standard Scaler'
     X_all   = scaler.transform(X_all)
     X_train = scaler.transform(X_train)
     X_test  = scaler.transform(X_test)
     X_valid = scaler.transform(X_valid)

  if op_upsample_smote == True: # SMOTE oversampling
    smote = SMOTE(sampling_strategy="minority") ; description += ', SMOTE Augmentation'
    X_train, y_train = smote.fit_resample(X_train, y_train)

  if op_upsample_over == True: # Random oversampling
    oversampler = RandomOverSampler(random_state=42); description += ', oversampling Augmentation'
    X_train, y_train = oversampler.fit_resample(X_train, y_train)

  if op_transform_pca == True:  # Do PCA
    #n_comps = min(500 , 0.1 * X_all.shape[1])
    pca = PCA(n_components=0.95).fit(X_all);  description += ', PCA'
    X_train = pca.transform(X_train) ; X_test = pca.transform(X_test)
    X_valid = pca.transform(X_valid) ; X_all = pca.transform(X_all)

  if op_downsample_majority == True:  # Down sample majority class
      # Separate instances for class 1
    class_1_instances = X_train[y_train == 1,:]
    class_0_instances = X_train[y_train == 0,:]
    number_of_samples = class_1_instances.shape[0]
    indices = np.random.choice(class_0_instances.shape[0], number_of_samples, replace=False)
    sampled_class_0_instances = class_0_instances[indices,:]

    # Combine instances for class 1 and sampled instances from class 0
    X_train = np.concatenate([class_1_instances, sampled_class_0_instances])
    y_train = np.concatenate([np.ones(class_1_instances.shape[0]), np.zeros(sampled_class_0_instances.shape[0])])


    # Train + Validation data
  X_train_val = np.concatenate((X_train, X_valid) , axis=0)
  y_train_val = np.concatenate((y_train, y_valid) , axis=0)
  return description

#----------------------------------
# Print results per class
#
def print_per_class_results(y_actual, y_pred, description=''):
  for label in (0,1):
    v0 = accuracy_score(y_actual, y_pred)
    v1 = precision_score(y_actual, y_pred, pos_label=label)
    v2 = recall_score(y_actual, y_pred, pos_label=label)
    v3 = f1_score(y_actual, y_pred, pos_label=label)
    print(f"{description},\t class={label}\tAccuracy={v0:.2f},\t Precision={v1:.2f},\tRecall={v2:.2f}\tF1-score={v3:.2f}")


#----------------------------------
# Print results per class
#
def print_results(y_actual, y_pred, description=''):
  try:
    v00 = accuracy_score(y_actual, y_pred)
    v01 = precision_score(y_actual, y_pred, pos_label=0)
    v02 = recall_score(y_actual, y_pred, pos_label=0)
    v03 = f1_score(y_actual, y_pred, pos_label=0)

    v11 = precision_score(y_actual, y_pred, pos_label=1)
    v12 = recall_score(y_actual, y_pred, pos_label=1)
    v13 = f1_score(y_actual, y_pred, pos_label=1)

    smsg = f"{description},\tAccuracy={v00:.2f},\tC0: Pr={v01:.2f}, Re={v02:.2f}, F1={v03:.2f},\tC1: Pr={v11:.2f}, Re={v12:.2f}, F1={v13:.2f}"
    print(smsg)
    with open(f"results_{start_time_str}.txt", "a") as myfile:
      myfile.write(f"{datetime.now()}\t {smsg}\n")

    results_df.insert(len(results_df.columns),description, y_pred)
  except Exception as error:
      print(f"something went wrong {error}")

#cutoff probability to make a binary value
def prob2label (y, threshold=0.5):
  y[y <  threshold] = 0
  y[y >= threshold] = 1
  return y


In [None]:
#ttest based keywords selection:
# Import the library
import scipy.stats as stats

def ttest2(X,y):
  X1 = X[y==1, :]; X2 =X[y!=1,:] ;
  numcols = X.shape[1]
  sval = np.zeros(numcols, float)
  pval = np.zeros(numcols, float)
  for k in range(0,numcols):
    test_result = stats.ttest_ind(a=X1[:,k], b=X2[:,k], equal_var=True)
    sval[k] = test_result.statistic
    pval[k] = test_result.pvalue
  return sval, pval

# filter bow features
cutoff = 0.22
[A,B] = ttest2(train_bow_features , y_train)
best_bow_features = np.argsort(B, axis=-1, kind=None, order=None)
last_bow_feature = np.min(np.argwhere(B[best_bow_features] > cutoff))
best_bow_feature_names = bow_vectorizer.get_feature_names_out()[best_bow_features[0:last_bow_feature]]

train_bow_features= train_bow_features[:,best_bow_features[0:last_bow_feature]]
test_bow_features = test_bow_features[:,best_bow_features[0:last_bow_feature]]
valid_bow_features = valid_bow_features[:,best_bow_features[0:last_bow_feature]]
all_bow_features   = all_bow_features[:,best_bow_features[0:last_bow_feature]]

# filter tfidf features
[A,B] = ttest2(train_tfidf_features , y_train)
best_tfidf_features = np.argsort(B, axis=-1, kind=None, order=None)
last_tfidf_feature  = np.min(np.argwhere(B[best_tfidf_features] > cutoff))
best_tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()[best_tfidf_features[0:last_tfidf_feature]]

train_tfidf_features= train_tfidf_features[:,best_tfidf_features[0:last_tfidf_feature]]
test_tfidf_features = test_tfidf_features[:,best_tfidf_features[0:last_tfidf_feature]]
valid_tfidf_features = valid_tfidf_features[:,best_tfidf_features[0:last_tfidf_feature]]
all_tfidf_features   = all_tfidf_features[:,best_tfidf_features[0:last_tfidf_feature]]


In [None]:
print(f"Top 10 bow features{best_bow_feature_names[0:10]}")
print(f"Top 10 tfidf features{best_tfidf_feature_names[0:10]}")
results_df.to_csv(f"labels_{start_time_str}.csv")

Top 10 bow features['be' 'please' 'would' 'should' 'would be' 'it would be' 'it would' 'add'
 'should be' 'to']
Top 10 tfidf features['be' 'should' 'it would be' 'please' 'it would' 'would be' 'would' 'add'
 'should be' 'to']


# Experiments



## Utility Functions


**Experimental Setup**

In [None]:
 #select options here and run classifiers as you like:

 current_options = select_optional_features(feature_group = 'tfidf',
                 op_scale_features  = False,
                 op_upsample_smote  = False,
                 op_upsample_over   = False,
                 op_transform_pca   = False ,
                 op_downsample_majority = False,

                                            )

 print(current_options)

tfidf


##Rule Based Methods##

This section contains several rule based methods.

In [None]:
suggestion_keywords = ["should", "could", "might", "ought to", "would", "recommend", "suggest", "consider", "better", "allow" ]
polite_phrases = ["would you mind", "could you please", "I suggest", "please", "if you want to", "be able to", "it would be"]
learned_keywords =best_bow_feature_names[0:1]

def contains_suggestion(paragraph):
    for keyword in suggestion_keywords:
        if keyword in paragraph.lower():
            return True
    for phrase in polite_phrases:
        if phrase in paragraph.lower():
            return True
    return False

def classify_paragraphs(paragraphs):
    y_pred = []
    for paragraph in paragraphs:
        if contains_suggestion(paragraph):
            y_pred.append(1)
        else:
            y_pred.append(0)
    return y_pred


y_pred = classify_paragraphs(test_df['sentence'])
print_results(y_test, y_pred, 'keywords ' )

keywords ,	Accuracy=0.87,	C0: Pr=0.97, Re=0.89, F1=0.93,	C1: Pr=0.44, Re=0.76, F1=0.56


**Linear Discrimination Analysis**

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = lda.predict(X_test)

# Calculate accuracy
print_results(y_test , y_pred, 'LDA, ' + current_options )


LDA, tfidf,	Accuracy=0.76,	C0: Pr=0.93, Re=0.79, F1=0.86,	C1: Pr=0.22, Re=0.51, F1=0.31


In [None]:
clf = RandomForestClassifier(n_estimators=501, n_jobs=-1,verbose=1)
clf.fit(X_train_val, y_train_val)
y_pred = clf.predict(X_test)
print_results(y_test , y_pred, 'random forest' + ', '+ current_options)


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   11.7s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:   47.8s
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 501 out of 501 | elapsed:  2.0min finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.1s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    0.2s


random forest, tfidf,	Accuracy=0.93,	C0: Pr=0.95, Re=0.97, F1=0.96,	C1: Pr=0.72, Re=0.60, F1=0.65


[Parallel(n_jobs=2)]: Done 446 tasks      | elapsed:    0.4s
[Parallel(n_jobs=2)]: Done 501 out of 501 | elapsed:    0.4s finished


**Basic Methods**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import BayesianRidge
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd

from sklearn.preprocessing import MinMaxScaler

# Create MinMaxScaler instance with feature_range=(0, 10)
scaler = MinMaxScaler(feature_range=(10, 20))

# Fit the scaler to your data
scaler.fit(X_all)

# Transform your data
X1 = scaler.transform(X_train_val)
X2 = scaler.transform(X_test)

# Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X1, y_train_val)
nb_predictions = nb_classifier.predict(X2)

# Bayesian classifier
bayesian_classifier = BayesianRidge()
bayesian_classifier.fit(X1, y_train_val)
bayesian_predictions = bayesian_classifier.predict(X2)


# Evaluation
print_results(y_test , nb_predictions >=0.5, 'Naive Bayes, '+ current_options)
print_results(y_test , bayesian_predictions>=0.5, 'BayesianRidge, '+ current_options)



Naive Bayes, tfidf,	Accuracy=0.90,	C0: Pr=0.90, Re=1.00, F1=0.95,	C1: Pr=0.57, Re=0.05, F1=0.09
BayesianRidge, tfidf,	Accuracy=0.91,	C0: Pr=0.95, Re=0.95, F1=0.95,	C1: Pr=0.57, Re=0.60, F1=0.58


In [None]:
#Some Useful classifiers
def test_basic_models(X1,y1,X2,y2, description):
  classifiers = {
      'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=21, metric="cosine"),
      'Logistic Regression': sklearn.linear_model.LogisticRegression(random_state=42),
      'Support Vector Machine-L': sklearn.svm.SVC(kernel='linear', random_state=42),
      'Support Vector Machine-R': sklearn.svm.SVC(kernel='rbf', random_state=42),
      'Support Vector Machine-S': sklearn.svm.SVC(kernel='sigmoid', random_state=42),
      'Support Vector Machine-WL': sklearn.svm.SVC(kernel="linear", class_weight={1: 10}, random_state=42),
      'Support Vector Machine-WR': sklearn.svm.SVC(kernel="rbf", class_weight={1: 10}, random_state=42),
      'Support Vector Machine-WS': sklearn.svm.SVC(kernel="sigmoid", class_weight={1: 10}, random_state=42),
      'Decision Tree classifier': DecisionTreeClassifier(max_depth=15, random_state=42),
  }

  # Loop through each classifier and evaluate performance
  for name, clf in classifiers.items():
      clf.fit(X1, y1)
      y_pred = clf.predict(X2)
      print_results(y2 , y_pred, 'basic ' + name + ', '+ description)
#---------------------
test_basic_models (X_train_val , y_train_val , X_test , y_test, current_options)


basic K-Nearest Neighbors, tfidf,	Accuracy=0.91,	C0: Pr=0.92, Re=0.99, F1=0.95,	C1: Pr=0.68, Re=0.26, F1=0.38
basic Logistic Regression, tfidf,	Accuracy=0.92,	C0: Pr=0.94, Re=0.98, F1=0.96,	C1: Pr=0.68, Re=0.45, F1=0.54
basic Support Vector Machine-L, tfidf,	Accuracy=0.92,	C0: Pr=0.95, Re=0.96, F1=0.96,	C1: Pr=0.63, Re=0.60, F1=0.62
basic Support Vector Machine-R, tfidf,	Accuracy=0.89,	C0: Pr=0.95, Re=0.93, F1=0.94,	C1: Pr=0.48, Re=0.55, F1=0.51
basic Support Vector Machine-S, tfidf,	Accuracy=0.91,	C0: Pr=0.96, Re=0.94, F1=0.95,	C1: Pr=0.57, Re=0.67, F1=0.62
basic Support Vector Machine-WL, tfidf,	Accuracy=0.84,	C0: Pr=0.98, Re=0.83, F1=0.90,	C1: Pr=0.37, Re=0.85, F1=0.52
basic Support Vector Machine-WR, tfidf,	Accuracy=0.89,	C0: Pr=0.97, Re=0.90, F1=0.93,	C1: Pr=0.47, Re=0.74, F1=0.57
basic Support Vector Machine-WS, tfidf,	Accuracy=0.76,	C0: Pr=0.98, Re=0.75, F1=0.85,	C1: Pr=0.29, Re=0.89, F1=0.44
basic Decision Tree classifier, tfidf,	Accuracy=0.90,	C0: Pr=0.95, Re=0.93, F1=0.94,	C1

**Ensemble Models**

This experiment trains well-known ensemble methods on the dataset.







In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score

def test_ensemble_models(X1,y1,X2,y2, description):
  # Initialize classifiers
  classifiers = {
      "Random Forest"     : RandomForestClassifier(n_estimators=101,class_weight={0:1,1:10}, n_jobs=-1,verbose=1),
      "AdaBoost"          : AdaBoostClassifier(n_estimators=101),
      "Gradient Boosting" : GradientBoostingClassifier(),
      "Extra Trees"       : ExtraTreesClassifier(),
      "LightGBM"          : LGBMClassifier(),
      "CatBoost"          : CatBoostClassifier(verbose=0)
  }

  # Loop through each classifier and evaluate performance
  for name, clf in classifiers.items():
      clf.fit(X1, y1)
      y_pred = clf.predict(X2)
      print_results(y2 , y_pred, 'Ensemble, ' + name + ', ' + description)


# Train and evaluate ensemble models
test_ensemble_models (X_train_val , y_train_val , X_test , y_test, current_options)



[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    9.9s
[Parallel(n_jobs=-1)]: Done 101 out of 101 | elapsed:   20.9s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.1s
[Parallel(n_jobs=2)]: Done 101 out of 101 | elapsed:    0.1s finished


Ensemble, Random Forest, tfidf,	Accuracy=0.94,	C0: Pr=0.96, Re=0.97, F1=0.97,	C1: Pr=0.73, Re=0.67, F1=0.70
Ensemble, AdaBoost, tfidf,	Accuracy=0.91,	C0: Pr=0.95, Re=0.94, F1=0.95,	C1: Pr=0.54, Re=0.60, F1=0.57
Ensemble, Gradient Boosting, tfidf,	Accuracy=0.92,	C0: Pr=0.95, Re=0.96, F1=0.96,	C1: Pr=0.62, Re=0.61, F1=0.61
Ensemble, Extra Trees, tfidf,	Accuracy=0.92,	C0: Pr=0.95, Re=0.96, F1=0.96,	C1: Pr=0.65, Re=0.59, F1=0.62
[LightGBM] [Info] Number of positive: 2381, number of negative: 6711
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.051038 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 26574
[LightGBM] [Info] Number of data points in the train set: 9092, number of used features: 860
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.261879 -> initscore=-1.036227
[LightGBM] [Info] Start training from score -1.036227
Ensemble, Ligh

In [None]:
X_train.shape

(8500, 6351)

**Neural Networks**

This network is trained on the training and validation sets and
tested on the testing set

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from datetime import datetime

class MLPModel(nn.Module):
    def __init__(self, input_size, hidden_sizes):
        super(MLPModel, self).__init__()
        self.layers = nn.ModuleList()

        # Input layer
        self.layers.append(nn.Linear(input_size, hidden_sizes[0]))
        self.layers.append(nn.BatchNorm1d(hidden_sizes[0]))
        self.layers.append(nn.Tanh())
        self.layers.append(nn.Dropout(0.2))

        # Hidden layers
        for i in range(1, len(hidden_sizes)):
            self.layers.append(nn.Linear(hidden_sizes[i - 1], hidden_sizes[i]))
            self.layers.append(nn.BatchNorm1d(hidden_sizes[i]))
            self.layers.append(nn.ReLU())
            self.layers.append(nn.Dropout(0.2))

        # Output layer
        self.layers.append(nn.Linear(hidden_sizes[-1], 1))
        self.layers.append(nn.Sigmoid())

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

# Hyperparameters
torch.cuda.empty_cache()

input_size = X_train_val.shape[1]  # Adjust this based on your input features
hidden_sizes = [500, 250, 100, 50]

# Instantiate the model
model_mlp2 = MLPModel(input_size, hidden_sizes)

# Check if GPU is available and move the model and data to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_mlp2.to(device)
print(device)
# Loss and optimizer
criterion = nn.BCELoss()  #nn.BCEWithLogitsLoss() #nn.CrossEntropyLoss() #nn.MSELoss() #nn.KLDivLoss() # nn.BCELoss()
optimizer = optim.Adam(model_mlp2.parameters(), lr=0.001)

#Generate balanced dataset for training:
# Separate instances for class 1
class_1_instances = X_train_val[y_train_val == 1,:]
class_0_instances = X_train_val[y_train_val == 0,:]
number_of_samples = class_1_instances.shape[0]

index             = np.random.choice(class_0_instances.shape[0], number_of_samples, replace=False)
sampled_class_0_instances = class_0_instances[index,:]

# Combine instances for class 1 and sampled instances from class 0
#balanced_X = np.concatenate([class_1_instances, sampled_class_0_instances])
#balanced_y = np.concatenate([np.ones(class_1_instances.shape[0]), np.zeros(sampled_class_0_instances.shape[0])])
balanced_X = X_train_val ; balanced_y = y_train_val


# Dummy data (replace this with your actual dataset)
# Assuming you have X_train and y_train as your training data and labels
data_X = torch.Tensor(balanced_X).to(device)
data_y = torch.Tensor(balanced_y).view(-1, 1).to(device)

# Create DataLoader for the dataset
dataset = TensorDataset(data_X, data_y)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

# Training loop
top_n      = 101 # top n better models
num_epochs = 250
losses     = [10000]* top_n
# Get the current date and time
current_datetime     = datetime.now()
current_datetime_str = f"{current_datetime.strftime('%Y-%m-%d_%H-%M-%S')}"

for epoch in range(num_epochs):
    for inputs, labels in dataloader:
        optimizer.zero_grad()
        outputs = model_mlp2(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    if epoch %10 == 0 :
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
    # Save better models:
    cur_loss =loss.item()
    for k in range(len(losses)):
        if cur_loss < losses[k]:
            losses[k] = cur_loss
            if k < top_n:
                model_file_name = f"mlp_model_best_{k:02d}_{current_datetime_str}.pth"
                torch.save(model_mlp2.state_dict(), model_file_name)
                #print(losses)
                break


#save the last model
print(losses)
model_file_name = f"mlp_model_last_{current_datetime_str}.pth"
torch.save(model_mlp2.state_dict(), model_file_name)




#------------------------------------------
# Lets get training accuracy
# Set the model to evaluation mode and evaluate it on train data:
model_mlp2.eval()
predictions=[]
with torch.no_grad():
  for inputs in dataloader: #remember from the earlier cell that this is the train dataloader
    outputs = model_mlp2(inputs[0])
    predictions.append(outputs.cpu().data.numpy())
# Calculate accuracy
predictions = np.concatenate(predictions)
y_pred = predictions >= 0.5

print_results(y_train_val , y_pred, 'torch nn, training ' + current_options)



[5.620389856630936e-06, 9.040949407790322e-06, 1.196373068523826e-05, 2.4004553779377602e-05, 3.4125296224374324e-05, 3.956385626224801e-05, 5.206716014072299e-05, 6.47710548946634e-05, 9.248861169908196e-05, 9.431212674826384e-05, 0.0001482561056036502, 0.00045639934251084924, 0.0005832063034176826, 0.006052209530025721, 0.009372045285999775, 0.015000290237367153, 0.024201955646276474, 0.06612623482942581, 0.18303440511226654, 0.24143077433109283, 0.7864460945129395, 1.163494348526001, 1.5431393384933472, 1.8736683130264282, 2.3808646202087402, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 

In [None]:
import os

# print output to the console
print(os.getcwd())
os.chdir('c:/users/mmr')
print(os.getcwd())


# output will look something similar to this on a macOS system
# /Users/dionysialemonaki/Documents/my-projects/python-proje

c:\users\mmr
c:\users\mmr


In [None]:
#-------------------------
# Test the network
fname = f"mlp_model_best_00_{current_datetime_str}.pth"
model = MLPModel(input_size, hidden_sizes)
model.load_state_dict(torch.load(fname))

#Prepare test data:
new_data = torch.Tensor(X_test) #.to(device)

# Create DataLoader for the new dataset
new_dataset = TensorDataset(new_data)
new_dataloader = DataLoader(new_dataset, batch_size=1, shuffle=False)

predictions = []
model.eval()
# Make predictions on the test data
with torch.no_grad():
  for inputs in new_dataloader:
    outputs = model(inputs[0])#(torch.tensor(X_test))
    #predictions = torch.round(outputs)
    predictions.append(outputs.cpu().data.numpy())

# Calculate accuracy
predictions = np.concatenate(predictions)
y_pred = predictions >= 0.5

print_results(y_test , y_pred, 'torch nn best, ' + current_options)

torch nn best, tfidf,	Accuracy=0.84,	C0: Pr=0.95, Re=0.87, F1=0.91,	C1: Pr=0.35, Re=0.61, F1=0.45


In [None]:
    #'MLP Network': MLPClassifier(hidden_layer_sizes=(150, 100,50), activation='relu', solver='adam', max_iter=1000),
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv1D, MaxPooling1D, Embedding, LSTM, Flatten
from tensorflow.keras.utils import to_categorical


# Define common parameters
train_size, vocab_size  = X_train.shape
print(X_train.shape)
  # Adjust based on your data
max_len  = vocab_size  # Adjust based on your data
numepochs  = 100

model_mlp1 = MLPClassifier(random_state=42, max_iter=50)


model_mlp2 = Sequential()
model_mlp2.add(Dense(500, input_dim=X_train.shape[1]))
model_mlp2.add(BatchNormalization())
model_mlp2.add(Activation(activation='sigmoid'))
model_mlp2.add(Dropout(0.2))
model_mlp2.add(Dense(250))
model_mlp2.add(BatchNormalization())
model_mlp2.add(Activation(activation='relu'))
model_mlp2.add(Dropout(0.2))
model_mlp2.add(Dense(100))
model_mlp2.add(BatchNormalization())
model_mlp2.add(Activation(activation='sigmoid'))
model_mlp2.add(Dropout(0.2))
model_mlp2.add(Dense(50))
model_mlp2.add(BatchNormalization())
model_mlp2.add(Activation(activation='sigmoid'))
model_mlp2.add(Dropout(0.2))
model_mlp2.add(Dense(1,activation=tf.keras.activations.sigmoid))
model_mlp2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define CNN model
model_cnn = Sequential()
model_cnn.add(Embedding(vocab_size, 128, input_length=max_len))
model_cnn.add(Conv1D(32, kernel_size=3, activation='relu'))
model_cnn.add(MaxPooling1D(pool_size=2))
model_cnn.add(Flatten())
model_cnn.add(Dense(128, activation='relu'))
model_cnn.add(Dense(len(set(y_train)), activation='softmax'))
model_cnn.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define RNN model
model_rnn = Sequential()
model_rnn.add(Embedding(vocab_size, 128, input_length=max_len))
model_rnn.add(LSTM(64, return_sequences=True))
model_rnn.add(LSTM(32))
model_rnn.add(Dense(128, activation='relu'))
model_rnn.add(Dense(len(set(y_train)), activation='softmax'))
model_rnn.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define LSTM model
model_lstm = Sequential()
model_lstm.add(Embedding(vocab_size, 128, input_length=max_len))
model_lstm.add(LSTM(128))
model_lstm.add(Dense(64, activation='relu'))
model_lstm.add(Dense(len(set(y_train)), activation='softmax'))
model_lstm.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Evaluate and compare models
# Initialize classifiers
NN = {
    "Modern MLPt": model_mlp2,
    "CNN": model_cnn,
    "Recurrent NN": model_rnn,
    "LSTM": model_lstm,
}

# Loop through each classifier and evaluate performance
for name, clf in NN.items():
  clf.fit(X_train_val, y_train_val, epochs=numepochs)
  y_pred = clf.predict(X_test)
  if (y_pred.ndim > 1): y_pred = np.argmax(y_pred , axis=1)


  print_results(y_test , y_pred, name + ', '+ current_options)




ModuleNotFoundError: No module named 'tensorflow.keras'

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torchvision.transforms import Lambda
from torch.nn.functional import softmax
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from collections import Counter

class TextCNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_filters, filter_sizes, num_classes, dropout_rate=0.5):
        super(TextCNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.conv_layers = nn.ModuleList([
            nn.Conv1d(in_channels=embedding_dim, out_channels=num_filters, kernel_size=fs)
            for fs in filter_sizes
        ])
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(num_filters * len(filter_sizes), num_classes)

    def forward(self, inputs):
        x = self.embedding(inputs).permute(0, 2, 1)  # Permute to (batch_size, embedding_dim, sequence_length)
        conv_outputs = [conv(x) for conv in self.conv_layers]
        pooled_outputs = [torch.max(conv_output, dim=2)[0] for conv_output in conv_outputs]
        concatenated = torch.cat(pooled_outputs, dim=1)
        concatenated = self.dropout(concatenated)
        output = self.fc(concatenated)
        return softmax(output, dim=1)





In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torchtext.vocab import build_vocab_from_iterator
from collections import Counter
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from collections import Counter

# Load data from CSV file
df = all_df

# Example: Preprocess data
tokenizer = get_tokenizer('basic_english')
counter = Counter()
for line in df['sentence']:
    counter.update(tokenizer(line))
vocab = build_vocab_from_iterator([tokenizer(line) for line in df['sentence']], specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])

# Convert labels to one-hot encoding
label_encoder = LabelEncoder()
x_data = []
y_data = label_encoder.fit_transform(df['label'])  # Replace 'label_column' with the name of your label column
for line in df['sentence']:
    x_data.append(torch.tensor([vocab[token] for token in tokenizer(line)]))

# Pad sequences and convert to tensors
x_data = nn.utils.rnn.pad_sequence(x_data, batch_first=True)
y_data = torch.tensor(y_data)

# Split data into train/validation sets
x_data_train, x_data_val, y_data_train, y_data_val = train_test_split(x_data, y_data, test_size=0.2, random_state=42)

# Example: Instantiate the TextCNN model
vocab_size = len(vocab)
embedding_dim = 128
num_filters = 128
filter_sizes = [3, 4, 5]
num_classes = 2
dropout_rate = 0.5

model = TextCNN(vocab_size, embedding_dim, num_filters, filter_sizes, num_classes, dropout_rate)

# Example: Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

# Example: Train the model
batch_size = 64
epochs = 10

for epoch in range(epochs):
    model.train()
    for i in range(0, len(x_data_train), batch_size):
        optimizer.zero_grad()
        batch_x, batch_y = x_data_train[i:i+batch_size], y_data_train[i:i+batch_size]
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        if (i+1) % 2 == 0:
            print(f'Epoch [{epoch+1}/{epochs}], Step [{i+1}/{len(x_train)}], Loss: {loss.item():.4f}')

# Example: Evaluate the model on validation data
model.eval()
with torch.no_grad():
    outputs = model(x_data_val)
    _, predicted = torch.max(outputs, 1)
    correct = (predicted == y_data_val).sum().item()
    accuracy = correct / len(y_data_val)
    print(f'Validation Accuracy: {accuracy:.4f}')
    print_results(y_data_val,predicted , 'Text CNN, ' + current_options )

Validation Accuracy: 0.7461
Text CNN, tfidf,	Accuracy=0.75,	C0: Pr=0.75, Re=1.00, F1=0.85,	C1: Pr=0.00, Re=0.00, F1=0.00
something went wrong Length of values (1985) does not match length of index (833)


  _warn_prf(average, modifier, msg_start, len(result))


**Mixture of Experts**

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report

# Instantiate base classifiers
rf_classifier = RandomForestClassifier()
gb_classifier = GradientBoostingClassifier()
lr_classifier = LogisticRegression()
mlp_classifier = MLPClassifier()

# Train base classifiers on your imbalanced dataset
rf_classifier.fit(X_train_val, y_train_val)
gb_classifier.fit(X_train_val, y_train_val)
lr_classifier.fit(X_train_val, y_train_val)
mlp_classifier.fit(X_train_val, y_train_val)

# Make predictions on the test set
rf_preds = rf_classifier.predict(X_test)
gb_preds = gb_classifier.predict(X_test)
lr_preds = lr_classifier.predict(X_test)
mlp_preds = mlp_classifier.predict(X_test)

# Combine predictions using weighted voting
ensemble_preds = (0.25 * rf_preds + 0.25 * gb_preds + 0.25 * lr_preds + 0.25 * mlp_preds)
y_pred = np.zeros(y_test.shape)
y_pred[ensemble_preds >= 0.5] =1
# Evaluate the ensemble
print(classification_report(y_test, y_pred))
print_results(y_test, y_pred, 'MoE, Eq. Weight, ' + current_options )


# Make predictions on the training set
y_preds_rf  = rf_classifier.predict(X_train_val).reshape(-1,1)
y_preds_gb  = gb_classifier.predict(X_train_val).reshape(-1,1)
y_preds_lr  = lr_classifier.predict(X_train_val).reshape(-1,1)
y_preds_mlp = mlp_classifier.predict(X_train_val).reshape(-1,1)
y_pred_train_val = np.concatenate((y_preds_rf,y_preds_gb, y_preds_lr, y_preds_mlp ), axis=1)

# Instantiate base classifiers
rf_combiner = RandomForestClassifier()
gb_combiner = GradientBoostingClassifier()
lr_combiner = LogisticRegression()
mlp_combiner = MLPClassifier()

rf_combiner.fit(y_pred_train_val, y_train_val)
gb_combiner.fit(y_pred_train_val, y_train_val)
lr_combiner.fit(y_pred_train_val, y_train_val)
mlp_combiner.fit(y_pred_train_val, y_train_val)

# Make predictions on the test set
y_pred_test_rf  = rf_classifier.predict(X_test).reshape(-1,1)
y_pred_test_gb  = gb_classifier.predict(X_test).reshape(-1,1)
y_pred_test_lr  = lr_classifier.predict(X_test).reshape(-1,1)
y_pred_test_mlp = mlp_classifier.predict(X_test).reshape(-1,1)
y_pred_test = np.concatenate((y_pred_test_rf, y_pred_test_gb, y_pred_test_lr, y_pred_test_mlp ), axis=1)

rfc_preds = rf_combiner.predict(y_pred_test)
gbc_preds = gb_combiner.predict(y_pred_test)
lrc_preds = lr_combiner.predict(y_pred_test)
mlpc_preds = mlp_combiner.predict(y_pred_test)

ensemblec_preds = (0.25 * rfc_preds + 0.25 * gbc_preds + 0.25 * lrc_preds + 0.25 * mlpc_preds)

# Evaluate the ensemble
print_results(y_test, prob2label(rfc_preds),  'MoE, rf-cmb, ' + current_options )
print_results(y_test, prob2label(gbc_preds),  'MoE, gb-cmb, ' + current_options )
print_results(y_test, prob2label(lrc_preds),  'MoE, lr-cmb, ' + current_options )
print_results(y_test, prob2label(mlpc_preds), 'MoE, mlpc-cmb, ' + current_options )
print_results(y_test, prob2label(ensemblec_preds), 'MoE, all-cmb, ' + current_options )



              precision    recall  f1-score   support

           0       0.96      0.96      0.96       746
           1       0.64      0.67      0.65        87

    accuracy                           0.93       833
   macro avg       0.80      0.81      0.81       833
weighted avg       0.93      0.93      0.93       833

MoE, Eq. Weight, tfidf,	Accuracy=0.93,	C0: Pr=0.96, Re=0.96, F1=0.96,	C1: Pr=0.64, Re=0.67, F1=0.65
MoE, rf-cmb, tfidf,	Accuracy=0.83,	C0: Pr=0.95, Re=0.86, F1=0.90,	C1: Pr=0.34, Re=0.63, F1=0.44
MoE, gb-cmb, tfidf,	Accuracy=0.84,	C0: Pr=0.96, Re=0.86, F1=0.91,	C1: Pr=0.36, Re=0.67, F1=0.47
MoE, lr-cmb, tfidf,	Accuracy=0.92,	C0: Pr=0.96, Re=0.96, F1=0.96,	C1: Pr=0.63, Re=0.63, F1=0.63
MoE, mlpc-cmb, tfidf,	Accuracy=0.84,	C0: Pr=0.96, Re=0.85, F1=0.90,	C1: Pr=0.36, Re=0.72, F1=0.48
MoE, all-cmb, tfidf,	Accuracy=0.84,	C0: Pr=0.96, Re=0.85, F1=0.90,	C1: Pr=0.35, Re=0.69, F1=0.47


In [None]:
print(classification_report(y_test,prob2label( mlpc_preds)))

**Word2vec Model**

In this set of experiments we build and test several well-known
word2vec and doc2vec models

In [None]:
#basic functions
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import get_tmpfile

#Get embedding of a document as the mean of embeddings of its words
def get_doc2vec(model ,doc ):
  tokens = doc.lower().split()
  vec = np.zeros(model.vector_size)
  num_tokens = 0
  for token in tokens:
    try:
      vec += model.get_vector(token)
      num_tokens += 1
    except:
      token = 'unk' #print(f"{token} not found in vocab")

  if num_tokens > 0:
    vec /= num_tokens
  return vec.reshape(1,-1)

#Generate document vectors for all of the sentences:
def get_corpus_embeddings(model , documents):
  X = get_doc2vec(model, documents[0] )
  for i in range(1,len(documents)):
    X = np.append(X, get_doc2vec(model, documents[i]).reshape(1,-1), axis=0)

  return X

In [None]:
#gensim doc2vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import get_tmpfile

documents = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(all_df['sentence'].tolist())]
model = Doc2Vec(documents, vector_size=300, window=4, min_count=1, workers=100, dbow_words =1, epochs=100)

#Persist a model to disk:
fname = get_tmpfile("gensim_doc2vec_model")
model.save(fname)
model = Doc2Vec.load(fname)  # you can continue training with the loaded model!

#Generate document vectors for all of the sentences:
X_d2v = get_corpus_embeddings (model.wv, all_df['sentence'].tolist())

test_basic_models(X_d2v[:y_train_val.shape[0],:], y_train_val,
                  X_d2v[y_train_val.shape[0]:,:], y_test, 'gensim_doc2vec, ')



basic K-Nearest Neighbors, gensim_doc2vec, ,	Accuracy=0.88,	C0: Pr=0.95, Re=0.92, F1=0.93,	C1: Pr=0.44, Re=0.55, F1=0.49


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


basic Logistic Regression, gensim_doc2vec, ,	Accuracy=0.89,	C0: Pr=0.94, Re=0.95, F1=0.94,	C1: Pr=0.49, Re=0.44, F1=0.46
basic Support Vector Machine-L, gensim_doc2vec, ,	Accuracy=0.89,	C0: Pr=0.94, Re=0.95, F1=0.94,	C1: Pr=0.49, Re=0.45, F1=0.47
basic Support Vector Machine-R, gensim_doc2vec, ,	Accuracy=0.91,	C0: Pr=0.94, Re=0.95, F1=0.95,	C1: Pr=0.56, Re=0.52, F1=0.54
basic Support Vector Machine-S, gensim_doc2vec, ,	Accuracy=0.76,	C0: Pr=0.91, Re=0.81, F1=0.86,	C1: Pr=0.15, Re=0.30, F1=0.20
basic Support Vector Machine-WL, gensim_doc2vec, ,	Accuracy=0.63,	C0: Pr=0.98, Re=0.60, F1=0.75,	C1: Pr=0.21, Re=0.92, F1=0.34
basic Support Vector Machine-WR, gensim_doc2vec, ,	Accuracy=0.73,	C0: Pr=0.99, Re=0.71, F1=0.83,	C1: Pr=0.27, Re=0.93, F1=0.42
basic Support Vector Machine-WS, gensim_doc2vec, ,	Accuracy=0.51,	C0: Pr=0.97, Re=0.46, F1=0.63,	C1: Pr=0.16, Re=0.86, F1=0.27
basic Decision Tree classifier, gensim_doc2vec, ,	Accuracy=0.80,	C0: Pr=0.94, Re=0.82, F1=0.88,	C1: Pr=0.27, Re=0.57, F1

In [None]:
docs=[]
for doc in documents:
  #print(doc.split())
  docs += [doc.split()]

print(docs)
#alldocs =[[doc.split()], for doc in documents]
print(len(documents))

9925


In [None]:
#gensim FastText
from gensim.models import FastText
from gensim.test.utils import get_tmpfile

documents = all_df['sentence'].tolist()
docs=[]
for doc in documents:
  #print(doc.split())
  docs += [doc.split()]

model_fasttext = FastText(vector_size=40, window=3, min_count=1, sentences=docs, epochs=100)

fname = get_tmpfile("suggestion_fasttext.model")
model_fasttext.save(fname)
model_fasttext = FastText.load(fname)

#Generate document vectors for all of the sentences:
X_ft = get_corpus_embeddings (model_fasttext.wv, all_df['sentence'].tolist())

test_basic_models(X_ft[:y_train_val.shape[0],:], y_train_val,
                  X_ft[y_train_val.shape[0]:,:], y_test, 'gensim-fasttext-words')


basic K-Nearest Neighbors, gensim-fasttext-words,	Accuracy=0.86,	C0: Pr=0.94, Re=0.90, F1=0.92,	C1: Pr=0.39, Re=0.52, F1=0.44
basic Logistic Regression, gensim-fasttext-words,	Accuracy=0.87,	C0: Pr=0.92, Re=0.94, F1=0.93,	C1: Pr=0.37, Re=0.31, F1=0.34
basic Support Vector Machine-L, gensim-fasttext-words,	Accuracy=0.88,	C0: Pr=0.92, Re=0.95, F1=0.93,	C1: Pr=0.38, Re=0.24, F1=0.29
basic Support Vector Machine-R, gensim-fasttext-words,	Accuracy=0.91,	C0: Pr=0.94, Re=0.96, F1=0.95,	C1: Pr=0.58, Re=0.47, F1=0.52
basic Support Vector Machine-S, gensim-fasttext-words,	Accuracy=0.77,	C0: Pr=0.92, Re=0.82, F1=0.87,	C1: Pr=0.20, Re=0.40, F1=0.27
basic Support Vector Machine-WL, gensim-fasttext-words,	Accuracy=0.46,	C0: Pr=0.99, Re=0.40, F1=0.57,	C1: Pr=0.16, Re=0.95, F1=0.27
basic Support Vector Machine-WR, gensim-fasttext-words,	Accuracy=0.61,	C0: Pr=0.99, Re=0.57, F1=0.73,	C1: Pr=0.20, Re=0.93, F1=0.33
basic Support Vector Machine-WS, gensim-fasttext-words,	Accuracy=0.47,	C0: Pr=0.96, Re=0.42

In [None]:
#Test pretrained word2vec models:
import gensim.downloader
models =['fasttext-wiki-news-subwords-300',
         'conceptnet-numberbatch-17-06-300',
         'word2vec-ruscorpora-300',
         'word2vec-google-news-300',
         'glove-wiki-gigaword-50',
         'glove-wiki-gigaword-100',
         'glove-wiki-gigaword-200',
         'glove-wiki-gigaword-300',
         'glove-twitter-25',
         'glove-twitter-50',
         'glove-twitter-100',
         'glove-twitter-200',
         '__testing_word2vec-matrix-synopsis',
         ]

#Play with pretrained word2vec embeddings
for model_name in models:
  print (model_name)
  model_fname = model_name + ".model"
  model_pretrained = gensim.downloader.load(model_name)


  #Generate document vectors for all of the sentences:
  X_gl = get_corpus_embeddings (model_pretrained, all_df['sentence'].tolist())

  test_basic_models(X_gl[:y_train_val.shape[0],:], y_train_val,
                    X_gl[y_train_val.shape[0]:,:], y_test, model_name)


fasttext-wiki-news-subwords-300

KeyboardInterrupt: 

In [None]:
classifier = MLPClassifier(hidden_layer_sizes=(50,25,20,10,5),
                           max_iter=100,activation = 'relu',
                           solver='adam',random_state=100).fit(X_d2v[0:y_train_val.shape[0],:], y_train_val)

y_pred = classifier.predict(X_d2v)
print(confusion_matrix(y_all,y_pred))
print(classification_report(y_all,y_pred))
print_results(y_all , y_pred , 'NN-all')


y_pred = classifier.predict(X_d2v[y_train_val.shape[0]:,:])
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print_results(y_test , y_pred , 'NN-test')




[[7347  110]
 [  65 2403]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      7457
           1       0.96      0.97      0.96      2468

    accuracy                           0.98      9925
   macro avg       0.97      0.98      0.98      9925
weighted avg       0.98      0.98      0.98      9925

NN-all,	Accuracy=0.98,	C0: Pr=0.99, Re=0.99, F1=0.99,	C1: Pr=0.96, Re=0.97, F1=0.96
something went wrong Length of values (9925) does not match length of index (833)
[[676  70]
 [ 38  49]]
              precision    recall  f1-score   support

           0       0.95      0.91      0.93       746
           1       0.41      0.56      0.48        87

    accuracy                           0.87       833
   macro avg       0.68      0.73      0.70       833
weighted avg       0.89      0.87      0.88       833

NN-test,	Accuracy=0.87,	C0: Pr=0.95, Re=0.91, F1=0.93,	C1: Pr=0.41, Re=0.56, F1=0.48


  results_df.insert(len(results_df.columns),description, y_pred)


In [None]:
classifier = MLPClassifier(hidden_layer_sizes=(50,25,20,10,5),
                           max_iter=100,activation = 'relu',
                           solver='adam',random_state=100).fit(X_train_val, y_train_val)

y_pred = classifier.predict(X_all)
print(confusion_matrix(y_all,y_pred))
print(classification_report(y_all,y_pred))
print_results(y_all , y_pred , 'NN-all')


y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print_results(y_test , y_pred , 'NN-test_orig')






[[7335  122]
 [  81 2387]]
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      7457
           1       0.95      0.97      0.96      2468

    accuracy                           0.98      9925
   macro avg       0.97      0.98      0.97      9925
weighted avg       0.98      0.98      0.98      9925

NN-all,	Accuracy=0.98,	C0: Pr=0.99, Re=0.98, F1=0.99,	C1: Pr=0.95, Re=0.97, F1=0.96
something went wrong Length of values (9925) does not match length of index (833)
[[650  96]
 [ 31  56]]
              precision    recall  f1-score   support

           0       0.95      0.87      0.91       746
           1       0.37      0.64      0.47        87

    accuracy                           0.85       833
   macro avg       0.66      0.76      0.69       833
weighted avg       0.89      0.85      0.86       833

NN-test_orig,	Accuracy=0.85,	C0: Pr=0.95, Re=0.87, F1=0.91,	C1: Pr=0.37, Re=0.64, F1=0.47


In [None]:
!pip install tensorflow

Collecting tensorflow
  Using cached tensorflow-2.13.1-cp38-cp38-win_amd64.whl.metadata (2.6 kB)
INFO: pip is looking at multiple versions of tensorflow to determine which version is compatible with other requirements. This could take a while.
  Using cached tensorflow-2.13.0-cp38-cp38-win_amd64.whl.metadata (2.6 kB)
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow-intel==2.13.0->tensorflow)
  Using cached protobuf-4.25.3-cp38-cp38-win_amd64.whl.metadata (541 bytes)
Collecting typing-extensions<4.6.0,>=3.6.6 (from tensorflow-intel==2.13.0->tensorflow)
  Using cached typing_extensions-4.5.0-py3-none-any.whl.metadata (8.5 kB)
Collecting tensorboard<2.14,>=2.13 (from tensorflow-intel==2.13.0->tensorflow)
  Using cached tensorboard-2.13.0-py3-none-any.whl.metadata (1.8 kB)
Collecting tensorflow-estimator<2.14,>=2.13.0 (from tensorflow-intel==2.13.0->tensorflow)
  Using cached tensorflow_estimator-2.13.0-py2.py3-none-any.whl.metada

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sqlalchemy 2.0.28 requires typing-extensions>=4.6.0, but you have typing-extensions 4.5.0 which is incompatible.
pydantic 2.6.4 requires typing-extensions>=4.6.1, but you have typing-extensions 4.5.0 which is incompatible.
pydantic-core 2.16.3 requires typing-extensions!=4.7.0,>=4.6.0, but you have typing-extensions 4.5.0 which is incompatible.
tensorflow-metadata 1.13.0 requires protobuf<4,>=3.13, but you have protobuf 4.25.3 which is incompatible.
tensorflow-text 2.10.0 requires tensorflow<2.11,>=2.10.0; platform_machine != "arm64" or platform_system != "Darwin", but you have tensorflow 2.13.0 which is incompatible.
tf-models-official 2.10.1 requires tensorflow~=2.10.0, but you have tensorflow 2.13.0 which is incompatible.
torch 2.2.1+cu121 requires typing-extensions>=4.8.0, but you have typing-extensions 4.5.0 

In [None]:
import tensorflow
from tensorflow import keras
from tensorflow.keras import layers

# Define model architecture
model = keras.Sequential([
  layers.Dense(256, activation="relu", input_shape=(embedding_dim,)),
  layers.Dense(128, activation="relu"),
  layers.Dense(1, activation="sigmoid")
])

class_weights ={0:1,1:100}
# Compile model with WBCE loss
model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=True), optimizer="adam", metrics=["accuracy"])

# Train model with potentially oversampled/undersampled data
model.fit(X_d2v[0:y_train_val.shape[0],:], y_train_val, epochs=100, class_weight=class_weights)

# Classify new documents
y_pred = model.predict(X_d2v) >= 0.5
print(confusion_matrix(y_all,y_pred))
print(classification_report(y_all,y_pred))
print_results(y_all , y_pred , 'NN-all2')


y_pred = model.predict(X_d2v[y_train_val.shape[0]:,:]) >= 0.5
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print_results(y_test , y_pred , 'NN-test2')


ImportError: cannot import name 'keras' from 'tensorflow' (unknown location)

In [None]:
import tensorflow
dir(tensorflow)

['__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__']

In [None]:
classifier = MLPClassifier(hidden_layer_sizes=(150,100,50),
                           max_iter=100,activation = 'relu',
                           solver='adam',random_state=100).fit(X_train_val, y_train_val)

y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

classifier.fit(X_train,y_train); plt.plot(classifier.loss_curve_,label="train")
classifier.fit(X_valid,y_valid); plt.plot(classifier.loss_curve_,label="validation")
classifier.fit(X_test,y_test); plt.plot(classifier.loss_curve_,label="test")

plt.xlabel("Iteration")
plt.ylabel("Misclassification Rate/Loss");
plt.legend(loc='upper right')
plt.title('mlp-tfidf-training')
plt.show()

[[685  61]
 [ 27  60]]
              precision    recall  f1-score   support

           0       0.96      0.92      0.94       746
           1       0.50      0.69      0.58        87

    accuracy                           0.89       833
   macro avg       0.73      0.80      0.76       833
weighted avg       0.91      0.89      0.90       833



**Balanced Classification**

This experiments trains an ensemble of random forests on the balanced subsets

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


# Separate instances for class 1
class_1_instances = X_train_val[y_train_val == 1,:]
class_0_instances = X_train_val[y_train_val == 0,:]
num_classifiers = 15
classifiers = []
number_of_samples = class_1_instances.shape[0]
# Build an ensemble of classifiers (Random Forests in this example)

for ks in range(1,num_classifiers+1):
  # Randomly sample 2000 instances from class 0
  indices = np.random.choice(class_0_instances.shape[0], number_of_samples, replace=True)
  sampled_class_0_instances = class_0_instances[indices,:]

  # Combine instances for class 1 and sampled instances from class 0
  balanced_X = np.concatenate([class_1_instances, sampled_class_0_instances])
  balanced_y = np.concatenate([np.ones(class_1_instances.shape[0]), np.zeros(sampled_class_0_instances.shape[0])])

  classifier = RandomForestClassifier(n_estimators=101, random_state=100)
  #classifier = sklearn.linear_model.LogisticRegression(random_state=42)
  classifier.fit(balanced_X, balanced_y)
  classifiers.append(classifier)
  y_pred = classifier.predict(X_test)
  print_results(y_test , y_pred,  f'ensemble{ks} ' + current_options)

# Make predictions on the test set using each classifier
predictions = [classifier.predict(X_test) for classifier in classifiers]

# Take a majority vote to get the final ensemble prediction
ensemble_predictions = np.mean(predictions, axis=0) > 0.5

# Evaluate the ensemble performance
ensemble_accuracy = accuracy_score(y_test, ensemble_predictions)
print(f'Ensemble Accuracy: {ensemble_accuracy}')
print_results(y_test , ensemble_predictions,  'ensemble_total ' + current_options)


ensemble1 bow,	Accuracy=0.89,	C0: Pr=0.98, Re=0.89, F1=0.93,	C1: Pr=0.48, Re=0.87, F1=0.62


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble2 bow,	Accuracy=0.89,	C0: Pr=0.98, Re=0.89, F1=0.93,	C1: Pr=0.48, Re=0.87, F1=0.62


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble3 bow,	Accuracy=0.88,	C0: Pr=0.98, Re=0.88, F1=0.93,	C1: Pr=0.45, Re=0.82, F1=0.58


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble4 bow,	Accuracy=0.89,	C0: Pr=0.98, Re=0.90, F1=0.94,	C1: Pr=0.49, Re=0.86, F1=0.62


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble5 bow,	Accuracy=0.89,	C0: Pr=0.98, Re=0.89, F1=0.93,	C1: Pr=0.47, Re=0.87, F1=0.62


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble6 bow,	Accuracy=0.89,	C0: Pr=0.98, Re=0.89, F1=0.93,	C1: Pr=0.48, Re=0.84, F1=0.61


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble7 bow,	Accuracy=0.88,	C0: Pr=0.98, Re=0.88, F1=0.93,	C1: Pr=0.47, Re=0.87, F1=0.61


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble8 bow,	Accuracy=0.89,	C0: Pr=0.98, Re=0.89, F1=0.94,	C1: Pr=0.48, Re=0.86, F1=0.62


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble9 bow,	Accuracy=0.88,	C0: Pr=0.98, Re=0.89, F1=0.93,	C1: Pr=0.47, Re=0.85, F1=0.60


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble10 bow,	Accuracy=0.89,	C0: Pr=0.98, Re=0.90, F1=0.94,	C1: Pr=0.49, Re=0.85, F1=0.62


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble11 bow,	Accuracy=0.87,	C0: Pr=0.98, Re=0.87, F1=0.92,	C1: Pr=0.43, Re=0.86, F1=0.57


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble12 bow,	Accuracy=0.89,	C0: Pr=0.98, Re=0.89, F1=0.93,	C1: Pr=0.48, Re=0.85, F1=0.61


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble13 bow,	Accuracy=0.87,	C0: Pr=0.98, Re=0.88, F1=0.93,	C1: Pr=0.44, Re=0.83, F1=0.58


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble14 bow,	Accuracy=0.88,	C0: Pr=0.98, Re=0.89, F1=0.93,	C1: Pr=0.46, Re=0.84, F1=0.60


  results_df.insert(len(results_df.columns),description, y_pred)


ensemble15 bow,	Accuracy=0.88,	C0: Pr=0.98, Re=0.88, F1=0.93,	C1: Pr=0.46, Re=0.84, F1=0.59


  results_df.insert(len(results_df.columns),description, y_pred)


Ensemble Accuracy: 0.8883553421368547
ensemble_total bow,	Accuracy=0.89,	C0: Pr=0.98, Re=0.89, F1=0.93,	C1: Pr=0.48, Re=0.87, F1=0.62


  results_df.insert(len(results_df.columns),description, y_pred)


**Longest Common Subsequence**

In [None]:
def longest_common_subsequence(str1, str2):
    words1 = str1.split()
    words2 = str2.split()

    m = len(words1)
    n = len(words2)

    # Initializing the dp table with zeros
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    # Building the dp table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if words1[i - 1] == words2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

    # Backtracking to find the longest common subsequence
    lcs_length = dp[m][n]
    lcs = []
    i = m
    j = n
    while i > 0 and j > 0:
        if words1[i - 1] == words2[j - 1]:
            lcs.append ( words1[i - 1])
            i -= 1
            j -= 1
            lcs_length -= 1
        elif dp[i - 1][j] > dp[i][j - 1]:
            i -= 1
        else:
            j -= 1

    lcs.reverse()
    return lcs , (len(lcs))/ (0.0001+m) # 0.0001 is denom is there to prevent div by 0
# Example usage:
str1        = "roses are red. violets are blue"
str2        = "the garden is full of roses and violets that are blue "
lcs12 , r12 = longest_common_subsequence(str1, str2)
print(f"Longest Common Subsequence:{r12}: {lcs12} ")


Longest Common Subsequence:0.6666555557407376: ['roses', 'violets', 'are', 'blue'] 


In [None]:
import sys
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords_list(tokens):
  filtered_tokens = [w for w in tokens if not w.lower() in stop_words]
  return filtered_tokens

def remove_stopwords_str(str):
  str = re.sub(r'[^A-Za-z0-9]+', ' ', str)
  str = re.sub(r'\W+', ' ', str)
  str = re.sub(r'\s+', ' ', str)

  tokens = str.split()
  filtered_tokens = [w for w in tokens if not w.lower() in stop_words]
  return ' '.join(filtered_tokens)

#--------------------
#

do_lcs = True # switch this flag if you want to run lcs

if do_lcs:
  train = train_df['sentence'].tolist(); train  = [remove_stopwords_str(s) for s in train ]
  test  = test_df['sentence'].tolist();  test   = [remove_stopwords_str(s) for s in test ]
  valid = valid_df['sentence'].tolist(); valid  = [remove_stopwords_str(s) for s in valid ]

  X = train + valid ;   y = y_train_val
  num_train = len(X);   num_test  = len(test)

  with open('train_test_lcs03.txt', 'w') as f:
    for m in range(num_train):
      if y[m] == 1:
        for n in range(num_test):
          temp, r = longest_common_subsequence (X[m], test[n])
          if r >= 0.01:
            print(f"{m}\t{n}\t{y[m]}\t{y_test[n]}\t{r:0.02f}\t{len(temp)}\t{temp}",file=f)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mmr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import numpy as np
from nltk.corpus import wordnet as wn

def wordnet_path_similarity(word1, word2):
    synsets1 = wn.synsets(word1)
    synsets2 = wn.synsets(word2)
    if not synsets1 or not synsets2:
        return 0
    else:
        max_similarity = 0
        for synset1 in synsets1:
            for synset2 in synsets2:
                path_similarity = synset1.path_similarity(synset2)
                if path_similarity is not None and path_similarity > max_similarity:
                    max_similarity = path_similarity
        return max_similarity

def dtw_distance(s1, s2, similarity_function=wordnet_path_similarity):
    len_s1, len_s2 = len(s1), len(s2)
    dtw_matrix = np.zeros((len_s1 + 1, len_s2 + 1))

    # Initialize the DTW matrix with infinity
    for i in range(len_s1 + 1):
        for j in range(len_s2 + 1):
            dtw_matrix[i, j] = float('inf')

    dtw_matrix[0, 0] = 0

    # Calculate DTW matrix
    for i in range(1, len_s1 + 1):
        for j in range(1, len_s2 + 1):
            cost = 1 - similarity_function(s1[i - 1], s2[j - 1])  # Using WordNet similarity as cost
            dtw_matrix[i, j] = cost + min(dtw_matrix[i - 1, j], dtw_matrix[i, j - 1], dtw_matrix[i - 1, j - 1])

    return dtw_matrix[len_s1, len_s2]

def dtw_distance_str(s1, s2):
  return dtw_distance(s1.split() , s2.split())

# Example lists of strings
list1 = ['cat', 'dog', 'fish']
list2 = ['cat', 'fish', 'bird']

# Calculate DTW similarity
dtw_distance_score = dtw_distance(list1, list2)

print("DTW Similarity:", dtw_distance_score)


DTW Similarity: 1.55


In [None]:
#--------------------
#

do_dtw = True # switch this flag if you want to run lcs

if do_dtw:
  train = train_df['sentence'].tolist(); train  = [remove_stopwords_str(s) for s in train ]
  test  = test_df['sentence'].tolist();  test   = [remove_stopwords_str(s) for s in test ]
  valid = valid_df['sentence'].tolist(); valid  = [remove_stopwords_str(s) for s in valid ]

  X = train + valid ;   y = y_train_val
  num_train = len(X);   num_test  = len(test)

  with open('train_test_dtw02.txt', 'w') as f2:
    for m in range(num_train):
      if y[m] == 1:
        s1 = X[m].split() ; n1 = len(s1)
        for n in range(num_test):
          s2 = test[n].split() ; n2 = len(s2)
          r = 1- (dtw_distance (s1, s2) /(n1+n2 + 0.00000001))
          #print(f"{m}\t{n}\t{y[m]}\t{y_test[n]}\t{r:0.02f}")
          print(f"{m}\t{n}\t{y[m]}\t{n1}\t{n2}\t{y_test[n]}\t{r:0.02f}",file=f2)



**A teste of ResNet**

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Convert numpy arrays to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_val_tensor = torch.tensor(X_test, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
y_val_tensor = torch.tensor(y_test, dtype=torch.long)

# Create DataLoader
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)

# Define BiLSTM model
# Define BiLSTM model
class BiLSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout):
        super(BiLSTM, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=1, bidirectional=True, batch_first=True)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        output, _ = self.lstm(text)
        hidden = torch.cat((output[:, -1, :hidden_dim], output[:, 0, hidden_dim:]), dim=1)
        return self.fc(hidden)


# Define model parameters
input_dim = X_train_vec.shape[1]
embedding_dim = 100
hidden_dim = 128
output_dim = 2  # Assuming binary classification
dropout = 0.5

# Initialize model, loss function, and optimizer
model = BiLSTM(input_dim, hidden_dim, output_dim, dropout)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model.train()  # Set model to training mode
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        print(type(inputs))
        outputs = model(inputs)
        print('OK2')
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * inputs.size(0)
    epoch_loss = running_loss / len(train_loader.dataset)
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}')

# Evaluation
model.eval()  # Set model to evaluation mode
with torch.no_grad():
    outputs = model(X_val_tensor)
    _, predicted = torch.max(outputs, 1)
    accuracy = (predicted == y_val_tensor).sum().item() / len(y_val_tensor)
    print(f'Validation Accuracy: {accuracy:.4f}')


NameError: name 'X_train_vec' is not defined

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense

def create_lstm_model(max_len, vocab_size, embedding_dim, num_lstm_units):
  """
  Creates a BiLSTM model for imbalanced document classification.

  Args:
      max_len: Maximum sequence length of documents.
      vocab_size: Size of the vocabulary.
      embedding_dim: Dimensionality of word2vec embeddings.
      num_lstm_units: Number of units in the LSTM layer.

  Returns:
      A compiled TensorFlow Keras model.
  """

  # Embedding layer for word2vec vectors
  inputs = tf.keras.Input(shape=(max_len,))
  embeddings = Embedding(vocab_size, embedding_dim, input_length=max_len)(inputs)

  # Bidirectional LSTM layer for capturing long-range dependencies in both directions
  lstm = Bidirectional(LSTM(num_lstm_units, return_sequences=True))(embeddings)

  # Global max pooling to extract the most informative features
  x = tf.keras.layers.GlobalMaxPooling1D()(lstm)

  # Dense layer for classification
  outputs = Dense(1, activation='sigmoid')(x)  # Sigmoid for binary classification

  # Model with Adam optimizer (consider experimenting with other optimizers)
  model = tf.keras.Model(inputs=inputs, outputs=outputs)
  model.compile(loss='binary_crossentropy',  # For imbalanced classes
                optimizer='adam',
                metrics=['accuracy'])

  return model

# Example usage (replace with your actual data)
max_len = 100  # Adjust based on your data
vocab_size = 10000  # Adjust based on your vocabulary
embedding_dim = 300  # Adjust based on your word2vec embeddings
num_lstm_units = 128  # Adjust based on your dataset complexity

# Load your pre-trained word2vec embeddings (not shown here)
word2vec_embeddings = X_d2v

# Prepare your imbalanced training data (X: sequences, y: labels)
X_train, y_train = X_d2v[0:y_train_val.shape[0],:], y_train_val

# Class weights for handling imbalanced data (optional)
class_weights = {0:1,1:10} #compute_class_weights(y_train)  # Replace with your weight calculation

model = create_lstm_model(max_len, vocab_size, embedding_dim, num_lstm_units)

# Train the model with appropriate class weights (if applicable)
model.fit(X_train, y_train, epochs=10, class_weight=class_weights)  # Adjust epochs

# Evaluate the model on your validation or test set
model.evaluate(X_test, y_test)


Epoch 1/10


ValueError: in user code:

    File "c:\python\python38\lib\site-packages\keras\engine\training.py", line 1160, in train_function  *
        return step_function(self, iterator)
    File "c:\python\python38\lib\site-packages\keras\engine\training.py", line 1146, in step_function  **
        
    File "c:\python\python38\lib\site-packages\keras\engine\training.py", line 1135, in run_step  **
        
    File "c:\python\python38\lib\site-packages\keras\engine\training.py", line 993, in train_step
        
    File "c:\python\python38\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
        
    File "c:\python\python38\lib\site-packages\keras\engine\input_spec.py", line 295, in assert_input_compatibility
        

    ValueError: Input 0 of layer "model_2" is incompatible with the layer: expected shape=(None, 100), found shape=(None, 300)
