<a href="https://colab.research.google.com/github/VoKisnaHai1102/Frames-to-Fables/blob/main/Assignment%204/240563_KrishnaAg_assgn4_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 4: Text Classification on TREC dataset

We are going to use the TREC dataset for this assignment, which is widely considered a benchmark text classification dataset. Read about the TREC dataset here (https://huggingface.co/datasets/CogComp/trec), also google it for understanding it better.

This is what you have to do - use the concepts we have covered so far to accurately predict the 5 coarse labels (if you have googled TERC, you will surely know what I mean) in the test dataset. Train on the train dataset and give results on the test dataset, as simple as that. And experiment, experiment and experiment!

Your experimentation should be 4-tiered-

i) Experiment with preprocessing techniques (different types of Stemming, Lemmatizing, or do neither and keep the words pure). Needless to say, certain things, like stopword removal, should be common in all the preprocesssing pipelines you come up with. Remember never do stemming and lemmatization together. Note - To find out the best preprocessing technique, use a simple baseline model, like say CountVectorizer(BoW) + Logistic Regression, and see which gives the best accuracy. Then proceed with that preprocessing technique only for all the other models.

ii) Try out various vectorisation techniques (BoW, TF-IDF, CBoW, Skipgram, GloVE, Fasttext, etc., but transformer models are not allowed) -- Atleast 5 different types

iii) Tinker with various strategies to combine the word vectors (taking mean, using RNN/LSTM, and the other strategies I hinted at in the end of the last sesion). Note that this is applicable only for the advanced embedding techniques which generate word embeddings. -- Atleast 3 different types, one of which should definitely be RNN/LSTM

iv) Finally, experiment with the ML classifier model, which will take the final vector respresentation of each TREC question and generate the label. E.g. - Logistic regression, decision trees, simple neural network, etc. - Atleast 4 different models

So applying some PnC, in total you should get more than 40 different combinations. Print out the accuracies of all these combinations nicely in a well-formatted table, and pronounce one of them the best. Also feel free to experiment with more models/embedding techniques than what I have said here, the goal is after all to achieve the highest accuracy, as long as you don't use transformers. Happy experimenting!

NOTE - While choosing the 4-5 types of each experimentation level, try to choose the best out of all those available. E.g. - For level (iii) - Tinker with various strategies to combine the word vectors - do not include 'mean' if you see it is giving horrendous results. Include the best 3-4 strategies.

### Helper Code to get you started

I have added some helper code to show you how to load the TERC dataset and use it.

In [1]:
!pip install -U datasets nltk scikit-learn gensim tensorflow pandas numpy matplotlib seaborn
!pip install fasttext-wheel

Collecting numpy
  Using cached numpy-2.3.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (62 kB)


In [2]:
import os
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

from datasets import load_dataset
from google.colab import userdata
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_val_score
import gensim
from gensim.models import Word2Vec, FastText
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, GlobalMaxPooling1D, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import re
import time
from collections import defaultdict

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

hf_token = userdata.get('HF_TOKEN')
os.environ['HF_TOKEN'] = hf_token
dataset = load_dataset("trec", token=hf_token)
train_data = dataset['train']
test_data = dataset['test']

print(f"Dataset splits: {list(dataset.keys())}")
print(f"Training samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")
print(f"Features: {train_data.features}")
X_train = [item['text'] for item in train_data]
y_train = [item['coarse_label'] for item in train_data]
X_test = [item['text'] for item in test_data]
y_test = [item['coarse_label'] for item in test_data]


coarse_labels = train_data.features['coarse_label'].names
print(f"\nCoarse label names: {coarse_labels}")
print(f"Number of classes: {len(coarse_labels)}")

print("\n--- First 3 Examples ---")
for i in range(3):
    print(f"Example {i+1}:")
    print(f" Text: {train_data[i]['text']}")
    print(f" Coarse: {train_data[i]['coarse_label']} ({coarse_labels[train_data[i]['coarse_label']]})")
    print(f" Fine: {train_data[i]['fine_label']} ({train_data.features['fine_label'].names[train_data[i]['fine_label']]})")
    print()

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Dataset splits: ['train', 'test']
Training samples: 5452
Test samples: 500
Features: {'text': Value(dtype='string', id=None), 'coarse_label': ClassLabel(names=['ABBR', 'ENTY', 'DESC', 'HUM', 'LOC', 'NUM'], id=None), 'fine_label': ClassLabel(names=['ABBR:abb', 'ABBR:exp', 'ENTY:animal', 'ENTY:body', 'ENTY:color', 'ENTY:cremat', 'ENTY:currency', 'ENTY:dismed', 'ENTY:event', 'ENTY:food', 'ENTY:instru', 'ENTY:lang', 'ENTY:letter', 'ENTY:other', 'ENTY:plant', 'ENTY:product', 'ENTY:religion', 'ENTY:sport', 'ENTY:substance', 'ENTY:symbol', 'ENTY:techmeth', 'ENTY:termeq', 'ENTY:veh', 'ENTY:word', 'DESC:def', 'DESC:desc', 'DESC:manner', 'DESC:reason', 'HUM:gr', 'HUM:ind', 'HUM:title', 'HUM:desc', 'LOC:city', 'LOC:country', 'LOC:mount', 'LOC:other', 'LOC:state', 'NUM:code', 'NUM:count', 'NUM:date', 'NUM:dist', 'NUM:money', 'NUM:ord', 'NUM:other', 'NUM:period', 'NUM:perc', 'NUM:speed', 'NUM:temp', 'NUM:volsize', 'NUM:weight'], id=None)}

Coarse label names: ['ABBR', 'ENTY', 'DESC', 'HUM', 'LOC', 

In [3]:
class TextPreprocessor:
    def __init__(self, method='none'):
        """
        method: 'none', 'porter_stem', 'snowball_stem', 'lemmatize'
        """
        self.method = method
        self.stop_words = set(stopwords.words('english'))

        if method == 'porter_stem':
            self.stemmer = PorterStemmer()
        elif method == 'snowball_stem':
            self.stemmer = SnowballStemmer('english')
        elif method == 'lemmatize':
            self.lemmatizer = WordNetLemmatizer()

    def preprocess_text(self, text):
        text = text.lower()

        text = re.sub(r'[^a-zA-Z\s]', '', text)

        tokens = word_tokenize(text)

        tokens = [token for token in tokens if token not in self.stop_words and len(token) > 2]

        if self.method == 'porter_stem':
            tokens = [self.stemmer.stem(token) for token in tokens]
        elif self.method == 'snowball_stem':
            tokens = [self.stemmer.stem(token) for token in tokens]
        elif self.method == 'lemmatize':
            tokens = [self.lemmatizer.lemmatize(token) for token in tokens]

        return ' '.join(tokens)

    def fit_transform(self, texts):
        return [self.preprocess_text(text) for text in texts]

    def transform(self, texts):
        return [self.preprocess_text(text) for text in texts]

In [4]:
def find_best_preprocessing():
    """Find the best preprocessing method using BoW + Logistic Regression"""
    preprocessing_methods = ['none', 'porter_stem', 'snowball_stem', 'lemmatize']
    results = {}

    print("Finding best preprocessing method...")

    for method in preprocessing_methods:
        print(f"Testing preprocessing method: {method}")

        preprocessor = TextPreprocessor(method)
        X_train_processed = preprocessor.fit_transform(X_train)
        X_test_processed = preprocessor.transform(X_test)

        vectorizer = CountVectorizer(max_features=5000, ngram_range=(1, 2))
        X_train_vec = vectorizer.fit_transform(X_train_processed)
        X_test_vec = vectorizer.transform(X_test_processed)

        clf = LogisticRegression(random_state=42, max_iter=1000)
        clf.fit(X_train_vec, y_train)

        y_pred = clf.predict(X_test_vec)
        accuracy = accuracy_score(y_test, y_pred)
        results[method] = accuracy

        print(f"Accuracy with {method}: {accuracy:.4f}")

    # Find best method
    best_method = max(results, key=results.get)
    print(f"\nBest preprocessing method: {best_method} with accuracy: {results[best_method]:.4f}")

    return best_method, results

# Find best preprocessing method
best_preprocessing, preprocessing_results = find_best_preprocessing()

Finding best preprocessing method...
Testing preprocessing method: none
Accuracy with none: 0.7500
Testing preprocessing method: porter_stem
Accuracy with porter_stem: 0.7440
Testing preprocessing method: snowball_stem
Accuracy with snowball_stem: 0.7460
Testing preprocessing method: lemmatize
Accuracy with lemmatize: 0.7480

Best preprocessing method: none with accuracy: 0.7500


In [5]:
class VectorizationTechniques:
    def __init__(self, preprocessor):
        self.preprocessor = preprocessor
        self.vectorizers = {}
        self.word2vec_model = None
        self.fasttext_model = None

    def prepare_data(self, X_train, X_test):
        """Preprocess the data"""
        self.X_train_processed = self.preprocessor.fit_transform(X_train)
        self.X_test_processed = self.preprocessor.transform(X_test)

        self.X_train_tokens = [text.split() for text in self.X_train_processed]
        self.X_test_tokens = [text.split() for text in self.X_test_processed]

        return self.X_train_processed, self.X_test_processed

    def bag_of_words(self, max_features=5000):
        """Bag of Words vectorization"""
        vectorizer = CountVectorizer(max_features=max_features, ngram_range=(1, 2))
        X_train_vec = vectorizer.fit_transform(self.X_train_processed)
        X_test_vec = vectorizer.transform(self.X_test_processed)
        self.vectorizers['bow'] = vectorizer
        return X_train_vec.toarray(), X_test_vec.toarray()

    def tfidf(self, max_features=5000):
        """TF-IDF vectorization"""
        vectorizer = TfidfVectorizer(max_features=max_features, ngram_range=(1, 2))
        X_train_vec = vectorizer.fit_transform(self.X_train_processed)
        X_test_vec = vectorizer.transform(self.X_test_processed)
        self.vectorizers['tfidf'] = vectorizer
        return X_train_vec.toarray(), X_test_vec.toarray()

    def word2vec_cbow(self, vector_size=100, window=5, min_count=2):
        """Word2Vec CBOW model"""
        model = Word2Vec(
            sentences=self.X_train_tokens,
            vector_size=vector_size,
            window=window,
            min_count=min_count,
            sg=0,  # CBOW
            seed=42,
            workers=4
        )
        self.word2vec_model = model
        return model

    def word2vec_skipgram(self, vector_size=100, window=5, min_count=2):
        """Word2Vec Skip-gram model"""
        model = Word2Vec(
            sentences=self.X_train_tokens,
            vector_size=vector_size,
            window=window,
            min_count=min_count,
            sg=1,  # Skip-gram
            seed=42,
            workers=4
        )
        return model

    def build_fasttext_model(self, vector_size=100, window=5, min_count=2):
        """FastText model"""
        model = FastText(
            sentences=self.X_train_tokens,
            vector_size=vector_size,
            window=window,
            min_count=min_count,
            seed=42,
            workers=4
        )
        self.fasttext_model = model
        return model

    def get_glove_embeddings(self, vector_size=100):
        """Download and use GloVe embeddings (simplified version)"""
        vocab = set()
        for tokens in self.X_train_tokens:
            vocab.update(tokens)

        vocab_size = len(vocab)
        embedding_matrix = np.random.normal(size=(vocab_size, vector_size))
        word_to_idx = {word: i for i, word in enumerate(vocab)}

        return embedding_matrix, word_to_idx

class CombinationStrategies:
    def __init__(self):
        pass

    def mean_pooling(self, embeddings):
        """Average word embeddings"""
        return np.mean(embeddings, axis=0) if len(embeddings) > 0 else np.zeros(embeddings.shape[1] if len(embeddings.shape) > 1 else 100)

    def max_pooling(self, embeddings):
        """Max pooling of word embeddings"""
        return np.max(embeddings, axis=0) if len(embeddings) > 0 else np.zeros(embeddings.shape[1] if len(embeddings.shape) > 1 else 100)

    def get_sentence_embeddings(self, tokenized_sentences, model, strategy='mean'):
        """Convert tokenized sentences to sentence embeddings"""
        sentence_embeddings = []

        for tokens in tokenized_sentences:
            word_embeddings = []
            for token in tokens:
                if token in model.wv:
                    word_embeddings.append(model.wv[token])

            if word_embeddings:
                word_embeddings = np.array(word_embeddings)
                if strategy == 'mean':
                    sentence_emb = self.mean_pooling(word_embeddings)
                elif strategy == 'max':
                    sentence_emb = self.max_pooling(word_embeddings)
                else:
                    sentence_emb = np.mean(word_embeddings, axis=0)  # default to mean
            else:
                sentence_emb = np.zeros(model.vector_size)

            sentence_embeddings.append(sentence_emb)

        return np.array(sentence_embeddings)

    def lstm_combination(self, tokenized_sentences, model, max_len=50):
        """Use LSTM to combine word embeddings"""
        vocab = set()
        for tokens in tokenized_sentences:
            vocab.update(tokens)

        word_to_idx = {word: i+1 for i, word in enumerate(vocab)}  # 0 reserved for padding
        vocab_size = len(word_to_idx) + 1

        sequences = []
        for tokens in tokenized_sentences:
            seq = [word_to_idx.get(token, 0) for token in tokens]
            sequences.append(seq)

        padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post')


        embedding_matrix = np.zeros((vocab_size, model.vector_size))
        for word, idx in word_to_idx.items():
            if word in model.wv:
                embedding_matrix[idx] = model.wv[word]

        return padded_sequences, embedding_matrix, vocab_size


class MLModels:
    def __init__(self):
        self.models = {}

    def logistic_regression(self):
        return LogisticRegression(random_state=42, max_iter=1000)

    def decision_tree(self):
        return DecisionTreeClassifier(random_state=42)

    def random_forest(self):
        return RandomForestClassifier(n_estimators=100, random_state=42)

    def svm(self):
        return SVC(random_state=42, kernel='rbf')

    def mlp_classifier(self):
        return MLPClassifier(hidden_layer_sizes=(128, 64), random_state=42, max_iter=500)

    def lstm_model(self, vocab_size, embedding_matrix, max_len=50, num_classes=6):
        """Create LSTM model for sequence classification"""
        model = Sequential([
            Embedding(vocab_size, embedding_matrix.shape[1],
                     weights=[embedding_matrix], input_length=max_len, trainable=False),
            LSTM(128, dropout=0.2, recurrent_dropout=0.2),
            Dense(64, activation='relu'),
            Dropout(0.3),
            Dense(num_classes, activation='softmax')
        ])

        model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
        return model

In [8]:
def run_comprehensive_experiment():
    """Run all combinations of experiments"""

    preprocessor = TextPreprocessor(best_preprocessing)
    vectorizer = VectorizationTechniques(preprocessor)
    combiner = CombinationStrategies()
    ml_models = MLModels()

    X_train_processed, X_test_processed = vectorizer.prepare_data(X_train, X_test)

    results = []

    print("Starting comprehensive experimentation...")
    print("="*80)

    vectorization_methods = [
        'bow', 'tfidf', 'word2vec_cbow', 'word2vec_skipgram', 'fasttext'
    ]

    combination_strategies = ['mean', 'max', 'lstm']

    ml_model_names = ['logistic_regression', 'decision_tree', 'random_forest', 'svm', 'mlp']

    print("Training embedding models...")
    w2v_cbow_model = vectorizer.word2vec_cbow()
    w2v_skipgram_model = vectorizer.word2vec_skipgram()
    fasttext_model = vectorizer.build_fasttext_model()

    embedding_models = {
        'word2vec_cbow': w2v_cbow_model,
        'word2vec_skipgram': w2v_skipgram_model,
        'fasttext': fasttext_model
    }

    exp_count = 0
    total_experiments = len(vectorization_methods) * len(ml_model_names) + \
                       len([v for v in vectorization_methods if v in embedding_models]) * len(combination_strategies) * len(ml_model_names)

    # 1. Traditional vectorization methods (BoW, TF-IDF)
    for vec_method in ['bow', 'tfidf']:
        print(f"\nTesting {vec_method.upper()} vectorization...")

        if vec_method == 'bow':
            X_train_vec, X_test_vec = vectorizer.bag_of_words()
        else:  # tfidf
            X_train_vec, X_test_vec = vectorizer.tfidf()

        for model_name in ml_model_names:
            exp_count += 1
            print(f"[{exp_count}/{total_experiments}] {vec_method} + {model_name}")

            try:
                # Get model
                if model_name == 'logistic_regression':
                    model = ml_models.logistic_regression()
                elif model_name == 'decision_tree':
                    model = ml_models.decision_tree()
                elif model_name == 'random_forest':
                    model = ml_models.random_forest()
                elif model_name == 'svm':
                    model = ml_models.svm()
                elif model_name == 'mlp':
                    model = ml_models.mlp_classifier()

                # Train and evaluate
                start_time = time.time()
                model.fit(X_train_vec, y_train)
                y_pred = model.predict(X_test_vec)
                accuracy = accuracy_score(y_test, y_pred)
                train_time = time.time() - start_time

                results.append({
                    'Preprocessing': best_preprocessing,
                    'Vectorization': vec_method.upper(),
                    'Combination': 'N/A',
                    'ML_Model': model_name,
                    'Accuracy': accuracy,
                    'Train_Time': train_time
                })

                print(f"  Accuracy: {accuracy:.4f}")

            except Exception as e:
                print(f"  Error: {str(e)}")
                results.append({
                    'Preprocessing': best_preprocessing,
                    'Vectorization': vec_method.upper(),
                    'Combination': 'N/A',
                    'ML_Model': model_name,
                    'Accuracy': 0.0,
                    'Train_Time': 0.0
                })

    # 2. Embedding methods with combination strategies
    for vec_method in ['word2vec_cbow', 'word2vec_skipgram', 'fasttext']:
        print(f"\nTesting {vec_method.upper()} embeddings...")
        model = embedding_models[vec_method]

        for combination in combination_strategies:
            if combination == 'lstm':
                # Special handling for LSTM
                print(f"  Testing {combination.upper()} combination with LSTM classifier...")

                try:
                    # Prepare data for LSTM
                    X_train_seq, embedding_matrix, vocab_size = combiner.lstm_combination(
                        vectorizer.X_train_tokens, model
                    )
                    X_test_seq, _, _ = combiner.lstm_combination(
                        vectorizer.X_test_tokens, model
                    )

                    # Convert labels to categorical
                    y_train_cat = to_categorical(y_train, num_classes=6)
                    y_test_cat = to_categorical(y_test, num_classes=6)

                    # Create and train LSTM model
                    lstm_model = ml_models.lstm_model(vocab_size, embedding_matrix)

                    start_time = time.time()
                    lstm_model.fit(X_train_seq, y_train_cat, epochs=5, batch_size=32, verbose=0)

                    # Predict
                    y_pred_proba = lstm_model.predict(X_test_seq, verbose=0)
                    y_pred = np.argmax(y_pred_proba, axis=1)
                    accuracy = accuracy_score(y_test, y_pred)
                    train_time = time.time() - start_time

                    exp_count += 1
                    print(f"[{exp_count}] {vec_method} + {combination} + LSTM: {accuracy:.4f}")

                    results.append({
                        'Preprocessing': best_preprocessing,
                        'Vectorization': vec_method,
                        'Combination': combination,
                        'ML_Model': 'LSTM',
                        'Accuracy': accuracy,
                        'Train_Time': train_time
                    })

                except Exception as e:
                    print(f"  Error with LSTM: {str(e)}")
                    results.append({
                        'Preprocessing': best_preprocessing,
                        'Vectorization': vec_method,
                        'Combination': combination,
                        'ML_Model': 'LSTM',
                        'Accuracy': 0.0,
                        'Train_Time': 0.0
                    })

            else:
                # Traditional combination strategies
                print(f"  Testing {combination.upper()} combination...")

                # Get sentence embeddings
                X_train_emb = combiner.get_sentence_embeddings(
                    vectorizer.X_train_tokens, model, combination
                )
                X_test_emb = combiner.get_sentence_embeddings(
                    vectorizer.X_test_tokens, model, combination
                )

                for model_name in ml_model_names:
                    exp_count += 1
                    print(f"[{exp_count}] {vec_method} + {combination} + {model_name}")

                    try:
                        # Get model
                        if model_name == 'logistic_regression':
                            ml_model = ml_models.logistic_regression()
                        elif model_name == 'decision_tree':
                            ml_model = ml_models.decision_tree()
                        elif model_name == 'random_forest':
                            ml_model = ml_models.random_forest()
                        elif model_name == 'svm':
                            ml_model = ml_models.svm()
                        elif model_name == 'mlp':
                            ml_model = ml_models.mlp_classifier()

                        # Train and evaluate
                        start_time = time.time()
                        ml_model.fit(X_train_emb, y_train)
                        y_pred = ml_model.predict(X_test_emb)
                        accuracy = accuracy_score(y_test, y_pred)
                        train_time = time.time() - start_time

                        results.append({
                            'Preprocessing': best_preprocessing,
                            'Vectorization': vec_method,
                            'Combination': combination,
                            'ML_Model': model_name,
                            'Accuracy': accuracy,
                            'Train_Time': train_time
                        })

                        print(f"    Accuracy: {accuracy:.4f}")

                    except Exception as e:
                        print(f"    Error: {str(e)}")
                        results.append({
                            'Preprocessing': best_preprocessing,
                            'Vectorization': vec_method,
                            'Combination': combination,
                            'ML_Model': model_name,
                            'Accuracy': 0.0,
                            'Train_Time': 0.0
                        })

    return results





In [9]:

experiment_results = run_comprehensive_experiment()

results_df = pd.DataFrame(experiment_results)

results_df = results_df.sort_values('Accuracy', ascending=False)

print("\n" + "="*100)
print("COMPREHENSIVE EXPERIMENT RESULTS")
print("="*100)

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print(results_df.to_string(index=False))

best_result = results_df.iloc[0]
print(f"\n BEST PERFORMING MODEL:")
print("="*50)
print(f"Preprocessing: {best_result['Preprocessing']}")
print(f"Vectorization: {best_result['Vectorization']}")
print(f"Combination Strategy: {best_result['Combination']}")
print(f"ML Model: {best_result['ML_Model']}")
print(f"Accuracy: {best_result['Accuracy']:.4f}")
print(f"Training Time: {best_result['Train_Time']:.2f} seconds")

print(f"\n TOP 10 PERFORMING COMBINATIONS:")
print("="*50)
top_10 = results_df.head(10)
for i, row in top_10.iterrows():
    print(f"{i+1:2d}. {row['Vectorization']:15s} + {row['Combination']:8s} + {row['ML_Model']:18s} = {row['Accuracy']:.4f}")

Starting comprehensive experimentation...
Training embedding models...

Testing BOW vectorization...
[1/70] bow + logistic_regression
  Accuracy: 0.7500
[2/70] bow + decision_tree
  Accuracy: 0.7140
[3/70] bow + random_forest
  Accuracy: 0.7200
[4/70] bow + svm
  Accuracy: 0.7080
[5/70] bow + mlp
  Accuracy: 0.7000

Testing TFIDF vectorization...
[6/70] tfidf + logistic_regression
  Accuracy: 0.7500
[7/70] tfidf + decision_tree
  Accuracy: 0.7340
[8/70] tfidf + random_forest
  Accuracy: 0.7320
[9/70] tfidf + svm
  Accuracy: 0.7260
[10/70] tfidf + mlp
  Accuracy: 0.7100

Testing WORD2VEC_CBOW embeddings...
  Testing MEAN combination...
[11] word2vec_cbow + mean + logistic_regression
    Accuracy: 0.4180
[12] word2vec_cbow + mean + decision_tree
    Accuracy: 0.5060
[13] word2vec_cbow + mean + random_forest
    Accuracy: 0.5860
[14] word2vec_cbow + mean + svm
    Accuracy: 0.6360
[15] word2vec_cbow + mean + mlp
    Accuracy: 0.5940
  Testing MAX combination...
[16] word2vec_cbow + max + 