#### Bag Of Words

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

"""
bag_of_words_calculate: caclulates bag of words for train
                        and test dataframe column
                        
arguments:
    train_df: pandas dataframe
    test_df:  pandas dataframe
    column:   string
"""
def bag_of_words_calculate(train_df, test_df, column):
    count_vectorizer = CountVectorizer(max_df=1.0, min_df=1, max_features=300)
    train_X_bow = count_vectorizer.fit_transform(train_df[column])
    test_X_bow = count_vectorizer.transform(test_df[column])
    return train_X_bow, test_X_bow

"""
bag_of_words_calculate_store: caclulates bag of words for train and test dataframe
                              column and stores them as a new column
                        
arguments:
    train_df: pandas dataframe
    test_df:  pandas dataframe
    column:   string
"""
def bag_of_words_calculate_store(train_df, test_df, column):
    train_X_bow,test_X_bow = bag_of_words_calculate(train_df, test_df, column)
    
    vectors = list()
    for v in train_X_bow.toarray():
        vectors.append(v)

    # save tf-idfs as a new column in the train dataframe
    train_df[f"bow_{column}"] = pd.Series(vectors,index=train_df.index)
    
    vectors = list()
    for v in test_X_bow.toarray():
        vectors.append(v)

    # save tf-idfs as a new column in the test dataframe
    test_df[f"bow_{column}"] = pd.Series(vectors,index=test_df.index)

#### Tf-idf

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

"""
tf_idf_calculate: caclulates tf-idfs for train and test dataframe column
                        
arguments:
    train_df: pandas dataframe
    test_df:  pandas dataframe
    column:   string
"""
def tf_idf_calculate(train_df, test_df, column):
    tf_idf_vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, max_features=300)
    train_X_tf_idf = tf_idf_vectorizer.fit_transform(train_df[column])
    test_X_tf_idf = tf_idf_vectorizer.transform(test_df[column])
    return train_X_tf_idf, test_X_tf_idf

"""
tf_idf_calculate: caclulates tf-idfs for train and test dataframe column 
                  and stores them as a new column
arguments:
    train_df: pandas dataframe
    test_df:  pandas dataframe
    column:   string
"""
def tf_idf_calculate_store(train_df, test_df, column):
    train_X_tf_idf, test_X_tf_idf = tf_idf_calculate(train_df, test_df, column)

    vectors = list()
    for v in train_X_tf_idf.toarray():
        vectors.append(v)
    # save tf-idfs as a new column in the train dataframe
    train_df[f"tf_idf_{column}"] = pd.Series(vectors, index = train_df.index)
    
    vectors = list()
    for v in test_X_tf_idf.toarray():
        vectors.append(v)
    # save tf-idfs as a new column in the test dataframe
    test_df[f"tf_idf_{column}"] = pd.Series(vectors, index = test_df.index)

#### Word2vec

Word2Vec is one of the popular methods in language modeling and feature learning techniques in natural language processing (NLP). This method is used to create word embeddings in machine learning whenever we need vector representation of data.

The advantage of using Word2Vec is that it can capture the distance between individual words.

Word embeddings (for example word2vec) allow to exploit ordering of the words and semantics information from the text corpus.

In [3]:
from gensim.models import Word2Vec

"""
word2vec_train: creates and trains a word2vec model for a dataframe column.
                also, it can save the model

arguments:
    dataframe: pandas dataframe
    column:    string
"""
def word2vec_create_train(dataframe, column, word2vec_model_file = None):
    word2vec_model = Word2Vec(dataframe[column], size = 300, window = 5,
                              min_count = 100, sg = 1, hs = 0, negative = 10)
    word2vec_model.train(dataframe[column],total_examples=len(dataframe[column]),epochs=20)
    if word2vec_model_file is not None:
        word2vec_model.save(word2vec_model_file)
    return word2vec_model

"""
word2vec_sentence_vectorizer: calculates the average of all word embeddings for 
                              each sentence and and stores them as a new column

arguments:
    dataframe: pandas dataframe
    column:    string
"""
def word2vec_sentence_vectorizer(dataframe, column, word2vec_model, store = False):
    sentences = dataframe[column].tolist()
    vectors = list()
    # for each sentence we sum all word embedding of each word
    # and we divide by the number of all words in the sentence.
    for sentence in sentences:
        sentence_vector = list()
        number_of_words = 0
        for word in sentence:
            try:
                if number_of_words == 0:
                    sentence_vector = word2vec_model[word]
                else:
                    sentence_vector = np.add(sentence_vector, word2vec_model[word])
                number_of_words += 1
            except:
                pass
        sentence_vector_array = np.asarray(sentence_vector) / number_of_words
        vectors.append(sentence_vector_array)
        
    # save word2vecs as a new column in the dataframe
    if store is True:
        dataframe[f"word2vec_{column}"] = pd.Series(vectors, index = test_df.index)
        
    return vectors