<a href="https://colab.research.google.com/github/bhattacharjee/mtu-nlp-assignment/blob/main/assignment1/pipeline_work.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install spacy  nltk spacymoji huggingface -q       >/dev/null 2>&1         
!pip install -q -U tensorflow-text                      >/dev/null 2>&1
!pip install -q tf-models-official                      >/dev/null 2>&1
!python -m spacy download de_core_news_sm               >/dev/null 2>&1
!python -m spacy download de_dep_news_trf               >/dev/null 2>&1
!pip install transformers                               >/dev/null 2>&1

!python -m spacy download de_core_news_sm               >/dev/null 2>&1
!python -m spacy download de_dep_news_trf               >/dev/null 2>&1

!pip install mlxtend                                    >/dev/null 2>&1
!pip install imblearn                                   >/dev/null 2>&1

# handling emojis
!pip install demoji                                     >/dev/null 2>&1

In [2]:
import requests
from functools import lru_cache
import sklearn

@lru_cache(maxsize=10)
def get_train_test_files():
    TRAIN_FILE = 'https://raw.githubusercontent.com/bhattacharjee/mtu-nlp-assignment/main/assignment1/Assessment1_Toxic_Train.csv'
    TEST_FILE = 'https://raw.githubusercontent.com/bhattacharjee/mtu-nlp-assignment/main/assignment1/Assessment1_Toxic_Test_For_Evaluation.csv'
    TRAIN_FILE_LOCAL = 'Assessment1_Toxic_Train.csv'
    TEST_FILE_LOCAL = 'Assessment1_Toxic_Test.csv'

    def download(url, localfile):
        with open(localfile, 'wb') as f:
            r = requests.get(url, allow_redirects=True)
            f.write(r.content)

    download(TRAIN_FILE, TRAIN_FILE_LOCAL)
    download(TEST_FILE, TEST_FILE_LOCAL)

    return TRAIN_FILE_LOCAL, TEST_FILE_LOCAL

def seed_random():
    import numpy as np
    import random
    np.random.seed(0)
    random.seed(0)

sklearn.set_config(display="diagram")

# Functions to read the CSV and do basic cleaning


Cleaning with Python
The data was first loaded using pandas. After that, regular expressions were used to perform the following:
1.	Convert to lowercase
2.	Emojis were replaced with their descriptions. Certain emojis can be relevant to the tasks at hand. The descriptions were modified to be a single word separated by underscores, eg. __thumbs_down__. These emojis are not German, but that should not make any difference to the models.
3.	The roles like @user, @moderator, etc. was removed. This was done because it was assumed that this might introduce bias into the classification, although the description of. The dataset says that chances of a bias are very unlikely.
4.	Ellipses are removed
5.	Any numbers are replaced with a tag, like NUM
6.	URL’s and links are removed
7.	Remove any punctuations
8.	Punctuations at the beginning or end of words are removed

In [3]:
import re
import pandas as pd
import demoji
from functools import lru_cache

def remove_roles(line:str)->str:
    # Remove texts like @USER, @MODERATOR etc
    pat = re.compile(u'\@[A-Za-z]+')
    return re.sub(pat, '', line)

@lru_cache(maxsize=3)
def get_train_test_df_cached():
    train_csv, test_csv = get_train_test_files()
    train_df = pd.read_csv(train_csv)
    test_df = pd.read_csv(test_csv)
    return train_df, test_df

def get_train_test_df():
    tr, te = get_train_test_df_cached()
    return tr.copy(), te.copy()


def remove_emojis(line:str)->str:
    # Replace emojis with their description, eg __thumbs_down__
    demoji_str = demoji.replace_with_desc(line, sep=" ::: ")
    if (demoji_str == line):
        return line
    
    inEmoji = False
    currentEmojiWords = []
    allWords = []

    def accumulate(word:str)->None:
        nonlocal inEmoji
        nonlocal currentEmojiWords
        nonlocal allWords
        if not inEmoji and word != ":::":
            allWords.append(word)
        elif inEmoji:
            if word == ':::':
                currentEmoji = "_".join(currentEmojiWords)
                currentEmoji = "__" + currentEmoji + "__"
                allWords.append(currentEmoji)
                currentEmojiWords = []
            else:
                currentEmojiWords.append(word)
        else: # Not in emoji but ::: is true
            inEmoji = True

    [accumulate(word) for word in demoji_str.split()]

    sentence = " ".join(allWords)
    return sentence


def remove_ellipses(line:str)->str:
    pat = re.compile(u'\.\.+')
    return re.sub(pat, ' ', line)

def to_lower(line:str)->str:
    return line.lower()

def replace_number_with_tag(line:str)->str:
    line = re.sub("\s\d*((\.|\,)\d+)?\s", " nummer ", line)
    line = re.sub('\s\d+$', '', line)
    line = re.sub('^\d+\s', '', line)
    return line

def remove_urls(line:str)->str:
    return re.sub('https?:\/\/\S+', ' hyperlink ', line)

def basic_clean(s:pd.Series)->pd.Series:
    return s.map(to_lower)                                                  \
            .map(remove_emojis)                                             \
            .map(remove_roles)                                              \
            .map(remove_ellipses)                                           \
            .map(replace_number_with_tag)                                   \
            .map(remove_urls)

@lru_cache(maxsize=3)
def get_clean_train_test_df_cached()->tuple:
    train_df, test_df = get_train_test_df()
    train_df['comment_text'] = basic_clean(train_df['comment_text'])
    test_df['comment_text'] = basic_clean(test_df['comment_text'])
    return train_df, test_df

def get_clean_train_test_df():
    tr, te = get_clean_train_test_df_cached()
    return tr.copy(), te.copy()

# Clean using Spacy and Enrich



Cleaning with Spacy
After the above set of steps, further cleaning was performed via a dedicated NLP toolkit. NLTK and spacy were both evaluated, but spacy seemed to be a better library for some tasks, and this was chosen for all asks as a result. The following operations were performed with spacy.
1.	Numbers or symbols are removed, we have already performed this step earlier, but some numbers may still be present.
2.	Stopwords are removed
3.	Punctuations are removed, again this was already done via regular expressions but some may still remain.
4.	Words are lemmatized
Parts of Speech Tagging and Named Entity Recognition
Experiments were tried with both POS tagging and removal of named entities, and without POS tagging and still having named entities. It was found that removal of named entities gave a big boost to model performance. Also, it was found that POS tagging gave a further small gain in model performance.
Both POS tagging and named entity removal were performed by use of the spacy library.
Additional Features
Taking inspiration from the approaches taken by various teams in the GermEval2021 competition, the following features were added:
1.	Number of words with length greater than 3 that have all letters in capital
2.	Number of exclamations
3.	Ratio of exclamations to number of characters


In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
import string
import spacy
from spacymoji import Emoji
import  de_core_news_sm

def is_punct_only(token:str)->bool:
    for c in list(token):
        if c not in string.punctuation:
            return False
    return True

def is_same(l1:list, l2:list)->bool:
    if (len(l1) != len(l2)):
        return False
    for x, y in zip(l1, l2):
        if x != y:
            return False
    return True

def get_num_of_allcap_words(s:str)->int:
    def is_allcaps(s:str)->bool:
        if (len(s) < 3):
            return False
        for c in list(s):
            if not (\
                    (ord(c) <=ord('Z') and ord(c) >= ord('A')) or           \
                    (ord(c) >= ord('0') and ord(c) <= ord('9'))             \
                    ):
                return False
        return True

    if len(s) < 3:
        return 0
    tokens = [w.strip() for w in s.split()]
    return sum([1 for t in tokens if is_allcaps(t)])

def get_percentage_of_excalamations(s:str)->float:
    if len(s) == 0:
        return 0.0
    exclamation_count = sum([1 for c in list(s) if c == '!'])
    return exclamation_count / len(s)


def is_empty_string(s:str)->bool:
    if s == '' or s == None:
        return True
    return False

def do_basic_nlp_cleaning(line:str)->str:
    nltk.download('stopwords', quiet=True)
    nltk.download('punkt', quiet=True)
    nltk.download('wordnet', quiet=True)

    # Tokenize
    tokens = word_tokenize(line)

    # Some tokens start with a punctuation, remove the first one
    def remove_first_punctuation(tok:str)->str:
        return                                                              \
            tok[1:]                                                         \
            if tok[0] in set(string.punctuation) and len(tok) != 0          \
            else tok

    tokens = [remove_first_punctuation(w) for w in tokens]

    # Remove stop words
    stop_words = set(stopwords.words("german"))
    tokens = [w for w in tokens if w not in stop_words]

    # Remove punctuations
    tokens = [w for w in tokens if not is_punct_only(w)]

    # Stem words
    stem = SnowballStemmer('german')
    tokens = [stem.stem(w) for w in tokens]

    return " ".join(tokens)

def get_cleaning_function(remove_named_ents:bool=True, pos_tagging:bool=False):
    #nlp = spacy.load("de_dep_news_trf")
    #nlp = spacy.load("de_core_news_sm")
    nlp = de_core_news_sm.load()
    emoji = Emoji(nlp)
    nlp.add_pipe(emoji, first=True)
    stopwords = spacy.lang.de.stop_words.STOP_WORDS

    def do_basic_nlp_cleaning(line:str)->str:
        def is_interesting_token(token, doc):
            if token.pos_ in set(['NUM', 'SYM']):
                return False
            if remove_named_ents:
                for e in doc.ents:
                    for t in e:
                        if token.text == t.text:
                            return False
            if token.text in stopwords:
                return False
            if (token.is_punct):
                return False
            #if token._.is_emoji:
            #    return False
            return True

        def remove_terminal_punctuations(word):
            word = word.strip()
            while word != "" and word[0] in list(string.punctuation):
                word = word[1:]
            while word != "" and word[-1] in list(string.punctuation):
                word = word[:-1]
            return word

        def get_final_string(tok, doc):
            lemma = tok.lemma_.lower()
            if pos_tagging:
                lemma = lemma + ":" + tok.pos_
                lemma = lemma + ":" + tok.tag_
            return lemma

        doc = nlp(line)
        words = [get_final_string(tok, doc) for tok in doc if is_interesting_token(tok, doc)]
        words = [remove_terminal_punctuations(word) for word in words]
        words = [word for word in words if word != ""]
        return  " ".join(words)

    return do_basic_nlp_cleaning

def get_enriched_dataset(df):
    cleaning_fn = get_cleaning_function(remove_named_ents=True, pos_tagging=True)
    df['cleaned_comment_text'] = df['comment_text'].map(cleaning_fn)
    df['n_all_caps'] = df['comment_text'].map(get_num_of_allcap_words)
    df['perc_exclamations'] = df['comment_text'].map(get_percentage_of_excalamations)
    df['num_exclamations'] = df['comment_text'].map(lambda s: sum([1 for x in list(s) if x == '!']))
    return df

@lru_cache(maxsize=3)
def get_enriched_train_test_dataset_cached():
    train_df, test_df = get_clean_train_test_df()
    train_df = get_enriched_dataset(train_df)
    test_df = get_enriched_dataset(test_df)
    return train_df, test_df

def get_enriched_train_test_dataset():
    train, test = get_enriched_train_test_dataset_cached()
    return train.copy(), test.copy()

train_df, test_df = get_enriched_train_test_dataset()

# Print Enriched Training DF

In [5]:
train_df

Unnamed: 0,comment_text,Sub1_Toxic,Sub2_Engaging,Sub3_FactClaiming,cleaned_comment_text,n_all_caps,perc_exclamations,num_exclamations
0,"gestern bei illner, montag bei nummer ist das...",1,0,1,gestern:ADV:ADV illner:ADJ:ADJA montag:NOUN:NN...,0,0.000000,0
1,mein gott der war erst gestern bei illner. die...,1,0,1,gestern:ADV:ADV redaktionen:NOUN:NN versagen:V...,0,0.000000,0
2,die cdu lässt das so wie so nicht zu . sagen ...,1,0,1,SPACE:_SP lässt:VERB:VVFIN sagen:VERB:VVFIN re...,0,0.000000,0
3,bei meiner beschissenen rente als 2x geschiede...,1,0,1,beschissen:ADJ:ADJA rente:NOUN:NN geschieden:A...,0,0.000000,0
4,"wer nummer jahre zum mindestlohn arbeiten muß,...",1,1,1,nummer:ADJ:ADJA mindestlohn:NOUN:NN arbeiten:V...,0,0.005025,3
...,...,...,...,...,...,...,...,...
3189,hier mal eine info. flüchtlinge werden nummer ...,0,0,0,mal:ADV:ADV info:NOUN:NN flüchtlinge:NOUN:NN n...,0,0.000000,0
3190,.aha .mal abwarten kommt bei uns auch .firmen ...,1,0,1,aha:X:XY mal:X:XY abwarten:NOUN:NN entlassen:P...,0,0.000000,0
3191,.so ist es,0,0,0,SPACE:_SP so:PROPN:NE,0,0.000000,0
3192,.die warten da,0,0,0,SPACE:_SP die:X:XY warten:NOUN:NN,0,0.000000,0


# Multinomial NB (Vectorization Approach 1)
## Use CountVectorizer, Term Frequency, and TF-IDF simultaneously

With experimentation it is found that using all three produces better results

In [6]:
from sklearn.naive_bayes import MultinomialNB, CategoricalNB, BernoulliNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE 
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from mlxtend.preprocessing import DenseTransformer
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import AdaBoostClassifier

def get_feature_column_names(df):
    return [cname for cname in df.columns if not cname.startswith('Sub')]

def get_target_column_names(df):
    return [cname for cname in df.columns if cname.startswith('Sub')]

def is_text_column(colname:str)->bool:
    if 'text' in colname:
        return True
    return False

def get_text_columns(df)->list:
    return [cn for cn in df.columns if is_text_column(cn)]

def get_nontext_columns(df)->list:
    return [cn for cn in df.columns if not is_text_column(cn)]

def run_classification(                                                     \
                       dataset:pd.DataFrame,                                \
                       target_column:str,                                   \
                       clf_gen_fn,                                          \
                       use_smote=False)->tuple:
    dataset = dataset[[cn for cn in dataset.columns if cn != 'comment_text']]
    X = dataset[get_feature_column_names(dataset)]
    y = dataset[target_column]
    trainX, testX, trainY, testY = train_test_split(X, y, random_state=0)

    def get_text_pipeline():
        return Pipeline(                                                    \
                        [                                                   \
                            ('cv', CountVectorizer(),),                     \
                            ('tfid', TfidfVectorizer()),                    \
                        ])

                        
    column_trans = make_column_transformer(                                 \
                            (CountVectorizer(ngram_range=(1,1)), 'cleaned_comment_text'),   \
                            (TfidfVectorizer(use_idf=True), 'cleaned_comment_text'),    \
                            (TfidfVectorizer(use_idf=False), 'cleaned_comment_text'),    \
                            remainder=MinMaxScaler(),                       \
                        )

    if use_smote:
        classif_pipeline = Pipeline(                                        \
                                [                                           \
                                    ('column_transformer', column_trans),   \
                                    ('dense', DenseTransformer()),          \
                                    ('smote', SMOTE(n_jobs=-1)),            \
                                    ('clf', clf_gen_fn()),                  \
                                ])
    else:
        classif_pipeline = Pipeline(                                        \
                                [                                           \
                                    ('column_transformer', column_trans),   \
                                    ('dense', DenseTransformer()),          \
                                    ('clf', clf_gen_fn()),                  \
                                ])
    
    
    classif_pipeline.fit(trainX, trainY)
    y_pred = classif_pipeline.predict(testX)

    return accuracy_score(testY, y_pred), f1_score(testY, y_pred), classif_pipeline

def run_classifiers():

    classifiers = {
        "LinearSVC_nosmote": (False, lambda: LinearSVC(),),
        "LinearSVC": (True, lambda: LinearSVC(),),
        "MultinomialNB_nosmote": (False, lambda: MultinomialNB(),),
        "RandomForestClassifier": (True, lambda: RandomForestClassifier(n_jobs=-1),),
        "RandomForestClassifier_nosmote": (False, lambda: RandomForestClassifier(n_jobs=-1),),
        "BernoulliNB_nosmote": (False, lambda: BernoulliNB(),),
    }

    result_df = pd.DataFrame(                                                           \
                    {                                                       \
                        'classifier': pd.Series('str'),                     \
                        'task_name': pd.Series('str'),                      \
                        'metric': pd.Series('str'),                         \
                        'smote': pd.Series(int),                            \
                        'value': pd.Series(float),                          \
                    })

    model_arr = list()
    for clfname, value in classifiers.items():
        use_smote, clfgen = value
        for colname in ['Sub1_Toxic', 'Sub2_Engaging', 'Sub3_FactClaiming']:
            accuracy, f1, model = run_classification(train_df, colname, clfgen, use_smote)
            print(f"{clfname:20.20s} {colname:20.20s} accuracy={accuracy:1.3f}              f1={f1:1.3f}   smote={use_smote}")
            result_dict = {
                'classifier': clfname,
                'task_name': colname,
                'metric': 'accuracy',
                'value': accuracy,
                'smote': 1 if use_smote else 0
            }
            result_df = result_df.append(result_dict, ignore_index=True)
            result_dict = {
                'classifier': clfname,
                'task_name': colname,
                'metric': 'f1_score',
                'value': f1,
                'smote': 1 if use_smote else 0
            }
            result_df = result_df.append(result_dict, ignore_index=True)
            model_arr.append((clfname, colname, model))

        
    return result_df, model_arr

seed_random()
result_df, model_arr = run_classifiers()

print('=' * 80)
print('=' * 80)

from sklearn import set_config

set_config(display="diagram")
import IPython
for clfname, colname, model in model_arr:
    print()
    print('-' * 80)
    print(clfname, colname, ':')
    print()
    IPython.display.display(model)



LinearSVC_nosmote    Sub1_Toxic           accuracy=0.662              f1=0.471   smote=False




LinearSVC_nosmote    Sub2_Engaging        accuracy=0.816              f1=0.584   smote=False




LinearSVC_nosmote    Sub3_FactClaiming    accuracy=0.741              f1=0.578   smote=False




LinearSVC            Sub1_Toxic           accuracy=0.645              f1=0.470   smote=True




LinearSVC            Sub2_Engaging        accuracy=0.810              f1=0.582   smote=True




LinearSVC            Sub3_FactClaiming    accuracy=0.745              f1=0.587   smote=True
MultinomialNB_nosmot Sub1_Toxic           accuracy=0.660              f1=0.205   smote=False
MultinomialNB_nosmot Sub2_Engaging        accuracy=0.797              f1=0.580   smote=False
MultinomialNB_nosmot Sub3_FactClaiming    accuracy=0.735              f1=0.600   smote=False
RandomForestClassifi Sub1_Toxic           accuracy=0.658              f1=0.244   smote=True
RandomForestClassifi Sub2_Engaging        accuracy=0.831              f1=0.620   smote=True
RandomForestClassifi Sub3_FactClaiming    accuracy=0.747              f1=0.570   smote=True
RandomForestClassifi Sub1_Toxic           accuracy=0.662              f1=0.182   smote=False
RandomForestClassifi Sub2_Engaging        accuracy=0.834              f1=0.578   smote=False
RandomForestClassifi Sub3_FactClaiming    accuracy=0.755              f1=0.522   smote=False
BernoulliNB_nosmote  Sub1_Toxic           accuracy=0.672              f1=0


--------------------------------------------------------------------------------
LinearSVC_nosmote Sub2_Engaging :




--------------------------------------------------------------------------------
LinearSVC_nosmote Sub3_FactClaiming :




--------------------------------------------------------------------------------
LinearSVC Sub1_Toxic :




--------------------------------------------------------------------------------
LinearSVC Sub2_Engaging :




--------------------------------------------------------------------------------
LinearSVC Sub3_FactClaiming :




--------------------------------------------------------------------------------
MultinomialNB_nosmote Sub1_Toxic :




--------------------------------------------------------------------------------
MultinomialNB_nosmote Sub2_Engaging :




--------------------------------------------------------------------------------
MultinomialNB_nosmote Sub3_FactClaiming :




--------------------------------------------------------------------------------
RandomForestClassifier Sub1_Toxic :




--------------------------------------------------------------------------------
RandomForestClassifier Sub2_Engaging :




--------------------------------------------------------------------------------
RandomForestClassifier Sub3_FactClaiming :




--------------------------------------------------------------------------------
RandomForestClassifier_nosmote Sub1_Toxic :




--------------------------------------------------------------------------------
RandomForestClassifier_nosmote Sub2_Engaging :




--------------------------------------------------------------------------------
RandomForestClassifier_nosmote Sub3_FactClaiming :




--------------------------------------------------------------------------------
BernoulliNB_nosmote Sub1_Toxic :




--------------------------------------------------------------------------------
BernoulliNB_nosmote Sub2_Engaging :




--------------------------------------------------------------------------------
BernoulliNB_nosmote Sub3_FactClaiming :



In [7]:
def print_df(df, metric, task):
    df = df[(df['metric'] == metric) & (df['task_name'] == task)]
    df = df.sort_values(by=['value'], ascending=False)
    print(df.head(3))
    return df

for task_name in ['Sub1_Toxic', 'Sub2_Engaging', 'Sub3_FactClaiming']:
    print('=' * 80)
    print(task_name)
    print('-' * len(task_name))
    print()
    for metric in ['accuracy', 'f1_score']:
        print_df(result_df, metric, task_name)
        print()


Sub1_Toxic
----------

                        classifier   task_name    metric smote     value
31             BernoulliNB_nosmote  Sub1_Toxic  accuracy     0   0.67209
1                LinearSVC_nosmote  Sub1_Toxic  accuracy     0  0.662078
25  RandomForestClassifier_nosmote  Sub1_Toxic  accuracy     0  0.662078

                classifier   task_name    metric smote     value
2        LinearSVC_nosmote  Sub1_Toxic  f1_score     0  0.470588
8                LinearSVC  Sub1_Toxic  f1_score     1  0.470149
20  RandomForestClassifier  Sub1_Toxic  f1_score     1  0.243767

Sub2_Engaging
-------------

                        classifier      task_name    metric smote     value
27  RandomForestClassifier_nosmote  Sub2_Engaging  accuracy     0  0.833542
21          RandomForestClassifier  Sub2_Engaging  accuracy     1  0.831039
3                LinearSVC_nosmote  Sub2_Engaging  accuracy     0   0.81602

                classifier      task_name    metric smote     value
22  RandomForestClass

# Multinomial NB pipeline (Vectorization Approach 2)
## In this method, word counts are used along with TF-IDF

In [8]:
import sklearn
from sklearn.naive_bayes import MultinomialNB, CategoricalNB, BernoulliNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
#from imblearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from mlxtend.preprocessing import DenseTransformer
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.base import TransformerMixin, BaseEstimator
import sklearn
import itertools

MAX_COMBINATION_NUMBER=3

def get_feature_column_names(df):
    return [cname for cname in df.columns if not cname.startswith('Sub')]

def get_target_column_names(df):
    return [cname for cname in df.columns if cname.startswith('Sub')]

def is_text_column(colname:str)->bool:
    if 'text' in colname:
        return True
    return False

def get_text_columns(df)->list:
    return [cn for cn in df.columns if is_text_column(cn)]

def get_nontext_columns(df)->list:
    return [cn for cn in df.columns if not is_text_column(cn)]

class CustomTextProcessor(BaseEstimator, TransformerMixin):
    def __init__(self, use_tfid=True):
        vect = CountVectorizer() if use_tfid == False else TfidfVectorizer()
        self.tweet_text_transformer = Pipeline(steps=[
                                        ('vect', vect),
                        ])

class CustomCountVectorizer(CountVectorizer):
    # The only difference here is that we don't return a sparse array
    # So that this can work with a transformer
    # Instead we return a dense array
    def __init__(self):
        super(CustomCountVectorizer, self).__init__()
    
    def fit_transform(self, X, y=None):
        return super(CustomCountVectorizer, self).fit_transform(X.squeeze(), y).toarray()
    
    def fit(self, X, y=None):
        return super(CustomCountVectorizer, self).fit(X, y)

    def transform(self, X):
        return super(CustomCountVectorizer, self).transform(X.squeeze()).toarray()

class CustomTfidVectorizer(TfidfVectorizer):
    # The only difference here is that we don't return a sparse array
    # So that this can work with a transformer
    # Instead we return a dense array
    def __init__(self):
        super(CustomTfidVectorizer, self).__init__()
    
    def fit_transform(self, X, y=None):
        return super(CustomTfidVectorizer, self).fit_transform(X.squeeze(), y).toarray()
    
    def fit(self, X, y=None):
        return super(CustomTfidVectorizer, self).fit(X, y)

    def transform(self, X):
        return super(CustomTfidVectorizer, self).transform(X.squeeze()).toarray()

def run_classification(                                                     \
                       dataset:pd.DataFrame,                                \
                       target_column:str,                                   \
                       clf_gen_fn,                                          \
                       use_smote=False)->tuple:

    dataset = dataset[[cn for cn in dataset.columns if cn != 'comment_text']]

    X = dataset[get_feature_column_names(dataset)]
    y = dataset[target_column]
    trainX, testX, trainY, testY = train_test_split(X, y, random_state=0)

    text_columns = ['cleaned_comment_text']
    numeric_columns = ['n_all_caps', 'perc_exclamations', 'num_exclamations']

    preprocessor = ColumnTransformer(
                        remainder='drop',
                        transformers =                                      \
                                    [                                       \
                                        ('text', CustomCountVectorizer(), text_columns), \
                                        ('text2', CustomTfidVectorizer(), text_columns), \
                                        ('num', StandardScaler(with_mean=False, with_std=False), numeric_columns)
                                    ])
    classif_pipeline = Pipeline([('prep', preprocessor), ('classif', clf_gen_fn())])

    classif_pipeline.fit(trainX, trainY)
    y_pred = classif_pipeline.predict(testX)

    return accuracy_score(testY, y_pred), f1_score(testY, y_pred), classif_pipeline



def run_classifiers():

    classifiers = {
        "LinearSVC_nosmote": (False, lambda: LinearSVC(),),
        "MultinomialNB_nosmote": (False, lambda: MultinomialNB(),),
        "RandomForestClassifier": (True, lambda: RandomForestClassifier(n_jobs=-1),),
    }

    result_df = pd.DataFrame(                                                           \
                    {                                                       \
                        'classifier': pd.Series('str'),                     \
                        'task_name': pd.Series('str'),                      \
                        'metric': pd.Series('str'),                         \
                        'smote': pd.Series(int),                            \
                        'value': pd.Series(float),                          \
                    })

    model_arr = list()
    for clfname, value in classifiers.items():
        use_smote, clfgen = value
        for colname in ['Sub1_Toxic', 'Sub2_Engaging', 'Sub3_FactClaiming']:
            accuracy, f1, model = run_classification(train_df, colname, clfgen, use_smote)
            print(f"{clfname:20.20s} {colname:20.20s} accuracy={accuracy:1.3f}              f1={f1:1.3f}   smote={use_smote}")
            result_dict = {
                'classifier': clfname,
                'task_name': colname,
                'metric': 'accuracy',
                'value': accuracy,
                'smote': 1 if use_smote else 0
            }
            result_df = result_df.append(result_dict, ignore_index=True)
            result_dict = {
                'classifier': clfname,
                'task_name': colname,
                'metric': 'f1_score',
                'value': f1,
                'smote': 1 if use_smote else 0
            }
            result_df = result_df.append(result_dict, ignore_index=True)
            model_arr.append((clfname, colname, model))

        
    return result_df, model_arr

seed_random()
result_df, model_arr = run_classifiers()


print('=' * 80)
print('=' * 80)

import IPython


for clfname, colname, model in model_arr:
    print()
    print('-' * 80)
    print(clfname, colname, ":")
    print()
    IPython.display.display(model)



LinearSVC_nosmote    Sub1_Toxic           accuracy=0.663              f1=0.476   smote=False




LinearSVC_nosmote    Sub2_Engaging        accuracy=0.821              f1=0.590   smote=False




LinearSVC_nosmote    Sub3_FactClaiming    accuracy=0.740              f1=0.574   smote=False
MultinomialNB_nosmot Sub1_Toxic           accuracy=0.666              f1=0.303   smote=False
MultinomialNB_nosmot Sub2_Engaging        accuracy=0.762              f1=0.541   smote=False
MultinomialNB_nosmot Sub3_FactClaiming    accuracy=0.710              f1=0.577   smote=False
RandomForestClassifi Sub1_Toxic           accuracy=0.662              f1=0.177   smote=True
RandomForestClassifi Sub2_Engaging        accuracy=0.839              f1=0.583   smote=True
RandomForestClassifi Sub3_FactClaiming    accuracy=0.757              f1=0.531   smote=True

--------------------------------------------------------------------------------
LinearSVC_nosmote Sub1_Toxic :




--------------------------------------------------------------------------------
LinearSVC_nosmote Sub2_Engaging :




--------------------------------------------------------------------------------
LinearSVC_nosmote Sub3_FactClaiming :




--------------------------------------------------------------------------------
MultinomialNB_nosmote Sub1_Toxic :




--------------------------------------------------------------------------------
MultinomialNB_nosmote Sub2_Engaging :




--------------------------------------------------------------------------------
MultinomialNB_nosmote Sub3_FactClaiming :




--------------------------------------------------------------------------------
RandomForestClassifier Sub1_Toxic :




--------------------------------------------------------------------------------
RandomForestClassifier Sub2_Engaging :




--------------------------------------------------------------------------------
RandomForestClassifier Sub3_FactClaiming :



In [9]:
def print_df(df, metric, task):
    df = df[(df['metric'] == metric) & (df['task_name'] == task)]
    df = df.sort_values(by=['value'], ascending=False)
    print(df.head(3))
    return df

for task_name in ['Sub1_Toxic', 'Sub2_Engaging', 'Sub3_FactClaiming']:
    print('=' * 80)
    print(task_name)
    print('-' * len(task_name))
    print()
    for metric in ['accuracy', 'f1_score']:
        print_df(result_df, metric, task_name)
        print()


Sub1_Toxic
----------

                classifier   task_name    metric smote     value
7    MultinomialNB_nosmote  Sub1_Toxic  accuracy     0  0.665832
1        LinearSVC_nosmote  Sub1_Toxic  accuracy     0  0.663329
13  RandomForestClassifier  Sub1_Toxic  accuracy     1  0.662078

                classifier   task_name    metric smote     value
2        LinearSVC_nosmote  Sub1_Toxic  f1_score     0  0.475634
8    MultinomialNB_nosmote  Sub1_Toxic  f1_score     0  0.302872
14  RandomForestClassifier  Sub1_Toxic  f1_score     1  0.176829

Sub2_Engaging
-------------

                classifier      task_name    metric smote     value
15  RandomForestClassifier  Sub2_Engaging  accuracy     1  0.838548
3        LinearSVC_nosmote  Sub2_Engaging  accuracy     0  0.821026
9    MultinomialNB_nosmote  Sub2_Engaging  accuracy     0  0.762203

                classifier      task_name    metric smote     value
4        LinearSVC_nosmote  Sub2_Engaging  f1_score     0  0.590258
16  RandomForestC

In [10]:
#result_df
result_df[result_df['metric'] == 'f1_score'].sort_values(by='value', ascending=False)


Unnamed: 0,classifier,task_name,metric,smote,value
4,LinearSVC_nosmote,Sub2_Engaging,f1_score,0,0.590258
16,RandomForestClassifier,Sub2_Engaging,f1_score,1,0.582524
12,MultinomialNB_nosmote,Sub3_FactClaiming,f1_score,0,0.576642
6,LinearSVC_nosmote,Sub3_FactClaiming,f1_score,0,0.57377
10,MultinomialNB_nosmote,Sub2_Engaging,f1_score,0,0.541063
18,RandomForestClassifier,Sub3_FactClaiming,f1_score,1,0.531401
2,LinearSVC_nosmote,Sub1_Toxic,f1_score,0,0.475634
8,MultinomialNB_nosmote,Sub1_Toxic,f1_score,0,0.302872
14,RandomForestClassifier,Sub1_Toxic,f1_score,1,0.176829
