#Project Summary

This is the source code for the IU Antisemitism Datathon and Hackathon 2020.
The final project classifier is an LSTM Keras network using word embeddings and several hidden layers.
We experimented using several different classification methods before concluding that the LSTM network gave superior performance.

##Methods Explored:
**1) Ngrams Model with Tf-Idf vectorization** <br>
Used NLTK to preprocess text of each tweet along with the user's profile description. The tweet text concatenated with the user's profile description was stored under the feature column "total_text". After setting text to lower case, stemming, lemmatizing, and removing stop words, there were ~550 unique words. Used NLTK's frequency distribution to find unigrams, bigrams, and trigrams that appeared more frequently(10 times more) in antisemitic tweets than clean tweets(or vice-versa). Those most significant ngrams were used for the next phase of the model. We used sklearn to extract the Term Frequency-Inverse Document Frequency(Tf-Idf) for each ngram. This method gives higher scores to ngrams that appear less frequently in the corpus as a whole but more frequently in an individual tweet, thus giving more weight to terms significant to any given tweet. The vectors of the Tfidf frequency of these ngrams served as inputs for a Naive Bayes Classifier.

**2) Spacy Text Classifiers** <br>
Vectorized the tweet data using the same ngram vocabularly determined in the first NLTK model. The data was vectorized using only a CountVectorizer instead of the tf-idf method of determining ngram frequency. Used sklearn's Pipeline to easily test this method of text preparation on a variety of different models. Tested models including LogisticRegression, Naive Bayes(MultinomialNB), Support Vector Classifier(SVC), RandomForestClassifier, AdaBoostClassifier, and RandomForest Classifier. Finally, these algorithms were combined into a VoterClassifier where each subclass was given a single vote. The AdaBoostClassifier and the VotingClassifer had the highest accuracy and f1score of all of the models tested.

**3) LSTM classifier** <br>
Used Keras Tokenizer, which converted all text to lowercase and stripped punctuation but otherwise skipped text preprocessing used in earlier models. Vectorized data is fed into deep network with several hidden layers, including an LSTM layer, a dropout layer, and several dense sigmoid layers. Despite the decreased amount of text preprocessing, this network outperformed all other models and was ultimately chosen to classify the data.

**4) Classifier Combination** <br>
We hypothesized that the LSTM networks already strong predictions could be augmented through the incorporation of prediction by the other classifiers. We engineered feature columns representing the predictions of the TfidfClassifier, VoterClassifier, AdaBoostClassifier, and LSTM network. These predictions were used as input for a Support Vector Classifier(SVC). However, this ensemble classifier performed worse when measured by the accuracy and f1scores than the LSTM network alone. This suggests that the LSTM network already encompasses all of the information gleaned by the previous text classifiers, and thus was unaided by their input.
Consequently, we chose to use solely the LSTM network.<br>

**Final Result**<br>
The LSTM network outputs a probability between 0 and 1 that the tweet is antisemitic. We found that feeding this input into the SVC, which outputted a binary prediction, obtained greater accuracy than setting an arbitrary threshold such as 0.5 and splitting the LSTM probability into a binary output in that manner. The SVC determines the optimal threshold. Thus, the final classifier is an SVC that outputs a binary probability given single input of the probability determined by the LSTM network.

The "Parent Classifier" outputs the binary prediction given the probability that the tweet is antisemitic according to the LSTM Classifier. <br>
The "Parent Classifier" is an SVM that determines the best probability threshold for classification.

##To Run The File

1) Upload training data and testing data.<br>Click the folder icon to the left of the screen and upload the two files, then change the two variable names below to the names of the two data files. <br>
2) Run All Cells <br>
3) View the output of the last cell in the notebook to see the F1 Score of the classifier.

In [None]:
#Set this variable equal to the name of your training data file(json)
JSON_FILE_NAME = "hackathon2.json"

#Set this variable equal to the name of your test data file(json)
JSON_TEST_DATA_FILE_NAME = "to_test.json"

#Import Necessary Libraries

In [None]:
#To get reproducible results, fix random number generators
import os
import random
import numpy as np
import tensorflow as tf
import os
def reset_random_seeds():
   os.environ['PYTHONHASHSEED']=str(1)
   tf.random.set_seed(1)
   np.random.seed(1)
   random.seed(1)
reset_random_seeds()
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)

In [None]:
!pip install flair flask

!pip install spacy

!python -m spacy download en
from spacy.lang.en import English
import spacy

import nltk
from nltk.stem import WordNetLemmatizer 
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LinearRegression

from sklearn.naive_bayes import MultinomialNB
from nltk.corpus import wordnet

from flask import abort, Flask, request
from flair.models import TextClassifier
from flair.data import Sentence
from flask import render_template
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_selection import chi2
from sklearn import metrics
from sklearn.metrics import accuracy_score
import numpy as np
import spacy
from spacy.lang.en import English
spacy_nlp = spacy.load('en')
import string
from sklearn.model_selection import train_test_split

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding, Conv1D, Embedding, MaxPooling1D
from keras.optimizers import RMSprop
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.utils import to_categorical
import keras.backend as K
from keras.callbacks import EarlyStopping

  import pandas.util.testing as tm
Using TensorFlow backend.


#Preprocess Data into Vector Formats

In [None]:
def preprocess(text, remove_uncommon=True):
  #move everything to lowercase
  text = text.lower()
  tokens = nltk.word_tokenize(text)
  #remove stop words
  go_tokens = []
  for w in tokens:
      if w not in stop_words:
          go_tokens.append(w)
  ps = PorterStemmer()
  #stem text
  stem_tokens=[]
  for w in go_tokens:
      stem_tokens.append(ps.stem(w))
  #lemmatize text
  lemmatizer = WordNetLemmatizer()
  final_tokens = [lemmatizer.lemmatize(word) for word in stem_tokens]
  if(remove_uncommon):
    #only keep words that occur more than 2 times
    myTokenFD = nltk.FreqDist(final_tokens)
    final_tokens = [w for w in list(final_tokens) if myTokenFD[w] > 5]
  return final_tokens
def preprocess_ngrams(text, n):
  final_tokens = preprocess(text, remove_uncommon = False)
  final_tokens = nltk.ngrams(final_tokens, n)
  return list(final_tokens)

In [None]:
print(preprocess_ngrams("This! is a bagging! of words bags bag mice bagging mice. bag mice", 3))

[('!', 'bag', '!'), ('bag', '!', 'word'), ('!', 'word', 'bag'), ('word', 'bag', 'bag'), ('bag', 'bag', 'mouse'), ('bag', 'mouse', 'bag'), ('mouse', 'bag', 'mouse'), ('bag', 'mouse', '.'), ('mouse', '.', 'bag'), ('.', 'bag', 'mouse')]


#Create DataFrame

In [None]:
df = pd.read_json(JSON_FILE_NAME)
df_test = pd.read_json(JSON_TEST_DATA_FILE_NAME)

In [None]:
type(list(df["user"].iloc[4].keys()))#.values[0]["description"])

list

In [None]:
#Add text column "total_text" combining the text of the tweet with the tweet author's profile descritpion
#Add binary column "antisem_binary" that represents tweets as either antisemitic(1) or clean(0)
def create_new_features(df):
  for i in range(0,len(df)):
    if "description" not in list(df["user"].iloc[i].keys()):
      df["user"].iloc[i]["description"] = ""
  df["total_text"] = df["text"] + " " + df["user"].values[0]["description"]
  #add the binary classification of antisemitism if the dataframe is training data
  if("antisemitism_rating" in df.columns):
    df["antisem_binary"] = np.where(df["antisemitism_rating"] > 3, 1, 0)
  return df

In [None]:
def get_all_text_from_column(df, column):
  text = df[column]
  text = ' '.join(np.asarray(text))
  return text

In [None]:
df = create_new_features(df)
df_test = create_new_features(df_test)

In [None]:
total_text = get_all_text_from_column(df, "total_text")

In [None]:
def get_semitic_df(df):
  return df[df["antisemitism_rating"] > 3]
def get_clean_df(df):
  return df[df["antisemitism_rating"] <= 3]

In [None]:
sem_df = get_semitic_df(df)
clean_df = get_clean_df(df)
print(sem_df.shape)
print(clean_df.shape)

total_text = get_all_text_from_column(df, "total_text")
sem_text = get_all_text_from_column(sem_df, "total_text")
clean_text = get_all_text_from_column(clean_df, "total_text")

total_tokens = preprocess(total_text)
sem_tokens = preprocess(sem_text)
clean_tokens = preprocess(clean_text)

total_bigrams = preprocess_ngrams(total_text, 2)
sem_bigrams = preprocess_ngrams(sem_text, 2)
clean_bigrams = preprocess_ngrams(clean_text, 2)
total_trigrams = preprocess_ngrams(total_text, 3)
sem_trigrams = preprocess_ngrams(sem_text, 3)
clean_trigrams = preprocess_ngrams(clean_text, 3)

(436, 31)
(569, 31)


#Get Significant Ngrams

In [None]:
def get_freq_distr(tokens):
  return nltk.FreqDist(tokens)
def get_percent_frequency_distr(tokens):
  fd = nltk.FreqDist(tokens)
  num_words = float(sum(fd.values()))
  relfrq = [x/num_words for x in fd.values() ]
  return relfrq
total_fd = get_freq_distr(total_tokens)
sem_fd = get_freq_distr(sem_tokens)
clean_fd = get_freq_distr(clean_tokens)
total_pfd = get_percent_frequency_distr(total_tokens)
sem_pfd = get_percent_frequency_distr(sem_tokens)
clean_pfd = get_percent_frequency_distr(clean_tokens)

total_bi_fd = get_freq_distr(total_bigrams)
sem_bi_fd = get_freq_distr(sem_bigrams)
clean_bi_fd = get_freq_distr(clean_bigrams)

total_tri_fd = get_freq_distr(total_trigrams)
sem_tri_fd = get_freq_distr(sem_trigrams)
clean_tri_fd = get_freq_distr(clean_trigrams)

In [None]:
print(total_fd)

<FreqDist with 556 samples and 30771 outcomes>


In [None]:
significant_words = []
for x in total_fd:
  dif = abs(sem_fd[x] - clean_fd[x])
  if(dif > 10):
    #print(x, p_dif, sem_fd[x], clean_fd[x])
    significant_words.append(x)
print("There are ", len(significant_words), "significant words")

significant_bigrams = []
for x in total_bi_fd:
  dif = abs(sem_bi_fd[x] - clean_bi_fd[x])
  if(dif > 10):
    #print(x, p_dif, sem_fd[x], clean_fd[x])
    significant_bigrams.append(x)
print("There are ", len(significant_bigrams), "significant bigrams")

significant_trigrams = []
for x in total_tri_fd:
  dif = abs(sem_tri_fd[x] - clean_tri_fd[x])
  if(dif > 10):
    #print(x, p_dif, sem_fd[x], clean_fd[x])
    significant_trigrams.append(x)
print("There are ", len(significant_trigrams), "significant trigrams")

There are  140 significant words
There are  92 significant bigrams
There are  52 significant trigrams


In [None]:
print(significant_words)
print(significant_bigrams)
print(significant_trigrams)

['medium', 'jew', 'http', ':', 'block', 'ann', 'appelbaum', '&', 'radek', 'sikorski', 'svenska', 'dagbladet😎', 'anti-semit', 'alt-right', '.', 'antysemitów', 'blokuję', 'automatyczni', '#', 'lol', 'dream', 'kike', 'evid', 'parliament', ')', 'fad', 'til', '(', 'sad', '?', 'rt', '@', 'make', 'holocaust', '’', ',', 'un', 'white', '“', 'real', '”', 'de', 'jewish', 'zionazi', "'s", 'hate', '``', '!', 'syria', 'rawr', 'xd', 'christian', 'vanguardiacom', 'huevo', 'donará', 'millón', 'en', 'cauca', 'santand', '//t.co/ku32x06m10', '//t.co/mwbqqk40ol', 'hitler', 'result', 'u', 'think', 'lo', 'la', 'el', 'like', 'million', 'woman', 'call', "''", 'zionist', "'", 'attack', 'muslim', 'remembr', 'day', 'live', '6', 'murder', 'due', 'fake', 'need', 'use', 'maga', 'thing', 'saddest', 'peopl', 'say', 'fuck', 'feel', 'wo', 'nazi', 'rememb', 'auschwitz', 'bet', 'fakenew', 'black', 'bad', 'could', 'apartheid', 'zionazist', 'ethnic', 'clean', 'land', 'militari', 'bomb', 'deport', 'antisemit', 'que', 'le', '

#Train and Test Data

In [None]:
y_train = df["antisem_binary"]
X_train = df.drop("antisem_binary", axis=1)
#y_test = df_test["antisem_binary"]
X_test = df_test
#X_test = df_test.drop("antisem_binary", axis=1)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

In [None]:
print(X_train.shape)
print(y_train.shape)

(1005, 30)
(1005,)


#NGrams Classifier

In [None]:
def get_vocabulary(significant_words, significant_bigrams, significant_trigrams):
  significant_bigrams = [(bigram[0] + " " + bigram[1]) for bigram in significant_bigrams]
  significant_trigrams = [(trigram[0] + " " + trigram[1]) for trigram in significant_trigrams]
  vocab = significant_words + significant_bigrams + significant_trigrams
  return list(set(vocab))

In [None]:
tfidf_clf = MultinomialNB()
tfidf_transformer = TfidfTransformer()
vocabulary = get_vocabulary(significant_words, significant_bigrams, significant_trigrams)
count_vect = CountVectorizer(ngram_range=(1, 3), vocabulary=vocabulary)
#X_train is the array of tweets, y_train is the label of each tweet
#only retain vocab_length many ngrams that are most relevant to classification(to decrease complexity)
def train_tfidf_clf(X_train, y_train):
  X_train_counts = count_vect.fit_transform(np.asarray(X_train["total_text"]))
  X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
  tfidf_clf.fit(X_train_tfidf, np.asarray(y_train))
#input: unprocessed text
#return classification of tweet
def predict_tfidf_clf(X_test):
  return tfidf_clf.predict(count_vect.transform(np.asarray(X_test["total_text"])))

In [None]:
def predict_tfidf_clf(X_test):
  X_test = count_vect.transform(np.asarray(X_test["total_text"]))
  return tfidf_clf.predict(X_test)

In [None]:
#X is the formatted output of each of the classifiers
#y is the label for each tweet-0 for clean, y for antisemitic
def score_tfidf_clf(X_test,y_test):
  X_test = count_vect.transform(np.asarray(X_test["total_text"]))
  y_test = np.asarray(y_test)
  score = tfidf_clf.score(X_test,y_test)
  return score

In [None]:
train_tfidf_clf(X_train,y_train)

In [None]:
def add_tfidf_pred(df):
  pred = pd.Series(predict_tfidf_clf(df), name="tfidf_pred")
  return pd.concat([df.reset_index(drop=True), pred.reset_index(drop=True)], axis=1)
df = add_tfidf_pred(df)
X_train = add_tfidf_pred(X_train)
X_test = add_tfidf_pred(X_test)

In [None]:
#accuracy = score_tfidf_clf(X_test, y_test)
#f1_score = metrics.f1_score(predict_tfidf_clf(X_test), y_test)
#print("Accuracy of this model: ", accuracy)
#print("F1 Score: ", f1_score)

#Spacy Text Classifier

In [None]:
# Custom transformer using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [text.strip().lower() for text in X["total_text"]]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

In [None]:
# Logistic Regression Classifier
classifier = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', count_vect),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,pd.Series(y_train))

Pipeline(memory=None,
         steps=[('cleaner', <__main__.predictors object at 0x7fdce5907710>),
                ('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 3), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 t...
                                             'militari', 'cauca', '; #',
                                             'zionazi block', 'fakenew', 'nazi',
                                             '# randum', 'ann', 'fad', 'http :',
                                             'woman', ...])),
                ('classifier',
                 Logis

In [None]:
def get_clf(classifier):
  return Pipeline([("cleaner", predictors()),
                 ('vectorizer', count_vect),
                 ('classifier', classifier)])
def fit_clf(classifier):
  pipe = get_clf(classifier)
  pipe.fit(X_train, y_train)
def get_clf_f1score(classifier):
  pipe = get_clf(classifier)
  pipe.fit(X_train,y_train)
  #print("Recall: ", metrics.recall_score(y_test, pipe.predict(X_test)))
  #print("Precision: ", metrics.precision_score(y_test, pipe.predict(X_test)))
  return metrics.f1_score(y_test, pipe.predict(X_test))

In [None]:
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
#print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
#print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
#print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
svm = SGDClassifier(loss='hinge', penalty='l2',
                             alpha=1e-3, random_state=42,
                             max_iter=5, tol=None)
svc = SVC(kernel="rbf", C=1.0,gamma="scale")
ada = AdaBoostClassifier(base_estimator=LogisticRegression(), n_estimators=50)

classifiers = [LogisticRegression(), MultinomialNB(), svc, RandomForestClassifier(n_estimators=10), ada]
voter = VotingClassifier(estimators=[("lr", classifiers[0]), ("mnb", classifiers[1]), ("svc", classifiers[2]), ("rf", RandomForestClassifier(n_estimators=10))], voting="hard")
#print("Classifier F1 Scores")
ada_clf = get_clf(ada)
voter_clf = get_clf(voter)
fit_clf(voter)
#print("Voter: ", get_clf_f1score(voter))
for classifier in classifiers.copy():
  fit_clf(classifier)
  #print(classifier.__class__.__name__, ": ", get_clf_f1score(classifier))
  

In [None]:
def add_ada_pred(df):
  pred = pd.Series(ada_clf.predict(df), name="ada_pred")
  return pd.concat([df.reset_index(drop=True), pred.reset_index(drop=True)], axis=1)
df = add_ada_pred(df)
X_train = add_ada_pred(X_train)
X_test = add_ada_pred(X_test)

In [None]:
def add_voter_pred(df):
  pred = pd.Series(voter_clf.predict(df), name="voter_pred")
  return pd.concat([df.reset_index(drop=True), pred.reset_index(drop=True)], axis=1)
df = add_voter_pred(df)
X_train = add_voter_pred(X_train)
X_test = add_voter_pred(X_test)

# **LSTM Network**

In [None]:
X_LSTM_train = X_train["text"]
y_LSTM_train = y_train
#X_LSTM_train,X_LSTM_test,y_LSTM_train,y_LSTM_test = train_test_split(X_LSTM,y_LSTM,test_size=0.1)

In [None]:
max_words = 1000
max_len = 150
tok = Tokenizer(num_words=max_words)
tok.fit_on_texts(X_LSTM_train)
sequences = tok.texts_to_sequences(X_LSTM_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)

In [None]:
def get_f1(y_true, y_pred): #taken from old keras source code
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
    return f1_val

In [None]:
def RNN():
    inputs = Input(name='inputs',shape=[max_len])
    layer = Embedding(max_words,50,input_length=max_len)(inputs)
    layer = LSTM(100)(layer)
    layer = Dense(256,name='FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(0.2)(layer)
    layer = Dense(1,name='out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

In [None]:
reset_random_seeds()
model = RNN()
model.compile(loss='binary_crossentropy',optimizer="adam",metrics=['accuracy',get_f1])
callbacks = []#[EarlyStopping(monitor='val_loss',min_delta=0.0001,patience=3)]
model.fit(sequences_matrix,y_LSTM_train,batch_size=128,epochs=20,
          validation_split=0.2,callbacks=callbacks)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 804 samples, validate on 201 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.callbacks.History at 0x7fdce3f69128>

In [None]:
#test_sequences = tok.texts_to_sequences(X_LSTM_test)
#test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)
#accr = model.evaluate(test_sequences_matrix,y_LSTM_test)
#print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}\n  F1Score: {:0.3f}'.format(accr[0],accr[1],accr[2]))

In [None]:
def add_lstm_pred(df):
  if("lstm_pred" in df.columns):
    df = df.drop("lstm_pred", axis=1)
  test_sequences = tok.texts_to_sequences(df["text"])
  test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)
  #print(pd.Series(model.predict(test_sequences_matrix)[:,0]))
  pred = pd.Series(model.predict(test_sequences_matrix)[:,0],name="lstm_pred")
  #print(pred)
  return pd.concat([df.reset_index(drop=True), pred.reset_index(drop=True)], axis=1)
  #return df
def add_lstm_binary_pred(df):
  df["lstm_binary_pred"] = np.where(df["lstm_pred"] > 0.5, 1, 0)
  return df
df = add_lstm_pred(df)
X_train = add_lstm_pred(X_train)
X_test = add_lstm_pred(X_test)

df = add_lstm_binary_pred(df)
X_train = add_lstm_binary_pred(X_train)
X_test = add_lstm_binary_pred(X_test)

In [None]:
#print("Accuracy Score: ", metrics.accuracy_score(X_test["lstm_binary_pred"], y_test))
#print("Recall Score: ", metrics.recall_score(X_test["lstm_binary_pred"], y_test))
#print("Precision Score: ", metrics.precision_score(X_test["lstm_binary_pred"], y_test))
#print("F1 Score: ", metrics.f1_score(X_test["lstm_binary_pred"], y_test))

#Parent Classifier
Support Vector Classifier uses predictions of subclassifiers to create ensemble classifier

In [None]:
parent = SVC(kernel="rbf", C=1.0,gamma="scale")


#X is the formatted output of each of the classifiers
#y is the label for each tweet-0 for clean, y for antisemitic
#We originally tested using the outputs of the AdaBoost, Voter, Tfidf, and LSTM classifiers as inputs
#columns = ["ada_pred", "voter_pred", "tfidf_pred", "lstm_pred"]

#However, the additional inputs weakened the SVC's predictive power, so we decided to only use the LSTM classifier for prediction.
columns = ["lstm_pred"]
def train_parent(X_train,y_train):
  X_train = X_train[columns]
  parent.fit(X_train,y_train)
def predict_parent(X):
  preds = parent.predict(X[columns])
  return preds
def score_parent(X,y):
  score = parent.score(X[columns],y)
  return score
def f1_parent(X,y):
  preds = predict_parent(X)
  return metrics.f1_score(y,preds)

In [None]:
#This is the f1 score of the classifier taking into account the antisemitic probability given by the LSTM classifier
train_parent(X_train, y_train)
#print("Accuracy: ", accuracy_score(y_test,predict_parent(X_test)))
#print("F1 Score: ", f1_parent(X_test, y_test))
test_predictions = predict_parent(X_test)

In [None]:
print(test_predictions)

[0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0
 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1
 1 0 1 0 1]
