## Natural Language Semantic Analysis for Processing Antisemitic Incident Data ##

After October 7, 2023, the Anti-Defamation League changed their methodologies to include anti-Zionist rallies and slogans in their Annual Audit of Antisemitism. This change in methods confounds the data for quantitative analysis. The following code identifies incidents that were included in the dataset using this new methodology in order to flag them for removal.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import normalize
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer
import hdbscan

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Read in dataset
df = pd.read_csv('data_incidents_nlp.csv')

In [None]:
# Transform descriptions into embeddings (semantic vectors)
model = SentenceTransformer("all-mpnet-base-v2")
emb = model.encode(df["description"].tolist(), show_progress_bar=True)
df["embedding"] = emb.tolist()

Batches: 100%|██████████| 958/958 [02:52<00:00,  5.54it/s]


In [None]:
# Define variables for definition
X = np.vstack(df["embedding"].values) # All embedding values as X
y = df["flag_2"].values # Status as "flagged" from inital keyword flagging process performed in R

In [None]:
C_antizion = X[y == 1].mean(axis=0) # The centroid of all embeddings that are flagged
C_other   = X[y == 0].mean(axis=0) # Centroid of all other embeddings
antizionism_axis = C_other - C_antizion # Initiate axis by subtracting the centroids
antizionism_axis = antizionism_axis / np.linalg.norm(antizionism_axis) # Standardize axis

In [None]:
scores = cosine_similarity(X, antizionism_axis.reshape(1, -1)).flatten() # Reshape on single axis by flattening directions
df["antizionism_score_axis"] = (scores - scores.min()) / (scores.max() - scores.min()) # Define the antizionism "score" for each event

In [None]:
# Run regression just to see, but not particularly useful

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

clr = LogisticRegression(max_iter=500)
clr.fit(X_train, y_train)

probs_test = clr.predict_proba(X_test)[:, 1]
preds_test = clr.predict(X_test)

In [22]:
from sklearn.metrics import (roc_auc_score, accuracy_score, log_loss,)

print("AUC:", roc_auc_score(y_test, probs_test))
print("Log loss:", log_loss(y_test, probs_test))
print("Accuracy:", accuracy_score(y_test, preds_test))

AUC: 0.9979516445499483
Log loss: 0.03829723179893413
Accuracy: 0.9882567849686847


In [23]:
from sklearn.model_selection import cross_val_score

auc_scores = cross_val_score(clr, X, y, cv=5, scoring="roc_auc")

print("CV AUC:", auc_scores)
print("Mean AUC:", auc_scores.mean())

CV AUC: [0.998238   0.9981301  0.99862848 0.99733894 0.99828181]
Mean AUC: 0.9981234654265831


In [24]:
y_shuffled = np.random.permutation(y)
cross_val_score(clr, X, y_shuffled, cv=5, scoring="roc_auc").mean()

np.float64(0.49250678975045237)

In [None]:
coef = pd.Series(clr.coef_[0]).sort_values(key=abs, ascending=False)

coef.head(10)

552   -2.669840
752    2.347832
158    2.302842
683    2.299570
536    2.253939
203   -2.020877
184    2.010700
120   -1.982510
33    -1.934568
343    1.928836
dtype: float64

In [26]:
from sklearn.model_selection import cross_val_predict

probs_cv = cross_val_predict(clr, X, y, cv=5, method="predict_proba")[:, 1]

df["zionism_score_regression"] = 1 - probs_cv

In [None]:
df.drop(columns = ["embedding"]).to_csv("data_incidents_nlp.csv", index=False) # Turn back into .csv, remove embeddings to limit file size