# Environment Setup

### **This section must be run before attempting to execute any other sections' code cells in order to ensure that all proper packages are installed.**

In [17]:
import sys
import os

# Check Python version
current_version = sys.version_info

if current_version.major != 3 or current_version.minor != 10:
    print(f"Current Python version is {current_version.major}.{current_version.minor}.")
    print("Switching to Python 3.10. Please wait...")

    # Install Python 3.10
    !sudo apt-get update
    !sudo apt-get install python3.10 python3.10-distutils -y

    # Install virtualenv if not installed
    !pip install virtualenv

    # Create a virtual environment using Python 3.10
    !virtualenv -p python3.10 py310_env
    !source py310_env/bin/activate

    print("Python 3.10 environment is ready. Restart the notebook and activate the virtual environment to use it.")
else:
    print(f"Python version is already 3.10 ({sys.version}).")

Python version is already 3.10 (3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]).


In [None]:
!pip3 install spacy==3.2.4
!pip3 install numpy==1.26.4
!pip3 install pandas==2.1.1

# To check versions uncomment:
#!pip3 list

In [19]:
!pip install rpy2
%load_ext rpy2.ipython



In [None]:
%%R
install.packages("dplyr")
install.packages("data.table")
install.packages("stringr")
install.packages("tidyr")
install.packages("R.utils")

In [21]:
# Mounts Google Drive so you can import external dependencies (e.g., trained_entity_linker.zip)

#from google.colab import drive
#drive.mount('/content/drive')

### Note: If you have uploaded the **pre-trained entity linker model** into Google Drive for use with this code (```trained_entity_linker.zip```), you can now move directly to the inference section of the notebook!

# knowledge_base

Steps to take before attempting to run knowledge_base:

- **Make sure that ```person_2022.csv```, ```wmpcand_120223_wmpid``` and ```bp2022_house_scraped_face_jasmine.xlsx``` are uploaded onto Google Drive. If they are not directly located in your main drive (e.g, they are in a folder) then the paths used in the code cells must be updated.**

  - Files [```person_2022.csv```](https://github.com/Wesleyan-Media-Project/datasets/blob/main/people/person_2022.csv), [```wmpcand_120223_wmpid```](https://github.com/Wesleyan-Media-Project/datasets/blob/main/candidates/wmpcand_120223_wmpid.csv) and [```bp2022_house_scraped_face_jasmine.xlsx```](https://github.com/Wesleyan-Media-Project/face_url_scraper_2022/blob/main/data/bp2022_house_scraped_face_jasmine.xlsx) can be downloaded/uploaded using the links provided.


In [None]:
%%R
library(dplyr)
library(haven)
library(data.table)
library(stringr)
library(quanteda)
library(readxl)
library(tidyr)

setwd("./")

# File paths
# In

# These files are located in our datasets repository (https://github.com/Wesleyan-Media-Project/datasets)
# Make sure that these files are uploaded into the colab environment before attempting to run

path_people_file <- "/content/drive/MyDrive/person_2022.csv"
path_cand_file <- "/content/drive/MyDrive/wmpcand_120223_wmpid.csv"
# Out
path_kb <- "entity_kb.csv"


# People file
people <- fread(path_people_file, encoding = "UTF-8", data.table = F)
# Create some additional person categories
people$pubhealth <- ifelse(people$face_category == "public health related", 1, 0)
people$cabinet <- ifelse(people$face_category == "cabinet", 1, 0)
people$historical <- ifelse(people$face_category == "historical figures", 1, 0)
# In case any of these variables contain NAs (they largely don't any more)
# Make them 0s instead
people$supcourt_2022[is.na(people$supcourt_2022)] <- 0
people$supcourt_former[is.na(people$supcourt_former)] <- 0
people$currsen_2022[is.na(people$currsen_2022)] <- 0
people$prompol[is.na(people$prompol)] <- 0
people$former_uspres[is.na(people$former_uspres)] <- 0
people$intl_leaders[is.na(people$intl_leaders)] <- 0
people$gov2022_gencd[is.na(people$gov2022_gencd)] <- 0

# Make sure there are no duplicate people
if (any(duplicated(people$wmpid))) {
  stop("There are duplicate people.")
}


# Candidate file
# Make sure that genelect is 1 so we ignore duplicate versions of the same candidate who ran for different offices but only made it to the general election in one
# Also retain only relevant variables
cands <- fread(path_cand_file, encoding = "UTF-8", data.table = F)
cands <- cands %>%
  filter(genelect_cd == 1) %>%
  select(wmpid, genelect_cd, cand_id, cand_office, cand_office_st, cand_office_dist, cand_party_affiliation)
# Make sure there are no duplicate candidates
if (any(duplicated(cands$wmpid))) {
  stop("There are duplicate candidates.")
}

# Merge candidate file into people file
people <- left_join(people, cands, by = "wmpid")

# Restrict to only 2022 candidates and other relevant people
# Also retain only relevant variables
people <- people %>%
  filter(genelect_cd == 1 | supcourt_2022 == 1 | supcourt_former == 1 | currsen_2022 == 1 | prompol == 1 | former_uspres == 1 | intl_leaders == 1 | gov2022_gencd == 1 | pubhealth == 1 | cabinet == 1 | historical == 1) %>%
  select(wmpid, full_name, first_name, last_name, fecid_2022a, fecid_2022b, genelect_cd, supcourt_2022, supcourt_former, currsen_2022, prompol, former_uspres, intl_leaders, gov2022_gencd, pubhealth, cabinet, historical, cand_id, cand_office, cand_office_st, cand_office_dist, cand_party_affiliation)


entities_candidate <- people$full_name

tks <- tokens(entities_candidate)
#---- FIRST NAME
# the first word is always the first name
people$first_name_extracted <- unlist(lapply(tks, function(x) {
  x[1]
}))

#---- LAST NAME
# If the name consists of two words, then the second one is the last name
people$last_name_extracted <- unlist(lapply(tks, function(x) {
  if (length(x) == 2) {
    x[2]
  } else {
    NA
  }
}))
# If the name consists of more than two words, then the last one is the last name
last_name_temp <- unlist(lapply(tks, function(x) {
  if (length(x) > 2) {
    x[length(x)]
  } else {
    NA
  }
}))
people$last_name_extracted[is.na(last_name_temp) == F] <- last_name_temp[is.na(last_name_temp) == F]
# if the last word is jr or sr, the second-to last word is the last name
last_word_temp <- unlist(lapply(tks, function(x) {
  x[length(x)]
}))
jr_temp_indices <- which(last_word_temp %in% c(".", "Jr", "Sr"))
jr_temp_names <- entities_candidate[jr_temp_indices]
jr_temp_suffix <- str_extract(jr_temp_names, "[J|S]r")
people$suffix_name_extracted <- NA
people$suffix_name_extracted[jr_temp_indices] <- jr_temp_suffix
jr_temp_names_without_suffix <- str_remove(jr_temp_names, " [J|S]r.?") # remove Jr/Sr + 0 or more occurence of .
jr_temp_names_without_suffix_tks <- tokens(jr_temp_names_without_suffix)
jr_temp_last_names <- unlist(lapply(jr_temp_names_without_suffix_tks, function(x) {
  x[length(x)]
}))
people$last_name_extracted[jr_temp_indices] <- jr_temp_last_names


# the II, the III
II_temp_indices <- which(last_word_temp %in% c("II", "III"))
II_temp_names <- entities_candidate[II_temp_indices]
II_temp_suffix <- str_extract(II_temp_names, "II+")
people$suffix_name_extracted[II_temp_indices] <- II_temp_suffix
II_temp_names_without_suffix <- str_remove(II_temp_names, " II+") # remove II/III
II_temp_names_without_suffix_tks <- tokens(II_temp_names_without_suffix)
II_temp_last_names <- unlist(lapply(II_temp_names_without_suffix_tks, function(x) {
  x[length(x)]
}))
people$last_name_extracted[II_temp_indices] <- II_temp_last_names

#---- MIDDLE NAMES
name_len <- unlist(lapply(tks, length))
no_middle_name <- which(name_len == 2)
no_middle_name <- sort(unique(c(no_middle_name, jr_temp_indices, II_temp_indices)))
middle_name_indices <- (1:nrow(people))[which((1:nrow(people) %in% no_middle_name) == F)] # people who do have middle names
tks_middle_names <- tks[middle_name_indices]
tks_middle_names <- lapply(tks_middle_names, function(x) {
  x[-1]
}) # remove the first word
tks_middle_names <- lapply(tks_middle_names, function(x) {
  x[-length(x)]
}) # remove the last word
tks_middle_names <- lapply(tks_middle_names, paste0, collapse = " ") # combine them and make a space so that multiple middle names, or "De La" etc. get resolved
tks_middle_names <- str_replace_all(tks_middle_names, " \\.", "\\.") # this does create a problem with periods, clean them up
tks_middle_names <- str_replace(tks_middle_names, "^\\.", "") # remove periods if they are the first char
tks_middle_names <- str_trim(tks_middle_names) # clean up spaces at beginning/end
people$middle_name_extracted <- NA
people$middle_name_extracted[middle_name_indices] <- tks_middle_names
people$middle_name_extracted[which(people$middle_name_extracted == "")] <- NA # some people ended up with an empty middle name, remove


# ----
# JASMINE'S FIXES to candidate names
# This file is located in our face_url_scraper_2022 repository (https://github.com/Wesleyan-Media-Project/face_url_scraper_2022)
# Make sure the face_url_scraper_2022 folder is located in the same directory as entity_linking_2022
fixes <- read_xlsx("/content/bp2022_house_scraped_face_jasmine.xlsx") # nolint: line_length_linter.
fixes <- fixes %>%
  select(wmpid, cand_name, full_name, starts_with("hc")) %>%
  select(-c(hc_face_note, hc_face_url, hc_office_district, hc_office_district_note))

people <- left_join(people, fixes, by = "wmpid")

# Overwrite with Jasmine's fixes
people$first_name_extracted[is.na(people$hc_first_name) == F] <- people$hc_first_name[is.na(people$hc_first_name) == F]
people$middle_name_extracted[is.na(people$hc_middle_name) == F] <- people$hc_middle_name[is.na(people$hc_middle_name) == F]
people$last_name_extracted[is.na(people$hc_last_name) == F] <- people$hc_last_name[is.na(people$hc_last_name) == F]
people$suffix_name_extracted[is.na(people$hc_suffix) == F] <- people$hc_suffix[is.na(people$hc_suffix) == F]

# Correct names
people$first_name <- people$first_name_extracted
people$middle_name <- people$middle_name_extracted
people$last_name <- people$last_name_extracted
people$suffix_name <- people$suffix_name_extracted
people <- unite(people, "full_name", c(first_name, middle_name, last_name, suffix_name), sep = " ", na.rm = T, remove = F)
people <- unite(people, "full_name_first_last", c(first_name, last_name), sep = " ", na.rm = T, remove = F)
people$full_name <- str_squish(people$full_name)
people$full_name_first_last <- str_squish(people$full_name_first_last)

# ----
# CANDIDATE DESCRIPTIONS
# Party
people$party[!people$cand_party_affiliation %in% c("DEM", "REP")] <- "3rd party"
people$party[people$cand_party_affiliation == "DEM"] <- "Democratic"
people$party[people$cand_party_affiliation == "REP"] <- "Republican"
people$party[is.na(people$cand_party_affiliation)] <- NA

# District number
district_number <- as.character(as.numeric(people$cand_office_dist))
district_number <- str_replace(district_number, "$", "th")
district_number <- str_replace(district_number, "1th", "1st")
district_number <- str_replace(district_number, "2th", "2nd")
district_number <- str_replace(district_number, "3th", "3rd")
district_number <- str_replace(district_number, "11st", "11th")
district_number <- str_replace(district_number, "12nd", "12th")

# State name rather than abbreviation
state_name <- state.name[match(people$cand_office_st, state.abb)]

# Construct the descriptions
people$descr <- NA
for (i in 1:nrow(people)) {
  if (is.na(people$genelect_cd[i]) == F) {
    if (people$cand_office[i] == "H") {
      people$descr[i] <- paste0(people$full_name[i], " is a ", people$party[i], " candidate for the ", district_number[i], " District of ", state_name[i], ".")
    } else if (people$cand_office[i] == "S") {
      people$descr[i] <- paste0(people$full_name[i], " is a ", people$party[i], " Senate candidate in ", state_name[i], ".")
    }
  } else if (people$currsen_2022[i] == 1) {
    people$descr[i] <- paste0(people$full_name[i], " is a Senator.")
  } else if (people$former_uspres[i] == 1) {
    people$descr[i] <- paste0(people$full_name[i], " is a former U.S. president.")
  } else if (people$prompol[i] == 1) {
    people$descr[i] <- paste0(people$full_name[i], " is a prominent politician.")
  } else if (people$intl_leaders[i] == 1) {
    people$descr[i] <- paste0(people$full_name[i], " is an international leader.")
  } else if (people$supcourt_2022[i] == 1) {
    people$descr[i] <- paste0(people$full_name[i], " is a Supreme Court Justice.")
  } else if (people$supcourt_former[i] == 1) {
    people$descr[i] <- paste0(people$full_name[i], " is a former Supreme Court Justice.")
  } else if (people$gov2022_gencd[i] == 1) {
    people$descr[i] <- paste0(people$full_name[i], " is a gubernatorial candidate.")
  } else if (people$pubhealth[i] == 1) {
    people$descr[i] <- paste0(people$full_name[i], " is a public health official.")
  } else if (people$cabinet[i] == 1) {
    people$descr[i] <- paste0(people$full_name[i], " is a cabinet member.")
  } else if (people$historical[i] == 1) {
    people$descr[i] <- paste0(people$full_name[i], " is a historical figure.")
  }
}


# ----
# Candidate aliases
for (i in 1:nrow(people)) {
  cand_names <- c(people$full_name[i], people$last_name[i], people$full_name_first_last[i])
  if (substr(cand_names[1], nchar(cand_names[1]), nchar(cand_names[1])) != "s") {
    cand_aliases <- c(cand_names, paste0(cand_names, "'s"))
  } else {
    cand_aliases <- c(cand_names, paste0(cand_names, "'"))
  }
  cand_aliases <- c(cand_aliases, toupper(cand_aliases))

  people$aliases[[i]] <- c(cand_aliases)
}

# ----
# Create knowledge base
kb <- people %>%
  select(wmpid, full_name, descr, aliases) %>%
  rename(id = wmpid, name = full_name)

# One-off fixes
kb$descr[kb$id == "WMPID1289"] <- "Joe Biden is the U.S. president."
kb$aliases[[1107]] <- str_remove(kb$aliases[[1107]], ",") # Remove commas from MLK because it screws with the csv

# Make sure every alias only exists once (people without middle names or suffixes will have duplicates otherwise)
kb$aliases <- lapply(kb$aliases, unique)

fwrite(kb, path_kb)
# The 4 variables in this file are the only thing
# from this script that enter the entity linker


# train

Steps to take before attempting to run train:

- **Make sure that ```fb_2022_adid_text.csv.gz```, ```fb_2022_adid_var1.csv.gz```, ```entity_kb.csv``` and ```wmp_fb_2022_entities_v082324.csv``` are uploaded onto Google Drive. If they are not directly located in your main drive (e.g, they are in a folder) then the paths used in the code cells must be updated..**

  - Files ```fb_2022_adid_text.csv.gz``` and ```fb_2022_adid_var1.csv.gz``` must be downloaded from our Figshare page. You can get access using this [Data Access form](https://www.creativewmp.com/data-access/).

  - File [```entity_kb.csv```](https://github.com/Wesleyan-Media-Project/entity_linking_2022_usabilitystudy/blob/main/facebook/data/entity_kb.csv) will either already be present as output from the knowledge_base section, or you can download/upload it manually from the link provided. If it is present as output from the knowledge_base section then the path must be changed as specified in the code cells.

  - File [```wmp_fb_2022_entities_v082324.csv```](https://github.com/Wesleyan-Media-Project/datasets/blob/main/wmp_entity_files/Facebook/wmp_fb_2022_entities_v082324.csv) can be downloaded/uploaded manually from the link provided

In [None]:
!python3 -m spacy download en_core_web_lg

In [None]:
%%R
library(data.table)
library(dplyr)
library(tidyr)

setwd("./")

# Input files
# This is an output from data-post-production/01-merge-results/01_merge_preprocessed_results
# Select fields of 'ad_id', 'page_name', 'disclaimer', 'ad_creative_body',
#        'ad_creative_link_caption', 'ad_creative_link_title',
#        'ad_creative_link_description', 'aws_ocr_text_img',
#        'google_asr_text', 'aws_ocr_text_vid'
#############################################################################################

# Make sure that these files are in the colab environment before attempting to run

path_ads <- "/content/drive/MyDrive/fb_2022_adid_text.csv.gz"
path_adid_to_pageid <- "/content/drive/MyDrive/fb_2022_adid_var1.csv.gz"

# If you ran the knowledge_base section, this path must be changed to "/content/entity_kb.csv"
path_entities_kb <- "/content/drive/MyDrive/entity_kb.csv"

# This file is located in our datasets repository (https://github.com/Wesleyan-Media-Project/datasets)
# Make sure that these files are uploaded into the colab environment before attempting to run

path_wmpent_file <- "/content/drive/MyDrive/wmp_fb_2022_entities_v082324.csv" # nolint: line_length_linter.
# Output files
path_output <- "ads_with_aliases.csv.gz"

# Pdid to wmpid
wmpents <- fread(path_wmpent_file) %>%
  select(pd_id, wmpid)
wmpents <- wmpents[wmpents$wmpid != "", ]

# Ads
df <- fread(path_ads, encoding = "UTF-8")

cols <- c(
  "ad_id", "page_name", "disclaimer", "ad_creative_body", "ad_creative_link_caption", "ad_creative_link_title",
  "ad_creative_link_description", "aws_ocr_text_img", "google_asr_text", "aws_ocr_text_vid"
) # nolint
# Select only the specified columns
df <- df[, ..cols]

# Adid to pdid
adid_to_pageid <-
  fread(path_adid_to_pageid, colClasses = "character") %>%
  select(ad_id, pd_id)

# Combine
df <- inner_join(df, adid_to_pageid, by = "ad_id")
df <- left_join(df, wmpents, by = "pd_id")

# Aliases, then merge in pd_id
aliases <- fread(path_entities_kb, encoding = "UTF-8", data.table = F)
aliases <- select(aliases, c(id, aliases))

# Keep only ads that have a wmpid
# Shape to long format
# Remove empty rows
# Keep only distinct rows based on pd_id and value
df <- df %>%
  filter(wmpid != "") %>%
  pivot_longer(-c(ad_id, pd_id, wmpid)) %>%
  filter(value != "") %>%
  distinct_at(vars(pd_id, value), .keep_all = T)

# Merge in aliases
df <- left_join(df, aliases, by = c("wmpid" = "id"))

# Get rid of ads that have no aliases
df <- df[is.na(df$aliases) == F, ]

fwrite(df, path_output)


In [None]:
import csv
from pathlib import Path
import os
import random
import json
import pandas as pd
import spacy # Use version 3.2.4
nlp = spacy.load("en_core_web_lg")
from spacy.kb import KnowledgeBase #vscode pylinter complains, actually loads fine
# for spacy version above v3.5
# from spacy.kb import InMemoryLookupKB
from spacy.util import minibatch, compounding
from tqdm import tqdm
import numpy as np
from spacy.training import Example
from spacy.ml.models import load_kb


# Input files
# Make sure that these files are in the colab environment before attempting to run

# If you ran the knowledge_base section, this path must be changed to "/content/entity_kb.csv"
path_candidates = "/content/drive/MyDrive/entity_kb.csv"

path_training_samples = "/content/ads_with_aliases.csv.gz"

# Output files
path_intermediate_kb = "intermediate_kb"
path_output_nlp = "trained_entity_linker"
path_output_kb = "trained_entity_linker"
path_output_kb_vocab = "trained_entity_linker"


#----
# Load the dataset on the candidates
# This contains their id, their name, a description, and aliases for their name

def load_entities():
    entities_loc = Path(path_candidates)

    names = dict()
    descriptions = dict()
    aliases = dict()
    with entities_loc.open("r", encoding="utf8") as csvfile:
        csvreader = csv.reader(csvfile, delimiter=",")
        next(csvreader) # Skip header row
        for row in csvreader:
            qid = row[0]
            name = row[1]
            desc = row[2]
            alias = row[3]
            names[qid] = name
            descriptions[qid] = desc
            aliases[qid] = alias
    return names, descriptions, aliases

# Create 3 dictionaries:
# name_dict - ID -> name
# desc_dict - ID -> description
# aliases_dict - ID -> aliases
name_dict, desc_dict, aliases_dict = load_entities()

# Example content for Biden:
print(f"{'WMPID1289'}, name={name_dict['WMPID1289']}, \
    desc={desc_dict['WMPID1289']}, alias={aliases_dict['WMPID1289']}")

#----
# Create a knowledge base
# So far, information on these people sits in a set of dictionaries
# Now we create a spacy knowledge base and populate it with the data above

# Instantiate a knowledge base with 300-dimensional entity embedding
# for spacy version above v3.5, instantiate the InMemoryLookupKB class instead of KnowledgeBase, which became an abstract class after v3.5
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=300)

# Populate the knowledge base from the csv file
# Starting with the id and description
for qid, desc in desc_dict.items():
    desc_doc = nlp(desc)
    desc_enc = desc_doc.vector
    kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342) # 342 is an arbitrary value

# Create a dictionary, with each unique alias as a key
# and the value being the fecids of all the people with that alias as a list
alias_to_fecids = dict()
for qid, alias in aliases_dict.items():
    for alias_specific in alias.split("|"):
        alias_to_fecids[alias_specific] = alias_to_fecids.get(alias_specific, []) + [qid]

# Now, start adding aliases to the kb
# The probabiltiy is 1/number of people with that alias
for alias, fecids in alias_to_fecids.items():
    kb.add_alias(alias=alias, entities=fecids, probabilities=[1/len(fecids) for fecid in fecids])

# Create a list of entity ids (i.e. fec ids in our case) that is looped over later
qids = name_dict.keys()
kb.to_disk(path_intermediate_kb)


#----
df = pd.read_csv(path_training_samples, encoding = 'UTF-8')
aliases = list(df['aliases'].str.split("|"))

# Apply NER to all training samples
TRAIN_DOCS = []
for text in tqdm(df['value']):
    doc = nlp(text)
    TRAIN_DOCS.append(doc)

# Put the character indices in the data frame
df['entity_start'] = np.nan
df['entity_end'] = np.nan
# Loop over the documents
# and record the indices from the NER results in the df
for d in range(len(TRAIN_DOCS)):
    for entity in TRAIN_DOCS[d].ents:
        if str(entity) in aliases[d]:
            print([entity, entity.start_char, entity.end_char])
            df.at[d, 'entity_start'] = entity.start_char
            df.at[d, 'entity_end'] = entity.end_char
            break


# Get the indices of the rows of the data frame where an entity match was detected
detected_entities_indices = np.where(df['entity_start'].isnull().to_numpy()==False)[0]
detected_entities_indices = list(detected_entities_indices)

# Make a new TRAIN_DOCS list with only those
TRAIN_DOCS2 = [TRAIN_DOCS[i] for i in detected_entities_indices]
print("Training on", len(TRAIN_DOCS2), "samples.") #currently about 14k (out of 60k)

# Create the annotations like this
# {'links': {(39, 48): {'H8MO01143': 1.0}}}
starts = [int(df['entity_start'][i]) for i in detected_entities_indices]
ends = [int(df['entity_end'][i]) for i in detected_entities_indices]
fecs = [df['wmpid'][i] for i in detected_entities_indices]
# print(starts, ends, fecs)

annotations = []
for i in range(len(starts)):
    annotations.append({'links': {(starts[i], ends[i]): {fecs[i]: 1.0}}, 'entities': [(starts[i], ends[i], 'PERSON')]})

# Make another version of TRAIN_DOCS, this time making it the correct tuple again,
#  with annotations as the second element
TRAIN_DOCS3 = []
for i in range(len(starts)):
    TRAIN_DOCS3.append((TRAIN_DOCS2[i], annotations[i]))

# Create gold-standard sentences
if "sentencizer" not in nlp.pipe_names:
    nlp.add_pipe("sentencizer")
sentencizer = nlp.get_pipe("sentencizer")
TRAIN_EXAMPLES = []
for i in range(len(starts)):
    example = Example.from_dict(nlp.make_doc(str(TRAIN_DOCS3[i][0])), annotations[i])
    example.reference = sentencizer(example.reference)
    TRAIN_EXAMPLES.append(example)

# Initialize the entity linker component
entity_linker = nlp.add_pipe("entity_linker", config={"incl_prior": False}, last=True)
entity_linker.initialize(get_examples=lambda: TRAIN_EXAMPLES, kb_loader=load_kb(path_intermediate_kb))

# At this point, the untrained component already works
test_doc = nlp("Donald Trump is a former president.")
for ent in test_doc.ents:
    if ent.kb_id_ != 'NIL':
        print(ent.kb_id_)


# Training loop
loss_list = []
with nlp.select_pipes(enable=["entity_linker"]):   # train only the entity_linker
    optimizer = nlp.resume_training() # This used to be begin_training, in spacy3 it seems it's resume because the component has already been initialized
    optimizer.learn_rate = 0.001
    for itn in tqdm(range(500)):   # one itn is one full pass over TRAIN_EXAMPLES
        random.shuffle(TRAIN_EXAMPLES)
        batches = minibatch(TRAIN_EXAMPLES, size=128)#size=compounding(4.0, 32.0, 1.001))  # increasing batch sizes -- seems to train WAY faster with a larger batch size (i.e. 128 rather than 32)
        losses = {} #at the end of the epoch, this will contain the cumulative loss of all of its batches (and as far as I can tell, the loss for one batch is the mean loss of all the samples in it)
        for batch in batches:
            nlp.update(
                batch,
                drop=0.2,      # prevent overfitting
                losses=losses,
                sgd=optimizer,
            )
        #if itn % 50 == 0:
        print(itn, "Losses", losses)   # print the training loss
        loss_list.append(losses['entity_linker'])

print(itn, "Losses", losses)

# Save the nlp object to file
nlp.to_disk(path_output_nlp)
kb.to_disk(path_output_kb)
kb.vocab.to_disk(path_output_kb_vocab)


# inference

Steps to take before attempting to run inference:

- **Make sure that ```fb_2022_adid_text.csv.gz``` and ```trained_entity_linker``` are uploaded onto Google Drive. If they are not directly located in your main drive (e.g, they are in a folder) then the paths used in the code cells must be updated.**

  - File ```fb_2022_adid_text.csv.gz``` will either already be present from running the train section, or else it must be downloaded from our Figshare page and then uploaded manually. You can get access using this [Data Access form](https://www.creativewmp.com/data-access/). If it is present as output from the train section, then you must update the path in the code cell as specified.

  - Folder ```trained_entity_linker``` will either already be present as output from the train section, or else it must be downloaded from our Figshare page and then uploaded manually **as a zip file**. You can get access using this [Data Access form](https://www.creativewmp.com/data-access/). If it is present as output from the train section, then you must change the path in the code cell as specified.

  - ***IF YOU ARE PART OF THE USABILITY STUDY YOU CAN SKIP THESE INSTRUCTIONS, AS YOU'LL DOWNLOAD THE FILES IN THE CODE.***


In [22]:
# Direct download of fb_2022_adid_text.csv.gz
!wget -O fb_2022_adid_text.csv.gz https://figshare.com/ndownloader/files/49885527?private_link=c46dc366bbf4294a9be3

--2024-12-02 18:44:51--  https://figshare.com/ndownloader/files/49885527?private_link=c46dc366bbf4294a9be3
Resolving figshare.com (figshare.com)... 54.228.103.118, 34.252.194.191, 2a05:d018:1f4:d003:5690:ce60:6925:7a1e, ...
Connecting to figshare.com (figshare.com)|54.228.103.118|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pstorage-wesleyan-1795779948/49885527/fb_2022_adid_text.csv.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA3OGA3B5WKAYJ2FVS/20241202/eu-west-1/s3/aws4_request&X-Amz-Date=20241202T184451Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=dd3554852d48c0cdc36ad029ca8a5eca971c45ad680664782f177b1ff0c97eb5 [following]
--2024-12-02 18:44:51--  https://s3-eu-west-1.amazonaws.com/pstorage-wesleyan-1795779948/49885527/fb_2022_adid_text.csv.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA3OGA3B5WKAYJ2FVS/20241202/eu-west-1/s3/aws4_request&X-Amz-Date=20241202T184451Z&X-Amz-Expires=10

In [23]:
%%R

library(data.table)
library(dplyr)
library(tidyr)

setwd("./")

# Input files
# This is an output from data-post-production/01-merge-results/01_merge_preprocessed_results.
# Make sure that this file is in the colab environment before attempting to run

# If you ran the train, this path must be changed to "/content/fb_2022_adid_text.csv.gz"
path_ads <- "/content/fb_2022_adid_text.csv.gz"
# Output files
path_prepared_ads <- "inference_all_fb22_ads.csv.gz"

# Ads
df <- fread(path_ads, encoding = "UTF-8")

# Subset to clean text dataframe
df2 <- df %>%
  select(
    ad_id, google_asr_text, page_name, disclaimer, ad_creative_body,
    ad_creative_link_title, ad_creative_link_description,
    aws_ocr_text_img, aws_ocr_text_vid, ad_creative_link_caption
  )

# Aggregate
df3 <- df2 %>%
  pivot_longer(-ad_id) %>%
  filter(value != "") %>%
  mutate(id = paste(ad_id, name, sep = "__")) %>%
  select(-c(ad_id, name))

df3 <- aggregate(df3$id, by = list(df3$value), c)
names(df3) <- c("text", "id")

# Save
fwrite(df3, path_prepared_ads)



Attaching package: ‘dplyr’



    between, first, last



    filter, lag



    intersect, setdiff, setequal, union


|





In [32]:
# Direct download of trained_entity_linker.zip

# Download the file
!wget -O trained_entity_linker.zip https://figshare.wesleyan.edu/ndownloader/files/46512400

# Unzip the file
import os
import zipfile

zip_path = "trained_entity_linker.zip"

# Extract the contents
!unzip trained_entity_linker.zip

# Remove the __MACOSX folder if it exists
if os.path.exists("__MACOSX"):
    !rm -rf __MACOSX/
    print("__MACOSX folder removed.")

__MACOSX folder removed.


In [33]:
import csv
from pathlib import Path
import os
import random
import json
import pandas as pd
import spacy # Use version '3.2.4'
# Make sure that this file is in the colab environment before attempting to run
# If you ran the knowledge_base section, this path must be changed to "/content/trained_entity_linker.zip"
nlp = spacy.load("/content/trained_entity_linker") # trained_entity_linker is output from 02_train_entity_linking.py
from spacy.kb import KnowledgeBase #vscode pylinter complains about this, but it actually loads fine
from spacy.util import minibatch, compounding
import re
import numpy as np
from tqdm import tqdm


# Input files
path_prepared_ads = "/content/inference_all_fb22_ads.csv.gz"
# Output files
path_el_results = "entity_linking_results_fb22.csv.gz"
path_el_results_notext = "entity_linking_results_fb22_notext.csv.gz"

# Read in prepared ads
df = pd.read_csv(path_prepared_ads)

# Code below runs a random sample of rows from the input dataframe,
# where n equals the # of rows

df = df.sample(n=500)
df = df.reset_index(drop=True)


df = df.replace(np.nan, '', regex=True)
fields = ['text']


def get_sims(sent_emb, ent_id):

    sentence_encoding = sent_emb
    entity_encodings = np.asarray(nlp.get_pipe('entity_linker').kb.get_vector(ent_id))

    sentence_norm = np.linalg.norm(sentence_encoding, axis=0)
    entity_norm = np.linalg.norm(entity_encodings, axis=0)

    sims = np.dot(entity_encodings, sentence_encoding) / (sentence_norm * entity_norm)

    return(sims)

# Give non-candidates like Kamala Harris a boost in comparison to actual cands
# This is necessary because non-cands don't have much training data, so the model
# almost never picks them
def is_it_kamala(nlpd_doc, possible_cands, likely_cand, boost_size = 0.1):

    sent_emb = nlpd_doc.vector

    sims = []
    for h in possible_cands:

        sim = get_sims(sent_emb, h)
        if h == likely_cand:
            sim += boost_size

        sims.append(sim)

    picked_cand = np.array(sims).argmax()
    picked_cand_id = possible_cands[picked_cand]

    return(picked_cand_id)

harrises = ['WMPID1144',
            'WMPID3207',
            'WMPID2']

barretts = ['WMPID3995',
            'WMPID17']


# This loop can take anywhere from 6-8 hours.

for f in fields:

    entities_in_field = []
    entities_in_field_start = []
    entities_in_field_end = []

    for i in tqdm(range(len(df))):

        entities_in_ad = []
        entities_in_ad_start = []
        entities_in_ad_end = []

        if pd.isnull(df[f][i])==False:
            test_text = df[f][i]
            test_doc = nlp(test_text)
            for ent in test_doc.ents:
                if ent.kb_id_ != 'NIL':

                    # Make sure we don't misclassify House as Steve House
                    # Steve House didn't run in 2022 \o/ yay!
                    # if (ent.kb_id_ == 'H0CO06119') & (ent.label_ == 'ORG'):
                    #     pass

                    # Make sure we don't misclassify Kamala as one of the other Harrises
                    if ent.kb_id_ in harrises:
                        # Check if it is actually Kamala
                        harrises_cand = is_it_kamala(test_doc, harrises, 'WMPID2', boost_size = 0.16)
                        entities_in_ad.append(harrises_cand)
                        entities_in_ad_start.append(ent.start_char)
                        entities_in_ad_end.append(ent.end_char)

                    # Make sure we don't misclassify Amy Coney Barrett as Thomas More Barrett
                    # If the EL detects Thomas More Barrett
                    elif ent.kb_id_ == 'WMPID3995':
                        # Check if it is actually Amy Coney
                        barretts_cand = is_it_kamala(test_doc, barretts, 'WMPID17', boost_size = 0.17)
                        entities_in_ad.append(barretts_cand)
                        entities_in_ad_start.append(ent.start_char)
                        entities_in_ad_end.append(ent.end_char)

                    # If it is none of these, proceed as normal
                    else:
                        entities_in_ad.append(ent.kb_id_)
                        entities_in_ad_start.append(ent.start_char)
                        entities_in_ad_end.append(ent.end_char)

        entities_in_field.append(entities_in_ad)
        entities_in_field_start.append(entities_in_ad_start)
        entities_in_field_end.append(entities_in_ad_end)


    df[f + '_detected_entities'] = entities_in_field
    df[f + '_start'] = entities_in_field_start
    df[f + '_end'] = entities_in_field_end

    print(f, "done!")

## Prepare data for additional dictionary search for Trump and Biden only on disclaimer and page name fields
# Split ids
df['id'] = df['id'].str.split('|')
# "Un-deduplicate", or "Re-hydrate", in WMP lingo
df = df.explode('id')
# Split into ad id and field
df_ids = df['id'].str.split('__', expand = True)
df_ids.columns = ['ad_id', 'field']
df = pd.concat([df, df_ids], axis = 1)
df = df.drop(labels = ['id'], axis = 1)
# Split the data frame into disclaimer/page_name, and other
df_1 = df[df['field'].isin(['disclaimer', 'page_name'])]
df_2 = df[df['field'].isin(['disclaimer', 'page_name']) == False]
# Make a copy of df_1
df1 = df_1.copy()
df1.reset_index(drop=True, inplace=True)

# This function does a simple dictionary search on disclaimer and page name fields.
# It only does this search for Biden and Trump.
# If this dictionary search finds any entity that was not detected by the model, it adds the corresponding WMPID to the detected entities list.

def update_detected_entities(df):
    # Mapping of names to their corresponding ids
    name_to_id = {'biden': 'WMPID1289', 'trump': 'WMPID1290'}
    # Iterate over each row in the DataFrame with tqdm
    for index, row in tqdm(df.iterrows(), total=len(df), desc="Processing rows"):
        # Split the text_detected_entities column to a list
        detected_entities = row['text_detected_entities']

        # Initialize lists to store start and end indices
        start_indices = row['text_start']
        end_indices = row['text_end']

        # Convert the text to lowercase
        text = row['text'].lower()

        # Iterate over each name to be detected
        for name in name_to_id.keys():
            # Find all occurrences of the name in the text
            name_occurrences = [i for i in range(len(text)) if text.startswith(name, i)]

            # Check each occurrence of the name
            for start_index in name_occurrences:
                # Check if the name is already detected by the entity linking model
                already_detected = False
                for start, end in zip(start_indices, end_indices):
                    if start <= start_index < end:
                        already_detected = True
                        break

                # If the name is not already detected, add its ID
                if not already_detected:
                    end_index = start_index + len(name)
                    detected_entities.append(name_to_id[name])
                    start_indices.append(start_index)
                    end_indices.append(end_index)

        # Update the DataFrame with the modified lists
        df.at[index, 'text_detected_entities'] = detected_entities
        df.at[index, 'text_start'] = start_indices
        df.at[index, 'text_end'] = end_indices

    return df


df2 = update_detected_entities(df1)


# Recombine the dataframes
df = pd.concat([df2, df_2], axis = 0)

# Save results
df.to_csv(path_el_results, index=False)
df = df.drop(['text'], axis = 1)
df.to_csv(path_el_results_notext, index=False)

  _C._set_default_tensor_type(t)
100%|██████████| 500/500 [00:32<00:00, 15.29it/s]


text done!


Processing rows: 100%|██████████| 320/320 [00:00<00:00, 3854.73it/s]


In [34]:
# Post-processing for the entity linking results
# Gather up all detected entities from different fields and put them all together

%%R

library(data.table)
library(dplyr)
library(tidyr)
library(stringr)

setwd("./")

# Paths
# In
path_detected_entities <- "/content/entity_linking_results_fb22_notext.csv.gz"
# Out
path_finished_enties <- "detected_entities_fb22.csv.gz"
path_finished_enties_for_ad_tone <- "detected_entities_fb22_for_ad_tone.csv.gz"

# Read in Spacy's detected entities
el <- fread(path_detected_entities)

# Transform the Python-based detected entities field into an R list
transform_pylist <- function(x) {
  x <- str_remove_all(x, "\\[|\\]|\\'")
  x <- str_remove_all(x, " ")
  return(x)
}
el$text_detected_entities <- transform_pylist(el$text_detected_entities)
# Remove all ads with no detected entities
el <- el %>% filter(text_detected_entities != "")
# For ad tone, remove disclaimer and page_name
el_at <- el %>% filter(!field %in% c("page_name", "disclaimer"))
# Aggregate over fields, then clean up and put things back into a list
el <- aggregate(el$text_detected_entities, by = list(el$ad_id), c)
el$x <- lapply(el$x, paste, collapse = ",")
el$x <- str_split(el$x, ",")
names(el) <- c("ad_id", "detected_entities")
# Same for ad tone
el_at <- aggregate(el_at$text_detected_entities, by = list(el_at$ad_id), c)
el_at$x <- lapply(el_at$x, paste, collapse = ",")
el_at$x <- str_split(el_at$x, ",")
names(el_at) <- c("ad_id", "detected_entities")

# Save version with combined fields
fwrite(el, path_finished_enties)
fwrite(el_at, path_finished_enties_for_ad_tone)


In [35]:
# Examine results

data = pd.read_csv('/content/detected_entities_fb22.csv.gz')
data.head(50)

Unnamed: 0,ad_id,detected_entities
0,x_1014277789264163,WMPID4650|WMPID4650|WMPID4650|WMPID4650|WMPID4...
1,x_1023768478192159,WMPID5292
2,x_1025448931468871,WMPID5292
3,x_1027022148014880,WMPID5292
4,x_1030271714332789,WMPID5292
5,x_1033176647384254,WMPID5311
6,x_1036431663721346,WMPID5292
7,x_1040820059962925,WMPID67|WMPID67|WMPID67
8,x_1041128443250119,WMPID5311
9,x_1051179346280418,WMPID5344


# Results Analysis Example

Steps to take before attempting to run:

- **Make sure that ```readcsv.py``` and any data that you want to analyze are uploaded onto Google Drive. If they are not directly located in your main drive (e.g, they are in a folder) then the paths used in the code cells must be updated.**

  - File [```readcsv.py```](https://github.com/Wesleyan-Media-Project/entity_linking_2022_usabilitystudy/blob/main/analysis/readcsv.py) can be downloaded/uploaded manually using the link provided.
  - [Data](https://github.com/Wesleyan-Media-Project/entity_linking_2022_usabilitystudy/tree/main/facebook/data) will either already be present as output from running various sections of this notebook, or you can download/upload files directly from the GitHub repository from the link provided. If you upload data manually, then the path must be changed accordingly.

In [None]:
!python /content/drive/MyDrive/readcsv.py --h

usage: readcsv.py [-h] --file FILE [--skiprows SKIPROWS] [--nrows NROWS]
                  [--filter_text FILTER_TEXT]

Filter large CSV files and save results to multiple Excel files if necessary.

options:
  -h, --help            show this help message and exit
  --file FILE           Path to the CSV file.
  --skiprows SKIPROWS   Number of rows to skip at the start of the file.
  --nrows NROWS         Number of rows to read from the file. Read 10000 rows if not specified.
  --filter_text FILTER_TEXT
                        Text to filter the rows.


In [None]:
!python /content/drive/MyDrive/readcsv.py --file /content/entity_linking_results_fb22.csv.gz

Input file: /content/entity_linking_results_fb22.csv.gz
Start reading data from row: 0
Number of rows to read: 10000
Filter text: No filter is applied
Filtered data saved to: Readcsv_Output_20241130_164043.xlsx
