# 4.keyword_subsetting

In this notebook we use the output from the topic model and our immersion journal/manual generate new keywords through computationally assited keyword retrival. Using a rich set of qualitatively evaluated keywords we then define a subset of bushfire tweets.

In [1]:
#Importing relevant packages
import pandas as pd
#Word embeddings
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
from nltk.stem import PorterStemmer
#Gary King et. al key-words
from keyword_algorithm import *
#Remove unwarranted warnings
pd.options.mode.chained_assignment = None 

## Implementation of Gary Kings et. al. (2017) computationally assited keyword retrival

Here we combine the algorithm introduced Gary King et. al. (2017) with a pre-trained word embeddings model. We set it up such that the algorithm is run iteratively for a selected amount of times. For the keyword algorithm we use the replication material code named ```keyword_algorithm.py```and is provided here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FMJDCD. We also had to update some functions to be compatible with the latest version of Pandas. To display our workflow we go through 1 iteration bellow.

In [12]:
#Load and prepare data for the model
data = pd.read_csv("data/final_df.csv", index_col=0)
data = data.dropna(subset = ["final_text"]).reset_index(drop = True)
data["index_col"] = data.index
data.to_csv("data/query_df.csv")

In [3]:
#Download model from https://fasttext.cc/docs/en/english-vectors.html
model_dir = "data/wiki-news-300d-1M.vec"
fasttext = KeyedVectors.load_word2vec_format(model_dir)

In [7]:
class QueryBuilder:
    
    def __init__(self, emb_model):
        
        self.query = Keywords()
        self.stemmer = PorterStemmer()
        #Load the data. Change path if necessary
        path = 'data/query_df.csv'
        self.query.LoadDataset(path, text_colname='final_text', 
                    date_colname="created_at", id_colname="index_col")
        #Pre-trained word embeddings model 
        self.we = emb_model
            
    def get_keywords(self, its = 2, top_n = 10, refkeys = [], tarkeys = [], algorithms = ['nbayes', 'logit'], 
                     date_start = "2019-06-01", date_end = "2020-05-30"):
        """
        Loops over King. et. al algorithm to extract relevant keywords used for building a 
        boolean query to subset relevant Tweets in the dataset.
        ---------
        arguments:
            - its: Iterations to run the algorithm
            - top_n: integer of how many of the most predictive keywords to extract in each iteration
            - refkeys: list of initial keywords used to create reference set of tweets
            - tarkeys: list of initial keywords used to limit the search set
            - algorithms: list of classifiers to run for extracting keywords 
            - date_start: y/m/d of start date for relevant tweets
            - date_end: y/m/d of end date for relevant tweets
        -----------    
        returns:
            - dictionary of accepted, rejected and nontarget keywords
            - the trained query model object
        """
        
        accepted_keywords = []
        rejected_keywords = []
        nontarget_keywords = []
        
        #Begin loop for mining search set
        for it in range(its):
            print("-"*66)
            print(f"STARTING ITERATION: {it}!")
            if it == 0:
                print(f"INITIAL REFERENCE KEYS: {refkeys} \n INITIAL TARGET KEYS: {tarkeys}")
            print("-"*66)
            
            #Build reference set of tweets
            self.query.ReferenceSet(any_words=refkeys, date_start=date_start, date_end=date_end)
            
            #Use accepted keys as search keys if not the first iteration
            if it > 0:
                self.query.SearchSet(any_words = accepted_keywords, 
                                     date_start=date_start, date_end=date_end)
            else:
                self.query.SearchSet(any_words = tarkeys, 
                                     date_start=date_start, date_end=date_end)
            
            
            #Run King algorithm to find keywords.
            self.query.ProcessData(stem = False, keep_twitter_symbols=False,
                                   remove_wordlist=refkeys)
            self.query.ReferenceKeywords()
            self.query.ClassifyDocs(min_df=5, ref_trainprop=1, algorithms=algorithms)
            self.query.FindTargetSet()
            self.query.FindKeywords()
            
            #Extract target keywords from algorithm results
            target_keywords = self.query.target_keywords[:top_n]
            #Also get the reference set keywords to loop over
            target_keywords += self.query.reference_keywords[:top_n]
            #Append unique nontarget keywords to list of nontarget keys
            for nonkey in self.query.nontarget_keywords[:100]:
                if nonkey not in nontarget_keywords:
                    nontarget_keywords.append(nonkey)
            
            #Loop over each relevant keyword from reference and found target keywords
            for keyword in target_keywords:
                #Check if keyword has already been rejected or accepted
                if keyword in accepted_keywords or keyword in rejected_keywords:
                    continue
                else:
                    inp = input(f"Keep {keyword.upper()} yes or no?")
                    if inp == "y":
                        accepted_keywords.append(keyword)
                        #get similar keywords through most similar pretrained embeddings
                        inp2 = input(f"Look at {keyword.upper()}'s most similar word embeddings, yes or no?")
                        if inp2 == "y":
                            #Look if keyword exist in embedding model dictionary
                            try:
                                #Get the stemmed embedding 
                                embeddings = [self.stemmer.stem(emb[0]) for emb in self.we.most_similar(keyword)]
                                for emb in embeddings:
                                    #Look if embedding already exist in embedding model dictionary
                                    if emb.lower() in accepted_keywords or emb.lower() in rejected_keywords:
                                        continue
                                    else:
                                        inp3 = input(f"Keep embedding {emb.upper()} yes or no?")
                                        if inp3 == "y":
                                            accepted_keywords.append(emb)
                                        elif inp3 =="n":
                                            rejected_keywords.append(emb)
                            except:
                                print(f"{keyword.upper()} embedding not present in Model!")
                                pass
                        elif inp2 == "n":
                            pass
                    elif inp == "n":
                        rejected_keywords.append(keyword)
                        
            #Add custom keyword(s) in the end of the loop. Either as list or single keyword
            inp4 = input(f"Do you wish to add any further keywords? If yes, Type keyword: ")
            if inp4:
                if isinstance(inp4, list):
                    [accepted_keywords.append(key) for key in inp4]
                else:
                     accepted_keywords.append(inp4)
            else:
                pass
            
            print("-"*66)
            print(" "*20, f"CURRENT KEYWORDS AFTER ITTERATION {it}")
            print("-"*66)
            print(f"ACCEPTED: \n {accepted_keywords}")
            print(f"REJECTED: \n {rejected_keywords}")
            
        fitted_model = self.query
        keywords = {"accepted_keys":accepted_keywords, "rejected_keys":rejected_keywords,"nontarget_keys":nontarget_keywords}
        
        return keywords, fitted_model
        
        
        

Now demostrating the workflow for one iteration. Note that for each iteration the search set is normally expanded by the accepted keywords. We do this untill we see no significant increase in the amount of documents and we are not getting new keywords to either accept or reject.

In [8]:
#Initiate the query builder object
query = QueryBuilder(fasttext)
#Run the workflow using initial keywords found by our topic models to demarcate reference and search set
keywords, model = query.get_keywords(its = 1, top_n=10, 
                                     refkeys=["bushfir|bushfir_crisi|bushfir_affect|firefight"], 
                                     tarkeys=["fire", "disast", "recov", "emerg","wildlif", "nsw"])

Keyword object initialized.
Loaded corpus of size 147440 in 4.02 seconds.
------------------------------------------------------------------
STARTING ITERATION: 0!
INITIAL REFERENCE KEYS: ['bushfir|bushfir_crisi|bushfir_affect|firefight'] 
 INITIAL TARGET KEYS: ['fire', 'disast', 'recov', 'emerg', 'wildlif', 'nsw']
------------------------------------------------------------------
Loaded reference set of size 1991 in 3.23 seconds.
Loaded search set of size 4816 in 7.26 seconds.
Time to process corpus: 2.3 seconds

4199 reference set keywords found.

Document Term Matrix: 6807 by 2595 with 92134 nonzero elements

Time to get document-term matrix: 0.19 seconds

Ref training size: 1991; Search training size: 1589; Training size: 3580; Test size: 4816

Time for Naive Bayes: 0.0 seconds
Time for Logit: 0.18 seconds
1613 documents in target set
3203 documents in non-target set


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


586 target set keywords found
709 non-target set keywords found
Keep FIRE yes or no?y
Look at FIRE's most similar word embeddings, yes or no?y
Keep embedding FLAME yes or no?y
Keep embedding THREE-ALARM yes or no?y
Keep embedding TWO-ALARM yes or no?y
Keep embedding FOUR-ALARM yes or no?y
Keep embedding BLAZE yes or no?y
Keep embedding FIVE-ALARM yes or no?y
Keep embedding FIRE. yes or no?n
Keep embedding CONFLAGR yes or no?y
Keep AFFECT yes or no?n
Keep RECOVERI yes or no?y
Look at RECOVERI's most similar word embeddings, yes or no?y
RECOVERI embedding not present in Model!
Keep SUPPORT yes or no?y
Look at SUPPORT's most similar word embeddings, yes or no?y
Keep embedding SUPPPORT yes or no?n
Keep embedding OPPOS yes or no?n
Keep embedding BACK yes or no?n
Keep embedding SUPORT yes or no?n
Keep embedding SUPPRT yes or no?n
Keep embedding HELP yes or no?y
Keep DEVAST yes or no?y
Look at DEVAST's most similar word embeddings, yes or no?n
Keep SERVIC yes or no?y
Look at SERVIC's most sim

## Defining the Bushfire Subset

Based on keywords found through the process above and our nethnography we define the subset of bushfire tweets bellow. Note that this is the final query which has been refined based on random sampling to and qualitative coding to evaluate accuracy as outlined in section 7.

In [19]:
#Define keywords for subsetting
positive_query = "firey|bushfir|bushfir_crisi|bushfir_affect|firefight|fire|blaze|/^nsw$/|firefighters|firemen|conflagration|/^ash$/|smoke|aerial|opbushfireassist|burnings|burns|burnt|burned|burn|burning|nswfires|blacksummer|habitat|flames|two-alarm|three-alarm|four-alarm|five-alarm|blaze|fireman|habitats|wild-life"
negative_query = r"covid|coron|covid|pandemic|epidemic|flu|rona|vaccin|virus"
positive_query = '|'.join(set([PorterStemmer().stem(w) for w in positive_query.split("|")]))
negative_query = '|'.join(set([PorterStemmer().stem(w) for w in negative_query.split("|")]))

#create subset
subset = data.loc[(data["final_text"].str.contains(positive_query))
                 &(~data["final_text"].str.contains(negative_query)) 
                 &(data["created_at"] >= "2019-06-01") 
                 &(data["created_at"] < "2020-06-01")]


subset.shape

(2976, 20)

In [21]:
#Create bushfire dummy
data["bushfire_dummy"] = data["index_col"].apply(lambda x: 1 if x in subset["index_col"] else 0)

In [24]:
#Save final_df with bushfire dummy
data.to_csv("data/final_df.csv")

In [None]:
#Save bushfire subset
subset = subset.reset_index(drop = True)
subset.to_csv("data/bushfire_subset.csv")