# Proposed method for expanding power outage keywords:

## Forward:
As you are by now aware, our twitter scraping function was designed to seek out keywords which we though were relevant to power outages.  This prefiltering feature was very helpful for cutting through the overwhelming volume of tweets that are posted in the United States each day, and we believed it would increse the proportion of tweets from blackout zones in our data.  

In the current version of our scraping function, data is drawn from twitter on a very limited number of keywords.  Ideally, our keyword vocabulary would be much more expansive, and it would be even better if those keywords were chosen from a corpus of confirmed tweets written from actual blackout events.  

Unfortunately, isolating blackouts based on Twitter data proved very challenging.  Without indepenantly confirmed locations for our tweets, we have not been able to implement the improved keyword selection that we origionally envisioned.  Nevertheless, we believe that improved keywords are imperitive to any future rollout of this project.  In this notebook, we will lay out the theory and practice that we think best for expanding and enriching the keyword vocabulary.

In [1]:
# Libraries:

import pandas as pd
import numpy as np
import time
import gensim

#### The first step in upgrading our keywords is to explore a body of words which we can prove are related to power outages.  

To that end: we will use Gensim and a corpus of Wikipedia articles that we are repurposing here from some earlier classwork.  Our goal is to expand the keywords by disambiguating from words we have chosen that relate to blackouts; this should return a body of words that we may not have even considered

In [4]:
# # Wiki text-data: takes ~ 3 minutes to load:

# t0 = time.time()
# model = gensim.models.KeyedVectors.load_word2vec_format('./lexvec.enwiki+newscrawl.300d.W.pos.vectors')
# print(time.time() - t0)

In [5]:
# My identified terms:

words = ['blackout','power','outage','electric','electricity','electrical','transformer','watt','wattage','arc',
         'circuit','breaker','cable','fault','conductor','fuse','riser','insulator','meter',
         'interruption','maintenance','relay','grid','severe','weather','storm','substation','surge','switch',
         'switchyard','station','transmission','system','lines','line','frequency','voltage']

## Dictionary of new terms, organized by parent term:

In [6]:
banker = {}
for word in words:
    arr = pd.DataFrame(model.most_similar(word,topn=20),columns=['word','val'])
    arr = arr[arr['val']>.39]
    banker[word] = list(arr.word)
    
banker

NameError: name 'model' is not defined

In [7]:
# Function 2: List Layout:
bank = []
for word in words:
    arr = pd.DataFrame(model.most_similar(word,topn=20),columns=['word','val'])
    arr = arr[arr['val']>.39]
    for i in arr.word:
        bank.append(i)

# Add origional terms to list
for word in words:
    bank.append(word)

#print(bank)

## Next Steps:

As you can see, we now have a working dictionary of far, far more potential keywords.  This is great outcome for a number of reasons.  The first is that we've achieved the desired outcome; tweets could just as easily be scraped from this larger body of keywords as they were from the smaller group we used to generate this new list.  

One unexpected but welcome outcome of generating these new keywords is that some of these keywords are clearly "off theme".  While "electrical" - a term we disambiguated from above - is concievably desirable as a keyword, the terms we generated from it are clearly not.  These terms are all very technical, and are unlikely to appear in a tweet pertaining to a blackout.

The great thing about this is that we could now begin to refine our keywords based on what does or does not appear to work. 

Whatever keywords that we choose to keep at this stage, the next and final stage in our keyword refinement would depend on having a body of confirmed tweets from actual blackouts.  If we had that data, we would now proceed to count vectorize these tweets, and perform a term frequency analysis on that corpus that would help us winnow the useful keywords we gathered from less useful ones.