# Tweet Vectorizer & Clustering
> Author: [Dawn Graham](https://dawngraham.github.io/)

This notebook preprocesses and vectorizes our tweet text so it can be used in models. I then use logistic regression to determine the words most associated with tweets made during power outages.

Versions used:
- Python 3.6.6
- matplotlib 3.0.2
- nltk 3.3
- numpy 1.15.4
- pandas 0.23.4
- regex 2018.11.22
- seaborn 0.9.0
- sklearn 0.0
- Unidecode 1.0.23

## Import libraries

In [1]:
# General
import pandas as pd
import numpy as np

# For natural language processing
import regex as re
import unidecode
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')

# For logistic regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# For K-Means
from sklearn.cluster import KMeans, k_means
from sklearn.metrics import silhouette_score
from sklearn.datasets.samples_generator import make_blobs

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib

%matplotlib inline

## Read in data

In [2]:
tweets = pd.read_csv('../data/combined_tweets_outages.csv')

# Set `timestamp` to datetime and set it to index
tweets['timestamp'] = pd.to_datetime(tweets['timestamp'])
tweets.set_index('timestamp', inplace=True)
tweets.head()

Unnamed: 0_level_0,id,likes,query,replies,retweets,text,user,outage,outage_state
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2012-11-01 23:50:22,264152432282578945,1,EversourceMA OR EversourceNH OR VelcoVT OR nat...,1.0,3,"Tom May, CEO of Northeast Utilities, the paren...",EversourceMA,1,WV OH PA NJ CT MA NY DE MD IN KY MI
2012-11-01 23:45:13,264151136792109056,0,EversourceMA OR EversourceNH OR VelcoVT OR nat...,0.0,0,@NYGovCuomo @lipanews @nationalgridus @nyseand...,readyforthenet,1,WV OH PA NJ CT MA NY DE MD IN KY MI
2012-11-01 23:34:44,264148498352590849,1,EversourceMA OR EversourceNH OR VelcoVT OR nat...,0.0,1,Some amazing video from the Wareham microburst...,EversourceMA,1,WV OH PA NJ CT MA NY DE MD IN KY MI
2012-11-01 23:34:20,264148399190851584,0,EversourceMA OR EversourceNH OR VelcoVT OR nat...,0.0,0,@nationalgridus Call me if you need some help ...,sparky1000,1,WV OH PA NJ CT MA NY DE MD IN KY MI
2012-11-01 23:31:56,264147793147490304,0,EversourceMA OR EversourceNH OR VelcoVT OR nat...,1.0,8,Current PSNH statewide w/o power: 885. We're d...,EversourceNH,1,WV OH PA NJ CT MA NY DE MD IN KY MI


## Preprocessing

In [3]:
# Preprocessing function
def text_to_words(raw_text):
    
    # Get rid of accents
    unaccented = unidecode.unidecode(raw_text)
    
    # Get rid of punctuation
    letters_only = re.sub("[^a-zA-Z]", " ", unaccented)
    
    # Get all lowercase words
    words = letters_only.lower().split()
    
    # Instantiate and run Lemmatizer
    lemmatizer = WordNetLemmatizer()
    tokens_lem = [lemmatizer.lemmatize(i) for i in words]
    
    # Remove stop words
    stops = set(stopwords.words('english'))
    meaningful_words = [w for w in words if not w in stops]
    
    # Join into string and return the result.
    return(" ".join(meaningful_words))

In [4]:
# Clean all tweets
total_tweets = tweets.shape[0]
clean_texts = []

print("Cleaning and parsing the tweets...")

j = 0
for text in tweets['text']:
    # Convert to words, then append to clean_tweets
    clean_texts.append(text_to_words(text))
    
    # If the index is divisible by 1000, print a message
    if (j+1) % 1000 == 0:
        print(f'Comment {j+1} of {total_tweets}.')
    
    j += 1
    
    if j == total_tweets:
        print('Done.')

Cleaning and parsing the tweets...
Comment 1000 of 38069.
Comment 2000 of 38069.
Comment 3000 of 38069.
Comment 4000 of 38069.
Comment 5000 of 38069.
Comment 6000 of 38069.
Comment 7000 of 38069.
Comment 8000 of 38069.
Comment 9000 of 38069.
Comment 10000 of 38069.
Comment 11000 of 38069.
Comment 12000 of 38069.
Comment 13000 of 38069.
Comment 14000 of 38069.
Comment 15000 of 38069.
Comment 16000 of 38069.
Comment 17000 of 38069.
Comment 18000 of 38069.
Comment 19000 of 38069.
Comment 20000 of 38069.
Comment 21000 of 38069.
Comment 22000 of 38069.
Comment 23000 of 38069.
Comment 24000 of 38069.
Comment 25000 of 38069.
Comment 26000 of 38069.
Comment 27000 of 38069.
Comment 28000 of 38069.
Comment 29000 of 38069.
Comment 30000 of 38069.
Comment 31000 of 38069.
Comment 32000 of 38069.
Comment 33000 of 38069.
Comment 34000 of 38069.
Comment 35000 of 38069.
Comment 36000 of 38069.
Comment 37000 of 38069.
Comment 38000 of 38069.
Done.


In [5]:
# Add cleaned tweets to dataframe
tweets = tweets.assign(clean_text = clean_texts)
tweets.head(3)

Unnamed: 0_level_0,id,likes,query,replies,retweets,text,user,outage,outage_state,clean_text
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2012-11-01 23:50:22,264152432282578945,1,EversourceMA OR EversourceNH OR VelcoVT OR nat...,1.0,3,"Tom May, CEO of Northeast Utilities, the paren...",EversourceMA,1,WV OH PA NJ CT MA NY DE MD IN KY MI,tom may ceo northeast utilities parent company...
2012-11-01 23:45:13,264151136792109056,0,EversourceMA OR EversourceNH OR VelcoVT OR nat...,0.0,0,@NYGovCuomo @lipanews @nationalgridus @nyseand...,readyforthenet,1,WV OH PA NJ CT MA NY DE MD IN KY MI,nygovcuomo lipanews nationalgridus nyseandg da...
2012-11-01 23:34:44,264148498352590849,1,EversourceMA OR EversourceNH OR VelcoVT OR nat...,0.0,1,Some amazing video from the Wareham microburst...,EversourceMA,1,WV OH PA NJ CT MA NY DE MD IN KY MI,amazing video wareham microburst dartmouth res...


## Train / Test Split

In [6]:
# Set features and target
features = tweets['clean_text']

X = features
y = tweets['outage']

In [7]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

## Bag of Words

In [8]:
vect = CountVectorizer()

# Fit the vectorizer on our corpus and transform
X_train_vect = vect.fit_transform(X_train)
X_train_vect = pd.DataFrame(X_train_vect.toarray(), columns=vect.get_feature_names())

# Transform the test set
X_test_vect = vect.transform(X_test)

# Transform entire set for K-Means clustering later
X_vect = vect.transform(X)

## Logistic Regression
Use logistic regression to get words that are most likely to be from tweets made during power outages. Cross-validation and accuracy scores help serve as a sanity check.

In [9]:
# Get baseline accuracy score
y_train.value_counts(normalize=True)[1]

0.5366887324436972

In [10]:
# Instantiate model
logreg = LogisticRegression(solver='liblinear')

# Fit on training data.
logreg.fit(X_train_vect, y_train)

# Get scores
print('CV score:', cross_val_score(logreg, X_train_vect, y_train, cv=3).mean())
print('Training accuracy:', logreg.score(X_train_vect, y_train))
print('Testing accuracy:', logreg.score(X_test_vect, y_test))

CV score: 0.7151762974737621
Training accuracy: 0.9055024342404819
Testing accuracy: 0.7160117671779785


#### Confusion Matrix
Check confusion matrix to get a sense of how the model is classifying tweets.

In [11]:
# Create confusion matrix
predictions = logreg.predict(X_test_vect)
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

Unnamed: 0,predict neg,predict pos
actual neg,3032,1378
actual pos,1325,3783


## Associated Words

In [12]:
# Create dataframe with coefs and e^coefs for each word
coefs = list(zip(vect.get_feature_names(), logreg.coef_[0].T))
coefs = pd.DataFrame(coefs, columns = ['word','coef'])
coefs['e^coef'] = np.exp(coefs['coef'])

#### Tweets made during power outages
Reminder: Tweets were not necessarily *about* power outages or in locations were outages occurred, but were made at times that there were verified power outages.

In [13]:
# Show words most associated with tweets made during power outages
coefs.sort_values(by='e^coef', ascending=False).head(20)

Unnamed: 0,word,coef,e^coef
17700,mairene,2.527414,12.521084
9160,etrs,2.415532,11.195727
8221,easter,2.022488,7.557101
14376,irene,1.975736,7.211925
21187,ojatg,1.954793,7.062455
27592,snowtober,1.95353,7.053542
20134,nhirene,1.949198,7.023052
20494,noreaster,1.839478,6.293254
17820,map,1.788442,5.980127
20154,nhsandy,1.774881,5.899578


#### Explore tweets with specific words

In [14]:
# Function to get full tweets and count of tweets containing specific words
def get_tweets(word):
    mask = tweets[tweets['clean_text'].str.contains(f'(^|\W){word}($|\W)')].index
    count = 0
    for i in mask:
        count += 1
        print(i, tweets['text'][i], '\n')
    print(f'\nTotal tweets containing "{word}": {count}')

In [15]:
# Look at titles with words most associated with tweets made during power outages
get_tweets('nhsandy')

2012-11-01 23:31:56 Current PSNH statewide w/o power: 885. We're down to three digits!  Outage list: http://www.psnhnews.com/outagelist  #nhsandy 

2012-11-01 22:28:44 Updated town x town listing posted. PSNH statewide total of 1,329 at 6pm. http://www.psnhnews.com/outagelist  #nhsandy 

2012-11-01 21:43:16 Curently: 1,460 PSNH c's statewide remain w/o power. The crews are (safely) rocking! #nhsandy 

2012-11-01 20:32:32 We expect virtually all customers to be restored by midnight tonight! More: http://ow.ly/eXblj  Updated ETRs: http://ow.ly/eXbqg  #nhsandy 

2012-11-01 19:06:52 Latest outage figures http://ow.ly/eX2be  about 4400 customers w/o. Crews making great progress #nhsandy 

2012-11-01 17:53:20 Quite a few are asking -- here is a list of *estimated* restoration times for the towns affected by #nhsandy outages: http://ow.ly/eWD6N  

2012-11-01 17:49:30 Tom May and Gary Long welcome Hydro Québec to NH to help out with the #nhsandy restoration: http://youtu.be/4jDLsmF1tcM  

2012

  This is separate from the ipykernel package so we can avoid doing imports until


#### Notes on findings
- Words in our dataset like `mairene`, `irene`, `nhirene`, `nhsandy`, `risandy`, etc. are highly associated with tweets made during power outages and are usually about outages. While these specific words would not be useful in detecting outages following a future event, they suggest that the name of a given storm (and the name combined with state abbreviations) could be useful in detecting outages following an event.
- Tweets containing `etrs` (standing for Estimated Time of Restorations) could be useful, but most of these are made by power companies providing updates after they already know about outages.
- `ojatg` and `fccenw` are from National Grid's short link to their Outage Central interactive outage map with ETRs. These links are no longer used, apparently because they have shifted to state-specific maps. `Outage Central` and `outage map` could be useful, however are similar to `etrs` and may not provide info about outage locations that power companies don't already know about.
- Taking a closer look at the tweets in our dataset `earthday` and `presidents` suggests these won't be helpful words to use as indicators of power outages.

## K-Means Clustering

In [16]:
kmeans = KMeans(n_clusters=3)
model = kmeans.fit(X_vect)

In [39]:
# Attach predicted cluster to `tweets` dataframe
tweets['predictions'] = model.labels_
tweets.head(3)

Unnamed: 0_level_0,id,likes,query,replies,retweets,text,user,outage,outage_state,clean_text,predictions
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2012-11-01 23:50:22,264152432282578945,1,EversourceMA OR EversourceNH OR VelcoVT OR nat...,1.0,3,"Tom May, CEO of Northeast Utilities, the paren...",EversourceMA,1,WV OH PA NJ CT MA NY DE MD IN KY MI,tom may ceo northeast utilities parent company...,2
2012-11-01 23:45:13,264151136792109056,0,EversourceMA OR EversourceNH OR VelcoVT OR nat...,0.0,0,@NYGovCuomo @lipanews @nationalgridus @nyseand...,readyforthenet,1,WV OH PA NJ CT MA NY DE MD IN KY MI,nygovcuomo lipanews nationalgridus nyseandg da...,0
2012-11-01 23:34:44,264148498352590849,1,EversourceMA OR EversourceNH OR VelcoVT OR nat...,0.0,1,Some amazing video from the Wareham microburst...,EversourceMA,1,WV OH PA NJ CT MA NY DE MD IN KY MI,amazing video wareham microburst dartmouth res...,0


In [31]:
tweets.loc[tweets['predictions'] == 0, 'text']

timestamp
2012-11-01 23:45:13    @NYGovCuomo @lipanews @nationalgridus @nyseand...
2012-11-01 23:34:44    Some amazing video from the Wareham microburst...
2012-11-01 23:34:20    @nationalgridus Call me if you need some help ...
2012-11-01 23:31:30                        Stop following nationalgridus
2012-11-01 23:30:21    Our #MA team is being supported by crews from ...
2012-11-01 23:29:46    @EvanMansolillo Hi Evan, can you provide us wi...
2012-11-01 22:28:44    Updated town x town listing posted. PSNH state...
2012-11-01 22:23:59    @nationalgridus Need Diesel for the Generators...
2012-11-01 21:58:15    Read Diversity Woman Magazine Daily top storie...
2012-11-01 21:24:41    Tnx for your patience! RT @shailchotai: @psnh ...
2012-11-01 20:47:08    @nationalgridus @SenGillibrand Yes. Come. Rest...
2012-11-01 20:18:41    We have under 5,000 customers left.  These are...
2012-11-01 19:57:09    @nationalgridus any estimated restoration time...
2012-11-01 19:40:33    @stephencross @nat

In [32]:
tweets.loc[tweets['predictions'] == 1, 'text']

timestamp
2012-11-01 23:31:56    Current PSNH statewide w/o power: 885. We're d...
2012-11-01 22:41:50    @nationalgridus Thanks 2 the crew who restored...
2012-11-01 21:48:12    Thank you @nationalgridus for finally giving m...
2012-11-01 21:43:16    Curently: 1,460 PSNH c's statewide remain w/o ...
2012-11-01 21:05:26    As of 2 PM, we've restored power to more than ...
2012-11-01 20:51:04    @nationalgridus how do you work in a neighborh...
2012-11-01 20:14:27    @jojones41261  If your neighbor has power but ...
2012-11-01 20:01:10    We continue to make progress in #RI. All elect...
2012-11-01 19:38:04    @Nationalgridus reports 294 #Billerica custome...
2012-11-01 19:27:10    National Grid @nationalgridus - from 73K to 4....
2012-11-01 19:00:54    Kudos to @nationalgridus crews for taking care...
2012-11-01 18:55:53    Got text from @nationalgridUS that power is re...
2012-11-01 18:55:52    Got text from @nationalgridUS that power is re...
2012-11-01 18:54:43    "@nationalgridus: 

In [33]:
tweets.loc[tweets['predictions'] == 2, 'text']

timestamp
2012-11-01 23:50:22    Tom May, CEO of Northeast Utilities, the paren...
2012-11-01 23:05:01    NSTAR crews have restored power to nearly 400,...
2012-11-01 22:15:10    A member of our #UNY team shares restoration e...
2012-11-01 21:20:33    National Grid #RI Pres. Tim Horan & Gov. @Linc...
2012-11-01 20:32:32    We expect virtually all customers to be restor...
2012-11-01 20:15:39    @jojones41261 To answer your other question, o...
2012-11-01 20:15:38    Our #NY teams are working hard to restore serv...
2012-11-01 19:55:08    One of our crew leaders working in Marlborough...
2012-11-01 19:36:43    Ultra-realistic #simulations make difference f...
2012-11-01 19:20:17    Here's a look at one of our #RI team's efforts...
2012-11-01 19:06:52    Latest outage figures http://ow.ly/eX2be  abou...
2012-11-01 17:58:20    @nikihana The crews are out in full force with...
2012-11-01 17:53:20    Quite a few are asking -- here is a list of *e...
2012-11-01 17:36:59    @nitroftam 95% of 

In [21]:
## Create a dataframe for cluster_centers (centroids)
centroids = pd.DataFrame(model.cluster_centers_)
centroids.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,34440,34441,34442,34443,34444,34445,34446,34447,34448,34449
0,0.000418,0.000209,0.0,5.2e-05,0.0,5.2e-05,0.0,0.0,5.2e-05,5.2e-05,...,0.0,5.2e-05,5.2e-05,0.0,5.2e-05,0.0,0.0,5.2e-05,0.0,0.0
1,0.000278,7e-05,7e-05,0.0,7e-05,0.0,7e-05,0.000139,0.0,0.0,...,7e-05,0.0,0.0,7e-05,0.0,7e-05,0.0,0.0,7e-05,7e-05
2,0.000219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000657,0.0,0.000219,0.0,0.0,0.0
