# 03: Identifying Power Outages Using Social Media - Exploratory Data Analysis
### Danielle Medellin, Matthew Malone, Omar Smiley

## Import Libraries 

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Load Data

In [4]:
tweets = pd.read_csv('../data/cleaned_tweets.csv')

In [6]:
tweets.shape

(16913, 13)

In [7]:
tweets.dtypes

tweet_id              int64
username             object
text                 object
tweet_date           object
search_term          object
city                 object
lat                 float64
long                float64
radius               object
query_start          object
name_and_tweet       object
outage_sentiment    float64
state                object
dtype: object

In [8]:
tweets['tweet_date'] = pd.to_datetime(tweets['tweet_date'])

## Exploratory Data Analysis

In [12]:
tweets['outage_sentiment'].describe()

count    16913.000000
mean         0.124359
std          0.096824
min          0.000000
25%          0.060000
50%          0.100000
75%          0.170000
max          1.000000
Name: outage_sentiment, dtype: float64

The minimum outage sentiment is 0 while the max is 1. The mean is .124, showing that there is a low average outage sentiment. The 75th percentile is below .2 showing that the overwhelming majority of the tweets have low proportions of words that are associationed with power outages.  

### Looking at tweets with extreme outage sentiment

In [14]:
tweets[tweets['outage_sentiment']==1]['text']

5315                                          power outage
13649                                    this power outage
14876    storm update cps reports there are currently  ...
Name: text, dtype: object

The first two tweets in this list obviously have high association with power outages because that is literally all the tweet says. It makes sense for these tweets to have an outage sentiment value of 1. 

In [18]:
tweets.loc[14876,['text']][0]

'storm update cps reports there are currently  power outages affecting  customers in the san antonio area'

Looking at the final tweet with an outage sentiment of 1, we see a tweet that is explicitly describing a power outage in a specific area due to a storm. It, again, makes sense that this tweet would have an outage sentiment of 1 and is a great example of a tweet that would give information about a power outage. 

In [20]:
len(tweets[tweets['outage_sentiment']==0]['text'])

815

In [32]:
tweets[tweets['outage_sentiment']==0]['text'].tail(50)

16171    im requiem for a dream blu ray unrated directo...
16247    ready to see the scariest movie of lightsout  ...
16288    justinwells my man ashtonlogan is a beast he h...
16301    silas nello qa on new record in todays dallas ...
16369    your a voice a crack in the mirror see me now ...
16384    bellamy is ready to crash if only someone woul...
16397    yall lightsout taking a month off to rest reju...
16444    cant stop playing atari and we have discovered...
16509    lookin goodout of curiosity can i ask which ri...
16542    lacy lying she fucking when the lights go outl...
16544    long day at the office whos ready for friday l...
16647                         mmmm   innout burger  irving
16700    ive never seen a president mess with his own p...
16712    cent spoiler alert you are a genius taking a p...
16722    when you finally got weekends off and you dont...
16734    im so bored i kinda wanna rewatch all of power...
16774    power_starz now that was a good ass episode no.

There are 815 tweets in the data set that received an outage sentiment score of 0. This means they are expected to have absoluetly no association with power outages. We will spot check a few of these tweets.

In [37]:
tweets.loc[335,['name_and_tweet']][0]

'carmen garsia here at the movies to see lightsout they werent kidding where are'

The above tweet has the phrase `lightsout` which was part of the search terms that we looked for in our intial scrape. We noticed that this pulled in many tweets about a horror movie, Lights Out, that came out in 2016. For this reason, we removed `lightsout` from our association list and, correctly, identified that tweets like the one above don't actually have any association with a power outage. 

In [38]:
tweets.loc[1738,['name_and_tweet']][0]

'pahblow i got bills theyre multiplying and im loosing controlcause the powerthey supplyingis gonna get cut off if i dont get my shit together'

The tweet above was pulled into our dataset because of its use of the word `power` and `cut` which were search terms. But after closer analysis of the tweet, we can see the tweet doesn't actually have anything to a power outage, despite the user discussing their power supply and it possibly being shut off. 

In [40]:
tweets.loc[16792,['name_and_tweet']][0]

'ashley bischoff and  barring a scientific breakthrough we may be a looooong way off from electric planes  \xa0with current tech an airbus a would need  kg   lbs of batteries to have the same flight rangethats four times the weight of the entire plane'

In [None]:
plt.hist(sent_df['outage_sentiment'],color='purple')
plt.title('Histogram of Outage Sentiments')
plt.xlabel('Outage Sentiment')
plt.ylabel('Frequency');

In [None]:
# adding 1 to each value to be able to log
sent_1 = [i+1 for i in sentiments]

In [None]:
plt.hist(np.log(sent_1),color='pink')
plt.title('Histogram of Log of Outage Sentiments')
plt.xlabel('Log of Outage Sentiment')
plt.ylabel('Frequency');

In [None]:
tweets.insert(11,column = 'outage_sentiment', value=sentiments, allow_duplicates=False)

In [None]:
tweets.head()

In [None]:
plt.boxplot(tweets['outage_sentiment']);

In [None]:
word_count = []

for tweet in corpus:
    word_count.append(len(tweet))
    
np.mean(word_count)

In [None]:
lengths = []

for tweet in tweets['text']:
    lengths.append(len(tweet))
    
np.mean(lengths)