# Analyze Tweets
## With tweepy and pandas
### By: Eric L. Sammons <elsammons@gmail.com>
---
### Purpose
<table style="width:75%">
    <tr>
        <td>
            This project is to provide the user / reader with an introduction into social media sentiment analysis. We'll leverage the nltk sentiment vader library and dictionary.
        </td>
        <td>
            <img src="https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg" width="125" height="75">
            <img src="https://static1.squarespace.com/static/538cea80e4b00f1fad490c1b/54668a77e4b00fb778d22a34/54668d8ae4b00fb778d285a2/1416007414694/python_nltk.png" width="125" height="75">
            <img src="https://twilio-cms-prod.s3.amazonaws.com/images/twitter-python-logos.width-808.jpg" width="125" height="75">
        </td>
    </tr>
</table>

### Sentiment Analysis
<table style="width:75%">
    <tr>
        <td>
            Defined as the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.
        </td>
    </tr>
</table>

### Before You Begin
<table style="width:75%">
    <tr>
        <td>
            You will need to have a bearer token from Twitter
        </td>
        <td>
            <a href="https://developer.twitter.com/en/docs/authentication/oauth-2-0/bearer-tokens">
                Using and Generating Bearer Tokens
            </a>
        </td>
    </tr>
</table>

### Getting Started
<table style="width:75%">
    <tr>
        <td>
            To use this notebook you will need to set up a <strong>.env</strong> file in root folder of this project.  The minimum values required are shown to the right.
        </td>
        <td>
            CONSUMER_KEY="xxxxxx"<p>
            CONSUMER_SECRET="xxxxxxxxxxx"<p>
            ACCESS_KEY="xxxxxxxxxxxx"<p>
            ACCESS_SECRET="xxxxxxxxxxxx"<p>
        </td>
    </tr>
 </table>

In [1]:
# Install requirements.
!pip install -Uqr requirements.txt

In [2]:
# We'll need these
from dotenv import find_dotenv, load_dotenv
import os
import tweepy
import datetime
import json
import pandas as pd
from nltk import download
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# download vader_lexicon
download('vader_lexicon')

# Import our helper functions and configs
import search
from lib.helper_functions import flatten_tweets
from lib.helper_functions import calculateCentroid

from IPython.core.display import HTML

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/esammons/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [3]:
load_dotenv(find_dotenv()) # load our .env file.

# Set up our bearer token.
auth = tweepy.OAuthHandler(os.environ['CONSUMER_KEY'], os.environ['CONSUMER_SECRET'])
auth.set_access_token(os.environ['ACCESS_KEY'], os.environ['ACCESS_SECRET'])

# Set api options
api = tweepy.API(auth,wait_on_rate_limit=True)

In [4]:
print(f"Searching {search.DAYS} days of tweets.")
today = datetime.date.today() # starting from today
p_days= today - datetime.timedelta(days=search.DAYS) # prior days to look back

Searching 7 days of tweets.


In [5]:
# Instantiate the tweepy client
tweets_list = tweepy.Cursor(api.search, 
                            q=search.SEARCH + "since: " + str(p_days), 
                            tweet_mode='extended', 
                            ).items()

In [6]:
# Let's grab some tweets.
tweets = []

for tweet in tweets_list:
    tweets.append(json.dumps(tweet._json))

With our list of <strong>tweets</strong> we will now create our dataframe.

Once the dataframe is created we'll:
* Leverage our flatten tweets function to make specific fields more accessible.
* Force <strong>created_at</strong> to datetime.
* Set the dataframe's index to the <strong>created_at</strong> field

In [7]:
tweets = pd.DataFrame(flatten_tweets(tweets))
tweets['created_at'] = pd.to_datetime(tweets['created_at'])
tweets = tweets.set_index('created_at')

With our tweets captured we are now ready to perform sentiment analysis.

---
<table style="width:75%">
    <th>Value</th>
    <th>Translation</th>
    <tr>
        <td>0</td>
        <td>Neutral</td>
    </tr>
    <tr>
        <td>>0</td>
        <td>Positive</td>
    </tr>
    <tr>
        <td><0</td>
        <td>Negative</td>
    </tr>
</table>

In [8]:
sid = SentimentIntensityAnalyzer()
tweets['scores'] = tweets['full_text'].apply(sid.polarity_scores) # combine with dataframe

# Isolate the compound value from scores and create a new column.
tweets['compound']  = tweets['scores'].apply(lambda score_dict: score_dict['compound'])
# Create a new, sentitment, column with pos, neg, neu based on compound.
tweets['sentiment'] = tweets['compound'].apply(lambda c: 'pos' if c > 0 else ('neu' if c == 0 else 'neg'))

Tweets contain a bounding box, an approximate area (in the shape of a rectangle) of where the user tweeted from.  For this bounding box to be populated the user must make location sharing available to the application.

In the next step we will take the bounding box and calculate the centroid, providing a simple latitude and longitude value so that we can more easily use mapping features of mapping utilities or libraries.

In [9]:
tweets['centroid'] = tweets['place'].apply(calculateCentroid) # calculate centroid
tweets[['long', 'lat']] = pd.DataFrame(tweets['centroid'].tolist(), index=tweets.index) # split centroid into long, lat
tweets['search_criteria'] = search.SEARCH

We want to set our minimum and maximum date values so that we can use these in our file name.  This will help reduce ambiguity and ensure there's no manual effort here and the consumer of the file can more easily identify the date range in play.

In [10]:
min_created_dt = tweets.index.min().strftime('%Y%m%d')
max_created_dt = tweets.index.max().strftime('%Y%m%d')

In [11]:
print(f"Eariest Created At Date: {min_created_dt}")
print(f"Latest Created At Date: {max_created_dt}")

Eariest Created At Date: 20210830
Latest Created At Date: 20210907


We write our dataframe out to a csv so that we can download it and use it in an analytics tool like Tableau.  This could just as easily be an s3 bucket; however, accessing a simple s3 file on an s3 bucket can be a bit more difficult than simply accessing the file locally.

In [12]:
tweets.to_csv(f'resultsets/tweets_{min_created_dt}_{max_created_dt}.csv',
        index=True,
        sep='~') # write to csv, keep index

Before we get into any visualizations, outside of this notebook, let's take a look at a few data features by the numbers.
* Breakdown by sentiment.
* Breakdown by author.
* Average compound score by sentiments negative and positive.

But first, let's start with a look at:
* The shape of the dataframe
* what columns we have
* column info
* sample of the data, first 5 rows

In [13]:
print(f"There were {tweets.shape[0]} tweets in the past {search.DAYS} days.")

There were 25 tweets in the past 7 days.


In [14]:
tweets.columns

Index(['id', 'id_str', 'full_text', 'truncated', 'display_text_range',
       'entities', 'metadata', 'source', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo',
       'coordinates', 'place', 'contributors', 'retweeted_status',
       'is_quote_status', 'retweet_count', 'favorite_count', 'favorited',
       'retweeted', 'lang', 'user-screen_name',
       'retweeted_status-user-screen_name', 'retweeted_status-full_text',
       'user-location', 'possibly_sensitive', 'extended_entities', 'scores',
       'compound', 'sentiment', 'centroid', 'long', 'lat', 'search_criteria'],
      dtype='object')

In [15]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 25 entries, 2021-09-07 06:32:22+00:00 to 2021-08-30 23:20:00+00:00
Data columns (total 38 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   id                                 25 non-null     int64  
 1   id_str                             25 non-null     object 
 2   full_text                          25 non-null     object 
 3   truncated                          25 non-null     bool   
 4   display_text_range                 25 non-null     object 
 5   entities                           25 non-null     object 
 6   metadata                           25 non-null     object 
 7   source                             25 non-null     object 
 8   in_reply_to_status_id              0 non-null      object 
 9   in_reply_to_status_id_str          0 non-null      object 
 10  in_reply_to_user_id                0 non-null      object 
 11  in_reply_t

In [16]:
tweets.head()

Unnamed: 0_level_0,id,id_str,full_text,truncated,display_text_range,entities,metadata,source,in_reply_to_status_id,in_reply_to_status_id_str,...,user-location,possibly_sensitive,extended_entities,scores,compound,sentiment,centroid,long,lat,search_criteria
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-09-07 06:32:22+00:00,1435128795544104960,1435128795544104960,RT @ComDannyda: How to #view/#check #files/#fo...,False,"[0, 140]","{'hashtags': [{'text': 'view', 'indices': [23,...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,,,,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neu,"(None, None)",,,@RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-04 18:39:46+00:00,1434224685479256072,1434224685479256072,RT @helpnetsecurity: Building for transactiona...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://help.twitter.com/en/using-twi...",,,...,Silicon Valley,False,,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neu,"(None, None)",,,@RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-04 18:34:43+00:00,1434223416673030149,1434223416673030149,RT @helpnetsecurity: Building for transactiona...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://abdirahiimyassin.weebly.com"" ...",,,...,Internet,False,,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neu,"(None, None)",,,@RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-04 16:13:51+00:00,1434187963873832963,1434187963873832963,RT @helpnetsecurity: Building for transactiona...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://www.powerapps.com"" rel=""nofoll...",,,...,,False,,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neu,"(None, None)",,,@RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-04 16:03:39+00:00,1434185397911924748,1434185397911924748,RT @helpnetsecurity: Building for transactiona...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://help.twitter.com/en/using-twi...",,,...,USA,False,,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neu,"(None, None)",,,@RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...


In [17]:
# Count of tweets by sentiment
tweets.groupby('sentiment')['sentiment'].count()

sentiment
neg     6
neu    18
pos     1
Name: sentiment, dtype: int64

In [18]:
# Count of tweets by author and show the top 10
tweets.groupby('user-screen_name')['user-screen_name'].count().nlargest(10)

user-screen_name
austin_castel      3
helpnetsecurity    3
tester109375070    3
KalemaChris        2
cybersec_feeds     2
opensource_orgs    2
schestowitz        2
CSSalesMan         1
ComDannyda         1
JavaGeekBot        1
Name: user-screen_name, dtype: int64

In [19]:
# Let's get an idea of what the sentiment is by author
tweets.groupby(['user-screen_name', 'sentiment'])['sentiment'].count()

user-screen_name  sentiment
CSSalesMan        neu          1
ComDannyda        neu          1
JavaGeekBot       neg          1
KalemaChris       neu          2
WhiteCh46522218   neu          1
austin_castel     neu          3
cybersec_feeds    neu          2
helpnetsecurity   neu          3
opensource_orgs   neg          2
schestowitz       neu          1
                  pos          1
sysadm_bot        neu          1
tester109375070   neu          3
wealldotcom       neg          1
worldofnubcraft   neg          1
writeffects       neg          1
Name: sentiment, dtype: int64

In [20]:
# Let's get the mean / average of the compound score by sentiment.
tweets.groupby('sentiment')['compound'].mean()

sentiment
neg   -0.413183
neu    0.000000
pos    0.616600
Name: compound, dtype: float64

In [21]:
# How many tweets included location information
print(f"{tweets.shape[0] - tweets['long'].isnull().sum()} tweets did not contain latitude and longitude.")

0 tweets did not contain latitude and longitude.


In [22]:
# How many tweets did NOT include location information
# Just an inverse of the above to double check things.
print(f"{tweets['long'].isnull().sum()} tweets contained latitude and longitude.")

25 tweets contained latitude and longitude.


In [25]:
tweets.groupby('user-location')['user-location'].count()

user-location
                             10
@wealldotcom                  1
Global                        3
Internet                      2
Kampala, Uganda               2
Mandaluyong City, NCR, PH     1
North Pole                    2
Silicon Valley                1
USA                           3
Name: user-location, dtype: int64

In [24]:
tweets.search_criteria

created_at
2021-09-07 06:32:22+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-04 18:39:46+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-04 18:34:43+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-04 16:13:51+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-04 16:03:39+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-04 16:01:28+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-03 04:58:58+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-03 04:58:53+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-02 05:13:33+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-02 04:53:00+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-02 04:52:59+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-02 04:52:54+00:00    @RedHat OR #RedHat OR #LifeatRedHat OR @RedHat...
2021-09-01 20:22:16+00:00    @RedHat OR #

In [None]:
from IPython.core.display import HTML

In [26]:
%%html
<div class='tableauPlaceholder' id='viz1631047197051' style='position: relative'><noscript><a href='#'><img alt='Analyze Twitter Data ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;An&#47;AnalyzeTwitterData&#47;AnalyzeTwitterData&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='AnalyzeTwitterData&#47;AnalyzeTwitterData' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;An&#47;AnalyzeTwitterData&#47;AnalyzeTwitterData&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1631047197051');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1000px';vizElement.style.height='927px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1000px';vizElement.style.height='927px';} else { vizElement.style.width='100%';vizElement.style.height='1777px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>