# Analyze Tweets
## With tweepy and pandas
### By: Eric L. Sammons <elsammons@gmail.com>
---
### Purpose
<table style="width:75%">
    <tr>
        <td>
            This project is to provide the user / reader with an introduction into social media sentiment analysis. We'll leverage the nltk sentiment vader library and dictionary.
        </td>
        <td>
            <img src="https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg" width="125" height="75">
            <img src="https://static1.squarespace.com/static/538cea80e4b00f1fad490c1b/54668a77e4b00fb778d22a34/54668d8ae4b00fb778d285a2/1416007414694/python_nltk.png" width="125" height="75">
            <img src="https://twilio-cms-prod.s3.amazonaws.com/images/twitter-python-logos.width-808.jpg" width="125" height="75">
        </td>
    </tr>
</table>

### Sentiment Analysis
<table style="width:75%">
    <tr>
        <td>
            Defined as the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.
        </td>
    </tr>
</table>

### Before You Begin
<table style="width:75%">
    <tr>
        <td>
            You will need to have a bearer token from Twitter
        </td>
        <td>
            <a href="https://developer.twitter.com/en/docs/authentication/oauth-2-0/bearer-tokens">
                Using and Generating Bearer Tokens
            </a>
        </td>
    </tr>
</table>

### Getting Started
<table style="width:75%">
    <tr>
        <td>
            To use this notebook you will need to set up a <strong>.env</strong> file in root folder of this project.  The minimum values required are shown to the right.
        </td>
        <td>
            CONSUMER_KEY="xxxxxx"<p>
            CONSUMER_SECRET="xxxxxxxxxxx"<p>
            ACCESS_KEY="xxxxxxxxxxxx"<p>
            ACCESS_SECRET="xxxxxxxxxxxx"<p>
        </td>
    </tr>
 </table>

In [1]:
# Install requirements.
!pip install -Uqr requirements.txt

In [2]:
# We'll need these
from dotenv import find_dotenv, load_dotenv
import os
import tweepy
import datetime
import json
import pandas as pd
from nltk import download
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# download vader_lexicon
download('vader_lexicon')

# Import our helper functions and configs
import search
from lib.helper_functions import flatten_tweets
from lib.helper_functions import calculateCentroid

from IPython.core.display import HTML

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/esammons/nltk_data...


In [3]:
load_dotenv(find_dotenv()) # load our .env file.

# Set up our bearer token.
auth = tweepy.OAuthHandler(os.environ['CONSUMER_KEY'], os.environ['CONSUMER_SECRET'])
auth.set_access_token(os.environ['ACCESS_KEY'], os.environ['ACCESS_SECRET'])

# Set api options
api = tweepy.API(auth,wait_on_rate_limit=True)

In [4]:
print(f"Searching {search.DAYS} days of tweets.")
today = datetime.date.today() # starting from today
p_days= today - datetime.timedelta(days=search.DAYS) # prior days to look back

Searching 7 days of tweets.


In [5]:
# Instantiate the tweepy client
tweets_list = tweepy.Cursor(api.search, 
                            q=search.SEARCH,
                            since=str(p_days), 
                            until=str(today), 
                            tweet_mode='extended', 
                            ).items()

In [7]:
# Let's grab some tweets.
tweets = []

for tweet in tweets_list:
    tweets.append(json.dumps(tweet._json))

With our list of <strong>tweets</strong> we will now create our dataframe.

Once the dataframe is created we'll:
* Leverage our flatten tweets function to make specific fields more accessible.
* Force <strong>created_at</strong> to datetime.
* Set the dataframe's index to the <strong>created_at</strong> field

In [8]:
tweets = pd.DataFrame(flatten_tweets(tweets))
tweets['created_at'] = pd.to_datetime(tweets['created_at'])
tweets = tweets.set_index('created_at')

With our tweets captured we are now ready to perform sentiment analysis.

---
<table style="width:75%">
    <th>Value</th>
    <th>Translation</th>
    <tr>
        <td>0</td>
        <td>Neutral</td>
    </tr>
    <tr>
        <td>>0</td>
        <td>Positive</td>
    </tr>
    <tr>
        <td><0</td>
        <td>Negative</td>
    </tr>
</table>

In [9]:
sid = SentimentIntensityAnalyzer()
tweets['scores'] = tweets['full_text'].apply(sid.polarity_scores) # combine with dataframe

# Isolate the compound value from scores and create a new column.
tweets['compound']  = tweets['scores'].apply(lambda score_dict: score_dict['compound'])
# Create a new, sentitment, column with pos, neg, neu based on compound.
tweets['sentiment'] = tweets['compound'].apply(lambda c: 'pos' if c > 0 else ('neu' if c == 0 else 'neg'))

Tweets contain a bounding box, an approximate area (in the shape of a rectangle) of where the user tweeted from.  For this bounding box to be populated the user must make location sharing available to the application.

In the next step we will take the bounding box and calculate the centroid, providing a simple latitude and longitude value so that we can more easily use mapping features of mapping utilities or libraries.

In [10]:
tweets['centroid'] = tweets['place'].apply(calculateCentroid) # calculate centroid
tweets[['long', 'lat']] = pd.DataFrame(tweets['centroid'].tolist(), index=tweets.index) # split centroid into long, lat

We want to set our minimum and maximum date values so that we can use these in our file name.  This will help reduce ambiguity and ensure there's no manual effort here and the consumer of the file can more easily identify the date range in play.

In [11]:
min_created_dt = tweets.index.min().strftime('%Y%m%d')
max_created_dt = tweets.index.max().strftime('%Y%m%d')

We write our dataframe out to a csv so that we can download it and use it in an analytics tool like Tableau.  This could just as easily be an s3 bucket; however, accessing a simple s3 file on an s3 bucket can be a bit more difficult than simply accessing the file locally.

In [12]:
tweets.to_csv(f'resultsets/tweets_{min_created_dt}_{max_created_dt}.csv', index=True) # write to csv, keep index

Before we get into any visualizations, outside of this notebook, let's take a look at a few data features by the numbers.
* Breakdown by sentiment.
* Breakdown by author.
* Average compound score by sentiments negative and positive.

But first, let's start with a look at:
* what columns we have
* column info
* sample of the data, first 5 rows

In [32]:
tweets.columns

Index(['id', 'id_str', 'full_text', 'truncated', 'display_text_range',
       'entities', 'metadata', 'source', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo',
       'coordinates', 'place', 'contributors', 'retweeted_status',
       'is_quote_status', 'retweet_count', 'favorite_count', 'favorited',
       'retweeted', 'lang', 'user-screen_name',
       'retweeted_status-user-screen_name', 'retweeted_status-full_text',
       'extended_entities', 'possibly_sensitive', 'quoted_status_id',
       'quoted_status_id_str', 'quoted_status', 'scores', 'compound',
       'sentiment', 'centroid', 'long', 'lat'],
      dtype='object')

In [22]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 336 entries, 2021-08-27 18:34:28+00:00 to 2021-08-27 00:00:03+00:00
Data columns (total 39 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   id                                 336 non-null    int64  
 1   id_str                             336 non-null    object 
 2   full_text                          336 non-null    object 
 3   truncated                          336 non-null    bool   
 4   display_text_range                 336 non-null    object 
 5   entities                           336 non-null    object 
 6   metadata                           336 non-null    object 
 7   source                             336 non-null    object 
 8   in_reply_to_status_id              49 non-null     float64
 9   in_reply_to_status_id_str          49 non-null     object 
 10  in_reply_to_user_id                54 non-null     float64
 11  in_reply_

In [33]:
tweets.head()

Unnamed: 0_level_0,id,id_str,full_text,truncated,display_text_range,entities,metadata,source,in_reply_to_status_id,in_reply_to_status_id_str,...,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,scores,compound,sentiment,centroid,long,lat
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-08-27 18:34:28+00:00,1431324248107151361,1431324248107151361,RT @RedHat: The results are in. #RedHat @OpenS...,False,"[0, 140]","{'hashtags': [{'text': 'RedHat', 'indices': [3...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,,,,,"{'neg': 0.0, 'neu': 0.922, 'pos': 0.078, 'comp...",0.1779,pos,"(None, None)",,
2021-08-27 18:30:02+00:00,1431323135501078532,1431323135501078532,The results are in. #RedHat @OpenShift managed...,False,"[0, 162]","{'hashtags': [{'text': 'RedHat', 'indices': [2...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://sproutsocial.com"" rel=""nofoll...",,,...,False,,,,"{'neg': 0.0, 'neu': 0.925, 'pos': 0.075, 'comp...",0.1779,pos,"(None, None)",,
2021-08-27 18:25:29+00:00,1431321989256450053,1431321989256450053,RT @Super_Launcher: Join the #birdies 🐦and our...,False,"[0, 139]","{'hashtags': [{'text': 'birdies', 'indices': [...","{'iso_language_code': 'zh', 'result_type': 're...","<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,False,1.430747e+18,1.4307472755595387e+18,,"{'neg': 0.0, 'neu': 0.612, 'pos': 0.388, 'comp...",0.8858,pos,"(None, None)",,
2021-08-27 18:23:04+00:00,1431321382491672584,1431321382491672584,RT @lailah_grant: How my good Networking &amp;...,False,"[0, 144]","{'hashtags': [{'text': 'programmer', 'indices'...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://www.fabricio-dev.com.br/"" rel...",,,...,,1.364264e+18,1.3642640964226335e+18,,"{'neg': 0.0, 'neu': 0.818, 'pos': 0.182, 'comp...",0.5826,pos,"(None, None)",,
2021-08-27 18:23:04+00:00,1431321382353256450,1431321382353256450,"RT @lailah_grant: Day x of Learn on job, was d...",False,"[0, 144]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://www.fabricio-dev.com.br/"" rel...",,,...,,,,,"{'neg': 0.073, 'neu': 0.846, 'pos': 0.081, 'co...",0.0516,pos,"(None, None)",,


In [17]:
# Count of tweets by sentiment
tweets.groupby('sentiment')['sentiment'].count()

sentiment
neg     13
neu    144
pos    179
Name: sentiment, dtype: int64

In [45]:
# Count of tweets by author and show the top 10
tweets.groupby('user-screen_name')['user-screen_name'].count().nlargest(10)

user-screen_name
schestowitz      13
jwsmithinfo       8
uiraribeiro       8
aespejo9          5
danieloh30        5
sarimzia          5
DivineOps         4
RedHatJobs        4
1Lpic             3
Discovertech3     3
Name: user-screen_name, dtype: int64

In [44]:
# Let's get an idea of what the sentiment is by author
tweets.groupby(['user-screen_name', 'sentiment'])['sentiment'].count()

user-screen_name  sentiment
100DaysOf2020     neu          1
1Lpic             neu          2
                  pos          1
2017_CB           neu          1
APEX_60MPH        neu          1
                              ..
waqasdaha         pos          1
wayneariola       pos          1
westurner         neg          1
                  neu          1
zerobanana        neu          1
Name: sentiment, Length: 270, dtype: int64

In [47]:
# Let's get the mean / average of the compound score by sentiment.
tweets.groupby('sentiment')['compound'].mean()

sentiment
neg   -0.457415
neu    0.000000
pos    0.509727
Name: compound, dtype: float64