# Analyze Tweets
## With tweepy and pandas
### By: Eric L. Sammons <elsammons@gmail.com>
---
<table style="width:75%">
    <caption><strong>Purpose</strong></caption>
    <tr>
        <td>
            This project is to provide the user / reader with an introduction into social media sentiment analysis. We'll leverage the nltk sentiment vader library and dictionary.
        </td>
        <td>
            <img src="https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg" width="125" height="75">
            <img src="https://static1.squarespace.com/static/538cea80e4b00f1fad490c1b/54668a77e4b00fb778d22a34/54668d8ae4b00fb778d285a2/1416007414694/python_nltk.png" width="125" height="75">
            <img src="https://twilio-cms-prod.s3.amazonaws.com/images/twitter-python-logos.width-808.jpg" width="125" height="75">
        </td>
    </tr>
</table>
    
            
<table style="width:75%">
    <caption><strong>Sentiment Analysis</strong></caption>
    <tr>
        <td>
            Defined as the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.
        </td>
    </tr>
</table>

<table style="width:75%">
    <caption>
        <strong>
            Before You Begin
        </strong>
    </caption>
    <tr>
        <td>
            You will need to have a bearer token from Twitter
        </td>
        <td>
            <a href="https://developer.twitter.com/en/docs/authentication/oauth-2-0/bearer-tokens">
                Using and Generating Bearer Tokens
            </a>
        </td>
    </tr>
</table>

<table style="width:75%">
    <caption>
        <strong>
            Getting Started
        </strong>
    </caption>
    <tr>
        <td>
            To use this notebook you will need to set up a <strong>.env</strong> file in root folder of this project.  The minimum values required are shown to the right.
        </td>
        <td>
            CONSUMER_KEY="xxxxxx"
            CONSUMER_SECRET="xxxxxxxxxxx"
            ACCESS_KEY="xxxxxxxxxxxx"
            ACCESS_SECRET="xxxxxxxxxxxx"
        </td>
    </tr>
 </table>
    
    

In [2]:
# Install requirements.
!pip install -Uqr requirements.txt

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.4.1 requires numpy~=1.19.2, but you have numpy 1.21.2 which is incompatible.
tensorflow 2.4.1 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
pythonwhat 2.23.1 requires dill~=0.2.7.1, but you have dill 0.3.3 which is incompatible.
pythonwhat 2.23.1 requires jinja2~=2.10, but you have jinja2 3.0.1 which is incompatible.[0m
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [29]:
# We'll need these
from dotenv import find_dotenv, load_dotenv
import os
import tweepy
import datetime
import json
import pandas as pd
from nltk import download
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# download vader_lexicon
download('vader_lexicon')

# Import our helper functions and configs
import search
from lib.helper_functions import flatten_tweets
from lib.helper_functions import calculateCentroid

from IPython.core.display import HTML

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/repl/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [30]:
load_dotenv(find_dotenv()) # load our .env file.

# Set up our bearer token.
auth = tweepy.OAuthHandler(os.environ['CONSUMER_KEY'], os.environ['CONSUMER_SECRET'])
auth.set_access_token(os.environ['ACCESS_KEY'], os.environ['ACCESS_SECRET'])

# Set api options
api = tweepy.API(auth,wait_on_rate_limit=True)

In [31]:
print(f"Searching {search.days} days of tweets.")
today = datetime.date.today() # starting from today
p_days= today - datetime.timedelta(days=search.days) # prior days to look back

Searching 7 days of tweets.


In [32]:
# Instantiate the tweepy client
tweets_list = tweepy.Cursor(api.search, 
                            q=search.search,
                            since=str(p_days), 
                            until=str(today), 
                            tweet_mode='extended', 
                            ).items()

In [33]:
# Let's grab some tweets.
tweets = []

for tweet in tweets_list:
    tweets.append(json.dumps(tweet._json))

With our list of <strong>tweets</strong> we will now create our dataframe.

Once the dataframe is created we'll:
* Leverage our flatten tweets function to make specific fields more accessible.
* Force <strong>created_at</strong> to datetime.
* Set the dataframe's index to the <strong>created_at</strong> field

In [34]:
tweets = pd.DataFrame(flatten_tweets(tweets))
tweets['created_at'] = pd.to_datetime(tweets['created_at'])
tweets = tweets.set_index('created_at')

With our tweets captured we are now ready to perform sentiment analysis.

---
<table style="width:75%">
    <th>Value</th>
    <th>Translation</th>
    <tr>
        <td>0</td>
        <td>Neutral</td>
    </tr>
    <tr>
        <td>>0</td>
        <td>Positive</td>
    </tr>
    <tr>
        <td><0</td>
        <td>Negative</td>
    </tr>
</table>

In [35]:
sid = SentimentIntensityAnalyzer()
tweets['scores'] = tweets['full_text'].apply(sid.polarity_scores) # combine with dataframe

# Isolate the compound value from scores and create a new column.
tweets['compound']  = tweets['scores'].apply(lambda score_dict: score_dict['compound'])
# Create a new, sentitment, column with pos, neg, neu based on compound.
tweets['sentiment'] = tweets['compound'].apply(lambda c: 'pos' if c > 0 else ('neu' if c == 0 else 'neg'))

Tweets contain a bounding box, an approximate area (in the shape of a rectangle) of where the user tweeted from.  For this bounding box to be populated the user must make location sharing available to the application.

In the next step we will take the bounding box and calculate the centroid, providing a simple latitude and longitude value so that we can more easily use mapping features of mapping utilities or libraries.

In [36]:
tweets['centroid'] = tweets['place'].apply(calculateCentroid) # calculate centroid
tweets[['long', 'lat']] = pd.DataFrame(tweets['centroid'].tolist(), index=tweets.index) # split centroid into long, lat

We want to set our minimum and maximum date values so that we can use these in our file name.  This will help reduce ambiguity and ensure there's no manual effort here and the consumer of the file can more easily identify the date range in play.

In [37]:
min_created_dt = tweets.index.min().strftime('%Y%m%d')
max_created_dt = tweets.index.max().strftime('%Y%m%d')

We write our dataframe out to a csv so that we can download it and use it in an analytics tool like Tableau.  This could just as easily be an s3 bucket; however, accessing a simple s3 file on an s3 bucket can be a bit more difficult than simply accessing the file locally.

In [38]:
tweets.to_csv(f'resultsets/tweets_{min_created_dt}_{max_created_dt}.csv', index=True) # write to csv, keep index

In [None]:
# sentiment = sentiment_scores.apply(lambda x: x['compound'])
# sentiment_resampled = sentiment.resample('1 d').mean()
# plt.plot(
#     sentiment_resampled.index.day,
#     sentiment_resampled, color = 'red'
# )


# plt.xlabel('Day')
# plt.ylabel('Sentiment')
# plt.title('Sentiment of Red Hat Tweets')

# plt.show()

In [None]:
# comp_score_resamp = tweets[tweets['comp_score'] == 'pos']['comp_score'].resample('1 d').count()
# plt.plot(
#     comp_score_resamp.index.day,
#     comp_score_resamp,
#     color = 'red'
# )

# plt.show()
# #     