# Lab 3 - Twitter API

In this lab, you will learn how to retrieve tweets data from Twitter by using an open source library called [Tweepy](https://docs.tweepy.org/en/latest/). Tweepy gives you a very convenient way to access the Twitter API with Python.  

Also, check the official [Twitter API](https://developer.twitter.com/en/docs/twitter-api/getting-started/guide).

This lab is written by Michelle KAN (michellekan@smu.edu.sg) and Jisun AN (jisunan@smu.edu.sg).

Let's first install the tweepy library:<br>

In [None]:
## This it OPTIONAL if you are running the current notebook using Google Colab
!pip install tweepy

## 1) Authentication

The following code imports the tweepy library and other required libraries. Twitter API uses the [tweepy.AuthHandler](https://docs.tweepy.org/en/v3.5.0/auth_tutorial.html) class for authentication. 

In [None]:
import tweepy
from tweepy import OAuthHandler

Before using the Twitter API, you will need a Twitter account, and to have obtained Twitter API authentication credentials.<br>Set your authentication credentials below. <br>

In [None]:
# Consumer/Access key/secret/token obtained from Twitter
# You should have created a Twitter app and gotten these keys.
# Do NOT share your key/secret/token with other students.
consumer_key    = ''
consumer_secret = ''
access_token    = ''
access_secret   = ''

The following code creates an authorization object with your above authentication info and calls the Twitter's API.

In [None]:
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

# This line finally calls Twitter's Rest API.
api = tweepy.API(auth)
#api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# The following codes verify if the authentication is successful
# If all goes well, you should see a message saying Authentication OK.
# Otherwise, check your Consumer/Access key/secret/token
try:
    api.verify_credentials()
    print("Authentication OK")
except:
    print("Error during authentication")

## 2) Types of Twitter API & Tweepy Cursor

### 2-1) Twitter REST API

The REST API is to pull data from Twitter. 

We can do retrieve tweets based on query or tweets of all users using `tweepy.Cursor.` 

`tweepy.Cursor` method deals with the pagination -- if there's many tweets returned, it makes it easy to iterate the data.


#### a) Search tweets

Below will return five tweets containing search words 

```
search_words = 'covid'
max_tweets = 5
tweets = tweepy.Cursor(api.search, q=search_words, tweet_mode='extended').items(max_tweets)
```


#### b) Users tweets

Below will return 5 tweets posted by BiilGates

```
username = 'BillGates'
max_tweets = 5
tweets = tweepy.Cursor(api.user_timeline, id=username, tweet_mode='extended').items(max_tweets)
```


### 2-2) Streaming API tweets
The Twitter streaming API is used to download twitter messages in real time. It is useful for obtaining a high volume of tweets, or for creating a live feed using a site stream or user stream. See the [Twitter Streaming API Documentation](https://developer.twitter.com/en/docs/tweets/filter-realtime/overview).

```
keyword = 'covid'
myStream.filter(track=[keyword])
```





## 3) Search Tweets

Now you are ready to search Twitter for recent tweets! 


 
### a) Search Tweets using Keywords


To create this query, you will define the:
- Search term 
- start date of your search (optional)
 
Note: Search API returns tweets with specific search terms, posted in the last 7 days. You need a premium account for going further than 7 days. 

(Optional) Uncomment and run the following code snippet if you wish to enable Python logging to know what's happening underlying in the API call.

In [None]:
# import logging
# logging.basicConfig(level=logging.DEBUG,
#                     format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
#                     datefmt='%m-%d %H:%M:%S')
# logger = logging.getLogger(__name__)

In [None]:
# Define the search term and the date_since date as variables
search_words = 'covid'
date_since = "2021-01-24" #if you want to collect data from yesterday

max_tweets = 5

Below we use `tweepy.Cursor()` to search for tweets containing the specified search_words and perform pagination. Parameters:
-   `api.search` – tweepy api method that returns a collection of relevant Tweets matching a specified query
- 	`q` – the search query string of 500 characters maximum, including operators. Queries may additionally be limited by complexity.
-   `lang` – restricts tweets to the given language
-   `since` – returns tweet created on or after this date. Date should be formatted as YYYY-MM-DD.

You can restrict the number of tweets returned by specifying a number in the `.items()` method. `.items(5)` will return 5 of the most recent tweets

In [None]:
# Below will return five tweets containing search words 
tweets = tweepy.Cursor(api.search, q=search_words, tweet_mode='extended').items(max_tweets)


In [None]:
# You can add other parameters like lang, since, etc) 
tweets = tweepy.Cursor(api.search, q=search_words, lang="en", since=date_since, tweet_mode='extended').items(max_tweets)


`tweets.Cursor()` returns an object `ItemIterator` that you can iterate to access the tweet data collected. Each tweet item in the iterator has various attributes including:

- the text of the tweet
- the date the tweet was sent
- and more. 

The code below loops through the object and prints the text associated with each tweet.

In [None]:
tweets = tweepy.Cursor(api.search, q=search_words, lang="en", since=date_since, tweet_mode='extended').items(max_tweets)

# Iterate tweets
for tweet in tweets:
    # print out user's screen name & tweet text
    print("----------------------------------------------------")
    print ('Tweet ID ' + str(tweet.id))
    print ('Created at ' + str(tweet.created_at))
    
    # Extracting tweet text when in Extended Mode
    try: # If it's Retweet
        text = tweet.retweeted_status.full_text
    except AttributeError:  # Not a Retweet
        text = tweet.full_text
    print('\t Tweet: ' + text)


#### <img align="left" src="https://docs.google.com/uc?id=1m3oi2yHQnNISJ5EhmWVhRsqPFao6qSU4" width="50"/><br><br>Who is Tweeting About 'covid'?

You can access a wealth of information associated with each tweet. 

Below is an example of accessing information of users who are sending the tweets including users' screen name and their locations. Note that user locations are manually entered into Twitter by the user. Thus, you will see a lot of variation in the format of this value.

- tweet.user.screen_name provides the user’s twitter handle associated with each tweet.
- tweet.user.location provides the user’s provided location.

You can try to include other items available within each tweet by checking out the [twitter developer documentation](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet).

In [None]:
tweets = tweepy.Cursor(api.search, q=search_words, lang="en", since=date_since, tweet_mode='extended').items(max_tweets)

for tweet in tweets:
    print("----------------------------------------------------")
    print ('Tweet ID ' + str(tweet.id))
    print (f'Tweeted by: @{tweet.user.screen_name} Created at: {str(tweet.created_at)} Location: {tweet.user.location}' )
    # Extract text when in Extended Mode
    try: # If it's Retweet
        text = tweet.retweeted_status.full_text
    except AttributeError:  # Not a Retweet
        text = tweet.full_text
    print('\t' + text)

#### Save Tweets in a JSON format into a File


Twitter API has limits in how many times we can call APIs to collect the data (Twitter Rate Limit). So, it's always better to save the data in the file. 

What is JSON? 

JavaScript Object Notation (JSON) is a standard text-based format for representing structured data based on JavaScript object syntax.

Table / Database --> Text format

| id        | name           | tweet  |
| ------------- |:-------------:| -----:|
| 123      | Jisun | Hello |
| 456      | Michelle      |  Welcome |

JSON
`[{'id':123, 'name':'Jisun', 'tweet':'Hello'},{'id':456, 'name':'Michelle', 'tweet':'Welcome'}]`


In [None]:
import json

In [None]:
# 'mypath' variable can be changed to your local path or Google Drive path
mypath = "."

In [None]:
tweets = tweepy.Cursor(api.search, q=search_words, lang="en", since=date_since, tweet_mode='extended').items(max_tweets)
# Write data into a file
filename = f"{mypath}/tweets_{search_words}.jsons"
with open(filename, "w") as output:
    for tweet in tweets:
        myjson = tweet._json
        output.write(json.dumps(myjson)+"\n")


Read tweets from the file.

Let's read the first tweet.

In [None]:
# Read data from a file
filename = f"{mypath}/tweets_{search_words}.jsons"

with open(filename) as fi:
    for line_cnt, line in enumerate(fi):
        tweet = json.loads(line.strip())
        break # Break here so that we read the first line of the file
        

In [None]:
# Print JSON formated text in pretty way
import pprint

pprint.pprint(tweet)

In [None]:
# Check keys in json
tweet.keys()

In [None]:
# How to access values in json
print(tweet['id'])
print(tweet['user']['name'])


#### Extract data from json

In [None]:
# Read data from a file
filename = f"{mypath}/tweets_{search_words}.jsons"

with open(filename) as fi:
    for line_cnt, line in enumerate(fi):
        tweet = json.loads(line.strip())

        tweetid = tweet['id']
        created_at = tweet['created_at']

        # Extract text from tweets in Extended Mode
        if 'retweeted_status' in tweet: # If it's Retweet
            text = tweet['retweeted_status']['full_text']
        else:  # Not a Retweet
            text = tweet['full_text']

        user_screen_name = tweet['user']['screen_name']
        user_location = tweet['user']['location']

        print("--------------------------")
        print (f'Tweet ID: {tweetid}')
        print (f'Tweeted by: @{user_screen_name}, Created at {created_at}, User Location: {user_location}' )
        print(f'\t {text}')

        break #If you want to read other lines, comment this out


### Exercise 1

Using the tweets retrieval code example given above, add on the following details for each tweet retrieved:
- Number of times the Tweet has been retweeted (a retweet is when someone shares someone else’s tweet.)
- Source/application used to post the Tweet.
- User's name and friends count in Twitter

You may take reference to the [twitter developer documentation](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet). 

An example of the expected tweet output is given as follows: the tweet has been retweeted 21 times, the tweet has been posted using 'Twitter for Android' and user 'Cotonete' has 82 friends in Twitter:<br>
<img align="center" src='https://drive.google.com/uc?export=view&id=1WHGR9Q9ou4_w_zMhioEfVyhV1VxYNenk' style="height: 110px;">

As shown above there could be two ways to get this done. You can use Tweepy API or you can use the saved file. 

Try both!


In [None]:
## Enter your code below using Tweepy API



In [None]:
## Enter your code below using the saved file


#### Removing Retweets

In the above example, some of the tweets retrieved may contain prefix 'RT' which means they are retweets. A retweet is when someone shares someone else’s tweet. It is similar to sharing in Facebook. Sometimes you may want to remove retweets as they contain duplicate content that might skew your analysis if you are only looking at word frequency. Other times, you may want to keep retweets.

Below you ignore all retweets by adding `-filter:retweets` to your query. You may wish to check out the [Twitter API](https://docs.tweepy.org/en/latest/api.html) documentation on other ways to customize your queries

In [None]:
new_search = search_words + " -filter:retweets" 
# new_search has the value "clean energy -filter:retweets"

tweets = tweepy.Cursor(api.search,q=new_search, lang="en",since=date_since).items(8)

for tweet in tweets:
    print("----------------------------------------------------")
    print (f'Tweeted by: @{tweet.user.screen_name} Created at: {str(tweet.created_at)} Location: {tweet.user.location}' )
    print(f'\tText: {tweet.text}')
    

### Create a Pandas Dataframe From A List of Tweet Data

Instead of displaying on screen, you can also populate a pandas dataframe using tweets data retrieved.

In [None]:
import pandas as pd

# setting parameters and retrieving tweets
new_search = search_words + " -filter:retweets" 
tweets = tweepy.Cursor(api.search,q=new_search, lang="en",since=date_since,tweet_mode='extended').items(8)

## initialise list to be used to store tweets retrieved
tweets_list = []

## appending tweets retrieved into a list
for tweet in tweets:
    
    try: # If it's Retweet
        text = tweet.retweeted_status.full_text
    except AttributeError:  # Not a Retweet
        text = tweet.full_text

    tweets_list.append([tweet.user.screen_name, tweet.created_at, tweet.user.location, text])

# populate dataframe with list of tweets
tweet_df = pd.DataFrame(data=tweets_list, columns=['user','created_at','location','text'])
tweet_df

In [None]:
## save the data into a csv file
tweet_df.to_csv('covid_tweet.csv')

### Search Tweets by Specific User

Besides keyword, we can also retrieve tweets posted by specific Twitter user. 

Parameters:
-   `api.user_timeline` – tweepy api method that returns the most recent statuses (up to 20) posted from the user specified.
-   `id` – unique user ID or screen name of a user
-   `lang` – restricts tweets to the given language
-   `include_rts` – boolean indicator to specify whether to include retweets
-   `exclude_replies` – boolean indicator to specify whether to exclude tweet replies

Similarly, you can restrict the number of tweets returned by specifying a number in the `.items()` method. `.items(10)` will return 10 of the most recent tweets.

Let's look at the following example that retrieves tweets posted by UK Model World Health Organization. 

In [None]:
import pandas as pd

user_id = "UKModelWHO"

## initialise list to be used to store tweets retrieved
tweets_list = []

## appending tweets retrieved into a list
for tweet in tweepy.Cursor(api.user_timeline, id=user_id ,lang="en", include_rts=False, exclude_replies=True, tweet_mode='extended').items(10):
    try: # If it's Retweet
        text = tweet.retweeted_status.full_text
    except AttributeError:  # Not a Retweet
        text = tweet.full_text
    tweets_list.append([tweet.user.screen_name, tweet.id, tweet.created_at, text])

# populate dataframe with list of tweets specifying required column names
tweet_df = pd.DataFrame(data=tweets_list, columns=['user','tweetid','created_at','text'])
tweet_df


In [None]:
## save the data into a csv file
tweet_df.to_csv('ukmodelwho_tweet.csv')

You can save tweets in their original json format

In [None]:
user_id = "UKModelWHO"

tweets = tweepy.Cursor(api.user_timeline, id=user_id ,lang="en", include_rts=False, exclude_replies=True, tweet_mode='extended').items(10)

filename = f"{mypath}/tweets_{user_id}.jsons"
with open(filename, "w") as output:
    for tweet in tweets:
        myjson = tweet._json
        output.write(json.dumps(myjson)+"\n")


Create dataframe from json files

In [None]:
tweets_list = []

filename = f"{mypath}/tweets_{user_id}.jsons"
with open(filename) as fi:
    for line_cnt, line in enumerate(fi):
        tweet = json.loads(line)

        tweet = json.loads(line.strip())

        tweetid = tweet['id']
        created_at = tweet['created_at']
        # # # Extended Mode
        if 'retweeted_status' in tweet: # If it's Retweet
            text = tweet['retweeted_status']['full_text']
        else:  # Not a Retweet
            text = tweet['full_text']

        user_screen_name = tweet['user']['screen_name']

        tweets_list.append([user_screen_name, tweetid, created_at, text])

# populate dataframe with list of tweets specifying required column names
tweet_df = pd.DataFrame(data=tweets_list, columns=['user','tweetid', 'created_at', 'text'])
tweet_df

    

## 4) Streming API

Step 1: Creating a StreamListener

`on_data()` is called when new data comes in


In [None]:
class MyStreamListener(tweepy.StreamListener):

    """ A listener handles tweets are the received from the stream.
    This is a basic listener that just prints received tweets to stdout.

    """
    def on_data(self, data):
        myjson=data[:-1]
        myoutput.write(myjson+"\n")
        return True

    def on_error(self, status):
        print ("Error", status)
        

Step 2: Creating a Stream


In [None]:
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener, tweet_mode='extended')


You need to stop the process before it collects too much data!!

In [None]:
keyword = 'covid'

myfilename = f'{mypath}/stream_tweets_{keyword}.jsons'
myoutput = open(myfilename, 'w')

while True:
    try:
        # myStream.filter(track=['coronavirus', 'covid', 'chinese virus', 'wuhan', 'ncov', 'sars-cov-2', 'koronavirus', 'corona', 'cdc', 'N95', 'kungflu', 'epidemic', 'outbreak', 'sinophobia', 'china', 'pandemic', 'covd'])
        myStream.filter(track=[keyword])

    except Exception as e:
        raise


In [None]:
outfilename = f"{mypath}/simple_stream_tweets_{keyword}.tsv" 

with open(myfilename) as fi, open(outfilename, 'w') as output:
    # Write header in the file to load the file into dataframe
    output.write("\t".join(['user_screen_name', 'tweetid', 'created_at', 'text'])+"\n")
    
    for line_cnt, line in enumerate(fi):
        try:
            tweet = json.loads(line.strip())
        except: # The last json is not complate 
            continue
        
        if 'limit' in tweet:
            continue
        
        tweetid = tweet['id']
        
        created_at = tweet['created_at']
        user_screen_name = tweet['user']['screen_name']

        # Extract Tweet text from Streaming API when in Extended Mode 
        text = tweet['text']
        try:
            text = tweet['extended_tweet']['full_text']
        except:
            pass

        # Below line will remove all tabs and line breaks from text
        text = " ".join(text.split())

        output.write("\t".join([user_screen_name, str(tweetid), created_at, text])+"\n")

    

Reading file into dataframe

In [None]:
infilename = f"{mypath}/simple_stream_tweets_{keyword}.tsv" 
df = pd.read_csv(infilename, sep="\t")
print(df.shape)
df.head()

## Exercise 2

Draw wordcloud using the collected tweets from Twitter streaming api


In [None]:
!conda install --yes -c conda-forge wordcloud

In [None]:
# Import relevant libraries

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

In [None]:
# Enter your code to extract twets from dataframe and combine in one sentence (Hint: Using join function)


In [None]:
# Enter your code to draw WordCloud


## 5) Sentiment analysis using VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. 

VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is.

Read more about VADER [here](https://github.com/cjhutto/vaderSentiment).



In [None]:
!pip install vaderSentiment


In [None]:
# import SentimentIntensityAnalyzer class 
# from vaderSentiment.vaderSentiment module. 
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 

In [None]:
# function to print sentiments 
# of the sentence. 
def sentiment_scores(sentence): 
  
    # Create a SentimentIntensityAnalyzer object. 
    sid_obj = SentimentIntensityAnalyzer() 
  
    # polarity_scores method of SentimentIntensityAnalyzer 
    # oject gives a sentiment dictionary. 
    # which contains pos, neg, neu, and compound scores. 
    sentiment_dict = sid_obj.polarity_scores(sentence) 
      
    print("Overall sentiment dictionary is : ", sentiment_dict) 
    print("sentence was rated as ", sentiment_dict['neg']*100, "% Negative") 
    print("sentence was rated as ", sentiment_dict['neu']*100, "% Neutral") 
    print("sentence was rated as ", sentiment_dict['pos']*100, "% Positive") 
  
    print("Sentence Overall Rated As", end = " ") 
  
    # decide sentiment as positive, negative and neutral 
    if sentiment_dict['compound'] >= 0.05 : 
        print("Positive") 
  
    elif sentiment_dict['compound'] <= - 0.05 : 
        print("Negative") 
  
    else : 
        print("Neutral") 
  


In [None]:
print("\n1st statement :") 
sentence = "eLearn is the best portal for students." 
# function calling 
sentiment_scores(sentence) 

print("\n2nd Statement :") 
sentence = "study is going on as usual"
sentiment_scores(sentence) 

print("\n3rd Statement :") 
sentence = "I am vey sad today."
sentiment_scores(sentence) 
    

## Exercise 3 - Sentiment analysis on the collected tweets

1. Write a python function that returns VADER's 'compound score' of a sentence
2. Re-read your tweets and put them into DataFrame (df)
3. Apply function (1) to every row in a Pandas DataFrame. Hint: Check [this blog post](https://www.geeksforgeeks.org/apply-function-to-every-row-in-a-pandas-dataframe/)
4. Plot histogram of vader compound score


In [None]:
# 1. Write a python function that returns VADER's 'compound score' of a sentence

def vader_compound_score(sentence): 
    
    #[Enter your code]
    


In [None]:
# 2. Re-read your tweets and put them into DataFrame (df)
# Change infilename accordingly

infilename = f"{mypath}/simple_stream_tweets_{keyword}.tsv" 
df = pd.read_csv(infilename, sep="\t")
print(df.shape)
df.head()


In [None]:
# 3. Apply function (1) to every row in a Pandas DataFrame
# Make the column name as 'vader'

df['vader'] = # [Enter your code]


In [None]:
df.head()

In [None]:
# 4. Plot histogram of vader compound score

plt.hist(df.vader)


## 6) [Tip] Conntent to Google Drive

All data stored in colab will be gone after your session is finished. 
You can store your data to your Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Change mypath to the path to your folder

In [None]:
mypath = '/content/drive/Path_to_your_folder'

In [None]:
# import os
# os.chdir(mypath)  #change dir
# print(os.getcwd())
# !ls