Collect text data using Twitter APIs.
--------------------------------------------------

There are a lot of free APIs through which we can collect data and use it to solve problems. We will learn the Twitter API in particular (as it can be used in many applications of NLP like product reviews, sentiment analysis,....).

Problem
------------
You want to collect text data using Twitter APIs.

Solution
------------
Twitter has a gigantic amount of data with a lot of value in it. Social media
marketers are making their living from it. There is an enormous amount
of tweets every day, and every tweet has some story to tell. When all of this
data is collected and analyzed, it gives a tremendous amount of insights to
a business about their company, product, service, etc.

How It Works
-------------------
Log in to the Twitter developer portal

Create your own app in the Twitter developer portal, and get the keys
mentioned below. Once you have these credentials, you can start pulling
data. Keys needed:

> • consumer key: Key associated with the application (Twitter, Facebook, etc.).

> • consumer secret: Password used to authenticate with the authentication server 
(Twitter, Facebook, etc.).

> • access token: Key given to the client after successful authentication of  above keys.

> • access token secret: Password for the access key.

Useful links :
-----------------
https://iag.me/socialmedia/how-to-create-a-twitter-app-in-8-easy-steps/

https://developer.twitter.com/en/docs/tweets/sample-realtime/overview/GET_statuse_sample

In [1]:
!pip install tweepy

Collecting tweepy
  Downloading https://files.pythonhosted.org/packages/36/1b/2bd38043d22ade352fc3d3902cf30ce0e2f4bf285be3b304a2782a767aec/tweepy-3.8.0-py2.py3-none-any.whl
Collecting requests-oauthlib>=0.7.0
  Downloading https://files.pythonhosted.org/packages/a3/12/b92740d845ab62ea4edf04d2f4164d82532b5a0b03836d4d4e71c6f3d379/requests_oauthlib-1.3.0-py2.py3-none-any.whl
Collecting PySocks>=1.5.7
  Downloading https://files.pythonhosted.org/packages/8d/59/b4572118e098ac8e46e399a1dd0f2d85403ce8bbaad9ec79373ed6badaf9/PySocks-1.7.1-py3-none-any.whl
Collecting oauthlib>=3.0.0
  Downloading https://files.pythonhosted.org/packages/05/57/ce2e7a8fa7c0afb54a0581b14a65b56e62b5759dbc98e80627142b8a3704/oauthlib-3.1.0-py2.py3-none-any.whl (147kB)
Installing collected packages: oauthlib, requests-oauthlib, PySocks, tweepy
Successfully installed PySocks-1.7.1 oauthlib-3.1.0 requests-oauthlib-1.3.0 tweepy-3.8.0


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [3]:
# Once all the credentials are in place, use the code below to fetch the data.

# Install tweepy
# !pip install tweepy

# Import the libraries
import numpy as np
import tweepy
import json
import pandas as pd
from tweepy import OAuthHandler #OAuth open authorization (Third party protocol)

# credentials  --> put your credentials here
consumer_key = "qlptsWKLxAJhKXnQW1dC2YEK2"
consumer_secret = "FcUiScQ8Nw5Ll4fa1OBVEyNpVGvj6mxn8S5PJB11dq0UTNvAP9"
access_token = "2716562274-bcg3Zb8HVxeyjQwp6DgQqjxbZOmwKFv7ylLoL37"
access_token_secret = "XCPOSL8GOKP9wFuFlPdMWc4zqduo4ngqLb8oV1aBtQXAY"

# calling API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# Provide the query you want to pull the data. For example,
# pulling data for "bollywood stars"
query = "#KhushiyanUnlocked"

# Fetching tweets
Tweets = api.search(query, count = 10, lang='en', exclude='retweets',tweet_mode='extended')

for tweet in Tweets:
    print(tweet)
    print("============================================================================================")

# The query above will pull the top 10 tweets when the term "boolywood stars" 
# is searched. The API will pull English tweets since the language 
# given is ‘en’ and it will exclude retweets.

Status(_api=<tweepy.api.API object at 0x000001CB3147CF98>, _json={'created_at': 'Sun Mar 01 03:50:26 +0000 2020', 'id': 1233962778576052226, 'id_str': '1233962778576052226', 'full_text': "#KhushiyanUnlocked\nDon't never ever use state Bank of India", 'truncated': False, 'display_text_range': [0, 59], 'entities': {'hashtags': [{'text': 'KhushiyanUnlocked', 'indices': [0, 18]}], 'symbols': [], 'user_mentions': [], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 156202825, 'id_str': '156202825', 'name': 'Digitalogic Infomedia Studios LLC', 'screen_name': 'digitalogicillc', 'location': 'Washington, USA', 'description': 'Official Twitter account of CEO of Digitalogic Infomedia Studios LLC al

By default, api.search returns 15 tweets,
but if we want more tweets we can get up to 100 tweets by adding count = 100

count is just one of the arguments we can play around other like the language,location, etc

"""  
Getting the Tweets + Some Attributes
------------------------------------
In this section, we will get some tweets plus some of their related attributes and store them in a structured format.

If we are interested in getting more than 100 tweets at a time, which we are in our case, we will not be able to do so by just using api.search. We will need to use tweepy.Cursor which will allow us to get as many tweets as we desire. I did not get too deep into trying to understand what Cursoring does, but the general idea in our case is that it will allow us to read 100 tweets, store them in a page inherently, then read the next 100 tweets.

For our purpose, the end result is that it will just keep going on fetching tweets until we ask it to stop by breaking the loop.
"""

In [6]:
# start by creating an empty DataFrame with the columns we'll need
df = pd.DataFrame(columns = ['Tweets', 'User', 'User_statuses_count', 
                             'user_followers', 'User_location', 'User_verified',
                             'fav_count', 'rt_count', 'tweet_date'])


In [15]:
# Next, lets define a function as follows.
def stream(data, file_name):
    i = 0
    for tweet in tweepy.Cursor(api.search, q=data, count=100, lang='en').items():
        print(i, end='\r')#returns the cursor to the start of line 
        df.loc[i, 'Tweets'] = tweet.text
        df.loc[i, 'User'] = tweet.user.name
        df.loc[i, 'User_statuses_count'] = tweet.user.statuses_count  # indicates the no. of times the user as tweeted 
        df.loc[i, 'user_followers'] = tweet.user.followers_count
        df.loc[i, 'User_location'] = tweet.user.location
        df.loc[i, 'User_verified'] = tweet.user.verified
        df.loc[i, 'fav_count'] = tweet.favorite_count
        df.loc[i, 'rt_count'] = tweet.retweet_count
        df.loc[i, 'tweet_date'] = tweet.created_at
#         df.to_excel('{}.xlsx'.format(file_name))
        i+=1
        if i == 1000:
            break
        else:
            pass


"""
Let's look at this function from the inside out:
------------------------------------------------
First, we followed the same methodology of getting each tweet in a for loop, but this time from tweepy.Cursor.

Inside tweepy.Cursor, we pass our api.search and the attributes we want: q = data: data will be whatever piece of text we pass into the stream function to ask our api.search to search for just like we did passing "food" in the previous example.

count = 100: Here we are setting the number of tweets to return to 100, via api.search, which is the maximum possible number.

lang = 'en': Here I am simply filtering results to return tweets in English only.

Next, I am filling my DataFrame with the attributes I am interested in and during each iteration making use of the .loc method in Pandas and my i counter.

The attributes I am passing into each column are self explanatory and you can look into the Twitter API documentation for what other attributes are available and play around with those.

Finally I am saving the result into an excel file using "df.to_excel" and here I am using a placeholder {} instead of naming the file inside the function because I want to be able to name the file myself when I run the function.

Now, I can just call my function as follows, looking for tweets about food again and naming my file "my_tweets."

Now, since we put our api.search into tweepy.Cursor, it will not just stop at the first 100 tweets. It will instead keep going on forever; that's why we are using i as a counter to stop the loop after 1000 iterations.
"""


In [16]:

# calling the above function 
stream(data = ['Taapsee Pannu'], file_name = 'my_tweets')




999

In [12]:
# view first 5 records
df.head()

Unnamed: 0,Tweets,User,User_statuses_count,user_followers,User_location,User_verified,fav_count,rt_count,tweet_date
0,RT @VertigoWarrior: Note these names. I`m not ...,Sach7511,4560,57,India,False,0,555,2020-03-01 04:30:15
1,#Thappad’s biggest triumph is in showing the w...,Silverscreen.in,41049,35086,"India, USA",True,0,0,2020-03-01 04:30:04
2,#TaapseePannu also spoke of her desire to be i...,GOODTIMES,31965,180396,India,True,0,0,2020-03-01 04:30:00
3,RT @HindustanTimes: Thappad box office day 2: ...,☯‿☯༎ຶᗩෆriT༎ຶ,3480,390,Follow Dance Studio & Arts,False,0,1,2020-03-01 04:25:05
4,#Thappad box office day 2: @taapsee film sees ...,HT Entertainment,92339,176475,India,True,5,0,2020-03-01 04:21:04


"""
Let's Analyze Some Tweets
--------------------------
"""

In [26]:
# importing TextBlob. It has build-in sentiment property
from textblob import TextBlob


# The sentiment property returns a named tuple of the form 
# Sentiment(polarity,subjectivity). The polarity score is a float 
# within the range [-1.0, 1.0]. 
# The subjectivity is a float within the range [0.0, 1.0] 
# where 0.0 is very objective and 1.0 is very subjective.


"""
I would like to add an extra column to this DataFrame that indicates the sentiment of a tweet.

We will also need to add another column with the tweets stripped of useless symbols, then run the sentiment analyzer on those cleaned up tweets to be more effective.
"""

In [27]:
# Let's start by writing our tweets cleaning function:
import re
def clean_tweet(tweet):
    return ' '.join(re.sub('(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)', ' ', tweet).split())


In [28]:
# Let's also write our sentiment analyzer function:
def analyze_sentiment(tweet):
    analysis = TextBlob(tweet)
    if analysis.sentiment.polarity > 0:
        return 'Positive'
    elif analysis.sentiment.polarity ==0:
        return 'Neutral'
    else:
        return 'Negative'


In [29]:
# Now let's create our new columns:
df['clean_tweet'] = df['Tweets'].apply(lambda x: clean_tweet(x))
df['Sentiment'] = df['clean_tweet'].apply(lambda x: analyze_sentiment(x))


In [30]:
# Let's look at some random rows to make sure our functions worked correctly.

# Example (300th row):
n=300
print('Original tweet:\n'+ df['Tweets'][n])
print()
print('Clean tweet:\n'+df['clean_tweet'][n])
print()
print('Sentiment:\n'+df['Sentiment'][n])


Original tweet:
RT @bhaveshkjha: #TheKapilSharmaShow Sharma ji your today's TRP is going to be very low as u have invited Paapsi Pannu and  Sinha
@KapilSha…

Clean tweet:
RT TheKapilSharmaShow Sharma ji your today s TRP is going to be very low as u have invited Paapsi Pannu and Sinha

Sentiment:
Neutral


"""
------------------------------------------------
Extra Good reading : (for those who want to become "Data Scientist" quickly)
https://towardsdatascience.com/extracting-twitter-data-pre-processing-and-sentiment-analysis-using-python-3-0-7192bd8b47cf
"""

In [31]:
df[df.Sentiment=='Positive'].shape[0]

240

In [32]:
df[df.Sentiment=='Neutral'].shape[0]

716

In [33]:
df[df.Sentiment=='Negative'].shape[0]

44