### Introduction to Twitter

#### What is Twitter?
Twitter is a micro-blogging social network website, where users post 280 (previously 140) characters long messages called 'Tweets'.

Link : https://twitter.com/
___


#### User actions on Twitter
> - Tweet -- Post a message with image/video and text within 240 characters on Twitter.
> - Retweet -- Retweet or share a tweet made by another user within Twitter.
> - Reply -- Post a message in respose to another user's tweet.
> - Mentions -- Tag another user in his/her tweet or reply.
> - Hashtag -- Another tag used to link to a topic or event.
> - Follow -- Follow or subscribe to a user's tweets. A Follower is a user that follows, and the user that is being followed in followee.
> - Search -- To search for tweets posted by other accounts based on a query.

#### Twitter API
Twitter provides an application programming interface (API) [1]. The API allows us to interact with the social media in many ways, like get user tweets, message users, search for tweets etc.

#### How does one use the API?
To use the API to do any of the above mentioned actions, the user needs to create a Twitter Developer App and get the following keys:

#### Customer Key
> - Consumer Secret
> - Access Token
> - Access Token Secret
> - These are necessary for the authentication process with the API.

#### How can I collect data from the API?
The API has various endpoints to perform various actions. We will primarily be focussing on Search and Streaming.

#### API rate limits
The Twitter API is rate limited in order to avoid the API hits hampering with the behaviour of the social network.

#### Libraries used
> - Tweepy
> - jsonpickle


In [1]:
import tweepy
from pprint import pprint #for printing the Tweet in a format 
import jsonpickle #for creating a pickle file for storing the tweets
#for getting Stream of Data
from tweepy import Stream
from tweepy.streaming import StreamListener
import json

import plotly.express as px
import plotly.graph_objects as go

from wordcloud import WordCloud, STOPWORDS
from PIL import Image
import numpy as np
import re
from matplotlib import pyplot as plt




#for NLP
import nltk #nltk.download('punkt')   nltk.download('stopwords') nltk.download('wordnet') nltk.download('averaged_perceptron_tagger')





In [2]:
#Consumer Key (API Key), Consumer Secret (API Secret)
auth = tweepy.OAuthHandler('WaVIYiWLULDMQx0Nt1I53q9Lv','cmm4NQDCSm0TpjTM9dFXhgzJyohccFe8aHDE4cMFUr0r40UDAi')
# Access Token, Access Token Secret
auth.set_access_token('1211972222-UovQvyUTZL7Js7ZCZiD3tHqPfM7bA1CHYlQBwRU','u9urF8WfWVtCZLhpa3FbKS5ma73X7F84CdF1SC8bRiqIo')

api = tweepy.API(auth)
if (not api):
    print("Authentication failed :(")
else:
    print("Authentication successfull!!! :D")

Authentication successfull!!! :D


In [3]:
query = '#coronavirusindia'  # this is what we're searching for
en_lang = 'en' # this is used to specify the language of the tweets
popular_results = 'popular' # used to specifiy the order of tweet results. Accepted values: popular|recent|mixec
extended_mode = 'extended' # used to tell the API not to truncate the tweet

In [4]:
#Query the endpoint
search_results = api.search(q=query, lang=en_lang, result_type=popular_results, 
                            tweet_mode=extended_mode)

In [5]:
for result in search_results:
    pprint(result._json)
    break

{'contributors': None,
 'coordinates': None,
 'created_at': 'Tue Mar 03 08:09:20 +0000 2020',
 'display_text_range': [0, 228],
 'entities': {'hashtags': [{'indices': [211, 228], 'text': 'coronavirusindia'}],
              'symbols': [],
              'urls': [{'display_url': 'twitter.com/rahulgandhi/st…',
                        'expanded_url': 'https://twitter.com/rahulgandhi/status/1227536939479228417',
                        'indices': [229, 252],
                        'url': 'https://t.co/SuEvqMFbQd'}],
              'user_mentions': []},
 'favorite_count': 20142,
 'favorited': False,
 'full_text': 'There are moments in the life of every nation when its leaders '
              'are tested. A true leader would be completely focused on '
              'averting the massive crisis about to be unleashed by the virus '
              'on India and its economy. \n'
              '\n'
              '#coronavirusindia https://t.co/SuEvqMFbQd',
 'geo': None,
 'id': 1234752707883175941,
 '

In [None]:
#Save the Tweets
file_name = 'data/search_tweets.json'

In [None]:
# Iterate through search results and save the tweet
with open(file_name, 'w') as f:
    for tweet in search_results:
        f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
                        '\n')

In [None]:
# #override tweepy.StreamListener to add logic to on_status
# class MyListener(StreamListener):
 
#     def on_data(self, data):
#         try:
#             pprint(data)
#             return True
#         except BaseException as e:
#             print("Error on_data: %s" % str(e))
#         return True
 
#     def on_error(self, status):
#         pprint(status)
#         return True

# twitter_stream = Stream(auth, MyListener())
# twitter_stream.filter(track=query)


## Analysing Twitter Data

### Fetch already collected Twitter data
In order to do any kind of analysis, we should have large amount of data. So, we would use the dataset of tweets collected using #WorldCup and can be downloaded here. Once downloaded, move the JSON file to a folder named resource in the root of the project.

The dataset contains a text file containing the Twitter API responses. The structure of the tweet is same as we've seen before.

#### How do I fetch this data?
> - Import the libraries and read the file
> - Convert the tweet string to Python dictionary

In [None]:


filename = 'data/worldcup-tweets.json'

# Open the file in read mode
with open(filename, 'r') as f:
    tweet_string_list = f.readlines()
    
# Convert Tweets from string to dict
tweet_list = []
for string in tweet_string_list:
    tweet_list.append(json.loads(string))

In [None]:
print(len(tweet_list))

## Analysis 1: How many hashtags are contained in the tweets?

### Calculating number of hashtags in a tweet
From the structure of the tweet, we can see that we get this information inside the 'entities' list. All we need is the length of the 'hashtags' list.
We define a method to do that.

In [None]:
from collections import Counter, OrderedDict

def get_num_of_hashtags(tweet_list):
    '''
    Returns the counter of number of tweets by 
    the number of hashtags used
    '''
    tweet_hashtags = Counter()
    for tweet in tweet_list:
        tweet_hashtags[len(tweet['entities']['hashtags'])] += 1
    return tweet_hashtags

In [None]:
hashtags_counter = get_num_of_hashtags(tweet_list)
pprint(hashtags_counter)

In [None]:
counter_dict = OrderedDict(hashtags_counter.most_common())
pprint(counter_dict)

### Plot the counts in form a Bar Chart

The following method contains the code to plot a bar chart, given labels and their counts.

In [None]:
# Get labels and counts from the dictionary
labels = list(counter_dict.keys())
counts = list(counter_dict.values())



In [None]:
fig = go.Figure([go.Bar(x=labels, y=counts)])
fig.show()

## Analysis 2: Which devices were used to send these tweets?

### Calculating number of tweets for most common devices used
Again, we can find this information from our tweet data, and a 'source' field specifies this information. It looks something like this:

`<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>`

Now, in order to get the text out of this HTML tag, we use regular expressions.
### Regex method

In [None]:

def remove_html_tags(text):
    '''
    Remove html tags from a string
    '''
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

### Count the tweet per device

Once we have cleaned the HTML tags, like the previous analysis, we will need to count how many tweets from which device.

In [None]:
def get_device_counts(tweet_list):
    '''
    Returns the 5 most common devices used to post a tweet
    '''
    tweet_device = Counter()
    for tweet in tweet_list:
        tweet_device[tweet['source']] += 1
    tweet_device = tweet_device.most_common(5)
    return tweet_device

### Plot a pie chart
We use a pie chart to show, how much percentage of the tweets were posted by which device.

### What do the numbers look like?

In [None]:
device_counter = get_device_counts(tweet_list)
pprint(device_counter)

In [None]:
device_labels = []
device_counts = []

for tup in device_counter:
    device_labels.append(remove_html_tags(tup[0]))
    device_counts.append(tup[1])

print(device_labels, '\n', device_counts)

In [None]:
fig = go.Figure(data=[go.Pie(labels=device_labels, values=device_counts)])
fig.show()

## Analysis 3: Which words are used most often?

### Let's create a word cloud! 
A word cloud or tag cloud is a form of visual representation where the size of each word, depicts its frequency in the text.

In [None]:



def make_word_cloud(tweet_text, stopwords):        
    ball_mask = np.array(Image.open('images/ball.jpg'))
    
    # Generate a word cloud image
    wordcloud = WordCloud(background_color="white", mask=ball_mask,
               stopwords=stopwords, width=1000).generate(tweet_text)

    # Display the generated image:
    plt.figure( figsize=(20,10) )
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.show()

To create a word cloud, we get all the text from our tweets, and remove the stop words. Stop words are most common words in English language and don't contribute to the topic relevant words. Stop words can be article (a, an, the) or pronouns (you, we, I).

In [None]:
tweet_text = ''
for tweet in tweet_list:
    tweet_text += tweet['text']

stopwords = set(STOPWORDS)
stopwords.add('https')
stopwords.add('BCwn8xx039RT')

make_word_cloud(tweet_text, stopwords)

# Basic NLP


In [None]:
from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

In [None]:
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)

In [None]:
from nltk.probability import FreqDist
fdist = FreqDist(tokenized_word)
(fdist)

In [None]:
fdist.most_common()


In [None]:
import matplotlib.pyplot as plt
fdist.plot(30,cumulative=False)
plt.show()

#### Zipf's law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. So word number n has a frequency proportional to 1/n.



In [None]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

In [None]:
filtered_sent=[]
for w in tokenized_word:
    if w.lower() not in stop_words:
        filtered_sent.append(w)
print("Tokenized Sentence:",tokenized_word)
print("Filterd Sentence:",filtered_sent)

In [None]:
# Stemming - IDK HOW GOOD THIS IS
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

stemmed_words=[]
for w in filtered_sent:
    stemmed_words.append(ps.stem(w))

print("Filtered Sentence:",filtered_sent)
print("Stemmed Sentence:",stemmed_words)

In [None]:
#Lexicon Normalization
#performing stemming and Lemmatization

from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()

words = ["flying","crying", "passing"]
for word in words:
    print("Lemmatized Word:",lem.lemmatize(word,"v"))
    print("Stemmed Word:",stem.stem(word))
    print("---")

In [None]:
sent = "Albert Einstein was born in Ulm, Germany in 1879."

tokens=nltk.word_tokenize(sent)
print(tokens)
nltk.pos_tag(tokens)


In [None]:
from textblob import TextBlob 
sentences = ["This is a very bad session", "Tennis is a good sport", "Tennis is the best sport"]

In [None]:
for sent in sentences:
    print(sent)
    print(TextBlob(sent).sentiment)
    print("---")