# Twitter Data Collection and Visualisation

## In this notebook:

1. Attaching to the Twitter API
2. Searching for a specific user
3. Searching for a specific topic
4. Extending the search and working with multi-level JSON Data


# Attaching to the Twitter API

## Questions & Objectives

* Setting up access and validity signing
* Setting up a handler to manage the connection
* Running a test search

Firstly we will download the libraries that deal with accessing the API (tweepy) and working with the JSON data (json)

In [None]:
# Run this cell now to import the libraries

import tweepy #https://github.com/tweepy/tweepy
import json

We then set up the variables that hold the validation keys. You need to add your keys (tokens) and secrets in the spaces below. Make sure to put them between the speech marks and make sure there is not extra spaces.

In [None]:
# Add in your keys and secrets then run this cell

access_key = ''
access_secret = ''
api_key = ''
api_secret = ''

We then set up the authication handler. We pass the keys and secrets as below and then set up the api object. We can then use this object to attach to the API each time.

In [None]:
auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

To test the connection we will run a test query.

We use the API object and we are going to ask for some of the tweets from users that you follow.


In [None]:
public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)

# Searching for a Specific User

* Search for a specific user
* Retrieve data from the Twitter API
* Call specific items from the JSON data object
* Look at the full JSON data

We will now look for tweets from a specific person. To do this we need their Twitter name. If you go to https://twitter.com/BarackObama you can see the twitter name under the main name. You can see it has a @ sign in front that we remove.  

For this we use the **get_user** method from the Twitter API.

In [None]:
# First we create an object and call the information on the user Barack Obama and hold it in the object.
user = api.get_user('BarackObama')

In [None]:
# This object is in JSON tuples.
# We can call the tuples and print their content. 
# we will look more at JSON later
# We can print the screen name as below

print(user.screen_name)

In [None]:
# We can print the number of followers -- check this is correct on the link to the Twitter page

print(user.followers_count)

In [None]:
# We Can print the user description 

print(user.description)

In [None]:
# To see all of the user information in it's raw format we can type:

print(user)

## Minitask

* Try using the information from the user print out to access the other information.
* See if you can work out how to get to the nested tuples
* Try and look at another user

In [None]:
# We can get tweets from the API user timeline
# This time we call the user_timeline method again with the BarackObama user method
# Here we call the last two tweets
# These are retured in a list object

new_tweets = api.user_timeline(screen_name = 'BarackObama',count=2)

In [None]:
# Here we can tweet the first tweet (which remember is 0 in a list)
new_tweets[0]

# Searching for a Topic

* Search the twitter API using a key word
* Retrieve the text from a single tweets
* Retrieve the text from multiple tweets
* Process and clean the text
* Visualise the text

We will now look for tweets that contain a specific word. 

For this we use the **search** method from the Twitter API.

In [None]:
# Here we are looking for the word covid
# We are asking for 10 english tweets to be returned
# This is returned as a list

brexit_tweets = api.search(q='covid', lang='en', count='10')

In [None]:
# we can print out the first in the list 

brexit_tweets[0]

This time we can't just call the JSON from the object (like we did with the user object)
We have to deal with the JSON directly. We do this using the json function
Then we can call all of the tuples as a dictionary object. 

(remember a tuple take the form ['text':'this is tweet text'] this means that we can call for the content of the tuple by the key of the tuple.) 

In [None]:
# here we can see all of the json in a nice format

brexit_tweets[0]._json

In [None]:
# Or we can just call the text

brexit_tweets[0]._json['text'] 

In [None]:
# We can text put the text into it's own list and just work with just the text

tweets_text = []
for each in brexit_tweets:
    tweets_text.append(each._json['text'])

In [None]:
# we can see how we have put this into a list

print(tweets_text)

In [None]:
# We can treat this like we did in earlier badges
# For example we can turn it into a string and tokenise it

tweets_string = " ".join(tweets_text)
from nltk.tokenize import word_tokenize
tokens = word_tokenize(tweets_string)
print(tokens[0:10])

In [None]:
# We can clean it up like we did earlier, making it all lowercase and removing stopwords

import nltk
import string
nltk.download('stopwords')
from nltk.corpus import stopwords
lowercase_tokens = [token.lower() for token in tokens]
remove_these = set(stopwords.words('english') + list(string.punctuation) + list(string.digits))
filtered_text = [word 
                 for word in lowercase_tokens 
                 if not word in remove_these]
print(filtered_text)

In [None]:
# we can produce word frequencies
from collections import Counter
simple_frequencies_dict = Counter(filtered_text)

In [None]:
# And word clouds
import matplotlib.pyplot as plt
from wordcloud import WordCloud

cloud = WordCloud(max_font_size=80,colormap="hsv").generate_from_frequencies(simple_frequencies_dict)
plt.figure(figsize=(16,12))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

## Minitask

* Try using a visualisation method or a search method you have used before to visualize the text
* Try searching for a differnt word

# Extending the search and working with multi-level JSON Data

* Search the Twitter API using an extended query with multiple terms
* Search using a tweepy cursor to retrieve more data
* Look at nested data from the JSON

We will now look for tweets that contain several words. We can combine query words with the operator 'OR'. This operator say give me tweets that contain word1 or word2. You might want to do this with related words on the same topic or use it to cover multiple spellings or typos. 

For this we use the **search** method from the Twitter API.

We want to gather more data than we did before. The search method limits the data we can retrieve. To extend the amount of data we retrieve we use a Tweepy Cursor. Twitter returns multiple pages of data. Almost like a book, but it will only give you one page at a time. Before we only took the first page. This time we will page through the extended version using a cursor object. The cursor maintain the connection with the API and allows us to ask for the next page.


In [None]:
# We set up a list to hold the tweets so we can then append to it as we iterate through 
# Previously we created it in the search but here we need it created so we can add to it

covid_tweets = []

# We then set up a tweepy cursor to maintain the connection
# We set up the query with the OR operator
# We iterate through the pages from the API using a for loop
# We append the content to a list
for page in tweepy.Cursor(api.search, q='covid OR covid19 OR COVID OR COVID19 or #covid', lang='en', min_retweets="1000").pages(100):
    covid_tweets.append(page)


In [None]:
# we can see the text from the first tweet

print(covid_tweets[0]._json['text']) 

Twitter data is nested.

This means that it can contain items within items. 

For example hashtags, user mentions, and URL's are contained within an entity tuple.

This looks like:

['entities': ['hashtags': ['hashtag1', 'hashtag2'], ['user_mentions': 'barackobama'], ['url':'www.bbc.co.uk']] 

In [None]:
# The hashtags are contained in a list within the entity tuple
# This means we need to call the entity, hashtag tuple and then iterate through the list
# We set up a list to hold the hashtags so we can then append to it as we iterate through
# We iterate through each tweet, and then through the hashtags in the list
# We add them to the list

covid_hashtags = []
for each in covid_tweets:
   for hashtag in each._json['entities']['hashtags']:
    covid_hashtags.append(hashtag['text'])

In [None]:
# We can them visualise these hashtags in the ways we have learnt before

hashtag_string = " ".join(covid_hashtags)
tokens = word_tokenize(hashtag_string)
simple_frequencies_dict_covid = Counter(tokens)
cloud = WordCloud(max_font_size=80, colormap="viridis", background_color='white',).generate_from_frequencies(simple_frequencies_dict_covid)
plt.figure(figsize=(16,12))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

## Minitask

* Try using the creating a visualisation with a different nested item

In [None]:
# Have this as a task -- look for another item of interest maybe alter to be URL's?
#covid_mentions = []
#for each in covid_tweets:
#   for mention in each._json['entities']['user_mentions']:
#        covid_mentions.append(mention['name'])
#people_dict=Counter(covid_mentions)

In [None]:
#cloud = WordCloud(max_font_size=80,background_color='white',colormap="viridis").generate_from_frequencies(people_dict)
#plt.figure(figsize=(16,12))
#plt.imshow(cloud, interpolation='bilinear')
#plt.axis('off')
#plt.show()

In [None]:
#covid_tweets = []
#for page in tweepy.Cursor(api.search, q='brexit', lang='en', min_retweets="1000").pages(100):
#    covid_tweets.append(page)