<a href="https://colab.research.google.com/github/ade1986/testrepository/blob/master/COMP1804_Lab3_Exercises_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


___
# COMP1804 Lab2.03 - Python Exercises (Collecting and Visualising Twitter Data)
---
.                                                                                                                                                                                  

We are now going beyond Python Basics to experiment with the various Python libraries for handling data. Yes, we're jumping in the deep end here! and this is a good strategy to build on the basics coverd and learn along the way.

In this lab, you will learn how to collect Twitter data and you will use the simple String manipulation techniques covered last week as well as the pandas library to compare the popularity of given words. You will also learn how to use the pyplot library to plot simple graphs. 

Social media gives organisations and businesses an unprecedented opportunity for connecting with the people and customers, and identifying relevant prospects. Analysing online text data is frequently used for opion mining, sentiment analysis, understanding followers, analysing the reach and results of poles or posts, identifying influencers, improving ROI (return on investment), and much more. 

**In this lab we will be using the Twitter Streaming API to download tweets related to the following keyword: "Brexit". By the end of this lab, you should be able to collect Twitter data using Python, do simple analysis, plot small graphs and export the data as csv/txt file.**



.

___

#1 Colleting Twitter Data using the Twitter Streaming API
Twitter provides a number of APIs to developers in order to access data programmaticlly. In this lab we will be using the Twitter Streaming API to download tweets related to the following 3 keywords: "Brexit", "Leave", and "Remain". Twitter also provides a REST API for the same purpose - you can find out more on this [here](https://developer.twitter.com/en.html). 

##a) The Twitter Streaming API requires: an API key, API secret, Access Token and Access Token Secret
> To get your keys and tokens, complete the following (detailed instructions in [COMP1804_Lab3_SupportingMaterial-Twitter](https://moodlecurrent.gre.ac.uk/pluginfile.php/1188895/mod_resource/content/4/COMP1804-1819-Lab3%20SupportingMaterial-Twitter.pdf)):

1.   Create a Twitter account at https://twitter.com/signup, if you do not already have one.
2.   To collect Twitter data you need to create Twitter app. Go to https://apps.twitter.com/ and log in with your Twitter credentials.
3.   Click "Create New App".
4.   Fill out the form, agree to the terms, and click "Create your Twitter application".
5.   In the next page, click on "API keys" tab, and copy your "API key" and "API secret".
6.   Scroll down and click "Create my access token", and copy your "Access token" and "Access token secret".



Steps 5 and 6 generate your credentials which are needed for accessing the Twitter APIs. They are used in the Python code below to connect to the Twitter Streaming API. 

##b) Connecting to the Twitter Streaming API and downloading data
The Python Tweepy module is needed to connect to Twitter Streaming API and for downloading the data. [Tweepy](https://github.com/tweepy/tweepy) is an open-source module, hosted on GitHub, that provides a set of simple methods to communicate with the Twitter platform and use its API. 

If Tweepy is not already installed in your PyCharm, follow the installation instructions in [COMP1804_Lab3_SupportingMaterial-tweepy](https://moodlecurrent.gre.ac.uk/pluginfile.php/1167723/course/section/674707/COMP1804-1819-Lab3%20SupportingMaterial-Tweepy.pdf).

The code below streams . Make sure you replace the values access_token, access_token_secret, consumer_key, and consumer_secret with the credentials generated in steps 5 and 6 above.

When you run the code, you will see data streamed in the box below. To interrupt the streaming click the stop icon on the left (Ctl+M).

In [0]:
#Import the necessary methods from tweepy library
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener


#Variables that contains the user credentials to access Twitter API 
consumer_key = "ENTER YOUR API KEY                              #API Key
consumer_secret = "ENTER YOUR API SECRET"  #API Secret 
access_token = "ENTER YOUR ACCESS TOKEN"     #Access Token
access_token_secret = "ENTER YOUR TOKEN SECRET"   #Token Secret


#This is a basic listener that just prints received tweets to stdout.
class Listener(StreamListener):
    
    # override method from tweepy.streaming
    def on_data(self, data):
        print(data)
        brexitFile = open('BrexitTweets.txt','a')
        brexitFile.write(data)
        brexitFile.close()       
      
    #def on_status(self, status):
    #    print(status.text)  
        
    def on_error(self, status_code):
        if status_code == 420:
            return False


#Handles Twitter authetinication and the connection to Twitter Streaming API
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, Listener())

#Filter Twitter Streams to capture data by keywords: 'brexit', 'leave', 'remain'
stream.filter(track=['brexit'])
   

##c) Reading collected data from file


The Twitter Streaming API returns tweets in JSON format. JSON stands for JavaScript Object Notation. This format makes the data easily readable both my humans and machines. Open the tweet file generated in the previous step and note the additional information it contains apart from the main tweet text. The following map of twitter message details by Raffi Krikorian, 2010 illustrates the JSON notation used and the corresponding meaning. More detail and up-to-date information on tweet format and the additional information tweets contain can be found on the [twitter developer's website](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object).



![alt text](https://raw.githubusercontent.com/cocoxu/socialmedia-class.github.io/master/assets/img/raffi-krikorian-map-of-a-tweet.png)

###Following is an example of using the Python json or simplejson modules to read in the tweet data in JSON format and process them in a more readable format. 
The code demonstrates of how to read and process a subset of the tweet details. Other data in JSON format can be processed in a similar way.


In [0]:
# Import the necessary package to process data in JSON format
try:
    import json
except ImportError:
    import simplejson as json

# Read in the file saved from last step 
tweets_filename = 'BrexitTweets.txt'
tweets_file = open(tweets_filename, "r")

# Tidy up the tweets so they're easier to read
for line in tweets_file:
    try:
        # Read in one line of the file, convert it into a json object 
        tweet = json.loads(line.strip())
        if 'text' in tweet:                # only messages containing a 'text' field are displayed
            print(tweet['id'])                  # the tweet's id
            print(tweet['created_at'])          # when the tweet was posted
            print(tweet['text'])                # content of the tweet
                        
            print(tweet['user']['id'])          # id of the user who posted the tweet
            print(tweet['user']['name'])        # name of the user
            print(tweet['user']['screen_name']) # name of the user account

            hashtags = []
            for hashtag in tweet['entities']['hashtags']:
                hashtags.append(hashtag['text'])
            print(hashtags)

    except:
        # if line read is not in JSON format, an exception may be thrown (we'll ignore here)
        continue

#2 Analysing and  visualising collected Twitter data
For this we'll use the pandas module for data manipulation, matplotlib for plotting charts, and re for regular expressions. 

In [0]:
import json
import pandas as pd
import matplotlib.pyplot as plt

We will read in the data from BrexitTweets.txt in into a list called bTweets.

In [0]:
# Read in the file saved from last step 
tweets_filename = 'BrexitTweets.txt'
tweets_file = open(tweets_filename, "r")

brexitTweets = []
for line in tweets_file:
    try:
        brexitTweets.append(json.loads(line))
    except:
        continue



We can then use the list.len method to find out how many tweets we collected.

In [0]:
print(len(brexitTweets))

pandas can be used to structure tweets data into a pandas DataFrame to simplify the data manipulation. DataFrames are a 2-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). 

Here, we'll create a  tweetsDF with three columns text, lang, and country. The 'text' column contains the main tweet text, the lang column contains the language in which the tweet was written, and country is the country from which the tweet was sent.

We first create an empty tweetsDF. Then the map() function is used to applies the lambda function of extracting the given key to each JSON element of brexitTweets list and assiagns a list of the results to each column on the tweetsDF.

In [0]:
tweetsDF = pd.DataFrame()

In [0]:
tweetsDF['text'] = list(map(lambda tweet: tweet['text'], brexitTweets))
tweetsDF['lang'] = list(map(lambda tweet: tweet['lang'], brexitTweets))
tweetsDF['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, brexitTweets))
tweetsDF.describe()

Now we've got the tweets in a better managable data structure, it's easy to use pyplot from matplotlib to plot charts.
The following code looks for the top 5 languages the tweets were written in and displays a bar chart.


In [0]:
tweets_by_lang = tweetsDF['lang'].value_counts()

fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Languages', fontsize=15)
ax.set_ylabel('Number of tweets' , fontsize=15)
ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
tweets_by_lang[:5].plot(ax=ax, kind='bar', color='green')

This code illustrates the top 5 countries where the tweets were written

In [0]:
tweets_by_country = tweetsDF['country'].value_counts()

fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Countries', fontsize=15)
ax.set_ylabel('Number of tweets' , fontsize=15)
ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')
tweets_by_country[:5].plot(ax=ax, kind='bar', color='grey')

Same results displayed in a pie chart.

In [0]:
tweets_by_country = tweetsDF['country'].value_counts()
tweets_by_country[:5].plot(kind='pie', label='Top 5 Countries')

#3 Activity
Once you've completed this notebook, experiment with the same code in PyCharm. 

In your PyCharm project create a file called Lab3_TwitterStreaming.py. Remember to make sure you replace all strings starting with "ENTER YOUR..." with your own twitter credentials into access_token, access_token_secret, consumer_key, and consumer_secret.