# Analysis of Twitter Data

## Play with Twitter Streaming API
API stands for Application Programming Interface. It is a tool that makes the interaction with computer programs and web services easy. Many web services provides APIs to developers to interact with their services and to access data in programmatic way. For this programming experiment, we will use Twitter Streaming API to download tweets related to the 2 keywords: "**big data**", and "**data analytic**".
### Step 1: Getting Twitter API keys
In order to access Twitter Streaming API, we need to get 4 pieces of information from Twitter: *API key*, *API secret*, *Access token* and *Access token secret*. Follow the steps below to get all these 4 elements:
* Create a twitter account if you do not already have one.
* Go to https://apps.twitter.com/ and log in with your twitter credentials.
* Click "Create New App"
* Fill out the form, agree to the terms, and click "Create your Twitter application"
* In the next page, click on "API keys" tab, and copy your "API key" and "API secret".
* Scroll down and click "Create my access token", and copy your "Access token" and "Access token secret".

In [None]:
# Variables that contains the user credentials to access Twitter API
consumer_key = 'abc'
consumer_secret = 'def'
access_token = 'ijk'
access_token_secret = 'lmn'

### Step 2: Connecting to Twitter Streaming API and downloading data
We will be using a Python library called **Tweepy** to connect to Twitter Streaming API and downloading the data.

If you don't have Tweepy installed in your machine, go to this link [https://github.com/tweepy/tweepy], and follow the installation instructions.

You can also run '*pip install tweepy*' in your anaconda installed directory.

In [None]:
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json

#This is a basic listener that just prints received tweets to stdout.
class MyListener(StreamListener):

    def on_data(self, data):
        try:
            with open('tweets.json', 'a') as f:
                f.write(data)
                dat = json.loads(data)
                print "%s %s" % (dat['created_at'], dat['text'])
                return True
        except BaseException as e:
            print("--> Error on_data: %s" % str(e))
            pass
        return True

    def on_error(self, status):
        print status

if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    twitter_stream = Stream(auth, MyListener())

    #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
    twitter_stream.filter(track=['big data', 'data analytic', 'data science', '#bigdata', '#datascience', '#dataanalytic'])
    #twitter_stream.filter(track=['*'], languages=['th'])
    #twitter_stream.filter(track=['*'])
    

## Reading and Understanding the data
The data that we stored in tweets.json is in **JSON** format. JSON stands for *JavaScript Object Notation*. This format makes it easy to humans to read the data, and for machines to parse it. Below is an example for one tweet in JSON format. You can see that the tweet contains additional information in addition to the main text which in this example: "*How #BigData and CRM are Shaping Modern Marketing https:\/\/t.co\/TgUYSUp9jT https:\/\/t.co\/V54kea8cT2*".

{"created_at":"Wed Oct 26 16:32:49 +0000 2016","id":791316663312457728,"id_str":"791316663312457728","text":"How #BigData and CRM are Shaping Modern Marketing https:\/\/t.co\/TgUYSUp9jT https:\/\/t.co\/V54kea8cT2","display_text_range":[0,73],"source":"\u003ca href=\"http:\/\/www.sociallymap.com\" rel=\"nofollow\"\u003eSociallymap\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":4327758735,"id_str":"4327758735","name":"Globe Trotter BI","screen_name":"GlobeTrotter_BI","location":null,"url":null,"description":"* R\u00e9seau international de consultants BI *      #Data #BusinessIntelligence #bigdata #datascientist #datamanagement","protected":false,"verified":false,"followers_count":104,"friends_count":212,"listed_count":50,"favourites_count":13,"statuses_count":318,"created_at":"Mon Nov 30 10:15:23 +0000 2015","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"fr","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":....,"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1477499569143"}

For the remaining of this lab, we will be using 4 Python libraries; *json* for parsing the data, *pandas* for data manipulation, *matplotlib* for creating charts, and *re* for regular expressions. 

The *json* and *re* libraries are installed by default in Python. You should install *pandas* and *matplotlib* if you don't have them in your machine.

We will start first by uploading *json* and *pandas* using the commands below:

In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt

Next we will read the data in into an array that we call tweets.

In [None]:
import sys
tweets_data_path = 'C:\\Program Files\\Anaconda2\\tweets_bigData_dataAnalytic.json'

tweets_data = []
tweets_file = open(tweets_data_path, "r")
count = 0
for line in tweets_file:
    try:
        count = count + 1
        tweet = json.loads(line)
        tweets_data.append(tweet)
        if count%100 == 0:
            sys.stdout.write('.')
        if count%7000 == 0:
            sys.stdout.write('\n')
    except Exception as e:
        print e
        continue
print "\n%s tweets read." % (count)

Next, we will structure the tweets data into a pandas *DataFrame* to simplify the data manipulation. We will start by creating an empty DataFrame called **tweets** using the following command.

In [None]:
tweets = pd.DataFrame()

Next, we will add 3 columns to the **tweets** DataFrame called *text*, *lang*, and *country*, in which *text* column  contains the tweet, *lang* column contains the language in which the tweet was written, and *country* the country from which the tweet was sent.

In [None]:
tweets['text'] = map(lambda tweet: tweet.get('text', None), tweets_data)
tweets['lang'] = map(lambda tweet: tweet.get('lang', None), tweets_data)
tweets['country'] = map(lambda tweet: tweet['place']['country'] if tweet.get('place') != None else None, tweets_data)
print tweets.head(10)

Next, we will create a chart describing the Top 15 countries from which the tweets were sent.

In [None]:
%matplotlib inline
tweets_by_country = tweets['country'].value_counts()

fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=10)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Countries', fontsize=12)
ax.set_ylabel('Number of tweets' , fontsize=12)
ax.set_title('Top 15 countries', fontsize=12, fontweight='bold')
tweets_by_country[:15].plot(ax=ax, kind='bar', color='blue')

### **YOUR TURN**
#### Problem01:
Create a chart describing the Top 15 languages in which the tweets were written.

In [None]:
'''
YOUR TURN NOW: 
1) create a chart describing the top 10 native languages of which 
the twitter users speak.
2) create a chart describing ...
''' 
