# Importing data from Twitter with python and tweepy 

<img src="./images/tweepy.png" alt="Twitter logo" style="width: 800px;" align="left"/>

[Twitter](https://twitter.com) provides an API that lets you download data from this social network. To do this we will use python and the [tweepy](https://github.com/tweepy/tweepy) library.

The aim is to retrieve tweets related with the word 'NoSQL' and store them in a file for later analysis.

The first thing to do is [register](https://www.google.es/webhp?q=create+twitter+app) a new Twitter application via the [Twitter Application Management](https://apps.twitter.com/) page.

After registering our application, Twitter will give us the keys that we need to access it using its API

In [1]:
import ConfigParser

configParser = ConfigParser.RawConfigParser()   
configParser.read("config.properties")


['config.properties']

In [2]:
consumer_key = configParser.get('Twitter', 'consumer_key')
consumer_secret = configParser.get('Twitter', 'consumer_secret')
access_token_key = configParser.get('Twitter', 'access_token_key')
access_token_secret = configParser.get('Twitter', 'access_token_secret')

The next step is to import the library and login in twitter.

In [3]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)

api = tweepy.API(auth)

## Testing the library

To test the library we will retrieve information about a user of twitter, his name, his followers, and so on ..

In [4]:
user = api.get_user('NoSQLDigest')

In [5]:
print "Name:", user.screen_name
print "Description:", user.description
print "Followers count:", user.followers_count
print "Friends' count:", user.friends_count
print "Statues Count [Number of Tweets]: ", user.statuses_count


Name: NoSQLDigest
Description: NoSQL Digest of tweets.
Followers count: 9828
Friends' count: 12
Statues Count [Number of Tweets]:  668072


## Retrieving Tweets via a search term

In [6]:
lookup ='NoSQL'

Using the following method we can download tweets quickly, but it depends on the limit set by Twitter

In [7]:
max_tweets = 200
search_results = api.search(q=lookup, lang = 'en', count=max_tweets)

print len(search_results)


100


In [8]:
from prettytable import PrettyTable

table = PrettyTable(["User", "Fecha", "Texto"])
table.align["User"] = "l"
table.align["Texto"] = "l"

for tweet in search_results[0:10]:
    table.add_row([tweet.user.screen_name, tweet.created_at, tweet.text[:60]])
 
print table

+-----------------+---------------------+--------------------------------------------------------------+
| User            |        Fecha        | Texto                                                        |
+-----------------+---------------------+--------------------------------------------------------------+
| MurphyRichard01 | 2015-10-27 17:53:02 | Hiring - Java with NoSQL developer in Mountain View, CA http |
| LJ_Blanchard    | 2015-10-27 17:52:12 | RT @jose_garde: NoSQL is still the cool kid in class - https |
| Loui_Picard     | 2015-10-27 17:47:18 | RT @jose_garde: NoSQL is still the cool kid in class - https |
| craigmullins    | 2015-10-27 17:46:43 | dashDB use cases: Standalone cloud DW, dev v. QA, data scien |
| opusrs          | 2015-10-27 17:46:09 | Big Data Architect - noSQL, Big Data Analytics, Hadoop, Cent |
| ScriptingMySQL  | 2015-10-27 17:43:01 | NoSQL simply isn't hip anymore                               |
|                 |                     |              

## Retrieving a user's timeline


In [9]:
timeline_results = api.user_timeline(screen_name = 'NoSQLDigest', count = 1000, include_rts = True)
len(timeline_results)

197

## Retrieving Tweets via a search term using a cursor

This method uses a cursor that skips the restriction of 200 tweets ;-)

In [10]:
c = tweepy.Cursor(api.search, q= lookup).items()
    
search_results = []
while True:
    try:
        tweet = c.next()
        # Insert into db
        search_results.append(tweet)
    except tweepy.TweepError as e:
        print e
        break

print len(search_results)

{"errors":[{"message":"Rate limit exceeded","code":88}]}
1283


This fails too!. In this case is because of a timeout limit :-(.

## Retrieving a user's timeline using a cursor

In [11]:
import sys

c = tweepy.Cursor(api.user_timeline,id='NoSQLDigest').items()    
timeline_results = []
while True:
    try:
        tweet = c.next()
        # Insert into db
        timeline_results.append(tweet)
    except:
        print "Error: ", sys.exc_info()[0]
        break

print len(timeline_results)

Error:  <type 'exceptions.StopIteration'>
3133


## Visualizing the content of a Tweet

Twitter returns data in [JSON](http://www.json.org/json-es.html) format. Let's see what a tweet looks like:

In [12]:
import pprintpp

tweet = search_results[0]
pprintpp.pprint(tweet._json)

{
    u'contributors': None,
    u'coordinates': None,
    u'created_at': u'Sun Oct 25 06:49:22 +0000 2015',
    u'entities': {
        u'hashtags': [],
        u'symbols': [],
        u'urls': [
            {
                u'display_url': u'mag.sqlauthority.com',
                u'expanded_url': u'http://mag.sqlauthority.com/',
                u'indices': [116, 139],
                u'url': u'https://t.co/hSEh8oTK8c',
            },
        ],
        u'user_mentions': [],
    },
    u'favorite_count': 0,
    u'favorited': False,
    u'geo': None,
    u'id': 658173486075129856,
    u'id_str': u'658173486075129856',
    u'in_reply_to_screen_name': None,
    u'in_reply_to_status_id': None,
    u'in_reply_to_status_id_str': None,
    u'in_reply_to_user_id': None,
    u'in_reply_to_user_id_str': None,
    u'is_quote_status': False,
    u'lang': u'en',
    u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'},
    u'place': None,
    u'possibly_sensitive': False,
    u're

## Parsing the result

The aim of the following functions to simplify the information we keep on file because Twitter provide too much information

In [13]:
def parse_user(usr):
    user = {}  
    user["created_at"] = usr['created_at']
    user["description"] = usr['description']
    user["favourites_count"] = usr['favourites_count']
    user["followers_count"] = usr['followers_count']
    user["friends_count"] = usr['friends_count']
    user["geo_enabled"] = usr['geo_enabled']
    user["id"] = usr['id']
    user["id_str"] =usr['id_str']
    user["name"] = usr['name']
    user["screen_name"] = usr['screen_name']
    user["statuses_count"] = usr['statuses_count']
    user["profile_image_url"] = usr['profile_image_url']
    if usr['time_zone'] <> None:
        user["time_zone"] = usr['time_zone']
        
    if user["profile_image_url"] <> None:
        save_profile_image_url(user["id_str"], user["profile_image_url"])
    
    return user

In [14]:
def parse_tweet(t):
    tweet = {}
    tweet['created_at'] = t['created_at']
    #for ht in tweet.entities.hashtags:
    #    print ht.text

    tweet['entities'] = []
    for k in t['entities']['hashtags']:
        tweet['entities'].append(k['text'])
  
    tweet['user_mentions'] = []
    for k in t['entities']['user_mentions']:
        k.pop("indices", None)
        tweet['user_mentions'].append(k)

    tweet['favorite_count'] =  t['favorite_count']

    if t['geo'] <> None:
        tweet['geo'] = t['geo']

    tweet['id'] = t['id']
    tweet['id_str'] = t['id_str']  

    tweet['lang'] = t['lang']
    tweet['retweet_count'] = t['retweet_count']
    tweet['source'] = t['source']
    tweet['text'] = t['text']
    
    tweet['user'] = parse_user(t['user'])

    if 'retweeted_status' in t.keys():
        tweet['retweeted_status'] = parse_tweet(t['retweeted_status'])
    
    if 'in_reply_to_screen_name' in tweet and tweet['in_reply_to_screen_name'] <> None:
        in_reply_to_user = {}
             
        in_reply_to_user['screen_name'] = tweet['in_reply_to_screen_name']
        in_reply_to_user['id'] = tweet['in_reply_to_user_id']
        in_reply_to_user['id_str'] = tweet['in_reply_to_user_id_str']
        
        tweet['in_reply_to_user'] = reply_to_usr 
        
        
    return tweet

We parse the content we've previously downloaded ...

In [15]:
tweets = []

for tweet in search_results:
    tweets.append(parse_tweet(tweet._json))
    

Now the content also in JSON format but much simpler

In [16]:
pprintpp.pprint(tweets[0])

{
    'created_at': u'Sun Oct 25 06:49:22 +0000 2015',
    'entities': [],
    'favorite_count': 0,
    'id': 658173486075129856,
    'id_str': u'658173486075129856',
    'lang': u'en',
    'retweet_count': 0,
    'source': u'<a href="http://ifttt.com" rel="nofollow">IFTTT</a>',
    'text': u'ReTw realmanurana: RT pinaldave: All technology news from the world of data, SQL Server, MySQL, NoSQL, Big Data an… https://t.co/hSEh8oTK8c',
    'user': {
        'created_at': u'Mon Nov 16 17:28:39 +0000 2009',
        'description': u'Computer Science, Technology, Buddhism and Curiosity (Ciencias de la computación, Tecnología, Budismo y Curiosidad)',
        'favourites_count': 28,
        'followers_count': 1456,
        'friends_count': 171,
        'geo_enabled': True,
        'id': 90439860,
        'id_str': u'90439860',
        'name': u'lexinerus',
        'profile_image_url': u'http://pbs.twimg.com/profile_images/378800000102712738/0772fe6a0154b6a4f852c8c71fc82157_normal.jpeg',
        

## Saving the data in a file

Finally we will record the data downloaded to a file, so that we can analyze later

In [17]:
import json
    
with open('../data/tweets.json',"w") as file:
    for t in tweets:
        r = json.dumps(t)
        file.write(r)
        file.write("\n")

In [18]:
print "Number of tweets ... ", len(tweets)

Number of tweets ...  2666


In [19]:
tweets = []
for tweet in timeline_results:
    tweets.append(parse_tweet(tweet._json))
    
with open('../data/timeline.json',"w") as file:
    for t in tweets:
        r = json.dumps(t)
        file.write(r)
        file.write("\n")
        
print "Number of tweets ... ", len(tweets)

Number of tweets ...  3134
