# Importing data from Twitter with python and tweepy 

<img src="./images/tweepy.png" alt="Twitter logo" style="width: 800px;" align="left"/>

[Twitter](https://twitter.com) provides an API that lets you download data from this social network. To do this we will use python and the library [tweepy](https://github.com/tweepy/tweepy).

The aim is to retrieve tweets related with the word 'NoSQL' and store them in a file for later analysis.

The first thing to do is [register](https://www.google.es/webhp?q=create+twitter+app) a new Twitter application via the [Twitter Application Management](https://apps.twitter.com/) page.

After registering our application, Twitter will give us the keys that we need to access it using its API

In [55]:
consumer_key = 'hRNtRgjzGq3wq3mt3fbuUkQ2c'
consumer_secret = 'yBbXnvNRpm92wvblpG9xhUMFF7w9sgxLfQT8k15Fs3k1RN4pnQ'
access_token_key = '12391902-qHO3gUBIvKuv7DjajXBmdm2SyZH8vgmR3jcpLVnnM'
access_token_secret = '9ViwfNW5FhOLhahaf4qimDLXfYuqDtGzJ1MmAQM0gN3LK'

The next step is to import the library and login in twitter.

In [56]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)

api = tweepy.API(auth)

## Testing the library

To test the library we will retrieve information about a user of twitter, his name, his followers, and so on ..

In [57]:
user = api.get_user('NoSQLDigest')



In [58]:
print "Name:", user.screen_name
print "Description:", user.description
print "Followers count:", user.followers_count
print "Friends' count:", user.friends_count
print "Statues Count [Number of Tweets]: ", user.statuses_count


Name: NoSQLDigest
Description: NoSQL Digest of tweets.
Followers count: 9785
Friends' count: 12
Statues Count [Number of Tweets]:  668120


## Retrieving Tweets via a search term

In [59]:
lookup ='NoSQL'

Using the following method we can download tweets quickly, but it depends on the limit set by Twitter

In [60]:
max_tweets = 200
search_results = api.search(q=lookup, lang = 'en', count=max_tweets)

print len(search_results)


100




In [61]:
from prettytable import PrettyTable

table = PrettyTable(["User", "Fecha", "Texto"])
table.align["User"] = "l"
table.align["Texto"] = "l"

for tweet in search_results[0:10]:
    table.add_row([tweet.user.screen_name, tweet.created_at, tweet.text[:80]])
 
print table

+-----------------+---------------------+----------------------------------------------------------------------------------+
| User            |        Fecha        | Texto                                                                            |
+-----------------+---------------------+----------------------------------------------------------------------------------+
| retweetjava     | 2015-10-21 04:00:19 | RT @MusicHackFest: #GoGettersNetwork #CodeTalk  Apache Hadoop and NoSQL as Analy |
| NeuvooPhpCA     | 2015-10-21 04:00:11 | CyberCoders is hiring a #Senior #Backend Developer - Ruby, Python, PHP, Agile, S |
| GujaratiGuy789  | 2015-10-21 03:57:05 | RT @SoftwareJokes: 3 database admins walked into a NoSQL bar. A little while lat |
| astosyk         | 2015-10-21 03:53:11 | RT @couchbase: Get rapid ramp-up on #NoSQL application development with lessons  |
| tjmickol        | 2015-10-21 03:40:36 | Listening to Sweet Home Wundermude by Spill &amp; Freunde on #kexp #galvanize #z |


## Retrieving a user's timeline


In [62]:
timeline_results = api.user_timeline(screen_name = 'NoSQLDigest', count = 1000, include_rts = True)
len(timeline_results)



198

## Retrieving Tweets via a search term using a cursor

This method uses a cursor that skips the restriction of 200 tweets ;-)

In [63]:
c = tweepy.Cursor(api.search, q= lookup).items()
    
search_results = []
while True:
    try:
        tweet = c.next()
        # Insert into db
        search_results.append(tweet)
    except tweepy.TweepError as e:
        print e
        break

print len(search_results)



{"errors":[{"message":"Rate limit exceeded","code":88}]}
2648


This fails too!. In this case is because of a timeout limit :-(.

## Retrieving a user's timeline using a cursor

In [64]:
import sys

c = tweepy.Cursor(api.user_timeline,id='NoSQLDigest').items()    
timeline_results = []
while True:
    try:
        tweet = c.next()
        # Insert into db
        timeline_results.append(tweet)
    except:
        print "Error: ", sys.exc_info()[0]
        break

print len(timeline_results)



Error:  <type 'exceptions.StopIteration'>
3136


## Visualizing the content of a Tweet

Twitter returns data in [JSON](http://www.json.org/json-es.html) format. Let's see what a tweet looks like:

In [66]:
import pprintpp

tweet = search_results[0]
pprintpp.pprint(tweet._json)

{
    u'contributors': None,
    u'coordinates': None,
    u'created_at': u'Wed Oct 21 04:00:19 +0000 2015',
    u'entities': {
        u'hashtags': [
            {u'indices': [19, 36], u'text': u'GoGettersNetwork'},
            {u'indices': [37, 46], u'text': u'CodeTalk'},
            {u'indices': [131, 135], u'text': u'IoT'},
            {u'indices': [136, 140], u'text': u'Java'},
            {u'indices': [139, 140], u'text': u'API'},
            {u'indices': [139, 140], u'text': u'Linux'},
        ],
        u'symbols': [],
        u'urls': [
            {
                u'display_url': u'bit.ly/1FAQ7B5',
                u'expanded_url': u'http://bit.ly/1FAQ7B5',
                u'indices': [107, 130],
                u'url': u'https://t.co/z2lcnaHeov',
            },
        ],
        u'user_mentions': [
            {
                u'id': 3243621325,
                u'id_str': u'3243621325',
                u'indices': [3, 17],
                u'name': u'Hackathon News',
      

## Parsing the result

The aim of the following functions to simplify the information we keep on file because Twitter provide too much information

In [67]:
def parse_user(usr):
    user = {}  
    user["created_at"] = usr['created_at']
    user["description"] = usr['description']
    user["favourites_count"] = usr['favourites_count']
    user["followers_count"] = usr['followers_count']
    user["friends_count"] = usr['friends_count']
    user["geo_enabled"] = usr['geo_enabled']
    user["_id"] = usr['id']
    user["id_str"] =usr['id_str']
    user["name"] = usr['name']
    user["screen_name"] = usr['screen_name']
    user["statuses_count"] = usr['statuses_count']
    user["profile_image_url"] = usr['profile_image_url']
    if usr['time_zone'] <> None:
        user["time_zone"] = usr['time_zone']
    
    return user

In [68]:
def parse_tweet(t):
    tweet = {}
    tweet['created_at'] = t['created_at']
    #for ht in tweet.entities.hashtags:
    #    print ht.text

    tweet['entities'] = []
    for k in t['entities']['hashtags']:
        tweet['entities'].append(k['text'])
  
    tweet['user_mentions'] = []
    for k in t['entities']['user_mentions']:
        k.pop("indices", None)
        tweet['user_mentions'].append(k)

    tweet['favorite_count'] =  t['favorite_count']

    if t['geo'] <> None:
        tweet['geo'] = t['geo']

    tweet['_id'] = t['id']
    tweet['id_str'] = t['id_str']  

    tweet['lang'] = t['lang']
    tweet['retweet_count'] = t['retweet_count']
    tweet['source'] = t['source']
    tweet['text'] = t['text']
    tweet['user'] = parse_user(t['user'])

    if 'retweeted_status' in t.keys():
        tweet['retweeted_status'] = parse_tweet(t['retweeted_status'])
        
    return tweet

We parse the content we've previously downloaded ...

In [69]:
tweets = []

for tweet in search_results:
    tweets.append(parse_tweet(tweet._json))
    

Now the content also in JSON format but much simpler

In [71]:
pprintpp.pprint(tweets[0])

{
    '_id': 656681395348217856,
    'created_at': u'Wed Oct 21 04:00:19 +0000 2015',
    'entities': [
        u'GoGettersNetwork',
        u'CodeTalk',
        u'IoT',
        u'Java',
        u'API',
        u'Linux',
    ],
    'favorite_count': 0,
    'id_str': u'656681395348217856',
    'lang': u'en',
    'retweet_count': 1,
    'retweeted_status': {
        '_id': 656600272781877248,
        'created_at': u'Tue Oct 20 22:37:58 +0000 2015',
        'entities': [
            u'GoGettersNetwork',
            u'CodeTalk',
            u'IoT',
            u'Java',
            u'API',
            u'Linux',
        ],
        'favorite_count': 0,
        'id_str': u'656600272781877248',
        'lang': u'en',
        'retweet_count': 1,
        'source': u'<a href="http://ifttt.com" rel="nofollow">IFTTT</a>',
        'text': u'#GoGettersNetwork #CodeTalk  Apache Hadoop and NoSQL as Analysis Engines for IoT Data ▸ https://t.co/z2lcnaHeov #IoT #Java #API #Linux …',
        'user': {
     

## Saving the data in a file

Finally we will record the data downloaded to a file, so that we can analyze later

In [75]:
import json
    
with open('./data/tweets.json',"w") as file:
    for t in tweets:
        r = json.dumps(t)
        file.write(r)
        file.write("\n")

In [76]:
print "Number of tweets ... ", len(tweets)

Number of tweets ...  3136


In [77]:
tweets = []
for tweet in timeline_results:
    tweets.append(parse_tweet(tweet._json))
    
with open('./data/timeline.json',"w") as file:
    for t in tweets:
        r = json.dumps(t)
        file.write(r)
        file.write("\n")
        
print "Number of tweets ... ", len(tweets)

Number of tweets ...  3136
