## Almacenando datos de Twitter en Riak

Se van a crear los siguientes buckets:

|bucket|Clave|Contenido|
|-|-|-|
|tweets|Id del tweet|Datos del Tweet en formnato JSON. Está incluido el tweet retwiteado y los datos del usuario|
|users|Código de usuario (screeen_name)|Datos del usurario en formato JSON|
|hashtag|Hashtag|Nº de veces que aparce el hashtag en cualquiera de los tweets|

El bucket de tweets tendrá los siguientes índices:

|Nombre del índice|Contenido|¿Que busquedas permite hacer?|
|-|-|-|
|idx_usr_bin|Usuario que crea el tweet|Tweets que ha creado un determinado usuario|
|idx_hashtag_bin|Hashtag que se menciona en el tweet|Tweets donde aparece un determinado hashtag|
|idx_user_mentioned_bin|Usuario que se mencionan en el tweet| Tweets donde se menciona a un determinado usuario|

In [2]:
import riak
from pprintpp import pprint as pp
import json

In [3]:
# connect to database
myClient = riak.RiakClient()
myClient.ping()

True

In [4]:
BUCKET_TWEETS = 'tweets'
BUCKET_USERS = 'users'
BUCKET_HASHTAGS = 'hashtags'

tweets = myClient.bucket(BUCKET_TWEETS)
users = myClient.bucket(BUCKET_USERS)
hashtags = myClient.bucket(BUCKET_HASHTAGS)


In [5]:
def insert_hashtag(hashtag):
    count = hashtags.get(hashtag).data
    if count == None:
        count = 1
    else:
        count = count + 1

    hashtags.new(hashtag, data = count).store()

In [6]:
def insert_user(user_json, replace = True):
    key_user = user_json["screen_name"]
    if replace or not users.get(key_user).exists:
        user = users.new(key_user, user_json)
        user.store()
        
    return key_user
    

In [7]:
def insert_tweet(tweet_json):
    tweet = tweets.new(tweet_json["id_str"], tweet_json)
    user = insert_user(tweet_json['user'])
    tweet.add_index('idx_usr_bin', user)
    
    for hashtag in tweet_json['entities']:
        tweet.add_index('idx_hashtag_bin', hashtag)   
        insert_hashtag(hashtag)
        
    for user_mentioned in tweet_json['user_mentions']:
        tweet.add_index('idx_user_mentioned_bin', user_mentioned["screen_name"])   
        insert_user(user_mentioned, replace = False)    
        
    tweet.store()     

    if 'retweeted_status' in tweet_json:
        insert_tweet(tweet_json['retweeted_status'])


In [8]:
tweets_data_path = '../data/tweets.json'
tweets_file = open(tweets_data_path, "r")

i = 1
for line in tweets_file:
    tweet_json = json.loads(line)
    
    try:
        insert_tweet(tweet_json)
        i = i + 1
    except:
        pass

        
print("%s tweets procesados" % i)

2664 tweets procesados


## Imprimir el contenido de 1 Tweet

Imprimir el contenido del tweet con el id 655039580312371200

In [9]:
tweet = tweets.get("655039580312371200").data

pp(tweet)

{
    u'created_at': u'Fri Oct 16 15:16:20 +0000 2015',
    u'entities': [
        u'IBM',
        u'DataEngine',
        u'NoSQL',
        u'PowerSystems',
        u'TCO',
        u'Redis',
        u'CAPI',
        u'Power8',
        u'RLEC',
        u'API',
    ],
    u'favorite_count': 4,
    u'id': 655039580312371200,
    u'id_str': u'655039580312371200',
    u'lang': u'en',
    u'retweet_count': 11,
    u'source': u'<a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>',
    u'text': u'The #IBM #DataEngine for #NoSQL on IBM #PowerSystems http://t.co/tllWUqRLiN #TCO #Redis #CAPI #Power8 #RLEC #API http://t.co/BLOyKri5XP',
    u'user': {
        u'created_at': u'Sun Nov 27 22:29:08 +0000 2011',
        u'description': u'IT Consultant #IT #bigdata #analytics #cloud #edtech #K12 #IoT #datagov #MDM #datascience\nTweets are my opinions, not the official position of my employer.',
        u'favourites_count': 1528,
        u'followers_count': 5093,
        u'friends_count': 508

## Mostrar la información asociada a 1 usuario

Mostrar la información asociada al usuario DBaker007

In [10]:
user = users.get("DBaker007").data

pp(user)

{
    u'created_at': u'Sun Nov 27 22:29:08 +0000 2011',
    u'description': u'IT Consultant #IT #bigdata #analytics #cloud #edtech #K12 #IoT #datagov #MDM #datascience\nTweets are my opinions, not the official position of my employer.',
    u'favourites_count': 1528,
    u'followers_count': 5093,
    u'friends_count': 5082,
    u'geo_enabled': False,
    u'id': 422966469,
    u'id_str': u'422966469',
    u'name': u'Duane Baker',
    u'profile_image_url': u'http://pbs.twimg.com/profile_images/613317207121006593/fBJIw92n_normal.jpg',
    u'screen_name': u'DBaker007',
    u'statuses_count': 11902,
    u'time_zone': u'Eastern Time (US & Canada)',
}


## Listar los Hashtags de mas de 100 tweets

In [11]:
for keys in hashtags.stream_keys():
    for key in keys:
        count = hashtags.get(key).data
        if count > 100:
            print('Hashtag %s:%s' % (key,  count))

Hashtag SoapUi:756
Hashtag Docker:106
Hashtag MongoDb:171
Hashtag sqlserver:142
Hashtag nodejs:109
Hashtag BigData:842
Hashtag hive:473
Hashtag hadoop:138
Hashtag MongoDB:776
Hashtag nosql:605
Hashtag mongodb:114
Hashtag NoSQL:1829
Hashtag java:152
Hashtag Java:794
Hashtag noSQL:112
Hashtag Oracle:159
Hashtag bigdata:268
Hashtag Hadoop:703
Hashtag Couchbase:108


## Listar los tweets asociados a 1 Hashtag

In [12]:
keys = tweets.stream_index("idx_hashtag_bin", 'NoSQL')
for keys in keys.results:
    for key in keys:
        tweet =  tweets.get(key).data
        
user = tweet["user"]["screen_name"]
print "Último Tweet:"
pp(tweet)
print "User:", user


Último Tweet:
{
    u'created_at': u'Tue Oct 20 21:42:53 +0000 2015',
    u'entities': [u'DevoxxMA', u'NoSQL'],
    u'favorite_count': 3,
    u'id': 656586409902391296,
    u'id_str': u'656586409902391296',
    u'lang': u'en',
    u'retweet_count': 7,
    u'source': u'<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>',
    u'text': u'#DevoxxMA speaker of the day:\n\n@ldoguin: @couchbase Developer Advocate. #NoSQL https://t.co/lwOK6W9nfE',
    u'user': {
        u'created_at': u'Sun Jun 03 18:23:51 +0000 2012',
        u'description': u'The biggest developer conference in Africa... every year around November in Morocco. This year the conference will be held on November 16-18, Casablanca.',
        u'favourites_count': 381,
        u'followers_count': 712,
        u'friends_count': 18,
        u'geo_enabled': True,
        u'id': 598544729,
        u'id_str': u'598544729',
        u'name': u'Devoxx Morocco',
        u'profile_image_url': u'http://pbs.twi

## Mostrar los tweets asociados a 1 usuario

In [13]:
keys = tweets.stream_index("idx_usr_bin", user)
for keys in keys.results:
    for tweet_key in keys:
        tweet = tweets.get(tweet_key).data
        print "Tweet:", tweet_key
        
pp(tweet)


Tweet: 656586409902391296
{
    u'created_at': u'Tue Oct 20 21:42:53 +0000 2015',
    u'entities': [u'DevoxxMA', u'NoSQL'],
    u'favorite_count': 3,
    u'id': 656586409902391296,
    u'id_str': u'656586409902391296',
    u'lang': u'en',
    u'retweet_count': 7,
    u'source': u'<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>',
    u'text': u'#DevoxxMA speaker of the day:\n\n@ldoguin: @couchbase Developer Advocate. #NoSQL https://t.co/lwOK6W9nfE',
    u'user': {
        u'created_at': u'Sun Jun 03 18:23:51 +0000 2012',
        u'description': u'The biggest developer conference in Africa... every year around November in Morocco. This year the conference will be held on November 16-18, Casablanca.',
        u'favourites_count': 381,
        u'followers_count': 712,
        u'friends_count': 18,
        u'geo_enabled': True,
        u'id': 598544729,
        u'id_str': u'598544729',
        u'name': u'Devoxx Morocco',
        u'profile_image_url': u'ht

## Listado de usuarios

In [31]:
for keys in users.stream_keys():  
    for key in keys:
        print('User %s' % key)

User xurxosanz
User xMAnton
User tmj_GBR_sales
User tmj_GBR_ADV
User seeteegee
User rusgautama
User rubens_ts
User retweetjava
User quantlabs
User orenmazor
User nico_nuggets
User mchacki
User lydiamartinparr
User lisabriercliffe
User linetracer
User jobs_school
User jhabiteici
User itticker
User ichisemasashi
User ibm_ecod_japan
User iankits
User hot_sql_jobs
User higevision
User harsha549
User gomaam
User geneolot
User ganeshkr2005
User exit666
User dunabit
User denovo_mark
User demartek
User danielscarvalho
User cooldezign
User christianphagen
User chris_hawk
User bsrust
User brucesebi
User bihranalytics
User balinderwalia
User analyticsprnews
User akmalchaudhri
User _unaizc_
User XebiaFr
User Skrobola
User SimonEvans76
User OHughe5
User NovikNG
User Nicuz95
User NicoleLagerloef
User NeuvooSantaCla
User LaVernOWithersp
User LTommaseo
User KirkDBorne
User JoeIngeno
User JobsCoding
User JamesJosephIgoe
User HotNewStacks
User Fernanrelenia
User DipenDedania
User DatabaseView
User Cloud