## Almacenando datos de Twitter en Riak

Se van a crear los siguientes buckets:

|bucket|Clave|Contenido|
|-|-|-|
|tweets|Id del tweet|Datos del Tweet en formnato JSON. Está incluido el tweet retwiteado y los datos del usuario|
|users|Código de usuario (screeen_name)|Datos del usurario en formato JSON|
|hashtag|Hashtag|Nº de veces que aparce el hashtag en cualquiera de los tweets|

El bucket de tweets tendrá los siguientes índices:

|Nombre del índice|Contenido|¿Que busquedas permite hacer?|
|-|-|-|
|idx_usr_bin|Usuario que crea el tweet|Tweets que ha creado un determinado usuario|
|idx_hashtag_bin|Hashtag que se menciona en el tweet|Tweets donde aparece un determinado hashtag|
|idx_user_mentioned_bin|Usuario que se mencionan en el tweet| Tweets donde se menciona a un determinado usuario|

In [1]:
import riak
from pprintpp import pprint as pp
import json

In [2]:
# connect to database
myClient = riak.RiakClient()
myClient.ping()

True

In [3]:
BUCKET_TWEETS = 'tweets'
BUCKET_USERS = 'users'
BUCKET_HASHTAGS = 'hashtags'

tweets = myClient.bucket(BUCKET_TWEETS)
users = myClient.bucket(BUCKET_USERS)
hashtags = myClient.bucket(BUCKET_HASHTAGS)


In [4]:
def insert_hashtag(hashtag):
    count = hashtags.get(hashtag).data
    if count == None:
        count = 1
    else:
        count = count + 1

    hashtags.new(hashtag, data = count).store()

In [5]:
def insert_user(user_json, replace = True):
    key_user = user_json["screen_name"]
    if replace or not users.get(key_user).exists:
        user = users.new(key_user, user_json)
        user.store()
        
    return key_user
    

In [6]:
def insert_tweet(tweet_json):
    tweet = tweets.new(tweet_json["id_str"], tweet_json)
    user = insert_user(tweet_json['user'])
    tweet.add_index('idx_usr_bin', user)
    
    for hashtag in tweet_json['entities']:
        tweet.add_index('idx_hashtag_bin', hashtag)   
        insert_hashtag(hashtag)
        
    for user_mentioned in tweet_json['user_mentions']:
        tweet.add_index('idx_user_mentioned_bin', user_mentioned["screen_name"])   
        insert_user(user_mentioned, replace = False)    
        
    tweet.store()     

    if 'retweeted_status' in tweet_json:
        insert_tweet(tweet_json['retweeted_status'])


In [7]:
tweets_data_path = '../data/tweets.json'
tweets_file = open(tweets_data_path, "r")

i = 1
for line in tweets_file:
    tweet_json = json.loads(line)
    
    try:
        insert_tweet(tweet_json)
        i = i + 1
    except:
        pass

        
print("%s tweets procesados" % i)

2664 tweets procesados


## Imprimir el contenido de 1 Tweet

Imprimir el contenido del tweet con el id 655039580312371200

In [8]:
tweet = tweets.get("655039580312371200").data

pp(tweet)

{
    u'created_at': u'Fri Oct 16 15:16:20 +0000 2015',
    u'entities': [
        u'IBM',
        u'DataEngine',
        u'NoSQL',
        u'PowerSystems',
        u'TCO',
        u'Redis',
        u'CAPI',
        u'Power8',
        u'RLEC',
        u'API',
    ],
    u'favorite_count': 4,
    u'id': 655039580312371200,
    u'id_str': u'655039580312371200',
    u'lang': u'en',
    u'retweet_count': 11,
    u'source': u'<a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>',
    u'text': u'The #IBM #DataEngine for #NoSQL on IBM #PowerSystems http://t.co/tllWUqRLiN #TCO #Redis #CAPI #Power8 #RLEC #API http://t.co/BLOyKri5XP',
    u'user': {
        u'created_at': u'Sun Nov 27 22:29:08 +0000 2011',
        u'description': u'IT Consultant #IT #bigdata #analytics #cloud #edtech #K12 #IoT #datagov #MDM #datascience\nTweets are my opinions, not the official position of my employer.',
        u'favourites_count': 1528,
        u'followers_count': 5093,
        u'friends_count': 508

## Mostrar la información asociada a 1 usuario

Mostrar la información asociada al usuario DBaker007

In [9]:
user = users.get("DBaker007").data

pp(user)

{
    u'created_at': u'Sun Nov 27 22:29:08 +0000 2011',
    u'description': u'IT Consultant #IT #bigdata #analytics #cloud #edtech #K12 #IoT #datagov #MDM #datascience\nTweets are my opinions, not the official position of my employer.',
    u'favourites_count': 1528,
    u'followers_count': 5093,
    u'friends_count': 5082,
    u'geo_enabled': False,
    u'id': 422966469,
    u'id_str': u'422966469',
    u'name': u'Duane Baker',
    u'profile_image_url': u'http://pbs.twimg.com/profile_images/613317207121006593/fBJIw92n_normal.jpg',
    u'screen_name': u'DBaker007',
    u'statuses_count': 11902,
    u'time_zone': u'Eastern Time (US & Canada)',
}


## Listar los Hashtags de mas de 100 tweets

In [10]:
for keys in hashtags.stream_keys():
    for key in keys:
        count = hashtags.get(key).data
        if count > 100:
            print('Hashtag %s:%s' % (key,  count))

Hashtag Docker:106
Hashtag MongoDb:171
Hashtag noSQL:112
Hashtag hive:473
Hashtag hadoop:138
Hashtag nosql:605
Hashtag mongodb:114
Hashtag NoSQL:1829
Hashtag java:152
Hashtag Java:794
Hashtag bigdata:268
Hashtag MongoDB:776
Hashtag SoapUi:756
Hashtag sqlserver:142
Hashtag nodejs:109
Hashtag BigData:842
Hashtag Hadoop:703
Hashtag Couchbase:108
Hashtag Oracle:159


## Listar los tweets asociados a 1 Hashtag

In [11]:
keys = tweets.stream_index("idx_hashtag_bin", 'NoSQL')
for keys in keys.results:
    for key in keys:
        tweet =  tweets.get(key).data
        
user = tweet["user"]["screen_name"]
print "Último Tweet:"
pp(tweet)
print "User:", user


Último Tweet:
{
    u'created_at': u'Tue Oct 13 11:41:00 +0000 2015',
    u'entities': [u'PostgreSQL', u'NoSQL', u'EnterpriseDB'],
    u'favorite_count': 0,
    u'id': 653898227658412034,
    u'id_str': u'653898227658412034',
    u'lang': u'en',
    u'retweet_count': 5,
    u'source': u'<a href="http://bufferapp.com" rel="nofollow">Buffer</a>',
    u'text': u'How Postgres is taking the fight to the NoSQL pretenders #PostgreSQL #NoSQL #EnterpriseDB http://t.co/ayy6cMMJ7E',
    u'user': {
        u'created_at': u'Tue Jan 03 23:02:47 +0000 2012',
        u'description': u'We provide our clients with full life-cycle advice and #bigdata services: strategy,  data science, implementation, support and training',
        u'favourites_count': 61,
        u'followers_count': 2109,
        u'friends_count': 392,
        u'geo_enabled': True,
        u'id': 454384379,
        u'id_str': u'454384379',
        u'name': u'Big Data Partnership',
        u'profile_image_url': u'http://pbs.twimg.com/prof

## Mostrar los tweets asociados a 1 usuario

In [12]:
keys = tweets.stream_index("idx_usr_bin", user)
for keys in keys.results:
    for tweet_key in keys:
        tweet = tweets.get(tweet_key).data
        print "Tweet:", tweet_key
        
pp(tweet)


Tweet: 653898227658412034
{
    u'created_at': u'Tue Oct 13 11:41:00 +0000 2015',
    u'entities': [u'PostgreSQL', u'NoSQL', u'EnterpriseDB'],
    u'favorite_count': 0,
    u'id': 653898227658412034,
    u'id_str': u'653898227658412034',
    u'lang': u'en',
    u'retweet_count': 5,
    u'source': u'<a href="http://bufferapp.com" rel="nofollow">Buffer</a>',
    u'text': u'How Postgres is taking the fight to the NoSQL pretenders #PostgreSQL #NoSQL #EnterpriseDB http://t.co/ayy6cMMJ7E',
    u'user': {
        u'created_at': u'Tue Jan 03 23:02:47 +0000 2012',
        u'description': u'We provide our clients with full life-cycle advice and #bigdata services: strategy,  data science, implementation, support and training',
        u'favourites_count': 61,
        u'followers_count': 2109,
        u'friends_count': 392,
        u'geo_enabled': True,
        u'id': 454384379,
        u'id_str': u'454384379',
        u'name': u'Big Data Partnership',
        u'profile_image_url': u'http://pbs.tw

## Listado de usuarios

In [13]:
for keys in users.stream_keys():  
    for key in keys:
        print('User %s' % key)

User whitestratus
User voxxed
User vduglued
User trevoirwilliams
User tonyshan
User tmj_sfo_eng
User tlockney
User thinklets
User smqueue_live
User scalajobz
User rodhamlin
User redlocal
User redisfeed
User red1elhaloui
User rainfc
User nodenow
User neffratiti
User natbusa
User marycset
User markhneedham
User mahurtado2
User ldellaquila
User karl_popp
User josh_teneycke
User jorgosac
User jobs4devs
User jobalertcenter1
User itknowingness
User infoworld
User impsardo
User ikwattro
User icrtiou
User hut8uk
User fmassetto
User fernandolazaro5
User fendien
User etiennebrunet
User eoinbrazil
User ekinoExperts
User developerWorks
User datacenter
User cinovo_de
User bitplaces
User binduamadhav
User bd_259
User b4d_tR1p
User anicho
User amber_ht
User abhayit2000
User _GBartolini_
User Vnomicinfo
User UserMalicious
User Ulitzer
User TodoCoders
User TheMarketingU
User Tamr_Inc
User ShopwareDev
User RaySr1946
User NtshFlors409
User NeuvooRouen
User NeuvooMilpitas
User NenadBozicNs
User Moe_CFC
Us