# Almacenando datos de Twitter en Riak

![png](https://upload.wikimedia.org/wikipedia/en/8/8e/Riak_distributed_NoSQL_key-value_data_store_logo.png)

Se van a crear los siguientes buckets:

|bucket|Clave|Contenido|
|-|-|-|
|tweets|Id del tweet|Datos del Tweet en formnato JSON. Está incluido el tweet retwiteado y los datos del usuario|
|users|Código de usuario (screeen_name)|Datos del usurario en formato JSON|
|hashtag|Hashtag|Nº de veces que aparce el hashtag en cualquiera de los tweets|

El bucket de tweets tendrá los siguientes índices:

|Nombre del índice|Contenido|¿Que busquedas permite hacer?|
|-|-|-|
|idx_usr_bin|Usuario que crea el tweet|Tweets que ha creado un determinado usuario|
|idx_hashtag_bin|Hashtag que se menciona en el tweet|Tweets donde aparece un determinado hashtag|
|idx_user_mentioned_bin|Usuario que se mencionan en el tweet| Tweets donde se menciona a un determinado usuario|

In [1]:
import riak
from pprintpp import pprint as pp
import json

In [2]:
# connect to database
myClient = riak.RiakClient()
myClient.ping()

True

In [3]:
BUCKET_TWEETS = 'tweets'
BUCKET_USERS = 'users'
BUCKET_HASHTAGS = 'hashtags'

tweets = myClient.bucket(BUCKET_TWEETS)
users = myClient.bucket(BUCKET_USERS)
hashtags = myClient.bucket(BUCKET_HASHTAGS)


In [4]:
def insert_hashtag(hashtag):
    count = hashtags.get(hashtag).data
    if count == None:
        count = 1
    else:
        count = count + 1

    hashtags.new(hashtag, data = count).store()

In [5]:
def insert_user(user_json, replace = True):
    key_user = user_json["screen_name"]
    if replace or not users.get(key_user).exists:
        user = users.new(key_user, user_json)
        user.store()
        
    return key_user
    

In [6]:
def insert_tweet(tweet_json):
    tweet = tweets.new(tweet_json["id_str"], tweet_json)
    user = insert_user(tweet_json['user'])
    tweet.add_index('idx_usr_bin', user)
    
    for hashtag in tweet_json['entities']:
        tweet.add_index('idx_hashtag_bin', hashtag)   
        insert_hashtag(hashtag)
        
    for user_mentioned in tweet_json['user_mentions']:
        tweet.add_index('idx_user_mentioned_bin', user_mentioned["screen_name"])   
        insert_user(user_mentioned, replace = False)    
        
    tweet.store()     

    if 'retweeted_status' in tweet_json:
        insert_tweet(tweet_json['retweeted_status'])


In [8]:
tweets_data_path = '../../data/tweets.json'
tweets_file = open(tweets_data_path, "r")

i = 1
for line in tweets_file:
    tweet_json = json.loads(line)
    
    try:
        insert_tweet(tweet_json)
        i = i + 1
    except:
        pass

        
print("%s tweets procesados" % i)

2667 tweets procesados


## Imprimir el contenido de 1 Tweet

Imprimir el contenido del tweet con el id 655039580312371200

In [9]:
tweet = tweets.get("655039580312371200").data

pp(tweet)

{
    'created_at': 'Fri Oct 16 15:16:20 +0000 2015',
    'entities': [
        'IBM',
        'DataEngine',
        'NoSQL',
        'PowerSystems',
        'TCO',
        'Redis',
        'CAPI',
        'Power8',
        'RLEC',
        'API',
    ],
    'favorite_count': 4,
    'id': 655039580312371200,
    'id_str': '655039580312371200',
    'lang': 'en',
    'retweet_count': 11,
    'source': '<a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>',
    'text': 'The #IBM #DataEngine for #NoSQL on IBM #PowerSystems http://t.co/tllWUqRLiN #TCO #Redis #CAPI #Power8 #RLEC #API http://t.co/BLOyKri5XP',
    'user': {
        'created_at': 'Sun Nov 27 22:29:08 +0000 2011',
        'description': 'IT Consultant #IT #bigdata #analytics #cloud #edtech #K12 #IoT #datagov #MDM #datascience\nTweets are my opinions, not the official position of my employer.',
        'favourites_count': 1528,
        'followers_count': 5093,
        'friends_count': 5082,
        'geo_enabled': False,

## Mostrar la información asociada a 1 usuario

Mostrar la información asociada al usuario DBaker007

In [10]:
user = users.get("DBaker007").data

pp(user)

{
    'created_at': 'Sun Nov 27 22:29:08 +0000 2011',
    'description': 'IT Consultant #IT #bigdata #analytics #cloud #edtech #K12 #IoT #datagov #MDM #datascience\nTweets are my opinions, not the official position of my employer.',
    'favourites_count': 1528,
    'followers_count': 5093,
    'friends_count': 5082,
    'geo_enabled': False,
    'id': 422966469,
    'id_str': '422966469',
    'name': 'Duane Baker',
    'profile_image_url': 'http://pbs.twimg.com/profile_images/613317207121006593/fBJIw92n_normal.jpg',
    'screen_name': 'DBaker007',
    'statuses_count': 11902,
    'time_zone': 'Eastern Time (US & Canada)',
}


## Listar los Hashtags de mas de 100 tweets

In [11]:
for keys in hashtags.stream_keys():
    for key in keys:
        count = hashtags.get(key).data
        if count > 100:
            print('Hashtag %s:%s' % (key,  count))

Hashtag nosql:606
Hashtag mongodb:114
Hashtag NoSQL:1830
Hashtag sqlserver:142
Hashtag nodejs:109
Hashtag BigData:843
Hashtag Java:794
Hashtag SQL:101
Hashtag Couchbase:108
Hashtag MongoDB:776
Hashtag java:152
Hashtag MongoDb:171
Hashtag hive:473
Hashtag SoapUi:756
Hashtag bigdata:268
Hashtag noSQL:112
Hashtag Hadoop:703
Hashtag Docker:106
Hashtag Oracle:159
Hashtag hadoop:138


## Listar los tweets asociados a 1 Hashtag

In [12]:
keys = tweets.stream_index("idx_hashtag_bin", 'NoSQL')
for keys in keys.results:
    for key in keys:
        tweet =  tweets.get(key).data
        
user = tweet["user"]["screen_name"]
print("Último Tweet:")
pp(tweet)
print("User:", user)


Último Tweet:
{
    'created_at': 'Fri Oct 16 01:10:08 +0000 2015',
    'entities': ['CBLiveNY', 'Couchbase', 'NoSQL', 'N1QL'],
    'favorite_count': 3,
    'id': 654826625255780352,
    'id_str': '654826625255780352',
    'lang': 'en',
    'retweet_count': 4,
    'source': '<a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>',
    'text': 'Missed #CBLiveNY or just want to revisit the great presentations? Slides are posted: http://t.co/SJB6CMRkSU #Couchbase #NoSQL #N1QL',
    'user': {
        'created_at': 'Wed Jan 12 19:23:02 +0000 2011',
        'description': 'Most complete, open source NoSQL database http://t.co/8hh92daYQK\n\n \nGet started with Couchbase today!',
        'favourites_count': 1023,
        'followers_count': 160978,
        'friends_count': 611,
        'geo_enabled': True,
        'id': 237401623,
        'id_str': '237401623',
        'name': 'Couchbase',
        'profile_image_url': 'http://pbs.twimg.com/profile_images/654754343531380736/aEtiCqGZ_nor

## Mostrar los tweets asociados a 1 usuario

In [13]:
keys = tweets.stream_index("idx_usr_bin", user)
for keys in keys.results:
    for tweet_key in keys:
        tweet = tweets.get(tweet_key).data
        print("Tweet:", tweet_key)
        
pp(tweet)


Tweet: 657595303273635840
Tweet: 656600601141358593
Tweet: 655122359364538368
Tweet: 656967829317488640
Tweet: 657924591885656065
Tweet: 657357065812320256
Tweet: 656185750871023616
Tweet: 654765852382785536
Tweet: 654826625255780352
Tweet: 656575352538558464
Tweet: 658077501328072704
Tweet: 657613935609315328
Tweet: 657298949997481985
Tweet: 656922788268253185
Tweet: 656243749274697728
Tweet: 657323674052251648
Tweet: 657000952646508544
Tweet: 655055781696507904
Tweet: 654729787643072512
Tweet: 658027458831126528
Tweet: 656887918468997120
Tweet: 654190712959844352
{
    'created_at': 'Wed Oct 14 07:03:14 +0000 2015',
    'entities': ['Couchbase', 'SQL', 'JSON', 'database', 'developers'],
    'favorite_count': 3,
    'id': 654190712959844352,
    'id_str': '654190712959844352',
    'lang': 'en',
    'retweet_count': 6,
    'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
    'text': 'Download #Couchbase 4.0 and build apps using #SQL for #J

## Listado de usuarios

In [15]:
for keys in users.stream_keys():  
    for key in keys:
        print('User %s' % key)

User unclejack__
User tosbourn
User tmj_usa_jobs
User teralytics
User tableau
User stigbra
User sskshsk
User social_rosie
User seti321
User rkeshavmurthy
User prajyotpro
User piyushswaroop
User nyike
User mogrep
User luisruizpavon
User lucaolivari
User lider_it
User lambdamedia
User kf
User k4l4m4r1s
User julianhyde
User joshollegien
User john_habib
User joaomc
User javapsyche
User hundredmondays
User fintechna
User fcanovai
User datachick
User cutting
User codeRecap
User backendsecret
User arachit1
User YelleHughes
User Victor_A_Herr
User SiliconArmada
User SRFitton
User SPairoux
User OlegYch
User Nubexx
User NeuvooITLee
User MookieFumi
User MongoDB
User MikeMcCrady2
User McMcgregory
User MartinTroup
User MarcMirallesDev
User Makolyte
User Jimmy_Byrd
User IT_Advice_Guru
User HilmyEssam
User GreggsHosting
User GonzalezCarmen
User BrightTALK
User BestJSON
User Beachbody
User yodathekiller
User xeraa
User webdevilopers
User tran_latoya
User tmj_usa_sales
User tarunProfile
User tamraraven