## Almacenando datos de Twitter en Riak

Se van a crear los siguientes buckets:

|bucket|Clave|Contenido|
|-|-|-|
|tweets|Id del tweet|Datos del Tweet en formnato JSON. Está incluido el tweet retwiteado y los datos del usuario|
|users|Código de usuario (screeen_name)|Datos del usurario en formato JSON|
|hashtag|Hashtag|Nº de veces que aparce el hashtag en cualquiera de los tweets|

El bucket de tweets tendrá los siguientes índices:

|Nombre del índice|Contenido|¿Que busquedas permite hacer?|
|-|-|-|
|idx_usr_bin|Usuario que crea el tweet|Tweets que ha creado un determinado usuario|
|idx_hashtag_bin|Hashtag que se menciona en el tweet|Tweets donde aparece un determinado hashtag|
|idx_user_mentioned_bin|Usuario que se mencionan en el tweet| Tweets donde se menciona a un determinado usuario|

In [1]:
import riak
from pprintpp import pprint as pp
import json

In [2]:
# connect to database
myClient = riak.RiakClient()
myClient.ping()

True

In [3]:
BUCKET_TWEETS = 'tweets'
BUCKET_USERS = 'users'
BUCKET_HASHTAGS = 'hashtags'

tweets = myClient.bucket(BUCKET_TWEETS)
users = myClient.bucket(BUCKET_USERS)
hashtags = myClient.bucket(BUCKET_HASHTAGS)


In [4]:
def insert_hashtag(hashtag):
    count = hashtags.get(hashtag).data
    if count == None:
        count = 1
    else:
        count = count + 1

    hashtags.new(hashtag, data = count).store()

In [5]:
def insert_user(user_json, replace = True):
    key_user = user_json["screen_name"]
    if replace or not users.get(key_user).exists:
        user = users.new(key_user, user_json)
        user.store()
        
    return key_user
    

In [6]:
def insert_tweet(tweet_json):
    tweet = tweets.new(tweet_json["id_str"], tweet_json)
    user = insert_user(tweet_json['user'])
    tweet.add_index('idx_usr_bin', user)
    
    for hashtag in tweet_json['entities']:
        tweet.add_index('idx_hashtag_bin', hashtag)   
        insert_hashtag(hashtag)
        
    for user_mentioned in tweet_json['user_mentions']:
        tweet.add_index('idx_user_mentioned_bin', user_mentioned["screen_name"])   
        insert_user(user_mentioned, replace = False)    
        
    tweet.store()     

    if 'retweeted_status' in tweet_json:
        insert_tweet(tweet_json['retweeted_status'])


In [7]:
tweets_data_path = '../data/tweets.json'
tweets_file = open(tweets_data_path, "r")

i = 1
for line in tweets_file:
    tweet_json = json.loads(line)
    
    try:
        insert_tweet(tweet_json)
        i = i + 1
    except:
        pass

        
print("%s tweets procesados" % i)

2664 tweets procesados


## Ejercicio 1

Imprimir el contenido del tweet con el id 655039580312371200

In [10]:
tweet = tweets.get("655039580312371200").data

pp(tweet)

{
    u'created_at': u'Fri Oct 16 15:16:20 +0000 2015',
    u'entities': [
        u'IBM',
        u'DataEngine',
        u'NoSQL',
        u'PowerSystems',
        u'TCO',
        u'Redis',
        u'CAPI',
        u'Power8',
        u'RLEC',
        u'API',
    ],
    u'favorite_count': 4,
    u'id': 655039580312371200,
    u'id_str': u'655039580312371200',
    u'lang': u'en',
    u'retweet_count': 11,
    u'source': u'<a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>',
    u'text': u'The #IBM #DataEngine for #NoSQL on IBM #PowerSystems http://t.co/tllWUqRLiN #TCO #Redis #CAPI #Power8 #RLEC #API http://t.co/BLOyKri5XP',
    u'user': {
        u'created_at': u'Sun Nov 27 22:29:08 +0000 2011',
        u'description': u'IT Consultant #IT #bigdata #analytics #cloud #edtech #K12 #IoT #datagov #MDM #datascience\nTweets are my opinions, not the official position of my employer.',
        u'favourites_count': 1528,
        u'followers_count': 5093,
        u'friends_count': 508

## Ejercicio 2

Mostrar la información asociada al usuario DBaker007

In [12]:
user = users.get("DBaker007").data

pp(user)

{
    u'created_at': u'Sun Nov 27 22:29:08 +0000 2011',
    u'description': u'IT Consultant #IT #bigdata #analytics #cloud #edtech #K12 #IoT #datagov #MDM #datascience\nTweets are my opinions, not the official position of my employer.',
    u'favourites_count': 1528,
    u'followers_count': 5093,
    u'friends_count': 5082,
    u'geo_enabled': False,
    u'id': 422966469,
    u'id_str': u'422966469',
    u'name': u'Duane Baker',
    u'profile_image_url': u'http://pbs.twimg.com/profile_images/613317207121006593/fBJIw92n_normal.jpg',
    u'screen_name': u'DBaker007',
    u'statuses_count': 11902,
    u'time_zone': u'Eastern Time (US & Canada)',
}


In [13]:
for keys in hashtags.stream_keys():
    for key in keys:
        count = hashtags.get(key).data
        if count > 100:
            print('Hashtag %s:%s' % (key,  count))

Hashtag hive:473
Hashtag hadoop:138
Hashtag java:152
Hashtag Java:794
Hashtag SoapUi:756
Hashtag MongoDB:776
Hashtag sqlserver:142
Hashtag nodejs:109
Hashtag BigData:842
Hashtag Docker:106
Hashtag Hadoop:703
Hashtag Couchbase:108
Hashtag bigdata:268
Hashtag noSQL:112
Hashtag MongoDb:171
Hashtag Oracle:159
Hashtag nosql:605
Hashtag mongodb:114
Hashtag NoSQL:1829


In [14]:
keys = tweets.stream_index("idx_hashtag_bin", 'NoSQL')
for keys in keys.results:
    for key in keys:
        tweet =  tweets.get(key).data
        
user = tweet["user"]["screen_name"]
print "Último Tweet:"
pp(tweet)
print "User:", user


Último Tweet:
{
    u'created_at': u'Fri Oct 16 15:16:20 +0000 2015',
    u'entities': [
        u'IBM',
        u'DataEngine',
        u'NoSQL',
        u'PowerSystems',
        u'TCO',
        u'Redis',
        u'CAPI',
        u'Power8',
        u'RLEC',
        u'API',
    ],
    u'favorite_count': 4,
    u'id': 655039580312371200,
    u'id_str': u'655039580312371200',
    u'lang': u'en',
    u'retweet_count': 11,
    u'source': u'<a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>',
    u'text': u'The #IBM #DataEngine for #NoSQL on IBM #PowerSystems http://t.co/tllWUqRLiN #TCO #Redis #CAPI #Power8 #RLEC #API http://t.co/BLOyKri5XP',
    u'user': {
        u'created_at': u'Sun Nov 27 22:29:08 +0000 2011',
        u'description': u'IT Consultant #IT #bigdata #analytics #cloud #edtech #K12 #IoT #datagov #MDM #datascience\nTweets are my opinions, not the official position of my employer.',
        u'favourites_count': 1528,
        u'followers_count': 5093,
        u'frien

In [20]:
keys = tweets.stream_index("idx_usr_bin", user)
for keys in keys.results:
    for tweet_key in keys:
        tweet = tweets.get(tweet_key).data
        print "Tweet:", tweet_key
        
pp(tweet)


Tweet: 657223844797751296
Tweet: 655039580312371200
Tweet: 657213912396468224
{
    u'created_at': u'Thu Oct 22 15:16:21 +0000 2015',
    u'entities': [u'datapreparation', u'BI', u'datamining', u'NoSQL'],
    u'favorite_count': 0,
    u'id': 657213912396468224,
    u'id_str': u'657213912396468224',
    u'lang': u'en',
    u'retweet_count': 8,
    u'source': u'<a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>',
    u'text': u'10 tools and platforms for #datapreparation @DataScienceCtrl https://t.co/5SbT0YCZ6J #BI #datamining #NoSQL https://t.co/nR7waAnR9a',
    u'user': {
        u'created_at': u'Sun Nov 27 22:29:08 +0000 2011',
        u'description': u'IT Consultant #IT #bigdata #analytics #cloud #edtech #K12 #IoT #datagov #MDM #datascience\nTweets are my opinions, not the official position of my employer.',
        u'favourites_count': 1528,
        u'followers_count': 5093,
        u'friends_count': 5082,
        u'geo_enabled': False,
        u'id': 422966469,
       

In [21]:
for keys in users.stream_keys():
    for key in keys:
        print('User %s' % key)

User zacaj_
User whatevermatt
User vermanivivek
User turkiriyadha
User trengarajan
User spendysf
User shiv_patil4sm
User shakamunyi
User richardanchieta
User rey_atkinson
User red_bin
User recuweb
User portable_hole
User peteaven
User pedrofmb
User omnianax
User nileshmodak
User neo4j
User mebehindadesk
User kogir
User joshingly
User jorgeahernan
User jetbrains
User jeffrey_g_clark
User jdegoes
User jamesddube
User iTunesPodcast
User frankyston
User edcarve
User durtgeek
User daosold
User danielckv
User boldarray
User babarbashir
User awasim
User amandlei316
User ale_dcn
User adron
User adaptorel
User _thinkIT_
User _such_a_surge_
User _EdsonDionisio
User TechProfession1
User Talena_Inc
User Server_failure
User SamuelAdey_
User PeterIturry
User Pentaho
User POS_easy
User NoSQLDigest
User NeuvooCsrPhi
User LukeVizzicks
User Loui_Picard
User JunsongW
User Jaime_Mehl
User Finextra
User Decideo
User DamianoMe
User CloudExpoWire
User CloudCMS
User ChrisMayDev
User BrettWidmann
User BizBuild