# DynamoDB

DynamoDB is one of the storage services offered inside AWS. It's fast and a flexible NoSQL database service for all kinds of applications. Amazon claims it is consistent and offers single-digit millisecond latency at any scale. The in-memory cache that can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second. It supports both document and key-value store models. Speaking of Document-oriented databases, they are one of the main categories of NoSQL databases.  They are designed for storing, retrieving and managing document-oriented information, also known as semi-structured data. 

Relational databases generally store data in separate tables that are defined by the programmer, and a single object may be spread across several tables. Document databases store all information for a given object in a single instance in the database, and every stored object can be different from every other. 

Its flexible data model, reliable performance, and automatic scaling of throughput capacity, makes it a great fit for mobile, web, gaming, ad tech, IoT, and many other applications.

This notebook outlines some operations on DynamoDB databases using boto3.

### Creating a Table

DynamoDB is schemaless (except the key schema). It means, you have to specify the key schema (attribute name and its type) when creating the table. There is no need to specify other non-key attributes. You can insert records into the table in nosql format with different schema for each record and make sure to include the keys while inserting.

From the [documentation page](http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#DynamoDB.Client.create_table), 

AttributeDefinitions is defined as - an array of attributes that describe the key schema for the table and indexes. It represents an attribute for describing the key schema for the table and indexes. 

KeySchema: Specifies the attributes that make up the primary key for a table or an index. The attributes in KeySchema must also be defined in the AttributeDefinitions array. 

Each KeySchemaElement in the array is composed of: 
- AttributeName - The name of this key attribute.
- KeyType - The role that the key attribute will assume:
    - HASH - partition key
    - RANGE - sort key

Create DynamoDB client and resource objects. 

In [None]:
import boto3
import botocore

# dynamodb = boto3.client('dynamodb')
client = boto3.client('dynamodb')
dynamodb = boto3.resource('dynamodb')

In below function, we are checking in the Try block to see if a table already exists in the database with the name being passed to the function. If it does, then it will return nothing. But if the table doesn't exist then it will create one with name as in input parameter **table_name**, a primary key field with the value in input parameter **key_name** with its type as mentioned in the input parameter **KeyType**.

It prints the message as either the table alreday exists or name of the newly created table. 

In [None]:
def create_dynamodb_table(table_name, key_name,KeyType):
    try:
        response = client.describe_table(TableName=table_name)
    except botocore.exceptions.ClientError as e:
        print("DynamoDB table '" + table_name + "' does not appear to exist, creating...")
        table = dynamodb.create_table(
                    TableName = table_name,
                    KeySchema = [ { 'AttributeName': key_name,
                                    'KeyType': 'HASH'  } ], # Partition key
                    AttributeDefinitions = [ 
                                  { 'AttributeName': key_name,
                                  'AttributeType': KeyType 
                                  } ],
                    ProvisionedThroughput = { 'ReadCapacityUnits': 1,
                                              'WriteCapacityUnits': 1 }
                )
        # Wait until the table exists.
        table.meta.client.get_waiter('table_exists').wait(TableName=table_name) 
        print("DynamoDB table '" + table_name + "' created.")

Call the create_dynamodb_table() method to create "dsa_courses" table with "courseid" as the primary key field that is of type number. The value "N" in below call to create_dynamodb_table() function tells the data type of primary key.

In [None]:
create_dynamodb_table("dsa_courses","courseid","N")

### Write to DynamoDB

A table object lets you write records to the table.

In [None]:
table = dynamodb.Table('dsa_courses')

table.put_item(
   Item={
        'coursename': 'Intro to data science',
        'courseid': 7600,
        'credits': 3
    }
)

### Read from DynamoDB

Retreive records from the table. Use the get_item() method to read the item from a table. You must specify the primary key value to read any item from the table.



In [None]:
import boto3

response = table.get_item(
   Key={
        'courseid': 7600
    }
)

item = response['Item']
name = item['coursename']

print(item)
print("Welcome to, {}" .format(name))

### Updating Items

Update a record in the table. Use the update_item() method to modify an existing item. You can update values of existing attributes, add new attributes, or remove attributes.

In below example, we are updating the course name from 'Intro to data science' to 'Introduction to data science'

In [None]:
table.update_item(
    Key={
        'courseid':7600
    },
    UpdateExpression='SET coursename= :val1',
    ExpressionAttributeValues={
        ':val1': 'Introduction to data science'
    }
)

### Delete table

In [None]:
response = client.delete_table(
    TableName='dsa_courses'
)

### Lets insert records into a DynamoDB table for the courses offered in DSA. 

The data about courses is stored in a CSV file. Here are the steps to load data from any csv file into Amazon DynamoDB.


- Create the pandas dataframe from the source data

- Convert dataframe to list of dictionaries (JSON) that can be consumed by any no-sql database

- Put the JSON object created from the dataframe using put_item method

In [None]:
# Create the pandas dataframe from the source data

import pandas as pd

create_dynamodb_table("dsa_courses","courseid","N")
table = dynamodb.Table('dsa_courses')

df=pd.read_csv('dsa_courses.csv')

df.columns=["courseid","coursename","credits"]


print("\n Top 5 rows of data in input file",df.head())


# Convert dataframe to list of dictionaries (JSON) that can be consumed by any no-sql database
json_data=df.T.to_dict().values()

for course in json_data:
    table.put_item(Item=course)

In [None]:
# Test
response = table.get_item(Key={'courseid': 8610})
response["Item"]

### Scan

The scan method reads every item in the entire table unlike get.item() and returns all the data in the table. An optional filter_expression if provided, filters the items matching the criteria and are returned. However, the filter is applied only after the entire table has been scanned. 

**FilterExpression** used below specifies a condition that returns only items that satisfy the condition. All other items are discarded.

In [None]:
def scan_table(table_name, filter_key=None, filter_value=None):
    """
    Perform a scan operation on table.
    Can specify filter_key (col name) and its value to be filtered.
    """
    table = dynamodb.Table(table_name)

    # Sample filter expression -  Key('year').between(1950, 1959);
    # If there is a filtering expression given as input then records are filtered. Else, just return all records.

    if filter_key and filter_value:
        filtering_exp = Key(filter_key).eq(filter_value)
        response = table.scan(FilterExpression=filtering_exp)
    else:
        response = table.scan()

    return response

### Display table contents

In [None]:
# Display all the items in dsa_courses table
scan_table("dsa_courses")["Items"]

# Stream tweets

### Tweepy library for streaming twitter data

[Read this doc](http://docs.tweepy.org/en/v3.4.0/streaming_how_to.html) for more information on the functions provided by Tweepy package

The Twitter streaming API is used to download twitter messages in real time. It is useful for obtaining a high volume of tweets, or for creating a live feed using a site stream or user stream

In Tweepy, an instance of tweepy.Stream establishes a streaming session and routes messages to StreamListener instance. The on_data method of a stream listener receives all messages and calls functions according to the message type.

Therefore using the streaming api has three steps.

* Create a class inheriting from StreamListener
* Using that class create a Stream object
* Connect to the Twitter API using the Stream.

##### Step 1: Creating a StreamListener

Create class MyStreamListener inheriting from StreamListener. There are different methods available. For example, override on_status() mathod to print status text. The on_data method of Tweepy’s StreamListener conveniently passes data from statuses to the on_status method.

``` bash

import tweepy
#override tweepy.StreamListener to add logic to on_status
class MyStreamListener(tweepy.StreamListener):

    def on_status(self, status):
        print(status.text)


```


##### Step 2: Creating a Stream

We need an api to stream. See Authentication Tutorial to learn how to get an api object. Once we have an api and a status listener we can create our stream object.:

``` bash
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener())
```



##### Step 3: Starting a Stream
A tweet stream uses a filter in both user_stream or the sitestream. In below example, it is filtering to stream all tweets containing the words India and America. The track parameter is an array of search terms to stream.

``` bash
myStream.filter(track=['India','America'])
```

In [None]:
file = open("Output.txt", "w")
file.write ("tweetid,text\n")
file.close()

In the below code cell, we are using Tweepy library to create a twitter stream, collects tweets for 10 seconds from the stream and write them to a text file called "Output.txt". 

In [None]:
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import tweepy
import json
import time

# import logging
# import time
# from logging.handlers import RotatingFileHandler

# Twitter security credentials 
ACCESS_TOKEN    = "908803963557941248-sRHYClIfMteyPMnwF4hWkARyuHNkJRT"
ACCESS_SECRET   = "FgGi0GshGh8Xbi0Tmkbks0G4Jvd20J5tTThCLJzxd0UVB"
CONSUMER_KEY    = "KZT7UkCSyLhVO18Wqx6OJISDY"
CONSUMER_SECRET = "X6hfBxJZz3jLqo8VeX451d7zW8u8v6yDqpiWTUWoq7hnGQTrp2"

start_time = time.time() #grabs the system time


# "listener" class is inheriting from StreamListener. Implement StreamListener to get the stream.
class listener(StreamListener):
    #This is a basic listener that just writes received tweets to file.   
    
    # Initialize the instance
    def __init__(self, time_limit=5):
        self.time = time.time()
        self.limit = time_limit
#         super(listener, self).__init__()
            
    #on_data is one of the methods in StreamListener class. It will automatically figure out what kind of data Twitter sent, 
    #and call an appropriate method to deal with the specific data type. It’s possible to deal with events like users 
    # sending direct messages, tweets being deleted, and more.
    def on_data(self, data):
        if (time.time() - self.time) < self.limit:
            with open("Output.txt", "a") as tweet_log:
                try:
                    # Twitter returns data in JSON format - we need to decode it first
                    decoded = json.loads(data)
                    msg = '%s\t%s\n' % (decoded['user']['screen_name'], decoded['text'].encode('ascii', 'ignore'))
                    tweet_log.write(msg)
                    return True

                except BaseException as e:
                    print('failed ondata,', str(e))
                    time.sleep(5)
                    pass
        else:
            return False
        
    # Ignore retweets. The tweet that is passed into the on_status method is an instance of the Status class. 
    # This class has properties describing the tweet, including the property retweeted_status, which tells us whether or 
    # not the tweet is a retweet.
    # Modify the on_status function to filter out retweets. If the retweeted_status property exists, don’t process the tweet.
    def on_status(self, status):
        if status.retweeted_status:
            return
    

#This handles Twitter authetification and the connection to Twitter Streaming API
#Setup tweepy to authenticate with Twitter
auth = OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

#Create an API object to pull data from Twitter – we’ll pass in the authentication:

# api = tweepy.API(auth)

#Create an instance of the tweepy Stream class, which will stream the tweets. 
# stream_listener = StreamListener()

#Pass auth credentials so that Twitter allows to connect. 
#Start streaming tweets by calling the filter method. This will start streaming tweets from the filter.json API endpoint.
#We pass in a list of terms to filter on, as the API requires.
twitterStream = Stream(auth, listener(time_limit=10)) #initialize Stream object with a time out limit
twitterStream.filter(track=['India','America'], languages=['en'])  

#This line filter Twitter Streams to capture data by whatever keywords you want to filter with, for example: 'India','America'
# stream.filter(track=['India','America'])

## Note:

Check if TwitterAnalysis table already exists in the DynamoDB. Run list_tables() method to list all the tables present in the database. The last line in the output tells what all tables exist.

In [None]:
table_name='TwitterAnalysis'

response = client.list_tables()
response

In [29]:
# If you see the table called TwitterAnalysis in the output of above cell, uncomment below code lines and run the cell. 
# It will delete the existing table. 

# response = client.delete_table(
#     TableName='TwitterAnalysis'
# )

Create TwitterAnalysis table and populate it with the tweets in Output.txt 

Follow the steps discussed above to write csb file to DynamoDB table

- Create the pandas dataframe from the source data

- Convert dataframe to list of dictionaries (JSON) that can be consumed by any no-sql database

- Put the JSON object created from the dataframe using put_item method

In [None]:
create_dynamodb_table(table_name,"tweetid","S")
table = dynamodb.Table(table_name)

import pandas as pd
df=pd.read_csv('Output.txt',sep='\t')

df.columns=["tweetid","text"]

print("\n Top 5 rows of data in input file\n\n",df.head())


# Convert dataframe to list of dictionaries (JSON) that can be consumed by any no-sql database
json_data=df.T.to_dict().values()

for tweet in json_data:
    table.put_item(Item=tweet)

In [None]:
response = client.describe_table(
    TableName=table_name
)
response
# scan_table("TwitterAnalysis")

In [None]:
response = client.scan(
    TableName=table_name
    Select='ALL_ATTRIBUTES'
)
response

# Delete the table

Delete the table by running below cell.

In [None]:
response = client.delete_table(
    TableName=table_name
)