## Data sourcing

### 0. Prior steps to create the dataset

1. GET tweets using Twitter REST API ([doc](https://dev.twitter.com/rest/public))
2. Aggregate tweets to this [single JSON source file](https://github.com/eolecvk/intro_spark_twitter/blob/master/data/tweets.json) with [this script](https://github.com/eolecvk/intro_spark_twitter/blob/master/utils/json_aggregator.py))
3. Upload the source file to S3 (in this case with public READ access)

### 1. Create a Spark table from source file stored in S3

The following function:

+ Checks if the table exists; if not:
+ Loads data from S3 to the Databricks FileSystem as JSON source file
+ Creates a DataFrame using the JSON source file
+ Saves the DataFrame to the Hive Metastore as Spark Table for persistence

In [2]:
SOURCE_URL = "https://s3.amazonaws.com/XXXXXXXXXXXXXXXXXX"
TABLE_NAME = "YOUR_TABLE_NAME"

def createTable(tableName=TABLE_NAME, sourceURL=SOURCE_URL):
    """
    Creates new Table based on the content of a json formated file
    stored in S3 and publically accessible with URL
    """
    import urllib2, json

    # Test existence of table
    table_exists = (
      sql("SHOW TABLES")
      .filter("tableName = '{}'".format(tableName))
      .count()
      )

    if table_exists:
        print("> Table `{}` already exists".format(tableName))

    if not table_exists:
        print("> Table `{}` not found in metastore".format(tableName))

        # File handler from URL
        print("> Loading data from {}".format(sourceURL))
        data = urllib2.urlopen(sourceURL)

        # Create string from file content
        json_str = ''.join([line for line in data])

        # Load to hdfs
        print("> Saving JSON file to FileSystem")
        dbutils.fs.put("/tmp/tweets.json", json_str, True)

        # Create new df from json file
        print("> Creating new DataFrame from json file")
        new_df = sqlContext.read.json("/tmp/tweets.json")

        # Saving df as table for persistence
        print("> Saving DataFrame as Table `{}`".format(tableName))
        new_df.write.saveAsTable(tableName)

createTable()

### Check table created by querying it

In [4]:
# Option_1: Using SQL query
tweet_df_sql = sql("SELECT * FROM {}".format(TABLE_NAME))

# Display
tweet_df_sql.printSchema()
tweet_df_sql.show()

In [5]:
# Option_2: Using sqlContext
tweet_df_sqlContext = sqlContext.table(TABLE_NAME)

# Display
tweet_df_sqlContext.printSchema()
tweet_df_sqlContext.show()