# Ingest Text Data
Labeled text data sometimes are in structured data format, such as reviews for sentiment analysis, news headlines for topic modeling, or documents for text classification, where you have one column for the label, one column for the text, and sometimes other columns as attributes, and you can just treat them as tabular data and ingest them as we talked about in last section. Sometimes text data, expecially raw text data comes as unstructured data and is often in .json or .txt format, and we will discuss how to ingest these types of data files into a Sagemaker Notebook in this section.


## Set Up Notebook

In [17]:
import io
import boto3
import sagemaker
from sagemaker import get_execution_role
import json

# Get region 
session = boto3.session.Session()
region_name = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket() # replace with your own bucket if you have one 
 # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf
role = get_execution_role()
prefix = 'text_spam/spam'
prefix_json = 'json_jeo'
filename = 'SMSSpamCollection.txt'
filename_json = 'JEOPARDY_QUESTIONS1.json'

## Downloading data from Online Sources

### Text data (in structured .csv format): Twitter -- sentiment140
 **Sentiment140** This is the sentiment140 dataset. It contains 1.6M tweets extracted using the twitter api . The tweets have been annotated with sentiment (0 = negative, 4 = positive) and topics (hashtags used to retrive tweets). The dataset contains the following columns:
* `target`: the polarity of the tweet (0 = negative, 4 = positive)
* `ids`: The id of the tweet ( 2087)
* `date`: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
* `flag`: The query (lyx). If there is no query, then this value is NO_QUERY.
* `user`: the user that tweeted (robotickilldozr)
* `text`: the text of the tweet (Lyx is cool)

In [3]:
#helper functions
def write_to_s3(filename, bucket, prefix):
    key = "{}/{}".format(prefix,filename)
    return boto3.Session().resource('s3').Bucket(bucket).upload_file(filename,key)

def upload_to_s3(bucket, prefix, filename):
    url = 's3://{}/{}/{}'.format(bucket, prefix, filename)
    print('Writing to {}'.format(url))
    write_to_s3(filename, bucket, prefix)

In [4]:
!wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip -O sentimen140.zip
# Uncompressing
!unzip sentimen140.zip -d sentiment140

--2020-10-08 14:51:58--  http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip [following]
--2020-10-08 14:51:58--  https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81363704 (78M) [application/zip]
Saving to: ‘sentimen140.zip’


2020-10-08 14:52:02 (19.0 MB/s) - ‘sentimen140.zip’ saved [81363704/81363704]

Archive:  sentimen140.zip
  inflating: sentiment140/testdata.manual.2009.06.14.csv  
  inflating: sentiment140/training.1600000.processed.noemoticon.csv  


In [7]:
#upload the files to the S3 bucket
import glob
csv_files = glob.glob("sentiment140/*.csv")
for filename in csv_files:
    upload_to_s3(bucket, 'text_sentiment140', filename)

Writing to s3://sagemaker-us-east-2-060356833389/text_sentiment140/sentiment140/training.1600000.processed.noemoticon.csv
Writing to s3://sagemaker-us-east-2-060356833389/text_sentiment140/sentiment140/testdata.manual.2009.06.14.csv


### Text data (in .txt format): SMS Spam data 
[SMS Spam Data](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The data composed by just one text file, where each line has the correct class followed by the raw message. We will use this data to showcase how to ingest text data in .txt format.

In [8]:
!wget http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip -O spam.zip
!unzip spam.zip -d spam

--2020-10-08 14:53:20--  http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip
Resolving www.dt.fee.unicamp.br (www.dt.fee.unicamp.br)... 143.106.12.20
Connecting to www.dt.fee.unicamp.br (www.dt.fee.unicamp.br)|143.106.12.20|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 210521 (206K) [application/zip]
Saving to: ‘spam.zip’


2020-10-08 14:53:23 (178 KB/s) - ‘spam.zip’ saved [210521/210521]

Archive:  spam.zip
  inflating: spam/readme             
  inflating: spam/SMSSpamCollection.txt  


In [9]:
txt_files = glob.glob("spam/*.txt")
for filename in txt_files:
    upload_to_s3(bucket, 'text_spam', filename)

Writing to s3://sagemaker-us-east-2-060356833389/text_spam/spam/SMSSpamCollection.txt


### Text Data (in .json format): Jeopardy Question data
[Jeopardy Question](https://j-archive.com/) were obtained by crawling the Jeopardy question archive website. It is an unordered list of questions where each question has the following key-value pairs:

* `category` : the question category, e.g. "HISTORY"
* `value`: \$ value of the question as string, e.g. "\$200"
* `question`: text of question
* `answer` : text of answer
* `round`: one of "Jeopardy!","Double Jeopardy!","Final Jeopardy!" or "Tiebreaker"
* `show_number` : string of show number, e.g '4680'
* `air_date` : the show air date in format YYYY-MM-DD

In [10]:
#json file format
!wget http://skeeto.s3.amazonaws.com/share/JEOPARDY_QUESTIONS1.json.gz
# Uncompressing
!gunzip JEOPARDY_QUESTIONS1.json.gz
filename = 'JEOPARDY_QUESTIONS1.json'
upload_to_s3(bucket, 'json_jeo', filename)

--2020-10-08 14:53:26--  http://skeeto.s3.amazonaws.com/share/JEOPARDY_QUESTIONS1.json.gz
Resolving skeeto.s3.amazonaws.com (skeeto.s3.amazonaws.com)... 52.216.25.180
Connecting to skeeto.s3.amazonaws.com (skeeto.s3.amazonaws.com)|52.216.25.180|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12721082 (12M) [application/json]
Saving to: ‘JEOPARDY_QUESTIONS1.json.gz’


2020-10-08 14:53:27 (15.7 MB/s) - ‘JEOPARDY_QUESTIONS1.json.gz’ saved [12721082/12721082]

Writing to s3://sagemaker-us-east-2-060356833389/json_jeo/JEOPARDY_QUESTIONS1.json


## Ingest Data into Sagemaker Notebook
## Method 1: Copying data to the Instance
AWS Command Line Tools (CLI) is a easy way to copy your data from s3 to your sagemaker instance and copy files between your S3 buckets. It is a quick and easy approach when you are dealing with medium sized data files, or you are experimenting and doing exploratory analysis. The documentation can be found here https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

In [18]:
#Specify file names
prefix = 'text_spam/spam'
prefix_json = 'json_jeo'
filename = 'SMSSpamCollection.txt'
filename_json = 'JEOPARDY_QUESTIONS1.json'

In [11]:
#copy data to your sagemaker instance using AWS CLI
!aws s3 cp s3://$bucket/$prefix_json/ text/$prefix_json/ --recursive

download: s3://sagemaker-us-east-2-060356833389/json_jeo/JEOPARDY_QUESTIONS1.json to text/json_jeo/JEOPARDY_QUESTIONS1.json


In [12]:
import json
data_location = "text/{}/{}".format(prefix_json, filename_json)
with open(data_location) as f:
    data = json.load(f)
    print(data[0])

{'category': 'HISTORY', 'air_date': '2004-12-31', 'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", 'value': '$200', 'answer': 'Copernicus', 'round': 'Jeopardy!', 'show_number': '4680'}


## Method 2: Use AWS compatible Python Packages
The easiest way to access your files in S3 without copying files into your instance storage is to use pre-built packages which already have implemented options to access data with a specified path string. As an example the pandas library uses the URI schemes to properly identify the method of accessing the data. While file:// will look on the local file system, s3:// accesses the data through the AWS boto library. You will find additional infos here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html. For pandas any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected.

For text data, most of the time you can read it as line-by-line files or use Pandas to read it as a DataFrame by specify a delimiter.

In [19]:
import pandas as pd
data_s3_location = "s3://{}/{}/{}".format(bucket, prefix, filename) # S3 URL
s3_tabular_data = pd.read_csv(data_s3_location, sep="\t", header=None)
s3_tabular_data.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


For Json files, depending on the structure, you can also use `Pandas` `read_json` function to read it if it's a flat json file.

In [20]:
data_json_location = "s3://{}/{}/{}".format(bucket, prefix_json, filename_json)
s3_tabular_data_json = pd.read_json(data_json_location, orient='records')
s3_tabular_data_json.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680


## Method 3: Use AWS Native methods
https://s3fs.readthedocs.io/en/latest/ <br>
S3Fs is a Pythonic file interface to S3. It builds on top of botocore. The top-level class S3FileSystem holds connection information and allows typical file-system style operations like cp, mv, ls, du, glob, etc., as well as put/get of local files to/from S3. 

In [18]:
import s3fs
fs = s3fs.S3FileSystem()
data_s3fs_location = "s3://{}/{}/".format(bucket, prefix)
# To List all files in your accessible bucket
fs.ls(data_s3fs_location)

['sagemaker-us-east-1-942158337222/text_spam/spam/SMSSpamCollection.txt']

In [23]:
# open it directly with s3fs
data_s3fs_location = "s3://{}/{}/{}".format(bucket, prefix, filename) # S3 URL
with fs.open(data_s3fs_location) as f:
    print(pd.read_csv(f, sep = '\t', nrows = 2))

    ham  \
0   ham   
1  spam   

  Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...  
0                      Ok lar... Joking wif u oni...                                                               
1  Free entry in 2 a wkly comp to win FA Cup fina...                                                               


In [24]:
s3_obj =boto3.client('s3')
key = "{}/{}".format(prefix_json,filename_json)
s3_clientobj = s3_obj.get_object(Bucket=bucket, Key=key)
s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')

In [29]:
s3_clientdata[:1000]

'[{"category": "HISTORY", "air_date": "2004-12-31", "question": "\'For the last 8 years of his life, Galileo was under house arrest for espousing this man\'s theory\'", "value": "$200", "answer": "Copernicus", "round": "Jeopardy!", "show_number": "4680"}, {"category": "ESPN\'s TOP 10 ALL-TIME ATHLETES", "air_date": "2004-12-31", "question": "\'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves\'", "value": "$200", "answer": "Jim Thorpe", "round": "Jeopardy!", "show_number": "4680"}, {"category": "EVERYBODY TALKS ABOUT IT...", "air_date": "2004-12-31", "question": "\'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year\'", "value": "$200", "answer": "Arizona", "round": "Jeopardy!", "show_number": "4680"}, {"category": "THE COMPANY LINE", "air_date": "2004-12-31", "question": "\'In 1963, live on \\"The Art Linkletter Show\\", this company served its billionth burger\'", "value": "$200", "answer":