# Marites ML Component

## Overview

This notebook contains the main logic for the marites ML component. It uses the AWS Comprehend service to generate targeted sentimental analysis that will be used for our graph. 

In [4]:
import sys
!{sys.executable} -m pip install -r requirements.txt

!{sys.executable} -m pip show boto3

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mName: boto3
Version: 1.21.37
Summary: The AWS SDK for Python
Home-page: https://github.com/boto/boto3
Author: Amazon Web Services
Author-email: 
License: Apache License 2.0
Location: /usr/local/lib/python3.9/site-packages
Requires: botocore, jmespath, s3transfer
Required-by: aws-lambda-powertools


## Prepare our data
We assume that the `user-tweets.csv` file will be in our directory. If not present, you can run the `extract-tweets` notebook to generate this file.

In [21]:
# Group all libaries used on this line here
import pandas as pd
import re
import boto3
import json
import time
from datetime import datetime
import os
from dotenv import load_dotenv
from io import StringIO
import tarfile

load_dotenv()

print("Import complete.")

Import complete.


In [6]:
# Group all constants used up here
username = 'elonmusk'
region = 'ap-southeast-1'
language_code = 'en'
input_bucket = 'marites-comprehend-input'
output_bucket = 'marites-comprehend-output'
data_access_role_arn = os.environ.get("DATA_ACCESS_ROLE")
input_doc_format = 'ONE_DOC_PER_LINE'

session_id = '1' # we need to generate this in actual implementation

In [7]:
def prepare_data(data):
    user_tweets = data
    
    # Drop the index column attached to user tweets csv
    user_tweets = user_tweets.drop(columns=["Unnamed: 0"])
    
    # Clean up the links from the text (they're useless to us)
    user_tweets['text'] = user_tweets['text'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])

    # Remove all emojis
    user_tweets = user_tweets.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))

    # Remove blank tweets
    user_tweets = user_tweets[user_tweets.text.str.strip().str.len() != 0]

    # Ensure that all text is in a single line
    user_tweets.text = user_tweets.text.str.replace('\n', ' ');
    user_tweets.text = user_tweets.text.str.replace('\r', ' ');
    
    return user_tweets

raw_tweets = pd.read_csv('./user-tweets.csv')
user_tweets = prepare_data(raw_tweets)
user_tweets.head()

Unnamed: 0,tweet_id,username,text
1,1511297746661253120,thesheetztweetz,Breaking - Amazon $AMZN signed the biggest roc...
2,1511153708654120963,thesheetztweetz,The U.S. Air Force's 388th Fighter Wing tested...
3,1511137391263715331,thesheetztweetz,U.S. Space Force Brig. Gen. Stephen Purdy rece...
4,1511087590832758789,thesheetztweetz,"Due the vent valve issue, the launch director ..."
5,1510994152175149062,thesheetztweetz,The countdown clock has now resumed at T-6:40 ...


## Analyse tweets using AWS

We'll use AWS Comprehend's targeted sentimental analysis to identify a user's sentiment towards a particular topic or entity.

In [8]:
def upload_to_s3(data, bucket_name, file_name):
    text_buffer = StringIO()
    data.text.to_csv(text_buffer, sep=' ', index=False, header=False)
    s3_resource = boto3.resource('s3')
    return s3_resource.Object(bucket_name, file_name).put(Body=text_buffer.getvalue())

date = datetime.now().strftime("%m-%d-%y")
job_suffix = '{}-{}'.format(date, username)
input_file_name = '{}/{}/{}-input.txt'.format(session_id, username, job_suffix)
upload_to_s3(user_tweets, input_bucket, input_file_name)

{'ResponseMetadata': {'RequestId': '8YFH6GRGW3MKY1ED',
  'HostId': '+TlAWrH12mgDidfpUlOu5mN7rCTTKIzyx503SCpxK7vwUSd3a/oz9FEVKy4f2eoceYbVh+zIqBQ=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '+TlAWrH12mgDidfpUlOu5mN7rCTTKIzyx503SCpxK7vwUSd3a/oz9FEVKy4f2eoceYbVh+zIqBQ=',
   'x-amz-request-id': '8YFH6GRGW3MKY1ED',
   'date': 'Sun, 10 Apr 2022 00:55:31 GMT',
   'x-amz-expiration': 'expiry-date="Tue, 12 Apr 2022 00:00:00 GMT", rule-id="comprehend-bucket-lifecycle"',
   'etag': '"d3933e79fb02a6cad17cd1cdb405767b"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'Expiration': 'expiry-date="Tue, 12 Apr 2022 00:00:00 GMT", rule-id="comprehend-bucket-lifecycle"',
 'ETag': '"d3933e79fb02a6cad17cd1cdb405767b"'}

In [10]:
input_s3_url = 's3://{}/{}/{}'.format(input_bucket, session_id, username)
output_s3_url = 's3://{}/{}/{}'.format(output_bucket, session_id, username)

def start_targeted_sentiment_job(input_s3_url, output_s3_url):
    input_data_config = {
        'S3Uri': input_s3_url,
        'InputFormat': input_doc_format
    }

    output_data_config = {
        'S3Uri': output_s3_url
    }

    job_name = 'Targeted_Sentiment_Job_{}'.format(job_suffix)
    
    comprehend = boto3.client('comprehend', region_name=region)
    return comprehend.start_targeted_sentiment_detection_job(InputDataConfig=input_data_config,
                                                             OutputDataConfig=output_data_config, 
                                                             DataAccessRoleArn=data_access_role_arn, 
                                                             LanguageCode=language_code, 
                                                             JobName=job_name)


response = start_targeted_sentiment_job(input_s3_url, output_s3_url)
job_id = response['JobId']
print('job_id: ' + job_id)

job_id: 1acb7e9516f91f9ee7d50da370a8cace


In [12]:
while True:
    result = comprehend.describe_targeted_sentiment_detection_job(JobId=job_id)
    job_status = result["TargetedSentimentDetectionJobProperties"]["JobStatus"]
    if job_status in ['COMPLETED', 'FAILED']:
        print("job_status: " + job_status)
        break
    else:
        print("job_status: " + job_status)
        time.sleep(60)

job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: COMPLETED


## Download the data locally

Download the results so that we can analyse what we can do with the dataset

In [20]:
def download_results(job_id, bucket_name, file_name):
    comprehend = boto3.client('comprehend', region_name=region)
    result = comprehend.describe_targeted_sentiment_detection_job(JobId=job_id)
    result_s3_url = result['TargetedSentimentDetectionJobProperties']['OutputDataConfig']['S3Uri']
    s3_name = 's3://{}/'.format(bucket_name)
    results_aws_filename = result_s3_url.replace(s3_name, '')
    s3 = boto3.client('s3')
    s3.download_file(bucket_name, results_aws_filename, file_name)

local_results_filename = './results-{}.tar.gz'.format(job_suffix)
download_results(job_id, output_bucket, local_results_filename)

In [22]:
def extract_targz(targz_file, output_path=''):
    if targz_file.endswith("tar.gz"):
        tar = tarfile.open(targz_file, "r:gz")
        tar.extractall(path = output_path)
        tar.close()
    elif targz_file.endswith("tar"):
        tar = tarfile.open(targz_file, "r:")
        tar.extractall(path=output_path)
        tar.close()

output_path = 'output'
extract_targz(local_results_filename, output_path)