# Objective

This notebook shows the connection to AWS and a Hello World with some of the services used

1) S3
2) Lambda (try the hello world from midterm)
3) Cloud Watch

To connect a key was made using IAM, and put into a local `.env` file, the credentials are temporary and will expire in 30 days. Care should be used when making these and different keys used for dev and prod.


* outside POC environment consider different security practices such as SSO 


In [14]:
# include in req
# !pip install pandas as pd
# !pip install pydot
# !pip install python-dotenv

In [15]:
import sys

import boto3
from dotenv import dotenv_values
import pandas as pd

config = dotenv_values("../.env") 

print(sys.version)

3.11.8 (v3.11.8:db85d51d3e, Feb  6 2024, 18:02:37) [Clang 13.0.0 (clang-1300.0.29.30)]


### Initialize Services

In [16]:
# Initialize a session using aws cred
session = boto3.Session(aws_access_key_id=config["aws_access_key_id"],
                        aws_secret_access_key=config["aws_secret_access_key"],
                        region_name=config["region"])

s3 = session.client('s3')
lamda_func = session.client("lambda")
cloudwatch = session.client('logs')

### Test S3 Connection

In [17]:
d = s3.list_buckets()

# show current buckets
b = [n["Name"] for n in d["Buckets"]]


# validate folders needed in connection
# also a test in /tests

assert 'fmi-lambda-demo' in b, "missing the lambda demo" # midterm
assert 'team4-cosmicai' in b, "missing our team4 cosmicai S3 connection" 

b

['aws-athena-query-results-211125778552-us-east-1',
 'cosmicai-data',
 'cosmicai2',
 'fmi-lambda-demo',
 'group2-s3-bucket',
 'group4-s3-bucket',
 'sagemaker-studio-211125778552-3zpozdpwzcx',
 'sagemaker-studio-211125778552-rrp76qgcj1n',
 'sagemaker-us-east-1-211125778552',
 'team-one-cosmic-data',
 'team-one-s3-cosmic',
 'team2cosmicai',
 'team3cosmicai',
 'team4-cosmicai']

To pull log events, the stream is needed, to pull the stream the group is needed, broadly

```mermaid
graph TD
    A[Log Groups] --> B[Log Streams]
    B --> C[Log Events]
```

***

### Log Groups


In [18]:

l = []
r = cloudwatch.describe_log_groups()

for group in r['logGroups']:
     l.append(group['logGroupName'])

df_log_groups = pd.DataFrame(l, columns=["log_group_names"])

# general if needed
df_log_groups[df_log_groups.log_group_names.str.contains("(?!.*sagemaker).*")].head(25) # don't include sagemaker, many instances related to labs

# may be interested in the cosmic ai logs
# df_log_groups[df_log_groups.log_group_names.str.contains("cosmic")]


Unnamed: 0,log_group_names
0,/aws-glue/column-statistics
1,/aws-glue/jobs/error
2,/aws-glue/jobs/output
3,/aws-glue/sessions/error
4,/aws-glue/sessions/output
5,/aws-glue/testconnection/error/Redo
6,/aws-glue/testconnection/error/Redshift_connec...
7,/aws-glue/testconnection/error/team3-con
8,/aws-glue/testconnection/output/Redo
9,/aws-glue/testconnection/output/Redshift_conne...


### Log Streams
***

In [19]:
LOG_GROUP = "/aws/lambda/data-parallel-init"

l = []

r = cloudwatch.describe_log_streams(logGroupName=LOG_GROUP)
for stream in r['logStreams']:
    l.append(stream['logStreamName'])

df_log_streams_raw = pd.DataFrame(l, columns=["raw_streams"])

In [20]:
# what a stream name should look like
df_log_streams_raw.iloc[-10].values[0]

'2024/11/22/[$LATEST]0eaeff78011e4031a381ecc431f1b26a'

In [21]:
# to make more readable 

df_log_streams = df_log_streams_raw["raw_streams"].str.split(r"\[\$LATEST\]", expand=True)
df_log_streams.columns = ["date_pulled", "stream_hash"]

df_log_streams

Unnamed: 0,date_pulled,stream_hash
0,2024/11/07/,0fcf0e06f1174a16aca0adba257efc84
1,2024/11/07/,1738a628c5cf4796b51d6c481da4b746
2,2024/11/07/,27d9c7e9e6464016b8dfdd647c214fb5
3,2024/11/07/,6a1b2157e1c6436db3bdc4e740fd2f41
4,2024/11/20/,0d836483f503404fa3ad3d332da2f8c4
5,2024/11/20/,344f2b0cdf904d4a9dce3aba4e7158e3
6,2024/11/20/,586ddf9995cd4000a65778d5853521c8
7,2024/11/22/,0eaeff78011e4031a381ecc431f1b26a
8,2024/11/22/,272db91f41c1428e988f3084f113f535
9,2024/11/22/,3dd38b8393f848bfaff52e67b369a996


### Log Events
***

In [22]:
# can now get log events

# try the latest (already in order oldest -> newest or desc)
LOG_STREAM = df_log_streams_raw.iloc[-1].values[0]

r = cloudwatch.get_log_events(logGroupName=LOG_GROUP, logStreamName=LOG_STREAM)

for event in r['events']:
    print(f"Timestamp: {event['timestamp']}, Message: {event['message']}")

Timestamp: 1732415152518, Message: INIT_START Runtime Version: python:3.12.v38	Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:7515e00d6763496e7a147ffa395ef5b0f0c1ffd6064130abb5ecde5a6d630e86

Timestamp: 1732415152929, Message: [INFO]	2024-11-24T02:25:52.928Z		Found credentials in environment variables.

Timestamp: 1732415153152, Message: START RequestId: 0f80b1cb-7fbe-4554-acc8-51db015e343b Version: $LATEST

Timestamp: 1732415153155, Message: [ERROR] KeyError: 'bucket'
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 35, in lambda_handler
    bucket = event["bucket"]
Timestamp: 1732415153157, Message: END RequestId: 0f80b1cb-7fbe-4554-acc8-51db015e343b

Timestamp: 1732415153157, Message: REPORT RequestId: 0f80b1cb-7fbe-4554-acc8-51db015e343b	Duration: 5.24 ms	Billed Duration: 6 ms	Memory Size: 128 MB	Max Memory Used: 81 MB	Init Duration: 631.31 ms	



In [23]:
event

{'timestamp': 1732415153157,
 'message': 'REPORT RequestId: 0f80b1cb-7fbe-4554-acc8-51db015e343b\tDuration: 5.24 ms\tBilled Duration: 6 ms\tMemory Size: 128 MB\tMax Memory Used: 81 MB\tInit Duration: 631.31 ms\t\n',
 'ingestionTime': 1732415161205}

In [24]:
pd.json_normalize(r, record_path=["events"])

Unnamed: 0,timestamp,message,ingestionTime
0,1732415152518,INIT_START Runtime Version: python:3.12.v38\tR...,1732415161205
1,1732415152929,[INFO]\t2024-11-24T02:25:52.928Z\t\tFound cred...,1732415161205
2,1732415153152,START RequestId: 0f80b1cb-7fbe-4554-acc8-51db0...,1732415161205
3,1732415153154,LAMBDA_WARNING: Unhandled exception. The most ...,1732415161205
4,1732415153155,[ERROR] KeyError: 'bucket'\nTraceback (most re...,1732415161205
5,1732415153157,END RequestId: 0f80b1cb-7fbe-4554-acc8-51db015...,1732415161205
6,1732415153157,REPORT RequestId: 0f80b1cb-7fbe-4554-acc8-51db...,1732415161205


In [25]:
# check all of them
# for stream in df_log_streams_raw["raw_streams"]:
#     r = cloudwatch.get_log_events(logGroupName=LOG_GROUP, logStreamName=stream)

#     # any valid events?
#     for event in r['events']:
#         print(f"Timestamp: {event['timestamp']}, Message: {event['message']}")

In [26]:
# TODO: get out metrics from runs\
    # pipe into data so team can use

# TODO: try a lambda function hello world