## Reading Content from s3 Object

Let us understand how we can read the content from s3 Object or file using Python boto3.
* Create s3 client using appropriate profile.
* Get one of the object name. We can use `list_objects` to get the object names. It can get up to 1000 object keys or names in each iteration.
* We can pick one of the object key or name and pass it on to `get_object` along with bucket name.
* The response will contain `Body` of type byte stream. We can decode the `Body` to string.
* We can further process the data using relevant string manipulation functions as per our requirements.

In [1]:
import boto3

In [2]:
import os
os.environ.setdefault('AWS_PROFILE', 'itvgenlogs')

'itvgenlogs'

In [3]:
s3_client = boto3.client('s3')

In [4]:
s3_client.list_objects?

[0;31mSignature:[0m [0ms3_client[0m[0;34m.[0m[0mlist_objects[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
.. note::

  

  This operation is not supported by directory buckets.

  

 

Returns some or all (up to 1,000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket. A 200 OK response can contain valid or invalid XML. Be sure to design your application to parse the contents of the response and handle it appropriately.

 


   

  This action has been revised. We recommend that you use the newer version, `ListObjectsV2 <https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html>`__, when developing applications. For backward compatibility, Amazon S3 continues to support ``ListObjects``.

   

 

The following operations are related to ``ListObjects``:

 


* `ListObjectsV2 <https://docs.aws.am

### 1. Find the Object name

In [5]:
s3_objects = s3_client.list_objects(
    Bucket='itv-genlogs-mana00',
    Prefix='logs/year' # Limit the response to keys that begin with ....
)

In [6]:
s3_objects

{'ResponseMetadata': {'RequestId': '0WGAKB4RF076HP6T',
  'HostId': 'k4xT8TY71bCvZQcYOPhME9jiZacJSYhKkzE+Hmb6/ITOTxZJM+9p7f/r5aOu5qr0FpMte6R3i/3CyZBxgBODSA==',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'k4xT8TY71bCvZQcYOPhME9jiZacJSYhKkzE+Hmb6/ITOTxZJM+9p7f/r5aOu5qr0FpMte6R3i/3CyZBxgBODSA==',
   'x-amz-request-id': '0WGAKB4RF076HP6T',
   'date': 'Fri, 16 Feb 2024 15:21:20 GMT',
   'x-amz-bucket-region': 'ap-southeast-1',
   'content-type': 'application/xml',
   'transfer-encoding': 'chunked',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'IsTruncated': False,
 'Marker': '',
 'Contents': [{'Key': 'logs/year=2024/month=02/day=16/gen_logs_s3-1-2024-02-16-12-51-20-2b9337a1-7e5c-4a5d-84cc-3587b8d40e07',
   'LastModified': datetime.datetime(2024, 2, 16, 12, 52, 22, tzinfo=tzutc()),
   'ETag': '"d7592fc75dbb74e210dfe695cab29f83"',
   'Size': 24450,
   'StorageClass': 'STANDARD',
   'Owner': {'DisplayName': 'laiaddara',
    'ID': '709f2485bbc57aa0687e130826e0d8c48d3beaba7e7f0

In [7]:
s3_objects['Contents']

[{'Key': 'logs/year=2024/month=02/day=16/gen_logs_s3-1-2024-02-16-12-51-20-2b9337a1-7e5c-4a5d-84cc-3587b8d40e07',
  'LastModified': datetime.datetime(2024, 2, 16, 12, 52, 22, tzinfo=tzutc()),
  'ETag': '"d7592fc75dbb74e210dfe695cab29f83"',
  'Size': 24450,
  'StorageClass': 'STANDARD',
  'Owner': {'DisplayName': 'laiaddara',
   'ID': '709f2485bbc57aa0687e130826e0d8c48d3beaba7e7f08305a5a39db5536f4f3'}},
 {'Key': 'logs/year=2024/month=02/day=16/gen_logs_s3-1-2024-02-16-12-53-21-23cefb72-3c41-4dd8-8da0-0597a1b0494d',
  'LastModified': datetime.datetime(2024, 2, 16, 12, 54, 22, tzinfo=tzutc()),
  'ETag': '"8eb5281271d1bcfc800c8893036a6a94"',
  'Size': 12509,
  'StorageClass': 'STANDARD',
  'Owner': {'DisplayName': 'laiaddara',
   'ID': '709f2485bbc57aa0687e130826e0d8c48d3beaba7e7f08305a5a39db5536f4f3'}},
 {'Key': 'logs/year=2024/month=02/day=16/gen_logs_s3-1-2024-02-16-12-54-22-18f05e40-b7a1-452c-8bf8-d059b0913ef5',
  'LastModified': datetime.datetime(2024, 2, 16, 12, 55, 23, tzinfo=tzutc(

In [8]:
len(s3_objects['Contents'])

7

In [9]:
s3_objects['Contents'][0]

{'Key': 'logs/year=2024/month=02/day=16/gen_logs_s3-1-2024-02-16-12-51-20-2b9337a1-7e5c-4a5d-84cc-3587b8d40e07',
 'LastModified': datetime.datetime(2024, 2, 16, 12, 52, 22, tzinfo=tzutc()),
 'ETag': '"d7592fc75dbb74e210dfe695cab29f83"',
 'Size': 24450,
 'StorageClass': 'STANDARD',
 'Owner': {'DisplayName': 'laiaddara',
  'ID': '709f2485bbc57aa0687e130826e0d8c48d3beaba7e7f08305a5a39db5536f4f3'}}

In [10]:
s3_objects['Contents'][0]['Key']

'logs/year=2024/month=02/day=16/gen_logs_s3-1-2024-02-16-12-51-20-2b9337a1-7e5c-4a5d-84cc-3587b8d40e07'

In [11]:
s3_object_key = s3_objects['Contents'][0]['Key']

### 2. Get Single Obejct content

In [12]:
s3_client.get_object?

[0;31mSignature:[0m [0ms3_client[0m[0;34m.[0m[0mget_object[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Retrieves an object from Amazon S3.

 

In the ``GetObject`` request, specify the full key name for the object.

 

**General purpose buckets** - Both the virtual-hosted-style requests and the path-style requests are supported. For a virtual hosted-style request example, if you have the object ``photos/2006/February/sample.jpg``, specify the object key name as ``/photos/2006/February/sample.jpg``. For a path-style request example, if you have the object ``photos/2006/February/sample.jpg`` in the bucket named ``examplebucket``, specify the object key name as ``/examplebucket/photos/2006/February/sample.jpg``. For more information about request types, see `HTTP Host Header Bucket Specification <https://docs.aws.amazon.com/AmazonS3/latest/dev/VirtualHosting.html#VirtualHostingSpecifyBuc

In [26]:
s3_object = s3_client.get_object(
    Bucket='itv-genlogs-mana00',
    Key=s3_object_key
)

In [16]:
type(s3_object)

dict

In [17]:
s3_object

{'ResponseMetadata': {'RequestId': '18DF74A6099C14E8',
  'HostId': 'Zc7O64jogDyT5G2UpD03lHMitBPl+f5+jYjTHLhw1XF253zz0BNK4ZpjA2pxALbcd+EZDMx5s3g=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'Zc7O64jogDyT5G2UpD03lHMitBPl+f5+jYjTHLhw1XF253zz0BNK4ZpjA2pxALbcd+EZDMx5s3g=',
   'x-amz-request-id': '18DF74A6099C14E8',
   'date': 'Wed, 20 Jan 2021 23:28:49 GMT',
   'last-modified': 'Tue, 19 Jan 2021 23:26:22 GMT',
   'etag': '"63414c2398f48cd7c5affe0ae3af2132"',
   'accept-ranges': 'bytes',
   'content-type': 'application/octet-stream',
   'content-length': '24460',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'AcceptRanges': 'bytes',
 'LastModified': datetime.datetime(2021, 1, 19, 23, 26, 22, tzinfo=tzutc()),
 'ContentLength': 24460,
 'ETag': '"63414c2398f48cd7c5affe0ae3af2132"',
 'ContentType': 'application/octet-stream',
 'Metadata': {},
 'Body': <botocore.response.StreamingBody at 0x117d5a690>}

In [30]:
s3_object['Body']

<botocore.response.StreamingBody at 0x7fef6ee4b820>

In [18]:
help(s3_object['Body'])

Help on StreamingBody in module botocore.response object:

class StreamingBody(io.IOBase)
 |  StreamingBody(raw_stream, content_length)
 |  
 |  Wrapper class for an http response body.
 |  
 |  This provides a few additional conveniences that do not exist
 |  in the urllib3 model:
 |  
 |      * Set the timeout on the socket (i.e read() timeouts)
 |      * Auto validation of content length, if the amount of bytes
 |        we read does not match the content length, an exception
 |        is raised.
 |  
 |  Method resolution order:
 |      StreamingBody
 |      io.IOBase
 |      _io._IOBase
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __del__(self)
 |  
 |  __enter__(self)
 |  
 |  __exit__(self, type, value, traceback)
 |  
 |  __init__(self, raw_stream, content_length)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |      Return an iterator to yield 1k chunks from the raw stream.
 |  
 |  __next__(self)
 |      Retu

In [31]:
s3_object['Body'].read() # After issuing read(), the content will be flushed out of the memory

b''

In [29]:
s3_object['Body'].read().decode('utf-8')

''

### 3. Final : Put everything together

In [23]:
import boto3

import os
os.environ.setdefault('AWS_PROFILE', 'itvgenlogs')

s3_client = boto3.client('s3')

s3_objects = s3_client.list_objects(
    Bucket='itv-genlogs-mana00',
    Prefix='logs/year'
)

s3_object_key = s3_objects['Contents'][0]['Key']
s3_object = s3_client.get_object(
    Bucket='itv-genlogs-mana00',
    Key=s3_object_key
)

file_contents = s3_object['Body'].read().decode('utf-8')

In [24]:
file_records = file_contents.splitlines()

In [25]:
file_records[:3]

['165.220.71.68 - - [16/Feb/2024:12:50:42 -0800] "GET /department/team%20sports/categories HTTP/1.1" 200 1702 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"',
 '45.171.150.225 - - [16/Feb/2024:12:50:30 -0800] "GET /department/team%20sports/categories HTTP/1.1" 200 2100 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.77.4 (KHTML, like Gecko) Version/7.0.5 Safari/537.77.4"',
 '115.93.215.91 - - [16/Feb/2024:12:50:25 -0800] "GET /categories/indoor/outdoor%20games/products HTTP/1.1" 200 750 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"']