# 1) Create a S3 Bucket

Use boto3. It is an AWS software development kit (SDK) for python which allows to directly create, update and delete aws resources from a Python environment. It enables the easy integration of various Python application, library, script with AWS services including Amazon S3, Amazon DynamoDB, Amazon EC2, etc...

In [8]:
# Import the package
import boto3
import time

# Assign the resource we want to use - in this case "s3"
s3 = boto3.resource('s3')

# print the name of all available buckets in my s3 account.
for bucket in s3.buckets.all():
    print(bucket.name)

sagemaker-studio-396913707640-axxlnw5zdgi
sagemaker-us-east-1-396913707640
stockpricetest-1736740588


In [18]:
## Functions to automate tasks - still to integrate working with modules. 
def create_s3_bucket(base_name):
    """
    Creates an S3 bucket with a unique name by appending a timestamp.

    Parameters:
        base_name (str): Base name for the bucket.

    Returns:
        str: The name of the created bucket or an error message.
    """
    # Generate a unique bucket name
    bucket_name = f"{base_name}-{int(time.time())}"
    
    # Create the S3 client
    s3 = boto3.resource('s3')
    
    try:
        # Create the bucket
        s3.create_bucket(
            Bucket=bucket_name   )
        print(f"S3 bucket '{bucket_name}' has been created successfully.")
        return bucket_name
    except Exception as e:
        print(f"S3 error: {e}")
        return None

In [7]:
# Give the bucket a unique name as no user can have the same bucket name globally
# Note: the bucket name should be between 3 and 63 characters long, always start with a lower letters, cannot contain underscode
# More info here https://docs.aws.amazon.com/workdocs/latest/userguide/client-name-files.html 
buck_name = 'stockpricetest'
create_s3_bucket( base_name = buck_name )



S3 bucket has been created successfully


# 2) Data download 
Download data from the yfinance package which is not installed in boto3

In [44]:
#!pip install yfinance

import pandas as pd
from datetime import datetime
import yfinance as yf

# initialise parameters
start_date = datetime( 2019, 1, 1)
end_date = datetime( 2021, 1, 1)

# get the data 
df_data = yf.download( 'AAPL', start = start_date, end = end_date)
df_data.reset_index(inplace = True)

df_data

[*********************100%***********************]  1 of 1 completed


Price,Date,Close,High,Low,Open,Volume
Ticker,Unnamed: 1_level_1,AAPL,AAPL,AAPL,AAPL,AAPL
0,2019-01-02,37.708595,37.930665,36.827486,36.985083,148158800
1,2019-01-03,33.952534,34.795437,33.907164,34.379953,365248800
2,2019-01-04,35.401951,35.471200,34.336981,34.511292,234428400
3,2019-01-07,35.323162,35.538069,34.838433,35.507026,219111200
4,2019-01-08,35.996529,36.252028,35.464044,35.712376,164101200
...,...,...,...,...,...,...
500,2020-12-24,129.047501,130.504510,128.196772,128.411901,54930100
501,2020-12-28,133.663025,134.298625,130.553438,131.022819,124486200
502,2020-12-29,131.883270,135.716459,131.365008,134.992856,121047300
503,2020-12-30,130.758774,132.978509,130.445853,132.577585,96452100


# 3) Extract, Load & Transform
Carry out data cleaning. 

The columns of the data have multiIndices due to column labels Price and Ticker. There is a main title for the column and a subtitle as well. Run "df_data.columns" to see the structure of the MultiIndex. 



In [45]:
print(df_data.columns)

#Flatten the MultiIndex to work with a single-level column index. 
df_data.columns = ['_'.join(col).strip() for col in df_data.columns.values]
print(df_data.columns)

# Drop features that are not important for this tutorial as they do not influence the stock price much
df_data.drop( columns = ['Date_'], axis = 1, inplace = True)
print(df_data.columns)

# Extract the features
df_data_features = df_data.iloc[ :-1, :]  # This excludes the last sample from the data to consider the remaining as our features. 
print( df_data_features )

# Take the open price column as our target 
df_data_targets = df_data.iloc[ 1:, 3].rename( "Targets" )
print( df_data_targets )

MultiIndex([(  'Date',     ''),
            ( 'Close', 'AAPL'),
            (  'High', 'AAPL'),
            (   'Low', 'AAPL'),
            (  'Open', 'AAPL'),
            ('Volume', 'AAPL')],
           names=['Price', 'Ticker'])
Index(['Date_', 'Close_AAPL', 'High_AAPL', 'Low_AAPL', 'Open_AAPL',
       'Volume_AAPL'],
      dtype='object')
Index(['Close_AAPL', 'High_AAPL', 'Low_AAPL', 'Open_AAPL', 'Volume_AAPL'], dtype='object')
