# Initial Data Acquisition

**Author:** Alan Meeson <alan.meeson@capgemini.com>

**Date:** 2023-02-06

This notebook retrieves the data from the BioNTech S3 bucket and stores it to the local disk for processing and analysis.

Note: the environment variables should be specified in a .env file in the format: ```AWS_BUCKET_NAME=<insert value here>```, with one variable per line.

In [None]:
import dotenv
import boto3
import os

In [None]:
# Specify the location to place the raw data
data_dir = '../data/raw'

In [None]:
dotenv.load_dotenv('../.env')

input_bucket_name = os.getenv('AWS_BUCKET_NAME')
access_key = os.getenv('AWS_ACCESS_KEY')
access_secret = os.getenv('AWS_ACCESS_SECRET')
s3_host = os.getenv('AWS_HOST')

## List the available data and sizes

In [None]:
session = boto3.Session(
    aws_access_key_id=access_key,
    aws_secret_access_key=access_secret
)
s3 = session.resource('s3')

In [None]:
input_bucket = s3.Bucket(input_bucket_name)
for object in input_bucket.objects.all():
    print("%s - %d Mb" % (object.key, object.size / pow(1024,2)))

## Download

In [None]:
# Create the output folder if it does not yet exist
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [None]:
input_bucket.download_file('data/Data1_Covid Variants evolution.csv', '../data/raw/Data1_Covid Variants evolution.csv')

In [None]:
input_bucket.download_file('data/Data2_Large set of country indicators.xlsx', '../data/raw/Data2_Large set of country indicators.xlsx')