## S3 Data Pipeline

Upload a CSV to S3, read it with Boto3 and Pandas, clean the data, and convert it to Parquet before re-uploading.  
Includes error handling and data transformation.

**Stack:** AWS S3, Boto3, Pandas, PyArrow

In [1]:
import boto3
import pandas as pd
import io
import pyarrow as pa
import pyarrow.parquet as pq

In [2]:
# --------- SETTINGS ---------
bucket_name = "carol-s3-demo"
input_key = "data/sample_input.csv"
output_key = "data/output.parquet"
region = "us-east-1"

In [3]:
# --------- CONNECT TO S3 ---------
s3 = boto3.client('s3', region_name=region)

In [5]:
# --------- READ CSV FROM S3 ---------
response = s3.get_object(Bucket=bucket_name, Key=input_key)
csv_data = response['Body'].read()
df = pd.read_csv(io.BytesIO(csv_data))

print("✅ Original data loaded:")
print(df.head())

✅ Original data loaded:
   id     name   age              email
0   1    Alice  25.0  alice@example.com
1   2      Bob  30.0    bob@example.com
2   3  Charlie  35.0                NaN
3   4    David   NaN  david@example.com
4   5      NaN  40.0    eve@example.com


In [6]:
# --------- TRANSFORM DATA ---------
df_cleaned = df.dropna()

print("✅ Cleaned data:")
print(df_cleaned.head())

✅ Cleaned data:
   id   name   age              email
0   1  Alice  25.0  alice@example.com
1   2    Bob  30.0    bob@example.com


In [7]:
# --------- CONVERT TO PARQUET ---------
table = pa.Table.from_pandas(df_cleaned)
parquet_buffer = io.BytesIO()
pq.write_table(table, parquet_buffer)

In [8]:
# --------- UPLOAD PARQUET TO S3 ---------
s3.put_object(Bucket=bucket_name, Key=output_key, Body=parquet_buffer.getvalue())
print(f"✅ Parquet file uploaded to s3://{bucket_name}/{output_key}")

✅ Parquet file uploaded to s3://carol-s3-demo/data/output.parquet
