# 🚀 **Data Extraction Notebook**

## 🔍 Overview
This notebook is responsible for secure and efficient extraction of raw data from an AWS S3 bucket using AWS CLI authentication and boto3. The extracted dataset is then loaded into a Pandas DataFrame for further analysis, setting the stage for data transformation and structured storage in a relational database.

## 🎯 Key Objectives
- ✅ Seamless AWS Integration: Utilize AWS CLI credentials for authentication, ensuring a robust and secure connection to S3.
- ✅ Automated Data Retrieval: Fetch structured data from an S3 bucket, optimizing cloud-based storage solutions.
- ✅ Data Ingestion & Validation: Load the dataset into a Pandas DataFrame, performing preliminary checks to verify data integrity.



### 1️⃣ Connecting to AWS S3
We use Boto3, the AWS SDK for Python, to establish a connection with S3. Authentication is handled securely through AWS CLI credentials, avoiding hardcoded keys and following best practices.

#### 🔹 Steps:
- Initialize a session using AWS CLI credentials.
- Create an S3 client to interact with the bucket.
- List the available objects to confirm access.

In [24]:
import boto3

# No need to manually specify credentials
s3 = boto3.client("s3")

# List buckets to confirm access
buckets = s3.list_buckets()

# Print bucket names
for bucket in buckets["Buckets"]:
    print(bucket["Name"])



data-analytics-001


### 2️⃣ Extracting Data from S3
We retrieve the CSV file from the specified S3 bucket and load it into a **Pandas DataFrame** for analysis.

In [27]:
import pandas as pd
import io

try:
    obj = s3.get_object(Bucket=bucket_name, Key=key)
    df = pd.read_csv(io.BytesIO(obj['Body'].read()))

    # Now 'df' is your Pandas DataFrame containing the data from the CSV
    print(df.head())  # Print the first few rows to verify

except Exception as e:
    print(f"Error: {e}")

   item_id                  description  quantity_on_hand  cost purchase_date  \
0     1000  Bennet Farm free-range eggs                29  2.35      2/1/2022   
1     1000  Bennet Farm free-range eggs                27   NaN           NaN   
2     2000                  Ruby's Kale                 3   NaN           NaN   
3     1100        Freshness White beans                13   NaN           NaN   
4     1100        Freshness White beans                53  0.69      2/2/2022   

                                              vendor  price date_sold  \
0          Bennet Farms, Rt. 17 Evansville, IL 55446    NaN       NaN   
1                                                NaN   5.49  2/2/2022   
2                                                NaN   3.99  2/2/2022   
3                                                NaN   1.49  2/2/2022   
4  Freshness, Inc., 202 E. Maple St., St. Joseph,...    NaN       NaN   

       cust quantity_sold item_type   location          unit  
0       NaN

In [31]:
df.head()

Unnamed: 0,item_id,description,quantity_on_hand,cost,purchase_date,vendor,price,date_sold,cust,quantity_sold,item_type,location,unit
0,1000,Bennet Farm free-range eggs,29,2.35,2/1/2022,"Bennet Farms, Rt. 17 Evansville, IL 55446",,,,Dairy,D12,dozen,
1,1000,Bennet Farm free-range eggs,27,,,,5.49,2/2/2022,198765.0,2,Dairy,D12,dozen
2,2000,Ruby's Kale,3,,,,3.99,2/2/2022,,2,Produce,p12,bunch
3,1100,Freshness White beans,13,,,,1.49,2/2/2022,202900.0,2,Canned,a2,12 ounce can
4,1100,Freshness White beans,53,0.69,2/2/2022,"Freshness, Inc., 202 E. Maple St., St. Joseph,...",,,,Canned,a2,12 oz can,



### 3️⃣ Storing Data for Further Processing
To pass this DataFrame to subsequent notebooks, we use the **`%store` magic command**.


In [35]:
%store df

Stored 'df' (DataFrame)


This allows the DataFrame to be accessed in the **Data Transformation** notebook without needing to re-run this extraction.


In [72]:

# Guardar el archivo como "raw_data.csv"
df.to_csv('raw_data.csv', index=False)


Crate a csv file with raw_data

## Summary
✅ Successfully connected to AWS S3.  
✅ Retrieved the dataset from the S3 bucket.  
✅ Loaded the data into a Pandas DataFrame.  
✅ Stored the DataFrame for later use in data transformation.

Next, proceed to the **Data Transformation** notebook, where we clean and preprocess the extracted data.
