# Data Ingestion

In this lesson, you will learn how to ingest data into SageMaker from various sources, including Amazon S3, public datasets, and custom data sources. Understanding how to effectively bring data into SageMaker is the first step in the data preparation process.

## Learning Objectives
- Identify various data sources compatible with SageMaker.
- Ingest data into SageMaker from Amazon S3.
- Understand different data formats and their implications.
- Explore public datasets available for machine learning.
- Learn about custom data sources and ingestion methods.

## Why This Matters

Knowing where to find and how to access data is essential for any machine learning project. Data ingestion is the first step in the data preparation process, which is crucial for building effective machine learning models.

### Data Sources

Data sources are the origins from which data is obtained for machine learning models. In SageMaker, common data sources include Amazon S3, public datasets, and custom APIs.

In [None]:
# Example of accessing data from Amazon S3
import boto3

# Initialize a session using Amazon S3
s3 = boto3.client('s3')

# List all buckets
buckets = s3.list_buckets()
for bucket in buckets['Buckets']:
    print(bucket['Name'])

#### Micro-Exercise 1: Identify Data Sources
List at least three different data sources that can be used in SageMaker.

In [None]:
# Example data sources
# 1. Amazon S3
# 2. Public datasets
# 3. Custom APIs

### Data Formats

Data formats refer to the structure in which data is stored. Common formats include CSV, JSON, and Parquet. Each format has its own advantages and disadvantages regarding processing speed and compatibility.

In [None]:
# Example of reading a CSV file in SageMaker
import pandas as pd

# Load a CSV file from S3
csv_file_path = 's3://my-bucket/my-data.csv'
data = pd.read_csv(csv_file_path)
print(data.head())

#### Micro-Exercise 2: Describe Data Ingestion Steps
Describe the steps to ingest data from an S3 bucket.

In [None]:
# Steps to ingest data
# 1. Create an S3 bucket
# 2. Upload data
# 3. Use SageMaker to access the bucket

## Examples

### Example 1: Ingesting Data from S3
This example demonstrates how to create an S3 bucket, upload a dataset, and access it from SageMaker.

```bash
# Create an S3 bucket
aws s3 mb s3://my-bucket

# Upload data
aws s3 cp local-file.csv s3://my-bucket/
```

### Example 2: Using Public Datasets
This example shows how to access a public dataset from a repository and ingest it into SageMaker.

```python
# Access public dataset
import pandas as pd

dataset_url = 'https://example.com/public-dataset.csv'
dataset = pd.read_csv(dataset_url)
print(dataset.head())
```

## Micro-Exercises

1. **List at least three different data sources that can be used in SageMaker.**
   
   ```python
   # Example data sources
   # 1. Amazon S3
   # 2. Public datasets
   # 3. Custom APIs
   ```

2. **Describe the steps to ingest data from an S3 bucket.**
   
   ```python
   # Steps to ingest data
   # 1. Create an S3 bucket
   # 2. Upload data
   # 3. Use SageMaker to access the bucket
   ```

## Main Exercise
In this exercise, you will create an S3 bucket, upload a dataset, and also access a public dataset to ingest into SageMaker. You will verify the successful ingestion of both datasets.

### Steps:
1. Create an S3 bucket:
   ```bash
   aws s3 mb s3://my-bucket
   ```
2. Upload your local dataset:
   ```bash
   aws s3 cp local-file.csv s3://my-bucket/
   ```
3. Access a public dataset:
   ```python
   import pandas as pd
   dataset_url = 'https://example.com/public-dataset.csv'
   public_data = pd.read_csv(dataset_url)
   print(public_data.head())
   ```

### Expected Outcomes:
- The dataset from S3 is successfully ingested into SageMaker.
- The public dataset is successfully accessed and ingested into SageMaker.

## Common Mistakes
- Not verifying data integrity before ingestion.
- Ignoring data format compatibility which can lead to errors.

## Recap
In this lesson, you learned about various data sources and formats compatible with SageMaker. You practiced ingesting data from Amazon S3 and public datasets. In the next lesson, we will explore data preprocessing techniques to prepare the ingested data for model training.

In [None]:
# Additional code cell for practice
# Example of listing objects in an S3 bucket
import boto3

# Initialize a session using Amazon S3
s3 = boto3.client('s3')

# Specify the bucket name
bucket_name = 'my-bucket'

# List objects in the specified S3 bucket
objects = s3.list_objects_v2(Bucket=bucket_name)
if 'Contents' in objects:
    for obj in objects['Contents']:
        print(obj['Key'])
else:
    print('No objects found in the bucket.')