# **Explore the Use Case and Analyze the Datasets**

## **Introduction**

![](2023-12-28-12-04-36.png)

![](2023-12-28-12-27-43.png)

![](2023-12-28-12-28-15.png)

![](2023-12-28-12-29-10.png)

![](2023-12-28-12-29-52.png)

**Saling Out:** Distributed model training in prallel across various instances

![](2023-12-28-12-32-20.png)

**Note:** Looking at the toolbox for this step, you will use Amazon Simple Storage Service or Amazon S3 and Amazon Athena to ingest, store, and query your data. With AWS Glue, you will catalog the data in its schema. For statistical bias detection in data, you will learn how to work with Amazon SageMaker Data Wrangler and Amazon SageMaker.

![](2023-12-28-12-35-09.png)

![](2023-12-28-12-41-50.png)

![](2023-12-28-12-42-37.png)

![](2023-12-28-12-46-31.png)

![](2023-12-28-12-48-13.png)

![](2023-12-28-12-49-00.png)

**Note:** Multi-class classification is a supervised learning task, hence you need
to provide your tax classifier model with examples how to correctly learn
to classify the products and the product reviews into the right sentiment classes. 

![](2023-12-28-12-52-24.png)

## **Working with Data**

### **Data Ingestion and Exploration**

Imagine your e-commerce company is collecting all the customer feedback across all online channels. You need to capture, suddenly, customer feedback streaming from social media channels, feedback captured and transcribed through support center calls, incoming emails, mobile apps, and website data, and much more. To do that, you need a flexible and elastic repository that can start, not only the different file formats, such as dealing with structured data, CSV files, as well as unstructured data, such as support center call audio files. 

![](2023-12-28-13-03-52.png)

You can ingest data in its raw format without any prior data transformation. Whether it's structured relational data in the form of CSV or TSV files, semi-structured data such as JSON or XML files, or unstructured data such as images,
audio, and media files. You can also ingest streaming data, such as an application delivering a continuous feed of log files, or feeds from social media channels, into your data lake. 

A data lake needs to be governed. With new data arriving at any point in time you need to implement ways to discover and catalog the new data. You also need to secure and control access to the data to comply with the political data security, privacy, and governance regulations. With this governance in place, you can now give data signs and machine learning teams access to large and diverse datasets. 

![](2023-12-28-13-12-41.png)

Data lakes are often built on top of object storage, such as Amazon S3. You're probably familiar with file and block storage. File storage stores and manages data as individual files organized in hierarchical file folder structures.
In contrast, block storage stores and manages data as individual chunks called the blocks. And each block receives a unique identifier, but no additional metadata is stored with that block. With object storage, data is stored and managed as objects, which consists of the data itself, any relevant metadata, such as when the object was last modified, and a unique identifier. Object storage is particularly helpful for storing and retrieving growing amounts of data of any type, hence it's the perfect foundation for data lakes. Amazon S3 gives you access to durable and high-available object storage in the cloud

![](2023-12-28-13-14-01.png)

![](2023-12-28-13-15-26.png)

![](2023-12-28-13-16-09.png)

![](2023-12-28-13-19-27.png)

![](2023-12-28-13-20-10.png)

![](2023-12-28-13-22-49.png)

To do that, import the AWS Wrangler Python library as shown here, and then call the catalog.create_database function, providing a name for the database to create. AWS Data Wrangler also offers a convenience function called catalog.create_CSV_table that you can use to register the CSV data with the AWS Glue Data Catalog. The function will only store the schema and the metadata in the AWS Glue Data Catalog table that you specify. The actual data again remains in your S3 bucket. 

![](2023-12-28-13-29-59.png)

![](2023-12-28-13-31-06.png)

Athena is an interactive queries service that lets you run standard SQL queries to explore your data. Athena is serverless, which means you don't need to set up any infrastructure to run those queries, and, no matter how large the data is that you want to query, you can simply type your SQL query, referencing the dataset schema you provided in the AWS Glue Data Catalog. No data is loaded or moved, and here is a sample SQL query. 

Again, this database and table only contains the metadata of your data. The data still resides in S3, and when you run this Python command, AWS Data Wrangler will send this SQL query to Amazon Athena.Athena then runs the query on the specified dataset and stores the results in S3, and it also returns the results in a Pandas DataFrame, as specified in the command shown here

![](2023-12-28-13-39-24.png)

Athena is based on Presto, an open source distributed SQL engine, developed for this exact use case, running interactive queries against data sources of all sizes. And remember, no installation or infrastructure setup is needed, and no data movement is required. Just register your data with AWS Glue and use Amazon Athena to explore your datasets from the comfort of your Python environment. 

### **Data Visualization**

![](2023-12-28-13-50-30.png)

![](2023-12-28-17-02-00.png)

![](2023-12-28-16-59-10.png)

![](2023-12-28-16-59-56.png)

![](2023-12-28-17-01-24.png)

![](2023-12-28-17-04-30.png)

![](2023-12-28-17-17-10.png)

![](2023-12-28-17-13-31.png)

![](2023-12-28-17-14-08.png)

![](2023-12-28-17-14-49.png)

![](2023-12-28-17-19-52.png)