## Module 1: Prepare Data Using SageMaker Data Wrangler

### Preprocessing & feature engineering
You need to prepare your dataset for machine learning. For that, we are going to use SageMaker Data Wranger (DW).  In this module, we are going to perform the following tasks:

1. Ingest data from S3
2. visualize and analyze our data
3. process and transform to clean up and encode our dataset
5. export data to feature store

![Data Wrangler Flow 01](statics/module_01_dw01.png)


## Import your dataset from S3

Select *Amazon S3* from Data sources. The Import a dataset from S3 page will be displayed.

![DW02](statics/module_01_dw02.jpg)

1. Navigate to the bucket and folder that contains the tlc303-5gcell.csv file.
2. Select the tlc303-5gcell.csv file. You'll see a preview of the data.
3. **Sampling Options:** You have the option to import your entire dataset into Data Wrangler or to sample a portion of it.

> **Note**
> The larger the dataset, the more accurate your analyses and visualizations will be and the longer they may take to render. By importing only a sample, rendering time may improve, but at the possible expense of losing influential data points. Random and stratified sampling strategies may help mitigate issues like these, but this depends on the distribution of the data and your unique use case.

> The following sampling settings only apply during interactive mode within Data Wrangler. When exporting (for example, to training or S3), these settings are ignored. If you wish to return a smaller subset of the data when exporting, use the Split Data transform.
When importing from Amazon S3, the following sampling options are available:

> **None** – Import the entire dataset.

> **First K** – Sample the first K rows of the dataset, where K is an integer that you specify.

> **Randomized** – Takes a random sample of a size that you specify.

> **Stratified** – Takes a stratified random sample. A stratified sample preserves the ratio of values in a column.

4. Let's accept all the defaults (First K and 50,000) and click the **Import** button.



## Navigating Data Wrangler Workspace

After importing the data, you will see a summary page with 3 tabs: Data, Analysis, Training

* Data tab summarizes the steps add to the data sources at this point of data transformation. Expand the individual steps to modify.
* Analysis tab shows are the visualization/report generated
* You use the training tab to training an AutoML model (we will cover in more details later)

![DW03](statics/module_01_dw03.png)


## Get Insights On Data and Data Quality

1. Use the **< Data flow** button on the top left to get to the main data flow workspace.
2. To get some insights on the data we've just imported, from the add icon (looks like a plus sign) next to the *Data types* node in the Data Flow diagram, select *Get data insights*. 

![DW04](statics/module_01_dw04.png)

This is a shortcut that takes us to the analysis page where we are provided with a list of various analysis types to choose and apply.

![DW05](statics/module_01_dw05.png)
1. By default,  **Analysis Type** selected **Data Quality and Insights Report**

Data Quality and Insights Report is a quick way to get a better understanding of your dataset. It generates a comprehensive report of your data across the following topics: Summary, Duplicate Rows, Anomalous Samples, Target Column, Quick Model, Feature Summary, Feature Details, Samples, and Definitions. You can export this report to share or review at a different time. Let’s look at some of the analysis in more detail.

1. For **Target column**, select **5g_sgnb_abnormal_release_rate**

To determine good 5G accessibility, we will use abnormal_release_rate, which represents the likelihood of connectivity drops. Any abnormal_release_rate > 0 is considered high probability for anomaly. Then we will train a classification model that can predict the likelihood of connectivity drops based on input features like network utilization, contention rates, health index, and throughput parameters. This use case is part of the 5G performance observabilitinitiative, aimed at predicting any potential loss of connectivity to the 5G radio network in the next hour, helping to ensure a seamless and uninterrupted user experience.

1. For **Problem Type** select **Classification**
2. Click the **Create** button to generate the report.

**Summary:** brief summary of the data that includes general information such as missing values, invalid values, feature types, outlier counts, and more. 

* This dataset is already pretty clean.
* 0% missing values
* 0% duplicate rows
* 100% valid

While Wait for the Data To Load, Let's Explore the Feature Store Console
Feature Store

![XXX](statics/feature_store.png)