# Exploratory Data Analysis

EDA is a set of tools and processes that allow a better understanding of the data, including:

- Create statistical summaries (mean, min, max...)
- Identify outliers
- Visualize feature distributions

In the Machine Learning pyramid, EDA is the foundational layer, which allow for an understanding of the data, preventing missing critical insights.

> **NIST** provided a handbook about EDA [here](https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm).

EDA should always be done in a new dataset to avoid missing insights and
impacting the model in a negative way.

> **IBM** has a blog post explaining the "whats", "whys" and "hows" about EDA
[here](https://www.ibm.com/cloud/learn/exploratory-data-analysis).

---

## Pandas

> Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. -- [Pandas](https://pandas.pydata.org)

### [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

Spreadsheet-like container for data, where rows are individual events and columns are features and target values.

### [`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)

Provides summary statistics for all numeric values in the DataFrame:

- Count
- Mean
- Standard deviation
- Minimum
- Maximum
- Quantiles

### [`hist()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html)

Calculates and plots data distributions. Helpful to identity outliers and distributions:

- Normal
- Uniform
- Chi-Square
- F
- ...

### [`corr()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)

Calculates the correlation matrix of all columns. Strong correlated attributes add little to ML models.

---

## Exercises

01. **Using SageMaker Studio for EDA in Pandas**: The objective of this
    exercise was not to learn Pandas, but to try AWS SageMaker Studio for the
    first time.

02. **Data Wrangler**: This exercise doesn't contain a notebook. Instead it has
    a CSV file that should be uploaded to S3. Then, use AWS SageMaker Data
    Wrangler to create and pipeline transforming the date field to have year,
    month and day. After that, create two visualizations: a table summary and a
    histogram. Wrap up by exporting to S3.

03. **Ground Truth**: Another exercise without notebook. It's just an JSONL
    file that should be uploaded to S3. Using AWS SageMaker Ground Truth,
    create an labeling job, add a labeling workforce, and finally label the
    data. After stopping the job, the `output.manifest` should contain the
    labeled data, as well as some metadata. This exercise was done within AWS
    SageMaker Studio, then copied locally.
04. **EDA Iris Dataset**: The fourth and final exercise of this lesson is all
    about applying the techniques learned for EDA on the famous Iris Dataset,
    available on Scikit-Learn's datasets.