## 2. Exploratory Data Analysis

## Part-One Sanitize and prepare data for modeling

**Important Topics**
* Dataset generation
    * Amazon SageMaker Ground Truth
    * Amazon SageMaker Ground Truth Plus
    * Amazon Mechanical Turk 
* Descriptive & Informative statistics
* Handling missing values and outliers

### a. Dataset Generation
Amazon SageMaker enables you to identify raw data, such as images, text files, and videos; add informative labels; and generate labeled synthetic data to create high-quality training datasets for your machine learning (ML) models. SageMaker offers two options, Amazon SageMaker Ground Truth Plus and Amazon SageMaker Ground Truth, which provide you with the flexibility to use an expert workforce to create and manage data labeling workflows on your behalf or manage your own data labeling workflows.

**Amazon SageMaker Ground Truth**
If you want the flexibility to build and manage your own data labeling workflows and workforce, you can use SageMaker Ground Truth. SageMaker Ground Truth is a data labeling service that makes it easy to label data and gives you the option to use human annotators through Amazon Mechanical Turk, third-party vendors, or your own private workforce.

**Amazon SageMaker Ground Truth Plus**
With SageMaker Ground Truth Plus, you can create high-quality training datasets without having to build labeling applications or manage labeling workforces on your own. SageMaker Ground Truth Plus helps reduce data labeling costs by up to 40%. SageMaker Ground Truth Plus provides an expert workforce that is trained on ML tasks and can help meet your data security, privacy, and compliance requirements. You upload your data, and then SageMaker Ground Truth Plus creates and manages data labeling workflows and the workforce on your behalf.

**Amazon Mechanical Turk (MTurk)** 
A crowdsourcing marketplace that makes it easier for individuals and businesses to outsource their processes and jobs to a distributed workforce who can perform these tasks virtually. This could include anything from conducting simple data validation and research to more subjective tasks like survey participation, content moderation, and more. MTurk enables companies to harness the collective intelligence, skills, and insights from a global workforce to streamline business processes, augment data collection and analysis, and accelerate machine learning development.

### b. Descriptive & Informative Statistics 
The first thing you should do, before cleaning the data, is to use descriptive statistics to better understand your data. Descriptive statistics help you gain valuable insights into your data so that you can more effectively preprocess the data and prepare it for your ML model. Descriptive statistics can be organized into a couple of different categories. 

**Overall Statistics**
Overall statistics include the number of rows (or instances) and the number of columns (or features or attributes) in your dataset. This information, which relates to the dimensions of your data, is really important. For example, it can indicate that you have too many features, which can lead to high dimensionality and poor model performance. 

**Attribute Statistics**
Attribute statistics are another type of descriptive statistic, specifically for numeric attributes, and are used to get a better sense of the shape of your attributes. This includes properties like the mean, standard deviation, variance, and minimum and maximum values.
![image.png](attachment:image.png)

**Multivariate Statistics**
Multivariate statistics mostly have to do with the correlations and relationships between your attributes. Identifying correlations is important, because they can impact model performance. High correlation between two attributes can sometimes lead to poor model performance. When features are closely correlated and they’re all used in the same model to predict the response variable, there could be problems—for example, the model loss not converging to a minimum state.
For correlation, it can go as high as one, or as low as minus one. When the correlation is one, this means those two numerical features are perfectly correlated with each other. It's like saying Y is proportional to X. When those two variables’ correlation is minus one, it’s like saying that Y is proportional to minus X. Any linear relationship in between can be quantified by the correlation. So if the correlation is zero, this means there's no linear relationship—but it does not mean that there's no relationship. It's just an indication that there is no linear relationship between those two variables.

**Types of Correlations**<br>
* Pearson Correlation - used when data has Gaussian distribution. For a Pearson’s correlation coefficient to indicate a notable correlation, the coefficient value should be above 0.5 for a positive correlation, or below -0.5 for a negative correlation. Anything between 0.5 to -0.5 falls into the indeterminate range.
* Spearman's Correlation - used when data has non-Gaussian distribution

### c. Handling missing values and outliers
**Outliers** are points in your dataset that lie at an abnormal distance from other values. They are not always something you want to clean up, because they can add richness to your dataset. But they can also make it harder to make accurate predictions, because they skew values away from the other, more normal, values related to that feature. Moreover, an outlier can also indicate that the data point actually belongs to another column.

**Missing data** can make it difficult to accurately interpret the relationship between the related feature and the target variable, so, regardless of how the data ended up being missed, it is important to deal with the issue.

Here are a few ways to fill missing data: 
* Remove the columns or rows that include the missing data
* Fill the missing value with the column mean, a zero, or another numerical value using imputation

Neither of these approaches is without trade-offs, so choose carefully based on the data. Generally speaking, if the data is small and you cannot afford to lose too many data points, you should choose one of the imputation techniques.

## Part-Two Perform Feature Engineering

**Important Topics**
* Data Augmentation
    * Scaling & Normalizing
    * Dimensionality reduction
    * Date formatting
    * One-hot encoding
    * Data Tranformation

### a. Data Augmentation
Data can be messy in several ways. For instance, maybe your algorithm expects to see data written in English, but there are some words in your dataset from different languages. Or maybe there are special characters in some of the words, or even just a lot of space between words. The key is to make sure you are standardizing your data. If your algorithm requires English, make sure it’s all in English. 

**Make sure the data is on the same scale**<br>
**Standardize language and grammar**<br>
**Make sure a column doesn’t include multiple features**<br>

**Dimensionality reduction**<br>
You may need to perform feature engineering because of the dimensionality of your dataset—particularly if there are too many features for your model to handle. To reduce the number of features, you need to deploy dimensionality reduction techniques like principal component analysis (PCA) or t-distributed stochastic neighbor embedding.

**Data Tranformation**<br>
For numerical features, you can do what is referred to as transformation. One of the examples is of a multinominal or polynomial transformation, where you take the square and cube of the original feature and use all three columns as separate attributes while training your model.

**Scaling & Normalizing**<br>
Keep in mind that not all ML algorithms will be sensitive to different scales of inputted features. Here is a collection of commonly used scaling and normalizing transformations that we usually use for data science and ML projects:
* StandardScaler - StandardScaler follows Standard Normal Distribution (SND). Therefore, it makes mean = 0 and scales the data to unit variance. 
* MinMax scaling - MinMaxScaler scales all the data features in the range [0, 1] or else in the range [-1, 1] if there are negative values in the dataset.
* Maxabs scaling - This would scale each column by its max value, but would not shift/center the data.
* Robust scaling - This Scaler removes the median and scales the data according to the quantile range
* Normalizer - This would perform normalization over rows. 

## Part-Three Analyze and visualize data for ML

**Important Topics**
* Scatter plots
* Box plots
* Histograms
* Scatter matrix
* Correlation matrix
* Heatmaps
* Confusion matrix

### a. Scatter plots 
Scatter plots are the graphs that present the relationship between two variables in a data-set. It represents data points on a two-dimensional plane or on a Cartesian system. The independent variable or attribute is plotted on the X-axis, while the dependent variable is plotted on the Y-axis.
![image-3.png](attachment:image-3.png)

### b. Box plots 
The box plot distribution will explain how tightly the data is grouped, how the data is skewed, and also about the symmetry of data.
![image-4.png](attachment:image-4.png)

### c. Histograms
By looking at the histogram of an individual feature, you can see the overall behavior of that particular variable. Is it normally distributed, is there only that one peak, are there multiple peaks in the data? Or you can spot other high-level important features, like skewness for that particular variable. 
![image-5.png](attachment:image-5.png)

### d. Scatter matrix
A scatter matrix consists of several pair-wise scatter plots of variables presented in a matrix format. It can be used to determine whether the variables are correlated and whether the correlation is positive or negative.
![image-6.png](attachment:image-6.png)

### e. Heatmaps
Heat Maps are used to better visualize the volume of locations/events within a dataset and assist in directing viewers towards areas on data visualizations that matter most.


Visualizations help give you a better idea of what’s inside a particular feature and help you answer questions like these: 
* What’s the range of the data?
* What’s the peak of the data?
* Are there any outliers?
* Are there any interesting patterns in the data? 

Data visualization will also help you determine whether you need to clean and preprocess your data before model training.


