# Data Exploratory Analysis

The primary goal of EDA is to maximize the analyst's insight into a data set and into the underlying structure of a data set, while providing all of the specific items that an analyst would want to extract from a data set.
 Insight implies detecting and uncovering underlying structure in the data. Such underlying structure may not be encapsulated in the list of items above; such items serve as the specific targets of an analysis, but the real insight and "feel" for a data set comes as the analyst judiciously probes and explores the various subtleties of the data. The "feel" for the data comes almost exclusively from the application of various graphical techniques, the collection of which serves as the window into the essence of the data. Graphics are irreplaceable--there are no quantitative analogues that will give the same insight as well-chosen graphics. 
To get a "feel" for the data, it is not enough for the analyst to know what is in the data; the analyst also must know what is not in the data, and the only way to do that is to draw on our own human pattern-recognition and comparative abilities in the context of a series of judicious graphical techniques applied to the data. 


## Objectives - 

#### [1] Know the data types of the dataset – whether continuous/discreet/categorical

#### [2] Understand how single categorical data is distributed/Varied (Table, Mode)


#### [3] Understand how single numeric data is distributed/Varied (Mean,  Median,  Mode, Variance, skewness, etc)

#### [4] Test effect on different levels in single categorical data on a numeric feature (2-sample, ANOVA)

#### [5] Understand how multi-dimensional (All) data distributed – MCA, PCA


#### [6] Identify missing or NA value


#### [7] Identify correlated features (Numeric/Categorical) – Co-linearity 


#### [8] Test whether a pattern or a feature is randomly generated


#### [9] Identify outliers


#### [10] Identify/sort the significant features / dataset to target (If supervised)


#### [11] Test the linearity relationship between data and target variable


#### [12] Is there any synergy between features on target feature


#### [13] Identify the normality of numeric features


#### [14] Identify the variance pattern of numeric features


#### [15] Identify the effects of different categorical factors/numeric values on target value


## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

## [1] Know the data types of the dataset – whether continuous/discreet/categorical

## [9] Identify outliers

Outliers are some times extreme values that fall a long way outside of the other observations. For example, in a normal distribution, outliers may be values on the tails of the distribution. However, outliers can also be normal values but out side of current data pattern (Need to build a model).

An outlier is an observation that appears to deviate markedly from other observations in the sample.
Identification of potential outliers is important for the following reasons.

An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly. If it can be determined that an outlying point is in fact erroneous, then the outlying value should be deleted from the analysis (or corrected if possible).

In some cases, it may not be possible to determine if an outlying point is bad data. Outliers may be due to random variation or may indicate something scientifically interesting. In any event, we typically do not want to simply delete the outlying observation. However, if the data contains significant outliers, we may need to consider the use of robust statistical techniques.

*High leverage point (Outlier removed cause model change)

#### Identify: 

http://machinelearningmastery.com/how-to-identify-outliers-in-your-data/

- [plotting]: Box plot, QQ-plot, histogram

- [Extreme Value Analysis]: Determine the statistical tails of the underlying distribution of the data. For example, statistical methods like the z-scores on univariate data.

- [Probabilistic and Statistical Models]: Determine unlikely instances from a probabilistic model of the data. For example, gaussian mixture models optimized using expectation-maximization.

- [Linear Models]: Projection methods that model the data into lower dimensions using linear correlations. For example, principle component analysis and data with large residual errors may be outliers.

- [Proximity-based Models]: Data instances that are isolated from the mass of the data as determined by cluster, density or nearest neighbor analysis.

- [Information Theoretic Models]: Outliers are detected as data instances that increase the complexity (minimum code length) of the dataset.

- [High-Dimensional Outlier Detection]: Methods that search subspaces for outliers give the breakdown of distance based measures in higher dimensions (curse of dimensionality).


#### Solve:

*Investigate the true/false of the cause of outlier is really an outlier?

- Simply remove outlier (Dangerous)

- Choose robust statistical model 

- Change/normalize outlier's value (replace by percentile value)




In [None]:
# -------- R

In [None]:
# -------- Python

## [7] Identify correlated features (Numeric/Categorical) – Co-linearity

Multicollinearity increases the standard errors of the coefficients. Increased standard errors in turn means that coefficients for some independent variables may be found not to be significantly different from 0. In other words, by overinflating the standard errors, multicollinearity makes some variables statistically insignificant when they should be significant. Without multicollinearity (and thus, with lower standard errors), those coefficients might be significant.

#### Identify:

- Variance Inflation Factor (VIF)

which assesses how much the variance of an estimated regression coefficient increases if your predictors are correlated. If no factors are correlated, the VIFs will all be 1. The Variance Inflation Factor (VIF) measures the impact of collinearity among the variables in a regression model. The Variance Inflation Factor (VIF) is 1/Tolerance, it is always greater than or equal to 1. There is no formal VIF value for determining presence of multicollinearity. Values of VIF that exceed 10 are often regarded as indicating multicollinearity, but in weaker models values above 2.5 may be a cause for concern. 1 = no-cor; 5-10 = cor


#### Solve:

- Remove highly correlated predictors from the model. If you have two or more factors with a high VIF, remove one from the model. Because they supply redundant information, removing one of the correlated factors usually doesn't drastically reduce the R-squared. 

- Use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number of predictors to a smaller set of uncorrelated components.


In [None]:
# ---------- R

In [None]:
# ---------- Python