# Visaulizations

#### Overview
Visualization tools to ...
- Explore individual features
  * Histograms
  * Plots
  * Statistics
- Explore feature relations
  * Scatter plots
  * Correlation plots
  * Plot (index vs. feature statistics)
  * And more

### <center>EDA is an art!</center>
    
<center>And visualizations are our art tools</center>


## Art tools

### Histograms
```python
plt.hist(x)
df.hist(kind='hist')
```

#### Never make a conclusion based on a single plot.
Try several different plots to prove your hypothesis.

* Original data
![histogram-confuse](img/hist-confuse.png)

* `np.log` to data
![histogram-confuse2](img/hist-confuse2.png)


#### Additional information
* From the two plots above, we can find the `peak` is exactly the mean value - the organizer might fill the `NaN` value with the average.
  * We can replace the missing values we found with something like `-999`
  * Or we can generate a new feature which will indicate that the value was missing - particularly useful for linear models!
  * XGBoost has a special algorithm that can fill missing values on its own.
  

### Plot (index vs. value)

![index-vs-value](img/plot-index-vs-value.png)

If we observe horizontal lines on this kind of plot,
* we can see that there are lots of **repeated values** in this feature

Also, note the randomness over the indices.
* we see horizontal patterns but no vertical ones
* it means the data is properly shuffled

![index-vs-value](img/plot-index-vs-value2.png)

We can also color code the points according to their labels.
* we can clearly see the data is not shuffled well here
* it is sorted by class label

### Feature Statistics

![feature-stats](img/feature-stats.png)

```python
df.describe()
x.mean()
x.std()
```

### Tools for individual features exploration

Histograms
> `plt.hist(x)`

Plot (index versus value)
> `plt.plot(x, '.')`

Statistics
> `df.describe()`<br>`x.mean()`<br>`x.var()`

Other tools:
> `x.value_counts()`<br>`x.isnull()`

## Exploring feature relations

![feature-relation-scatter](img/feature-relation-scatter.png)

* If the task is a **classification**, it's convenient to **color code the points with their labels** like on the picture above
  * The color indicates the class of the object
* For **regression**, the heat map light coloring can be used, too
  * alternatively, the target value can be visualized by point size
  
### We can effectively use scatter plots to check if the data distribution in the train and test sets are the same.

![scatter-c1-c2-test](img/scatter-c1-c2-test.png)
* Test set has no label so are colored as gray
* We clearly see that red points are mixed with part of the gray ones --- `good`
* But other gray points are located in the region where we don't have any training data --- `bad`
  * **If you see some kind of discrepancy between colored and gray points distribution, you should probably stop and think if you are doing it right**.
    * bug, completely overfit feature, something else not healthy.
    
    
<br>
### Case 1
![scatter-feature1-feature2](img/scatter-f1-f2.png)

<center>**Feature Relation**</center>
$$X2 \leq 1 - X1$$

How do we use feature relations?
* For tree-based model, we can create a new features like the **difference or ratio between X1 and X2**.

### Case 2
![scatter-class1-class2](img/scatter-c1-c2.png)

* It's hard to say what is the true relation between the features
* but after all our goal is not to decode the data here but **to generate new features and get a better score**.

#### This plot gives us an idea about how to generate the features out of these two features
* There are several triangles in plot
  * so we could probably make a feature to each triangle a given point belongs and hope that this feature will help.


## Exploring individual features: pairs/ groups

When you have small number of features, you can plot all the pairwise scatter plots at once using scatter metrics function from `pandas`. It's pretty handy.

```
pd.scatter_matrix(df)
```

![scatter-matrix](img/scatter-matrix.png)


We can also compute some kind of distance between the columns of our feature table and store them into a matrix with size of the number of features by the number of features.


#### For example, we can compute correlation between the counts.

```
df.corr()
plt.matshow()
```

![corr-between-counts](img/corr-counts.png)

We can run some kind of clustering like K-means clustering on the rows and columns of this matrix and reorder the features.


![corr-between-counts-clustered](img/corr-counts-clustered.png)


### We actually came to the last topic of our discussion, `feature groups`.

There are groups of very similar features!
* **It's a good idea to generate new features based on the groups.**
* Some statistics could collated over the group will work fine as features.


### Another visualization that helps to find feature groups is `statistics` of each feature
* Calculate stats for each column and plot them against column index.

```python
df.mean().plot(style='.')
df.std().plot(style='.')
```
![feature-groups-by-statistics](img/feature-grouped-bystat.png)

If we sort them out, we can clearly see some categorization.

![feature-groups-by-statistics-sorted](img/feature-grouped-bystat-sorted.png)

```python
df.mean().sort_values().plot(style='.')
```


#### And else...
* How many rows are there such that the value of first feature is bigger than the one of second feature?
* How many distinct combinations the features have in the dataset?

> "With such custom functions, we should build the metrics mannually - and we can use `matshow` function from `matplotlib`."



## Conclusion

### Explore individual features
* Histogram
* Plot (index vs. value)
* Statistics

### Explore feature relations
* Pairs
  - Scatter plot, scatter matrix
  - Corrplot
* Groups
  - Corrplot + clustering
  - Plot (index vs. feature statistics)