# Intro to Data Science @ SzISz Part II.
## Data discovery

### Table of contents
- <a href="#What-is-Data-Discovery?">Data Discovery Theory</a>
- <a href="#Let's-do-it-then!">Let's do it!</a>
- <a href="#What-about-this-dataset?">Example</a>
- <a href="#More-data!">More example</a>


### What is Data Discovery?
Data discovery is the process in which one looks into data and tries to:
- figure out what is interesting in the data
- what can one do with it
- if it needs extensive preprocessing

From <a href="https://en.wikipedia.org/wiki/Data_discovery#Definition">Wikipedia</a>:
> Data Discovery is a user-driven process of searching for patterns or specific items in a data set.
> Data Discovery applications use visual tools such as geographical maps, pivot-tables, and heat-maps
> to make the process of finding patterns or specific items rapid and intuitive. Data Discovery may 
> leverage statistical and data mining techniques to accomplish these goals.

### Why is it important?
To speed up the whole process by giving you insights about:
- if the data can be used at all
- the necessary preprocessing steps
- the possible algorithms
- the interesting data points
- which features to use

### Tools
Everything. Two important factor:
- speed __->__ base statistics
- ease of understanding __->__ PLOTS-PLOTS-PLOTS!

### Let's do it then! 
Load the built-in iris dataset with sklearn's `load_iris` function and discover the dataset! (hint: load the dataset with `return_X_y=True` parameter and create a `pandas.DataFrame` from the data; then use the `pandas.DataFrame`'s `plot` function for plotting. You can try `pandas.DataFrame`'s `describe` method as well.).

#### Answer the following questions:
- What is the task to solve?
- Is anything interesting showed up?
- What question should we ask about the dataset?
- How should we solve the task?
- What should we do as the first step of preprocessing?

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import pandas as pd

from sklearn.datasets import load_iris

- Load the dataset into a pandas DataFrame

In [None]:
X, y = load_iris(True)

df = pd.DataFrame(X)
df['label'] = y

df.head()

- Plot the data points

In [None]:
df.plot(1, 3, kind='scatter', c='label', colormap='Set1')

- Generate basic statistics about the data

In [None]:
df.describe()

In [None]:
sns.boxplot(data=df[range(4)])

- Generate basic statistics by labels

In [None]:
df.groupby('label').describe()

In [None]:
fig, axs = plt.subplots(ncols=3, sharey="row", figsize=(16,6))

for i in range(3):
    sns.boxplot(data=df[df.label == i][range(4)], ax=axs[i])

fig.show()

-  Plot every feature against each other!

In [None]:
fig, ax = plt.subplots(nrows=4, ncols=4, sharex="col", sharey="row", figsize=(12,12))

for i, row in enumerate(ax):
    for j, col in enumerate(row):
        col.scatter(df[i], df[j], c=df['label'], cmap='hot')
        col.set_title('{} - {}'.format(i, j))

fig.show()

- Generate the correlation matrix

In [None]:
df.corr()

In [None]:
sns.heatmap(df.corr())

- Dealing with missing values and outliers

In [None]:
filtered = df
outliers = pd.DataFrame()
for i in range(4):
    upper_thres = df[i].mean() + 2 * df[i].std()
    lower_thres = df[i].mean() - 2 * df[i].std()
    filtered = filtered[(upper_thres > filtered[i]) & (lower_thres < filtered[i])]
    outliers = pd.concat((outliers, df[(upper_thres <= df[i]) | (lower_thres >= df[i])]))

In [None]:
sns.heatmap(filtered.corr())

In [None]:
outliers.plot(1, 3, kind='scatter', c='label', colormap='hot')

---