# Intro to Data Science @ SzISz Part II.
## Data discovery

### Table of contents
- <a href="#What-is-Data-Discovery?">Theory</a>
- <a href="#Let's-do-it-then!">Action I.</a>
- <a href="#What-about-this-dataset?">Action II.</a>
- <a href="#More-data!">Action III.</a>


### What is Data Discovery?
Data discovery is the process in which one looks into data and tries to:
- figure out what is interesting in the data
- what (s)he can do with it
- if it needs extensive preprocessing

From <a href="https://en.wikipedia.org/wiki/Data_discovery#Definition">Wikipedia</a>:
> Data Discovery is a user-driven process of searching for patterns or specific items in a data set.
> Data Discovery applications use visual tools such as geographical maps, pivot-tables, and heat-maps
> to make the process of finding patterns or specific items rapid and intuitive. Data Discovery may 
> leverage statistical and data mining techniques to accomplish these goals.

### Why is it important?
To speed up the whole process by giving you insights about:
- if the data can be used at all
- the necessary preprocessing steps
- the possible algorithms
- the interesting data points

### Tools
Everything. Two important factor:
- speed __->__ base statistics
- ease of understanding __->__ PLOTS-PLOTS-PLOTS!

### Let's do it then!
Given the data in `'../data/misterious.csv'`. Read it with `pandas.read_csv`, and then plot it, using the `pandas.DataFrame`'s `plot` method (hint: use shift-tab inside the brackets). You can try `pandas.DataFrame`'s `describe` as well.

#### Answer the following questions:
- What question should we ask?
- What can be the task to solve?
- How?
- Is anything interesting showed up?
- What should we do as the first step of preprocessing?

In [None]:
import numpy as np
import pandas as pd

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

### Plot every feature against each other!
hints: 
- <a href="http://matplotlib.org/examples/pylab_examples/subplots_demo.html">`subplots`</a>
- <a href="http://matplotlib.org/examples/shapes_and_collections/scatter_demo.html">`scatterplots`</a>

### What was the data about?
Let's <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set" style="color: black; text-decoration: none; cursor: default;">find out</a>! This time use the data file `'../data/i.csv'`.<a href="http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html" style="color: white; text-decoration: none; cursor: default;">.. or just use this :D</a>
- use `sklearn`'s <a href="http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">`DecisionTree`</a> on it:
   - import sklearn's DecisionTree model!
   - read the data
   - init the model
   - fit the model
   - transform data
   - display the resulting tree (see: <a href="http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html">export_graphviz</a> then <a href="http://www.webgraphviz.com/">this tool</a>, result should be similar to <a href="http://scikit-learn.org/stable/_images/iris.svg">this</a>)

#### Story time: scikit-learn's interface

The `fit`-`transform`-`predict` principle:

Every sklearn object has a `fit` method, and depending on the object's function  
a __`transform`__ (+`fit_transform`)  
or a __`predict`__ (+`fit_predict`) method.

For example:
```python
clf = DecisionTreeClassifier()
clf.fit(X, y)
y_hat = clf.predict(X)
```
Where:
- `X` is always the input data  
- `y` is the target data

---

### What about this dataset?
No misteries this time. Let's use the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html">20 news groups</a> dataset!

#### Questions:
- What can we see from the raw data?
- What should we do with it?
- How can we visualize the data?
- What question should we answer from this?

### How should we represent texts? aka. Basics of Text Mining

#### Create document vectors!
- Split the documents into words
- Count the occurences
- Each word is a feature -> We've got a vector!

#### Write a function which returns the document as a list of words!

### There must be a better way of doing this!

`Scikit-learn` is here to save the day again (this won't be the last time!):  
Let's use <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">`sklearn.feature_extraction.text.CountVectorizer`</a>

### Let's try to analyze/visualize the documents this time!

But before, check that matrix's shape!  
What should we do? (hint: docstring!)

To visualize, use the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html">`t-SNE`</a> method!

---

### More data!

Let's look into our first <a href="https://www.kaggle.com/c/job-salary-prediction">kaggle dataset</a> and find out as much as we can!