# Data

What is the problem?
- Objective
- Current solution

Problem category
- Supervised
    - Classification
    - Regression
- Unsupervised
- Reinforcement

Problem scale
- Batch learning
- Online learning

## Get the Data

Popular open data repositories:
- [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php)
- [Kaggle Datasets](https://www.kaggle.com/datasets)

## Visualize the Data

Quick look at the data
- `pd.head()`
- `pd.info()`
- `pd.describe()`
- `pd['feature'].value_counts()`

## Prepare the Data

Generate training and test sets
- Add a unique identifier to each instance
- `train_test_split(data, test_size=0.2, random_state=42)`
- `StratifiedShuffleSplit()`

Feature union

Data cleaning
- Fix missing values
    - Get rid of the corresponding distrits
    - Get rid of the whole attribute
    - Set the values to some value (zero, mean, median)
    - `Imputer(strategy='median')`
- Handling text and categorical attributes
    - `LabelEncoder()`: however, ML algorithms will assume that two nearby values are more similar than two distant values.
    - One-hot encoding: `OneHotEncoder()`
    - Apply both transformations in one shot: `LabelBinarizer()`

Feature scaling
- *min-max scaling*
- *standardization*
- `StandardScaler()`

# Training

## Select & Train a Model

Supervised learning
- Regression
    - `LinearRegression()`
    - `DecisionTreeRegressor()`
    - `RandomForestRegressor()`
- Classification

Unsupervised Learning

## Fine-tune the Model

Choose measurement: $l_k$ norm

- $k=1$: *Mean Absolution Error (MAE)* or *Manhattan norm*
- $k=2$: *Root Mean Square Error (RMSE)* or *Euclidian norm*, `mean_squared_error()`
- The higher the norm index, the more it focuses on large values and neglects small ones.

Cross validation
- `cross_val_score()`

Grid search
- `GridSearchCV()`
- `RandomizedSearchCV()`

Ensemble Methods
- Combine the models that perform best

Feature weights
- `feature_importances_`

## Save the Model

```python
from sklearn.externals import joblib

joblib.dump(my_model, "my_model.pkl")
# and later...
my_model_loaded = joblib.load("my_model.pkl")
```

# Evaluate on the Test set

Present your solution
- What you have learned
- What worked and what did not
- What assumptions were made
- What your system's limitations are