# Intro to Machine Learning in Python

This guide covers the very basics of running machine learning algorithms in Python. 


## Assumptions

This guide assumes you are using [CSV files](https://en.wikipedia.org/wiki/Comma-separated_values) for your data. Using files of different structured types is made easy with `pandas` (which we are using), but is beyond the scope of this guide.

It is also assumed that you already understand the terminology around machine learning. If you are comfortable with the following terms:

* table
* row
* column
* data frame
* vector
* feature
* dependent variable
* independent variable
* model
* fit
* prediction

... you will be fine. Otherwise take a look at [How To Talk About Data in Machine Learning (Terminology from Statistics and Computer Science)](https://machinelearningmastery.com/data-terminology-in-machine-learning/) for a quick primer.

## Which Python version should I use?

There are two main versions of Python in use today: Python 2 and Python 3. Neither is inherently better than the other for day-to-day use, but in mainstream production systems some users are forced to use one or the other. My recommendation is this:

*Use **Python 2** if any of the following are true*:
* You know you will be shipping code to a production environment that does not support Python3.
* You are using a specific library that is only available in Python 2.

Otherwise, use Python 3.

All examples in this document are in Python 3, but the Python 2 versions are almost identical.

## The basic pattern

The most simple pattern to use for machine learning is:

1. **Ingest**: import the data into a local data structure
1. **Groom**: modify the data into some schema
1. **Split**: break the data into a training set and a testing set
1. **Select**: pick an algorithm apporpriate for the data and the situation
1. **Fit**: build a model of the data using the selected algorithm
1. **Predict**: compute new results from the model
1. **Display**: show a range of predictions from the model

## Libraries

This guide uses:

* `sklearn` for all machine learning algorithms
* `pandas` to help import and groom data
* `matplotlib` for displaying graphical representations of the output

```bash
$ pip install sklearn pandas matplotlib
```

## Ingest

Loading a CSV file is easy with `pandas`:

```python
dataset = pd.read_csv(filename)
```

`dataset` is now a data frame containing the full contents of your input file and with smart assumptions about the types of values in each column.

In [None]:
dataset = pd.read_csv('data.csv')

## Groom

### Dependent and independent variables

For any learning algorithm to work, data needs to be broken into the table of data that represents the independent variables and a column that represents the dependent variables. If you are using all columns of your dataset, and if the last column is the dependent variable and the other colums are your dependent variables, then you just need to do this:

```python
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
```

`pandas` makes it very simple to re-build your table in a variety of ways and to extract just the colums you want or need for your specific purpose. For instance, if you *only* wanted columns two and three as independent variables from your input file, and if your dependent variable was in column 0, then the above could be re-written as:

```python
X = dataset.iloc[:, 2:4].values
y = dataset.iloc[:, 0].values
```

### Missing data



```python
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
imputer = imputer.fit(X[:, range[0]:range[1]])
X = X[:, range[0]:range[1]] = imputer.transform(X[:, range[0]:range[1]])
```

## Split

## Select

## Fit

## Predict

## Display

## References

* [Machine Learning A-Z™: Hands-On Python & R](https://www.udemy.com/machinelearning/learn/v4/overview)
  * Most of the code in this repo was inspired by what I learned in that course. I have adapted it heavily for easier re-use and readability.

* [How To Talk About Data in Machine Learning (Terminology from Statistics and Computer Science)](https://machinelearningmastery.com/data-terminology-in-machine-learning/)

* [pandas](http://pandas.pydata.org/)

* [SciKit-Learn (`sklearn`)](http://scikit-learn.org/)

* [MatPlotLib](https://matplotlib.org/)
