In this work, we're going to learn about data analysis basics, preprocessing and solving the classification problem using the simplest machine learning algorithm - __kNN__ on real data about 537,577 transactions on Black Friday in retail stores. The data contains the attributes described in the table:

| Column name | Description |
|---|---|
| User_ID |Customer's identificator |
| Product_ID | Purchase's identificator |
| Gender | Gender of customer |
| Age | Age range |
| Occupation | Time of transaction |
| City_Category | City category (A,B,C) |
| Stay_In_Current_City_Years | Years in city |
| Marital_Status | Martial status |
| Product_Category_1 / Product_Category_2 / Product_Category_3 | Count of products in specific category |
| Purchase | Purchase amount, $ |

# Introduction to Machine Learning in Python

Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.

A core objective of a learner is to generalize from its experience. Generalization in this context is the ability of a learning machine to perform accurately on new, unseen examples/tasks after having experienced a learning data set. The training examples come from some generally unknown probability distribution (considered representative of the space of occurrences) and the learner has to build a general model about this space that enables it to produce sufficiently accurate predictions in new cases.

## Working with data

Let's build a model for predicting the age category depending on the cost of purchase.

### Imports and configurations

The primary data analysis will be carried out using the __Pandas__ package - one of the data-oriented python packages, which simplifies data import and it's analysis. __Pandas__ is built on popular libraries for working with numbers `numpy` and graphs `matplotlib`, which makes data manipulations and visualization more comfortable.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Reading data

On input we have data stored in the file with the extension *.csv.

Data from CSV file can be read using the `read_csv` function. By default, fields are assumed to be separated by commas.

In [2]:
blackFriday_df = pd.read_csv('BlackFriday.csv')

When processing a CSV file with pandas, it turns out an object called __DataFrame__, which consists of rows and columns.

In [3]:
type(blackFriday_df)

pandas.core.frame.DataFrame

To display the initial and final elements, use the `head` and `tail` functions.

Look at the data

In [4]:
# code here

Analyze the read data. To do this, use the `describe` function.

In [5]:
# code here

### Data picking

We don't need all the source data, so we have to weed out the extras

Leave only the customer id, purchase id, as well as gender, age and total value of purchases for further analysis.

In [6]:
blackFriday_df = blackFriday_df[['User_ID', 'Product_ID', 'Gender', 'Age', 'Purchase']]

blackFriday_df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Purchase
0,1000001,P00069042,F,0-17,8370
1,1000001,P00248942,F,0-17,15200
2,1000001,P00087842,F,0-17,1422
3,1000001,P00085442,F,0-17,1057
4,1000002,P00285442,M,55+,7969


Look at the source data (column `Age`). The number of unique entries can be viewed using the function `value_counts`

In [7]:
# code here

## Data preprocessing

Визуализируем исходные данные

Постройте гистограмму распределения суммы покупок через `plot` с параметром `kind='hist'`

Let's visualize the source data.

Build a bar plot of the distribution of purchase amount through `plot` with the parameter` kind = 'hist'`

In [8]:
# code here

Далее постройте диаграмму распределения возрастных категорий через `plot` с параметром `kind='bar'`

Next step: build a bar plot of the distribution of purchase amount through `plot` with the parameter` kind = 'bar'`

In [9]:
# code here

Данные для обучении модели классификации должны быть в числовом формате

Необходимо убедиться в том, что исходные данные имеют тип `int` или `float`

Data for learning classification models must be in numeric format.

You need to make sure that the source data is one of the type: `int` or` float`

In [10]:
blackFriday_df['Age'].dtype

dtype('O')

In [11]:
blackFriday_df['Purchase'].dtype

dtype('int64')

As you can see, the age data is a string, so you need to use instruments of preprocessing. To do this, you can use `preprocessing` from the library` sklearn`

In [12]:
from sklearn import preprocessing

In [13]:
# code here

## Solving the classification task using the kNN algorithm

    Remember that we need to build a model to predict the age category depending on the total cost of purchases.

kNN (k Nearest Neighbors) - this is one of the simplest classification algorithms. The task of classification in machine learning is the task of assigning an object to one of the predetermined classes on the basis of its formalized features.

To classify each of the test sample objects, you must perform the following operations sequentially:

* Calculate the distance to each of the objects of the training sample.
* Select k objects of the training sample, the distance to which is minimal.
* The class of the object being classified is the class most often found among the k nearest neighbors.

[Learn more](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)

### Imports and configurations

The `sklearn` library contains all the methods and functions necessary for learning and testing models.

In [14]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

### Larning the model

Create a model that implements the kNN algorithm with the number of neighbors 3 using `sklearn`

In [15]:
# code here

For training, it is necessary to load into the model arrays of attributes (X) and classes (y). __Important__: the type of data being loaded - __array__

In [16]:
# code here

Then, in order to build test and training samples from the source data, we can use the function `train_test_split`, from `scikit-learn`.

The size of the training and test samples is generally __70% / 30% __

In [17]:
# code here

Then, by calling the function `fit`, train the model

In [18]:
# code here

### Model accuracy test

We can use the model's function `predict`, in order to predict the age range of the data, depending on the cost of purchases` X_test`

In [19]:
# code here

For accuracy test you can use function `metrics`

In [20]:
# code here

_Try to improve accuracy of model_

### Homework

Analyze data about age and physical. characteristics of moluscs using [dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data)

| Original name | Name of column |
|---|---|
| M |sex |
| x0.455 | length |
| x0.365 | diameter |
| x0.095 | height |
| x0.514 | whole_weight |
| x0.2245 | shucked_weight |
| x0.101 | viscera_weight |
| x0.15 | shell_weight |
| x15 | rings |

Try to teach the model yourself to predict the dependencies between the following characteristics on your choice:

* Diameter
* Height
* Full weight

## Conclusion

kNN is one of the simplest classification algorithms, so it is often ineffective in real-world problems. In addition to classification accuracy, the problem of this classifier is the classification speed: if there are N objects in the training sample, M objects in the test choice, and K is the dimension of space, then the number of operations for classifying the test sample can be O (K * M * N). Nevertheless, the kNN workflow is a good example to start exploring Machine Learning.