# SCIKIT-LEARN
[Official documentation](https://scikit-learn.org/stable/)

***
 

In [35]:
##  Alt + Enter - Insert cell below

# Skikit-learn Machine Learning module
import sklearn as sk

# Efficient numerical arrays
import numpy as np

# Plotting
import matplotlib.pyplot as plt

# Working with DataFrames
import pandas as pd

# Part 1 - Overview of the Scikit-learn library

## Scikit-learn

Scikit-learn is a Python machine learning library.
Machine Learning, in general, is given a set of n samples of data and then tries to predict properties of unknown data. [1](#section)

Scikit-learn supports and iteroperates with NumPy and SciPy for numerical operations, as well as with Matplotlib for plotting and Pandas for data manipulation.

Three main techniques of Machine Learning can be highlighted - classification, regression and clustering. [2](#section) 

### 1. Classification

Identifying which category an object belongs to. 
Possible applications of algorithms of that type can be spam detection or image recognition.
Some of the examples of the algorithms are logistic regression, support vector machine, random forests or nearest neighbors. 

### 2. Regression

Predicting a continuous-valued attribute associated with an object.
Possible applications can be drug response or stock prices prediction algorithm.
Some of the algorithms are linear regression, nearest neighbors or random forest.

### 3. Clustering

Automatic grouping of similar objects into sets.
Possible applications can be customer segmentation or grouping experiment outcome.
Some of the algorithms are k-Means or spectral clustering.

# Part 2 - Demonstrations of three scikit-learn algorithms.

## Algorithm 1 - <span style="color:red">*Logistic regression*</span>

Despite its name, logistic regression is a linear model for <b>classification</b> rather than regression.  [3](#section)

We are going to be using iris dataset to test logistic regression in work. The data set is available online.
Let us load the data set and look at it before we do any prediction.

In [64]:
#importing the necessary module
from sklearn.datasets import load_iris
data = load_iris()

In [65]:
# Printing off the dataset as an array
print(data)

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
     

In [78]:
# Let us print it as a dataframe with added column headings 
df = pd.DataFrame(data['data'], columns=data['feature_names'])
print(df)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns]


In [79]:
# We are adding another column - 'species'. 
# It is empty for now, the model's prediction will be there
df['species'] = data['target']

In [74]:
print(df)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  5.1               3.5                1.4               0.2   
1                  4.9               3.0                1.4               0.2   
2                  4.7               3.2                1.3               0.2   
3                  4.6               3.1                1.5               0.2   
4                  5.0               3.6                1.4               0.2   
..                 ...               ...                ...               ...   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

     species  
0          0

In [81]:
# Let us see first 10 rows to make it look neater
df.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


So, data from the first 4 columns will be analyzed and the result of the prediction will be placed into the fifth column.

## References


<a id='section'></a>
 [1. Machine learning: the problem setting](https://scikit-learn.org/stable/tutorial/basic/tutorial.html "Press to check the reference source")
 
 [2. Scikit-learn official documentation](https://scikit-learn.org/stable/index.html "Press to check the reference source")
 
 [3. Logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression "Press to check the reference source")

***
## End
