# Scikit-Learn

- Scikit-Learn is a package that provides efficient versions of a large number of common ML algorithms. 
- Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation.
- A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.


## Data Representation in Scikit-Learn

- The information can be thought of as a two-dimensional numerical array, the features matrix. 
- By convention, this matrix is stored in a variable named X. 
- The features matrix is assumed to be two-dimensional, with shape [n_samples, n_features], and is most often contained in a NumPy array or a Pandas DataFrame.
    - The samples (i.e., rows) always refer to the individual objects described by the dataset.
    - The features (i.e., columns) refer to the distinct observations that describe each sample in a quantitative manner.

## Target array
- The target array, called y,  is usually one dimensional, with length n_samples, and is generally contained in a NumPy array or Pandas Series. 
- The target array may have continuous numerical values, or discrete classes/labels. 
- The target array is that it is usually the quantity we want to predict from the data.




## Scikit-Learn’s Estimator API

### Estimators objects
- An estimator is any object that learns from data
    - it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.
- Fitting data: the main API implemented by scikit-learn is that of the estimator.;
    - All estimator objects expose a fit method that takes a dataset (usually a 2-d array):
    
<tt> >>> estimator.fit(data)</tt>

- Estimator parameters: All the parameters of an estimator can be set when it is instantiated or by modifying the corresponding attribute:

<tt> >>> estimator = Estimator(param1=1, param2=2) </tt>
- Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:
    
    <tt> >>> estimator.estimated_param_ </tt>


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
import sklearn as sk

In [None]:
from sklearn import datasets

In [None]:
iris = datasets.load_iris()

In [None]:
iris

In [None]:
X = iris.data

In [None]:
y =iris.target

In [None]:
y

In [None]:
X

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier()

In [None]:
knn.fit(X,y)

In [None]:
knn.predict(X)

In [None]:
len(X)

In [None]:
knn.fit(X[:100],y[:100])

In [None]:
knn.predict(X[100:])

In [None]:
result=knn.predict(X[100:])

In [None]:
y[100:]-result

In [None]:
unique, counts = np.unique((y[100:]-result), return_counts=True) 
dict(zip(unique, counts))

In [None]:
y

In [None]:
indice =np.random.permutation(len(X))

In [None]:
indice

In [None]:
X_train= X[indice[:100]]
y_train= y[indice[:100]]
X_test= X[indice[100:]]
y_test= y[indice[100:]]

In [None]:
knn.fit(X_train,y_train)

In [None]:
result=knn.predict(X_test)

In [None]:
y_test - result

In [None]:
unique, counts = np.unique((y_test-result), return_counts=True) 
dict(zip(unique, counts))

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
knn.fit(X_train,y_train)

In [None]:
result=knn.predict(X_test)

In [None]:
result

In [None]:
unique, counts = np.unique((y_test-result), return_counts=True) 
dict(zip(unique, counts))

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtc = DecisionTreeClassifier()

In [None]:
dtc.fit(X_train,y_train)
result=dtc.predict(X_test)

In [None]:
y_test - result

In [None]:
unique, counts = np.unique((y_test-result), return_counts=True) 
dict(zip(unique, counts))

## 1) creare un dataframe che contenga i valori di iris
## 2) creare due nuovin attributi che rappresentano l'area del petalo e del sepalo
## 3) utilizzare il nuovo dataframe per fare la predizione
## 4) utilizzare un classificatore non ancora introdotto (bayes?)