# SCIKIT-LEARN

## LINKS

- https://scikit-learn.org/stable/tutorial/index.html
- https://www.tutorialspoint.com/scikit_learn/index.htm

- Also known as "sklearn"
- Aids in Machine Learning and Statistical Modeling
    - Classification
    - Regression
    - Clustering
    - Dimensionality Reduction
- Sklearn is built upon
    - NumPy
    - SciPy
    - Matplotlib

https://www.tutorialspoint.com/scikit_learn/scikit_learn_introduction.htm

Sklearn is more focused on modeling of data rather than loading-manipulating-summarizing the data.

Groups of data models provided by sklearn are as follows:
- Supervised Learning Algorithms
- Unsupervised Learning Algorithms
- Clustering
- Cross-Validation
- Dimensionality Reduction
- Ensemble Methods
- Feature Extraction
- Open Source

## MODELING PROCESS

### DATASET LOADING

<pre>
<font color='green'>

Datasets are collections of data.
Each dataset has two components:
- Features : the variables of data. AKA predictors, inputs, attributes.
    - Feature Matrix
    - Feature Names
- Responses : the output variable that basically depends on the feature variables. AKA target, label, output.
    - Response Vector : it represents the response column. Generally, there is only one response column.
    - Target Names : the possible values that can be taken by a response vector.

</font>
</pre>

In [23]:
# Loading the "iris" dataset

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
Y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("Feature names :", feature_names)
print("Target names :", target_names)
print("\nFirst 10 rows of X:\n", X[:10])

Feature names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names : ['setosa' 'versicolor' 'virginica']

First 10 rows of X:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


### SPLITTING THE DATASET

In [25]:
# splitting the iris dataset in a 70:30 ratio, with 70% accounting for training data and the remaining 30% accounting for testing data

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
Y = iris.target

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.3,random_state=1)

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(105, 4)
(45, 4)
(105,)
(45,)
