### What is Scikit-learn ?
- Scikit-learn is a open-source python library that simplifies the process of building machine learning models.
- It offers a clear and consistent that helps us both begineers and experienced users work efficently.
- Support tasks like classification, regression, clustering and preprocessing.
- Makes model building fast and reliable.
- provides ready to use tools for training and evaluation.
- Reduce complexity bt avoiding manual implementation of algorithms.

### Steps to Build a Model with scikit-learn
- Load the datasets.
- Data Processing.
- Split the data into testing and training sets.
- Train a model amd make predictions.

### Why Choose scikit-learn ?
- It supports easy algorithm switching and hyperparameter tuning.
- It offers tools for preprocessing, traing and evaluation.

### Step 1: Loading Dataset
#### What is dataset?
- A collection of data is called dataset. It is having the following two components.
- 1.) Features(X) : The variables of data are called features. They are also known as inputs, attributes, or predictors.
- (a.) Feature Matrix : It is collection of features, in case there are morethan one.
- (b.) Feature Names : It is the list of all the names of the features.
- 2.) Response(y) : It is the output variable that basically depends upon the feature variables.They are also known as target, label or output.
- (a.) Response Vector : It is used to represent response column, Generally we have just one response column.
- (b.) Target Names : It represent the possible values taken by a response vector. 

### Scikit-learn provides built-in datasets like Iris, Digits and Boston Housing.Using Iris Dataset:
- load_iris() loads the data.
- X Stores the feature data.
- y Stores the target labels.
- feature_names and target_names give descriptive names.

In [5]:
from sklearn.datasets import load_iris
iris=load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("Feature Names: ",feature_names)
print("Target Names : ",target_names)
print("\n Type of X is : ",type(X))
print("\n First 5 Rows of X: \n",X[:5])

Feature Names:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target Names :  ['setosa' 'versicolor' 'virginica']

 Type of X is :  <class 'numpy.ndarray'>

 First 5 Rows of X: 
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


### Step 2: Splitting the Dataset
- To check the accuracy of the model, we can split the dataset into two parts.
- Training set : Used to train the model.
- Testing set : Used to test the model.
- Using train_test_split() : we split Iris dataset so that training is 60% and 40% is testing (test_size=0.4), random_state=1 ensures reproducibility.
- test_size : This represents the ratio of test data to the given data. For example, we can produce  test_data=0.3 for 150 rows of X like 150*0.3=45.
- random_state : It is used to guarantee that split will be always be the same. 
- After spliting, we get this :
- X_train, y_train -> Training Data.
- X_test, y_test  -> Testing Data.

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.3, random_state=1)
print("X_train Shape:",  X_train.shape)
print("X_test Shape:", X_test.shape)
print("Y_train Shape:", y_train.shape)
print("Y_test Shape:", y_test.shape)

X_train Shape: (105, 4)
X_test Shape: (45, 4)
Y_train Shape: (105,)
Y_test Shape: (45,)


### Step 3:Handling Categorical Data
- ML algorithms must be work with numerical inputs, so categorical(text) data must be converted into numbers.
- If not encoded properly, models can be misinterpret categories.

### Scikit-learn provides multiple encoding methods :
#### 1. Label Encoding:
- It converts each category into unique values.
- For Eg: In a column with categories like 'cat','dog' and 'bird', it would be convert them 0,1,and 3 etc.
- This method works well when all the categories have meaningfull order such as 'Low", "Medium", and "High".
#### LabelEncoder():
- It is initialized to create an encoder object that will convert categorical values into numerical values.
#### fit_transform():
- This method first fits encoder to the categorical data and then transforms the categories into numerical values.


In [8]:
from sklearn.preprocessing import LabelEncoder
categorical_feature=['Dog','Cat','Cat','Dog']
encoder = LabelEncoder()
encoder_feature = encoder.fit_transform(categorical_feature)
print('Encoder Feature:',encoder_feature)

Encoder Feature: [1 0 0 1]


### 2.One-Hot Encoding:
- It creates seperate binary columns for each category.This is useful when categories does not habe any natural ordering.
- For Eg: cat,dog,bird -> 3 columns(cat/dog/bird) with 1s and 0s.
- Input must be reshaped into 2D-array.
- OneHotEncoder(sparse_output= False) generates binary columns.

In [11]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
categorical_feature = ['Dog','Cat','Bird','Cat','Dog','Dog']
categorical_feature = np.array(categorical_feature).reshape(-1,1)
encoder = OneHotEncoder(sparse_output = False)
encoder_feature = encoder.fit_transform(categorical_feature)
print('Onehotencoder:\n',encoder_feature)

Onehotencoder:
 [[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]]


### Step 4: Training the Model
- Now that our data is ready, it's time to train a ML model.
- Scikit-learn has many ML algorithms with a consistent interface for traing, evaluation and prediction.
- Here we use Logistic Regression :
- log_reg = LogisticRegression(max_iter=200) : Creating Logistic regression classifier object.
- log_reg.fit(X_train, y_train) : Using this Logistic regression model adjust the parameters to best fit the data.

In [12]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter = 200)
log_reg.fit(X_train, y_train)

### Step 5: Make Predictions
- Once trained we use model to make predictions on the X_test by calling predict method.
- log_reg.predict() : It uses trained logistic regression model to make predictions on the test data X_test. 

In [14]:
y_pred = log_reg.predict (X_test)

### Step 6: Evaluating Model Accuracy
- Check how well your model is performing by comparing y_text, y_pred.
- Here we are using metrics module's method accuracy_score.

In [15]:
from sklearn import metrics
print('\n Logistic Regression Accuracy Score :\n',metrics.accuracy_score(y_test, y_pred))


 Logistic Regression Accuracy Score :
 0.9777777777777777
