# Scikit-Learn

- It provides simple and efficient tools for pre-processing and predictive modeling

**Pre-processing**
- Imputes Missing Values
- Encode Categorical Variables
- Scaling/Normalizing of Data

**Model building**
- Identifying category to which an object belongs.
- Predicting a cotinuous-valued attribute associated with an object.

**Automating the process**
- After the model building create pipelines to automate the pre-processing part and predict the target using the final model.

**Steps to build a model in scikit-learn.**

1. Import hte model 
2. Prepare the data set
3. Separate the independent and the target values
4. Create an object of the model
5. Fill the model with the data
6. Use the model to predict the target

In [1]:
# import the scikit-learn library 
import sklearn

**If you got an error while running the above cell, import it by using the following command.**

If you are using anaconda with python3: **!pip install scikit-learn**

If you are using jupyter with python3: **!pip3 install scikit-learn**

In [2]:
#check the version 
sklearn.__version__

'1.0.2'

**We have seen int the pandas notebook that we have some missing values in out data.**
**We will impute those missing values using the scikit-learn imputer.**

In [3]:
# read the data set and check for the null values
import pandas as pd
data = pd.read_csv('big_mart_sales.csv')
data.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [4]:
# import the simple imputer 
from sklearn.impute import SimpleImputer

- For imputing the missing values, we will use **Simpleimputer**.
- First we will create an object of the imputer and define the strategy.
- We will impute the Item_Weight by mean value and Outlet_Size by most frequent value.
- Fit the object with the data.
- Transform the data.

In [9]:
# create the object of the imputer for Item_Weight and Outlet_Size
impute_weight = SimpleImputer(strategy='mean')
impute_size = SimpleImputer(strategy='most_frequent')

In [10]:
# fit the Item_Weight impute with the data and transform 
impute_weight.fit(data[['Item_Weight']])
data.Item_Weight = impute_weight.transform(data[['Item_Weight']])

In [11]:
# fit the Outlet_Size impute with the data and transform 
impute_size.fit(data[['Outlet_Size']])
data.Outlet_Size = impute_size.transform(data[['Outlet_Size']])

In [12]:
# check the null values
data.isna().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

- Now, after the processing step, we separate the independent and target variable and pass the data to the model object to train the model.

- if we have a problem in which we have to identify the category of an object based on some features. For example whether the given picture is of a cat or a dog. Thease are classsification problems. 

- Or, if we have to indentify a continuous attribute like predicting sales based on some features. these are **Regression Problems**.

**SCIKIT-LEARN** has tools which will help you build Regression, Classification models and many others.

In [13]:
# some of the very basic models scikit learn has.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier

After we have build the model now wherever new data points are added to the existing data, we need to perform the same preprocessing steps again before we can use the model to make predictions. This becomes a tedious and time consuming process!

So, scikit-learn provides tools to create a pipeline of all those steps that will make your work a lot more easier.

In [14]:
from sklearn.pipeline import Pipeline