# Basics of Scikit Learn

- It provides simple and efficient tools for pre-processing and predictive modelling.
- Steps to build a model in scikit-learn:
    - Import the model and prepare the data set
    - Separate the independent and target variables.
    - Create an object of the model.
    - Fit the model with the data.
    - Use the model to predict target.

## Setting up library and importing data

In [1]:
# Install scikit-learn library
!pip install scikit-learn




[notice] A new release of pip is available: 24.1 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Importing necessary libraries
import sklearn
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline

print(pd.__version__)
print(np.__version__)
print(sklearn.__version__)

2.1.1
1.26.1
1.5.0


In [8]:
# check the version 
sklearn.__version__

'1.5.0'

In [4]:
# Load the data
data = pd.read_csv('datasets/big_mart_sales.csv')

# Check the data
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [5]:
# Checking for NULL values
data.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

## Impute the missing values

For imputing the missing values, **`SimpleImputer`** is used. First, we will create an object of the Imputer and define the strategy.

In [6]:
# import the SimpleImputer
from sklearn.impute import SimpleImputer

# create the object of the imputer for Item_Weight and Outlet_Size
impute_weight = SimpleImputer(strategy = 'mean')
impute_size = SimpleImputer(strategy = 'most_frequent')

In [7]:
# fit the Item_Weight imputer with the data and transform
impute_weight.fit(data[['Item_Weight']])
data.Item_Weight = impute_weight.transform(data[['Item_Weight']])

In [8]:
# fit the Outlet_Size imputer with the data and transform
impute_size.fit(data[['Outlet_Size']])

In [9]:
# Check for null values.
data.isna().sum()

Item_Identifier                 0
Item_Weight                     0
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

- After the preprocessing step, we separate the independent and target variable and pass the data to the model object to train the model.
- If we have a problem in which we have to identify the category of an object based on some features. These are **`classification problems`**.
- If we have to identify a continous attribute like predicting values based on some features. These are **`regression problems`**.
- **`SCIKIT-LEARN`** has tools which will help you build Regression, Classification models and many others.

In [10]:
# some of the very basic models scikit learn has.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier

After we have build the model now whenever new data points are added to the existing data, we need to perform the same preprocessing steps again before we can use the model to make predictions. This becomes a tedious and time consuming process! 
- So, scikit-learn provides tools to create a pipeline of all those steps that will make your work a lot more easier.

In [11]:
from sklearn.pipeline import Pipeline