<a href="https://colab.research.google.com/github/akkulu95/machine_learning/blob/main/fit_and_fit_transform_study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here we will discuss the difference between fit and fit transform in sk learn
Heres the link :https://www.geeksforgeeks.org/what-is-the-difference-between-transform-and-fit_transform-in-sklearn-python/#:~:text=The%20fit(data)%20method%20is,fit()%20method.

* The fit(data) method is used to compute the mean and std dev for a given feature to be used further for scaling.
* The transform(data) method is used to perform scaling using mean and std dev calculated using the .fit() method.
* The fit_transform() method does both fits and transform

Let us consider we will have to perform scaling as one of the data processing steps to be performed. To demonstrate this example let us consider an inbuilt iris dataset

In [1]:
from sklearn import datasets
import pandas as pd
  
iris = datasets.load_iris()
data = pd.DataFrame(iris.get('data'), columns=[
    'sepal length', 'petal length', 'sepal width', 'sepal width'])
data.head()

Unnamed: 0,sepal length,petal length,sepal width,sepal width.1
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Let us split the data as train and test splits

In [2]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data.iloc[:,:-1],data['sepal width'],test_size=.33,random_state=42)

Now let us perform a standard scaling on the sepal width column. Scaling in general means converting the column to a common number scale, Standard scaling in particular converts the column of interest by transforming it to a range of numbers with mean = 0 and standard deviation = 1

The fit function computes the formulation to transform the column based on Standard scaling but doesn’t apply the actual transformation. The computation is stored as a fit object. The fit method doesn’t return anything

In [3]:
from sklearn.preprocessing import StandardScaler
  
scaler = StandardScaler()
scaler.fit(data['sepal width'])

StandardScaler()

The transform method takes advantage of the fit object in the fit() method and applies the actual transformation onto the column. So, fit() and transform() is a two-step process that completes the transformation in the second step. Here, Unlike the fit() method the transform method returns the actually transformed array.

In [4]:
scaler.transform(data['sepal width'])

array([[-1.34022653e+00, -1.31544430e+00],
       [-1.34022653e+00, -1.31544430e+00],
       [-1.39706395e+00, -1.31544430e+00],
       [-1.28338910e+00, -1.31544430e+00],
       [-1.34022653e+00, -1.31544430e+00],
       [-1.16971425e+00, -1.05217993e+00],
       [-1.34022653e+00, -1.18381211e+00],
       [-1.28338910e+00, -1.31544430e+00],
       [-1.34022653e+00, -1.31544430e+00],
       [-1.28338910e+00, -1.44707648e+00],
       [-1.28338910e+00, -1.31544430e+00],
       [-1.22655167e+00, -1.31544430e+00],
       [-1.34022653e+00, -1.44707648e+00],
       [-1.51073881e+00, -1.44707648e+00],
       [-1.45390138e+00, -1.31544430e+00],
       [-1.28338910e+00, -1.05217993e+00],
       [-1.39706395e+00, -1.05217993e+00],
       [-1.34022653e+00, -1.18381211e+00],
       [-1.16971425e+00, -1.18381211e+00],
       [-1.28338910e+00, -1.18381211e+00],
       [-1.16971425e+00, -1.31544430e+00],
       [-1.28338910e+00, -1.05217993e+00],
       [-1.56757623e+00, -1.31544430e+00],
       [-1.

In [5]:
scaler.fit_transform(X_train)

array([[-0.13835603, -0.26550845,  0.22229072],
       [ 2.14752625, -0.02631165,  1.61160773],
       [-0.25866563, -0.02631165,  0.39595535],
       [-0.8602136 ,  1.16967238, -1.39857913],
       [ 2.26783585, -0.50470526,  1.66949594],
       [-0.01804644, -0.74390206,  0.16440251],
       [-0.739904  ,  0.93047557, -1.39857913],
       [-0.98052319,  1.16967238, -1.45646733],
       [-0.8602136 ,  1.88726279, -1.10913808],
       [-0.98052319, -2.4182797 , -0.18292674],
       [ 0.58350153, -0.74390206,  0.62750818],
       [-1.22114238,  0.93047557, -1.10913808],
       [-0.98052319, -0.02631165, -1.28280271],
       [-0.8602136 ,  0.69127877, -1.2249145 ],
       [-0.25866563, -0.74390206,  0.22229072],
       [-0.8602136 ,  0.93047557, -1.34069092],
       [-0.13835603, -0.02631165,  0.22229072],
       [ 2.26783585,  1.88726279,  1.66949594],
       [-1.46176157,  0.45208196, -1.39857913],
       [ 0.46319194, -0.26550845,  0.28017893],
       [-0.13835603, -1.22229567,  0.685

As we can see, the final output of fit(), transform(), and fit_transform() is going to be the same. Now, we will have to ensure that the same transformation is applied to the test dataset.  But, we cannot use the fit() method on the test dataset, because it will be the wrong approach as it could introduce bias to the testing dataset. So, let us try to use the transform() method directly on the test dataset.

In [6]:
scaler.transform(X_test)

array([[ 0.34288234, -0.50470526,  0.51173177],
       [-0.13835603,  1.88726279, -1.2249145 ],
       [ 2.26783585, -0.98309887,  1.78527236],
       [ 0.22257275, -0.26550845,  0.39595535],
       [ 1.1850495 , -0.50470526,  0.56961997],
       [-0.49928482,  0.93047557, -1.34069092],
       [-0.25866563, -0.26550845, -0.12503853],
       [ 1.30535909,  0.21288516,  0.7432846 ],
       [ 0.46319194, -1.93988609,  0.39595535],
       [-0.01804644, -0.74390206,  0.0486261 ],
       [ 0.82412072,  0.45208196,  0.7432846 ],
       [-1.22114238, -0.02631165, -1.39857913],
       [-0.37897522,  1.16967238, -1.45646733],
       [-1.10083279,  0.21288516, -1.34069092],
       [-0.8602136 ,  1.88726279, -1.34069092],
       [ 0.58350153,  0.69127877,  0.51173177],
       [ 0.82412072, -0.02631165,  1.14850206],
       [-0.25866563, -1.22229567,  0.0486261 ],
       [-0.13835603, -0.50470526,  0.39595535],
       [ 0.70381112, -0.50470526,  1.03272565],
       [-1.34145197,  0.45208196, -1.282