### Data Preprocessing Steps
    - Type casting ( checking the datatype of the columns)
    - Checking for duplicates
    - Scaling data ( Numerical Continuous Features)
        - Standard Scaling
        - MinMax Scaling
    - Encoding of Data
        - Label Encoding
        - Onehot Encoding
    - Missing Value Treatments
        - Dropping the rows/ columns
        - Imputation Techniques ( fill with 0 or some suitable value ( like mean , median etc))
            - SimpleImputer
            - KNN Imputer
    - Handling Outliers
        - Ceiling/Capping  the values
    - Handling Imbalanced Data
        - Undersampling, Oversampling (SMOTE)
        
    - Handling date/time data
   
### Introdution to Sklearn architecture
    - fit
    - transform
    - fit_transform
    - predict
    - predict_proba

In [None]:
scikit-learn  sklearn

In [6]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [25]:
dataset = load_iris()
(dataset.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [26]:
dataset.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [29]:
dataset["target_names"]

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [36]:
dataset["feature_names"]

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [37]:
pd.DataFrame(dataset["data"], columns=dataset["feature_names"])

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [28]:
pd.DataFrame(dataset["data"], columns=dataset["target_names"])

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [9]:
data["feature_names"]

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [13]:
data["target_names"]

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [15]:
(data["target"])

150

In [12]:
print(data["DESCR"])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [19]:
data["target_names"]

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [38]:
df = pd.DataFrame(dataset["data"], columns=dataset["feature_names"])
df["target"] = dataset["target"]
df.sample(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
149,5.9,3.0,5.1,1.8,2
53,5.5,2.3,4.0,1.3,1
139,6.9,3.1,5.4,2.1,2
116,6.5,3.0,5.5,1.8,2
85,6.0,3.4,4.5,1.6,1
51,6.4,3.2,4.5,1.5,1
113,5.7,2.5,5.0,2.0,2
18,5.7,3.8,1.7,0.3,0
131,7.9,3.8,6.4,2.0,2
61,5.9,3.0,4.2,1.5,1


### Check data imbalance

In [39]:
df["target"].value_counts()

target
0    50
1    50
2    50
Name: count, dtype: int64

    - The data is equally distributed in different classes

### Check for duplicates

In [40]:
df.duplicated().sum()

1

In [42]:
df[df.duplicated()]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
142,5.8,2.7,5.1,1.9,2


In [46]:
df[df["sepal length (cm)"]==5.8 ].sort_values(by = "sepal width (cm)")

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
92,5.8,2.6,4.0,1.2,1
67,5.8,2.7,4.1,1.0,1
82,5.8,2.7,3.9,1.2,1
101,5.8,2.7,5.1,1.9,2
142,5.8,2.7,5.1,1.9,2
114,5.8,2.8,5.1,2.4,2
14,5.8,4.0,1.2,0.2,0


In [48]:
df = df.drop_duplicates()
df.drop_duplicates(inplace = True)  ## inplace operation or assign it to new dataframe

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(inplace = True)  ## inplace operation or assign it to new dataframe


In [51]:
df[140:]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
140,6.7,3.1,5.6,2.4,2
141,6.9,3.1,5.1,2.3,2
143,6.8,3.2,5.9,2.3,2
144,6.7,3.3,5.7,2.5,2
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


### Check missing values

In [54]:
df.isna().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64

In [97]:
df = pd.DataFrame({
    "age":[20, np.nan, 25, 67],
    "salary":[50000, 60000, np.nan, 450000],
    "gender":["M","F","M","F"]
    
    
})

df

Unnamed: 0,age,salary,gender
0,20.0,50000.0,M
1,,60000.0,F
2,25.0,,M
3,67.0,450000.0,F


In [98]:
df

Unnamed: 0,age,salary,gender
0,20.0,50000.0,M
1,,60000.0,F
2,25.0,,M
3,67.0,450000.0,F


## checking the datatype of all columns



In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     3 non-null      float64
 1   salary  3 non-null      float64
 2   gender  4 non-null      object 
dtypes: float64(2), object(1)
memory usage: 224.0+ bytes


In [60]:
df.isna().sum()

age       0
salary    0
gender    0
dtype: int64

In [78]:
df["salary"].unique()

array([ 50000.,  60000.,     nan, 450000.])

In [79]:
np.where(df["salary"].dtype=="object")

(array([], dtype=int64),)

In [80]:
df.sort_values(by = "salary")

Unnamed: 0,age,salary,gender
0,20,50000.0,M
1,24,60000.0,F
3,67,450000.0,F
2,25,,M


#### Delete the column/ row

In [86]:
df

Unnamed: 0,age,salary,gender
0,20.0,50000.0,M
1,,60000.0,F
2,25.0,,M
3,67.0,450000.0,F


In [91]:
df

Unnamed: 0,age,salary,gender
0,20.0,50000.0,M
1,,60000.0,F
2,25.0,,M
3,67.0,450000.0,F


In [101]:
df.dropna(subset=["salary"] )

Unnamed: 0,age,salary,gender
0,20.0,50000.0,M
1,,60000.0,F
3,67.0,450000.0,F


In [102]:
df

Unnamed: 0,age,salary,gender
0,20.0,50000.0,M
1,,60000.0,F
2,25.0,,M
3,67.0,450000.0,F


### Imputing the missing value

In [105]:
df.fillna("N/A")

Unnamed: 0,age,salary,gender
0,20.0,50000.0,M
1,,60000.0,F
2,25.0,,M
3,67.0,450000.0,F


### SimpleImputer and KNN Imputer
    - SimpleImputer is used for imputing values using simple strategy like mean, median, constant value
    - KNN imputer is more intelligent way to impute values. It imputes missing value based on the similar points
    
    Things to keep in mind::
        - Outlier will affect the result on Simple Imputer ( more in mean, less in median strategy)
        - Whenever dealing with distance, scale the data otherwise distance is more inflienced by higher scale data

In [178]:
from sklearn.impute import SimpleImputer, KNNImputer

In [244]:
df = pd.DataFrame({
    "age":[20, np.nan, 25, 67, 70, 20, 45, 25, 64],
    "salary":[50000, 60000, np.nan, 45000, 32000, 61000, 34000, 65000, 7500000],
    "gender":["M","F","M","F","F","F","F","F","M"]
    
    
})

df

Unnamed: 0,age,salary,gender
0,20.0,50000.0,M
1,,60000.0,F
2,25.0,,M
3,67.0,45000.0,F
4,70.0,32000.0,F
5,20.0,61000.0,F
6,45.0,34000.0,F
7,25.0,65000.0,F
8,64.0,7500000.0,M


In [254]:
simp_imp = SimpleImputer(strategy='mean')
simp_imp.fit(df[["age","salary"]])

In [255]:
simpleimputed_df = pd.DataFrame(simp_imp.transform(df[["age","salary"]]))
simpleimputed_df

Unnamed: 0,0,1
0,20.0,50000.0
1,42.0,60000.0
2,25.0,980875.0
3,67.0,45000.0
4,70.0,32000.0
5,20.0,61000.0
6,45.0,34000.0
7,25.0,65000.0
8,64.0,7500000.0


In [247]:
# list(dir(simp_imp))

In [256]:
simp_imp.statistics_

array([4.20000e+01, 9.80875e+05])

In [257]:
df["age"].mean(), df["salary"].mean()

(42.0, 980875.0)

In [258]:
knn_imputer = KNNImputer(n_neighbors=1, weights="distance")
knn_imputed_df = pd.DataFrame(knn_imputer.fit_transform(df[['age',"salary"]]))

In [259]:
df

Unnamed: 0,age,salary,gender
0,20.0,50000.0,M
1,,60000.0,F
2,25.0,,M
3,67.0,45000.0,F
4,70.0,32000.0,F
5,20.0,61000.0,F
6,45.0,34000.0,F
7,25.0,65000.0,F
8,64.0,7500000.0,M


In [260]:
simpleimputed_df

Unnamed: 0,0,1
0,20.0,50000.0
1,42.0,60000.0
2,25.0,980875.0
3,67.0,45000.0
4,70.0,32000.0
5,20.0,61000.0
6,45.0,34000.0
7,25.0,65000.0
8,64.0,7500000.0


In [261]:
knn_imputed_df

Unnamed: 0,0,1
0,20.0,50000.0
1,20.0,60000.0
2,25.0,65000.0
3,67.0,45000.0
4,70.0,32000.0
5,20.0,61000.0
6,45.0,34000.0
7,25.0,65000.0
8,64.0,7500000.0


### Scaling Of data
        - Bring all the features in similar scale
        - this helps in building a better ML models

In [113]:
# df = pd.DataFrame(dataset["data"], columns=dataset["feature_names"])

In [114]:
# df

#### Standard Scaler
    - mean will be zero and standard deviation will be 1
    - subtract the mean and divide by standard deviation
    - z(x) = (x - mean(X))/std(X) ----> z_score of x, stardardscaling of X
    - standard_score = (x - mean(x)) / std(x)   ----> transformation


    - x = (standard_score * std(x)) + mean(x)  ----> inverse transform
    
    - what is the meaning of the z-score ---> it hows how far the data point is from the mean in units of standard deviation
            - z score of 0.8 means that the data point is 0.8 standard deviations away from the mean 

In [116]:
df = pd.DataFrame({
    "age":[20, 23, 25, 67, 45],
    "salary":[50000, 60000, 12000, 45000, 60000],
    "gender":["M","F","M","F","M"]
    
    
})

df

Unnamed: 0,age,salary,gender
0,20,50000,M
1,23,60000,F
2,25,12000,M
3,67,45000,F
4,45,60000,M


In [126]:
# df["age"].mean(), df["age"].std()
df["salary"].mean(), df["salary"].std()

(45400.0, 19768.662069042508)

In [129]:
45400.0 + (19768.662069042508 * 0.23)

49946.79227587978

In [124]:
df["age_scaled"] = (df["age"] - df["age"].mean())/df["age"].std()
df["salary_scaled"] = (df["salary"] - df["salary"].mean())/df["salary"].std()
df

Unnamed: 0,age,salary,gender,age_scaled,salary_scaled
0,20,50000,M,-0.803017,0.232692
1,23,60000,F,-0.652451,0.738543
2,25,12000,M,-0.552074,-1.689543
3,67,45000,F,1.555845,-0.020234
4,45,60000,M,0.451697,0.738543


In [130]:
df["age_scaled"].mean(), df["age_scaled"].std()
df["salary_scaled"].mean(), df["salary_scaled"].std()

(4.4408920985006264e-17, 0.9999999999999999)

#### MinMax Scaling
     - (x - min(X))/ range(X) where range(X) = max(X) - min(X)
     - the data is scaled between 0 and 1

In [133]:
df["age_minmax"] = (df["age"] - df["age"].min())/(df["age"].max() - df["age"].min())
df

Unnamed: 0,age,salary,gender,age_scaled,salary_scaled,age_minmax
0,20,50000,M,-0.803017,0.232692,0.0
1,23,60000,F,-0.652451,0.738543,0.06383
2,25,12000,M,-0.552074,-1.689543,0.106383
3,67,45000,F,1.555845,-0.020234,1.0
4,45,60000,M,0.451697,0.738543,0.531915


### Using Sklearn module

        - preprocessing --> StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
        - metrics --> mean_squared_error, r2-score, accuracy
        - model_selection
        - linear_model --> LinearRegression, LogisticRegression
        -

In [136]:
df = pd.DataFrame({
    "age":[20, 23, 25, 67, 45],
    "salary":[50000, 60000, 12000, 45000, 60000],
    "gender":["M","F","M","F","M"]
    
    
})

df

Unnamed: 0,age,salary,gender
0,20,50000,M
1,23,60000,F
2,25,12000,M
3,67,45000,F
4,45,60000,M


In [134]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [137]:
std_scaler = StandardScaler()

In [139]:
std_scaler.fit(df[["age"]])

In [147]:
df["age"].std(), np.std(df["age"])

(19.924858845171276, 17.82133552795637)

In [148]:
df["age_scaled2"] =  std_scaler.transform(df[["age"]])

In [149]:
df

Unnamed: 0,age,salary,gender,age_scaled2
0,20,50000,M,-0.8978
1,23,60000,F,-0.729463
2,25,12000,M,-0.617238
3,67,45000,F,1.739488
4,45,60000,M,0.505013


In [150]:
std_scaler = StandardScaler()
std_scaler.fit_transform(df[["age","salary"]])

array([[-0.89780028,  0.26015703],
       [-0.72946273,  0.82571578],
       [-0.61723769, -1.88896624],
       [ 1.73948804, -0.02262235],
       [ 0.50501266,  0.82571578]])

In [154]:
min_max = MinMaxScaler()
df[["age_min","salary_min"]] =    min_max.fit_transform(df[["age", "salary"]])

In [155]:
df

Unnamed: 0,age,salary,gender,age_scaled2,age_min,salary_min
0,20,50000,M,-0.8978,0.0,0.791667
1,23,60000,F,-0.729463,0.06383,1.0
2,25,12000,M,-0.617238,0.106383,0.0
3,67,45000,F,1.739488,1.0,0.6875
4,45,60000,M,0.505013,0.531915,1.0


In [157]:
df = pd.DataFrame(dataset["data"], columns=dataset["feature_names"])
# df["target"] = dataset["target"]
df.sample(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
95,5.7,3.0,4.2,1.2
47,4.6,3.2,1.4,0.2
65,6.7,3.1,4.4,1.4
128,6.4,2.8,5.6,2.1
87,6.3,2.3,4.4,1.3
7,5.0,3.4,1.5,0.2
22,4.6,3.6,1.0,0.2
142,5.8,2.7,5.1,1.9
143,6.8,3.2,5.9,2.3
0,5.1,3.5,1.4,0.2


In [161]:
std_scaler = StandardScaler()
df_transformed = std_scaler.fit_transform(df)
df_transformed = pd.DataFrame(df_transformed, columns=df.columns)

In [163]:
df_transformed

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-0.900681,1.019004,-1.340227,-1.315444
1,-1.143017,-0.131979,-1.340227,-1.315444
2,-1.385353,0.328414,-1.397064,-1.315444
3,-1.506521,0.098217,-1.283389,-1.315444
4,-1.021849,1.249201,-1.340227,-1.315444
...,...,...,...,...
145,1.038005,-0.131979,0.819596,1.448832
146,0.553333,-1.282963,0.705921,0.922303
147,0.795669,-0.131979,0.819596,1.053935
148,0.432165,0.788808,0.933271,1.448832


In [165]:
min_max = MinMaxScaler()
df_minmax = min_max.fit_transform(df)
# df_minmax.shape

(150, 4)

In [168]:
list(dir(min_max))

['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_feature_names',
 '_check_n_features',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_parameter_constraints',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_reset',
 '_sklearn_auto_wrap_output_keys',
 '_validate_data',
 '_validate_params',
 'clip',
 'copy',
 'data_max_',
 'data_min_',
 'data_range_',
 'feature_names_in_',
 'feature_range',
 'fit',
 'fit_transform',
 'get_feature_names_out',
 'get_params',
 'inverse_transform',
 'min_',
 'n_features_in_',
 'n_samples_seen_',
 'partial_fit',
 'scale_',
 'set_output',
 'set_params',
 'transform

In [176]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [175]:
df_transformed

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-0.900681,1.019004,-1.340227,-1.315444
1,-1.143017,-0.131979,-1.340227,-1.315444
2,-1.385353,0.328414,-1.397064,-1.315444
3,-1.506521,0.098217,-1.283389,-1.315444
4,-1.021849,1.249201,-1.340227,-1.315444
...,...,...,...,...
145,1.038005,-0.131979,0.819596,1.448832
146,0.553333,-1.282963,0.705921,0.922303
147,0.795669,-0.131979,0.819596,1.053935
148,0.432165,0.788808,0.933271,1.448832


In [177]:
std_scaler.inverse_transform(df_transformed)

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [169]:
list(dir(std_scaler))

['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_feature_names',
 '_check_n_features',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_parameter_constraints',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_reset',
 '_sklearn_auto_wrap_output_keys',
 '_validate_data',
 '_validate_params',
 'copy',
 'feature_names_in_',
 'fit',
 'fit_transform',
 'get_feature_names_out',
 'get_params',
 'inverse_transform',
 'mean_',
 'n_features_in_',
 'n_samples_seen_',
 'partial_fit',
 'scale_',
 'set_output',
 'set_params',
 'transform',
 'var_',
 'with_mean',
 'with_std']

In [172]:
std_scaler.mean_, std_scaler.var_

(array([5.84333333, 3.05733333, 3.758     , 1.19933333]),
 array([0.68112222, 0.18871289, 3.09550267, 0.57713289]))

#### Outlier Removal
    - Celing or capping of the data to remove outliers
    - remove the outliers in some cases

In [262]:
df["salary"]

0      50000.0
1      60000.0
2          NaN
3      45000.0
4      32000.0
5      61000.0
6      34000.0
7      65000.0
8    7500000.0
Name: salary, dtype: float64

In [269]:
df["salary"].quantile(0.1), df["salary"].quantile(0.80)

(33400.0, 63400.0)

In [273]:
df["age_clipped"] = np.clip(df["salary"], a_min=df["salary"].quantile(0.001), a_max=df["salary"].quantile(0.80))
df





Unnamed: 0,age,salary,gender,age_clipped
0,20.0,50000.0,M,50000.0
1,,60000.0,F,60000.0
2,25.0,,M,
3,67.0,45000.0,F,45000.0
4,70.0,32000.0,F,32014.0
5,20.0,61000.0,F,61000.0
6,45.0,34000.0,F,34000.0
7,25.0,65000.0,F,63400.0
8,64.0,7500000.0,M,63400.0


### Assignment:
    - How to detect missing values in a dataframe?
    - Try to understand the StandardScaler object and try to do inverse_transform
    - compare the results on KNN imputer with uniform and distance weight ( along with outlier removal and scaling)