# Feature Scaling in EDA (Exploratory Data Analysis)

1. **Definition**:
    - Feature scaling is a technique to standardize the independent variables of a dataset within a specific range.

2. **Importance**:
    - Ensures that all features contribute equally to the model.
    - Improves the performance and training stability of machine learning algorithms.

3. **Common Methods**:
    - **Min-Max Scaling (Normalization)**:
      - Scales the data to a fixed range, usually 0 to 1.
      - Formula: X_scaled = (X - X_min) / (X_max - X_min)
    - **Standardization (Z-score Normalization)**:
      - Scales the data to have a mean of 0 and a standard deviation of 1.
      - Formula: X_scaled = (X - mean) / std
    - **Robust Scaling**:
      - Uses median and interquartile range for scaling.
      - Less sensitive to outliers.
      - Formula: X_scaled = (X - median) / IQR

4. **When to Use**:
    - Algorithms that compute distances (e.g., KNN, SVM, K-means).
    - Gradient-based algorithms (e.g., Linear Regression, Logistic Regression, Neural Networks).
    - Principal Component Analysis (PCA).

5. **Implementation in Python**:
    ```python
    from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

    # Example data
    data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

    # Min-Max Scaling
    min_max_scaler = MinMaxScaler()
    data_min_max = min_max_scaler.fit_transform(data)

    # Standardization
    standard_scaler = StandardScaler()
    data_standard = standard_scaler.fit_transform(data)

    # Robust Scaling
    robust_scaler = RobustScaler()
    data_robust = robust_scaler.fit_transform(data)
    ```

6. **Considerations**:
    - Always fit the scaler on the training data and transform both training and test data.
    - Be cautious with scaling categorical features.
    - Check the distribution of features before and after scaling.

7. **Conclusion**:
    - Feature scaling is a crucial step in the EDA process.
    - It ensures that machine learning models perform optimally and consistently.

### Implementaion Of feature_scaling


In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns   

In [7]:
df=pd.read_csv('assets/cleaned_data.csv')

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CreditScore        10000 non-null  int64  
 1   Gender             10000 non-null  object 
 2   Age                10000 non-null  float64
 3   Tenure             10000 non-null  int64  
 4   Balance            10000 non-null  float64
 5   NumOfProducts      10000 non-null  int64  
 6   HasCrCard          10000 non-null  int64  
 7   IsActiveMember     10000 non-null  int64  
 8   EstimatedSalary    10000 non-null  float64
 9   Exited             10000 non-null  int64  
 10  Geography_Germany  10000 non-null  bool   
 11  Geography_Spain    10000 non-null  bool   
 12  Gender_encoded     10000 non-null  float64
dtypes: bool(2), float64(4), int64(6), object(1)
memory usage: 879.0+ KB


In [9]:
df["Age"].fillna(df["Age"].mean(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].mean(),inplace=True)


In [13]:
from sklearn.preprocessing import  StandardScaler

from sklearn.preprocessing import MinMaxScaler

In [14]:
df.describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Gender_encoded
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,650.5288,38.893505,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037,0.5464
std,96.653299,10.306848,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769,0.497867
min,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0,0.0
25%,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0,0.0
50%,652.0,38.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0,1.0
75%,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0,1.0
max,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0,1.0


In [15]:
df.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_encoded
0,619,Female,42.0,2,0.0,1,1,1,101348.88,1,False,False,0.0
1,608,Female,41.0,1,83807.86,1,0,1,112542.58,0,False,True,0.0
2,502,Female,42.0,8,159660.8,3,1,0,113931.57,1,False,False,0.0
3,699,Female,39.0,1,0.0,2,0,0,93826.63,0,False,False,0.0
4,850,Female,43.0,2,125510.82,1,1,1,79084.1,0,False,True,0.0


In [16]:
Scalar=StandardScaler()

new_df=pd.DataFrame(df,columns=["Age","Tenure"])


In [17]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     10000 non-null  float64
 1   Tenure  10000 non-null  int64  
dtypes: float64(1), int64(1)
memory usage: 156.4 KB


In [18]:
transformed_df=Scalar.fit_transform(new_df)
print(transformed_df)

[[ 0.30141612 -1.04175968]
 [ 0.20438839 -1.38753759]
 [ 0.30141612  1.03290776]
 ...
 [-0.28075021  0.68712986]
 [ 0.30141612 -0.69598177]
 [-1.05697198 -0.35020386]]


In [21]:
transformed_df_df = pd.DataFrame(transformed_df, columns=new_df.columns)
transformed_df_df.describe()

Unnamed: 0,Age,Tenure
count,10000.0,10000.0
mean,1.01501e-15,-1.078249e-16
std,1.00005,1.00005
min,-2.027249,-1.733315
25%,-0.6688611,-0.6959818
50%,-0.08669477,-0.004425957
75%,0.4954716,0.6871299
max,5.152802,1.724464


In [22]:
min_max=MinMaxScaler()
transformed_df_minmax=min_max.fit_transform(new_df)

In [23]:
print(transformed_df_minmax)

[[0.32432432 0.2       ]
 [0.31081081 0.1       ]
 [0.32432432 0.8       ]
 ...
 [0.24324324 0.7       ]
 [0.32432432 0.3       ]
 [0.13513514 0.4       ]]


In [24]:
from sklearn.preprocessing import OneHotEncoder



In [28]:
df2=pd.read_csv('assets/Churn_Modelling.csv')

In [29]:
# Initialize the OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the 'Geography' column
geography_encoded = encoder.fit_transform(df2[['Geography']]).toarray()

# Create a DataFrame with the encoded columns
geography_encoded_df = pd.DataFrame(geography_encoded, columns=encoder.get_feature_names_out(['Geography']))

# Concatenate the original dataframe with the encoded columns
df2_encoded = pd.concat([df2, geography_encoded_df], axis=1)

# Drop the original 'Geography' column
df2_encoded.drop('Geography', axis=1, inplace=True)

# Display the first few rows of the new dataframe
df2_encoded.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain
0,1,15634602,Hargrave,619,Female,42,2,0.0,1,1,1,101348.88,1,1.0,0.0,0.0
1,2,15647311,Hill,608,Female,41,1,83807.86,1,0,1,112542.58,0,0.0,0.0,1.0
2,3,15619304,Onio,502,Female,42,8,159660.8,3,1,0,113931.57,1,1.0,0.0,0.0
3,4,15701354,Boni,699,Female,39,1,0.0,2,0,0,93826.63,0,1.0,0.0,0.0
4,5,15737888,Mitchell,850,Female,43,2,125510.82,1,1,1,79084.1,0,0.0,0.0,1.0


In [30]:
df2.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0
