In [1]:
import numpy as np
import pandas as pd
import math
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split

In [2]:
raw_data = pd.read_csv('./credit+approval/crx.data', header=None, na_values="?")
cont_attr = [1,2,7,10,13,14]
dist_attr = []
for i in range(15):
    if i not in cont_attr:
        dist_attr.append(i)


# 1. Handling Missing Values
+ For Continuous Columns we use the __mean__
+ For Discrete Columns we use the __mod__

## Dealing with Missing Values in Machine Learning (copilot response)

Handling missing values is an important step in the data preprocessing phase of a machine learning project. Here are some common methods to deal with missing values:

### 1. Remove Missing Values

- **Remove Rows:** If the dataset has a small number of missing values, you can simply remove those rows.

    ```python
    # Remove rows with any missing values
    data_cleaned = data.dropna()
    ```

### 2. Impute Missing Values

- **Mean/Median/Mode Imputation:** Replace missing values with the mean, median, or mode of the respective column

    ```python
    from sklearn.impute import SimpleImputer

    # Replace missing values with mean
    mean_imputer = SimpleImputer(strategy='mean')
    data_imputed = mean_imputer.fit_transform(data)

    # Replace missing values with median
    median_imputer = SimpleImputer(strategy='median')
    data_imputed = median_imputer.fit_transform(data)

    # Replace missing values with mode
    mode_imputer = SimpleImputer(strategy='most_frequent')
    data_imputed = mode_imputer.fit_transform(data)

    ```
- **Constant Value Imputation:** Replace missing values with a specific constant value, such as 0 or -1.
### 3. Predict Missing Values

- **Use Machine Learning Models:** Predict missing values based on other features in the dataset.
    ```python 
    from sklearn.ensemble import RandomForestRegressor

    # Assume 'data' is a DataFrame and we want to predict missing values in 'target_column'
    missing_data = data[data['target_column'].isnull()]
    available_data = data[data['target_column'].notnull()]

    # Train a model on available data
    model = RandomForestRegressor()
    model.fit(available_data.drop('target_column', axis=1), available_data['target_column'])

    # Predict missing values
    missing_data['target_column'] = model.predict(missing_data.drop('target_column', axis=1))

    # Combine the datasets back together
    data_imputed = pd.concat([available_data, missing_data])

    ```
### 4. Use Algorithms That Support Missing Values

- Certain algorithms like __decision__ trees and __XGBoost__ can handle missing values natively, without requiring explicit imputation.

    ```python 
    from xgboost import XGBClassifier

    # Train an XGBoost model on data with missing values
    model = XGBClassifier()
    model.fit(X_train, y_train)
    ```
### 5. Create a Missing Indicator

- Create an additional feature indicating whether a value was missing.
    ```python 
    # Create an indicator column for missing values
    data['missing_indicator'] = data['target_column'].isnull().astype(int)

    # Impute missing values after creating the indicator
    mean_imputer = SimpleImputer(strategy='mean')
    data['target_column'] = mean_imputer.fit_transform(data[['target_column']])

    ```

- **Conclusion**

        Handling missing values appropriately can significantly improve the performance of your machine learning models. The method you choose depends on the nature of your data and the specific requirements of your project. Always ensure to explore and understand the impact of missing values on your dataset before deciding on the best strategy.

In [3]:
# Continuous => mean
# Disctrete => mod
print("Before Preprocessing:",raw_data.isna().sum())
for index in range(15):
    if index in cont_attr:
        _mean = raw_data[index].mean()
        raw_data[index] = raw_data[index].replace(math.nan, _mean)
        continue
    _mode = raw_data[index].mode()[0]
    raw_data[index] = raw_data[index].replace(math.nan, _mode)

print("After Preprocessing:",raw_data.isna().sum())

Before Preprocessing: 0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64
After Preprocessing: 0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64


# 2. Normalization


1. min_max_scale method: 
    $$ x_{scale} = \frac{x-x_{min}}{x_{max}-x_{min}}$$

2. mean_scale method:
    $$ x_{scale} = \frac{x-x_{mean}}{x_{max}-x_{mean}}$$



## Why Does Normalization Take Place in Machine Learning? (Copilot response to this question)

Normalization is a crucial step in the machine learning process for several key reasons:

1. **Improved Convergence During Training:**
   - **Gradient Descent:** Algorithms that use gradient descent for optimization, such as neural networks, benefit significantly from normalized data. It helps in faster convergence and prevents the algorithm from getting stuck in local minima.

2. **Equal Contribution of Features:**
   - **Feature Scaling:** When features in a dataset have different ranges, those with larger ranges can dominate the learning algorithm, making it difficult for the model to learn from other features. Normalization ensures that all features contribute equally to the model's performance.

3. **Reduced Sensitivity to Feature Scaling:**
   - **Distance Metrics:** Algorithms that use distance metrics, such as k-Nearest Neighbors (k-NN) and Support Vector Machines (SVM), are sensitive to the scale of features. Normalization ensures that each feature contributes equally to the distance computation, improving the accuracy of these algorithms.

4. **Enhanced Model Performance:**
   - **Consistency:** Normalized data often lead to better performance across various machine learning models. Consistently scaled data help the algorithms learn more effectively and make better predictions.

5. **Prevents Bias:**
   - **Bias Reduction:** Without normalization, the learning algorithm might be biased towards features with higher magnitudes. Normalization ensures that no single feature dominates the learning process, leading to more balanced and unbiased models.

### Common Normalization Techniques

1. **Min-Max Scaling:**
   - Rescales the feature to a fixed range, typically \([0, 1]\) or \([-1, 1]\).

      ```python
      from sklearn.preprocessing import MinMaxScaler
      scaler = MinMaxScaler()
      normalized_data = scaler.fit_transform(data)
      ```

2. **Standardization (Z-Score Normalization):**
   - Centers the data around the mean with a unit standard deviation.

      ``` python 
      from sklearn.preprocessing import StandardScaler 
      scaler = StandardScaler() 
      standardized_data = scaler.fit_transform(data) 
      ```
3. **Robust Scaler:**
   - Uses median and interquartile range for scaling, making it robust to outliers.

      ``` python 
      from sklearn.preprocessing import RobustScaler 
      scaler = RobustScaler() 
      robust_scaled_data = scaler.fit_transform(data) 
      ```

         By normalizing your data, you ensure that your machine learning model performs optimally and learns effectively from the features provided. It's a critical preprocessing step that can significantly impact the performance and accuracy of your models.

In [4]:
for index in cont_attr:
    x = raw_data[index]
    x_min = raw_data[index].min()
    x_max = raw_data[index].max()
    print("=====")
    print("before index: {0}-min: {1}-max: {2}".format(index, x_min, x_max))
    raw_data[index] = (x-x_min)/(x_max-x_min)

    x_min = raw_data[index].min()
    x_max = raw_data[index].max()
    print("after index: {0}-min: {1}-max: {2}".format(index, x_min, x_max))
    

=====
before index: 1-min: 13.75-max: 80.25
after index: 1-min: 0.0-max: 1.0
=====
before index: 2-min: 0.0-max: 28.0
after index: 2-min: 0.0-max: 1.0
=====
before index: 7-min: 0.0-max: 28.5
after index: 7-min: 0.0-max: 1.0
=====
before index: 10-min: 0-max: 67
after index: 10-min: 0.0-max: 1.0
=====
before index: 13-min: 0.0-max: 2000.0
after index: 13-min: 0.0-max: 1.0
=====
before index: 14-min: 0-max: 100000
after index: 14-min: 0.0-max: 1.0


In [5]:
#  using minmaxScaler

min_max_scaler = MinMaxScaler()
x = raw_data[cont_attr].values
x_scaled = min_max_scaler.fit_transform(x)

normalized = pd.DataFrame(x_scaled, columns=cont_attr, index=raw_data.index)
raw_data[cont_attr] = normalized
raw_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,0.256842,0.000000,u,g,w,v,0.043860,t,t,0.014925,f,g,0.1010,0.00000,+
1,a,0.675489,0.159286,u,g,q,h,0.106667,t,t,0.089552,f,g,0.0215,0.00560,+
2,a,0.161654,0.017857,u,g,q,h,0.052632,t,f,0.000000,f,g,0.1400,0.00824,+
3,b,0.211729,0.055000,u,g,w,v,0.131579,t,t,0.074627,t,g,0.0500,0.00003,+
4,b,0.096541,0.200893,u,g,w,v,0.060000,t,f,0.000000,f,s,0.0600,0.00000,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,0.110226,0.360179,y,p,e,h,0.043860,f,f,0.000000,f,g,0.1300,0.00000,-
686,a,0.134135,0.026786,u,g,c,v,0.070175,f,t,0.029851,t,g,0.1000,0.00394,-
687,a,0.172932,0.482143,y,p,ff,ff,0.070175,f,t,0.014925,t,g,0.1000,0.00001,-
688,b,0.062707,0.007321,u,g,aa,v,0.001404,f,f,0.000000,f,g,0.1400,0.00750,-


## Why and How Nominal Variables Convert into Numerical Values in Machine Learning (copilot response)

### Why Convert Nominal Variables?

Nominal variables, also known as categorical variables, represent distinct categories or labels without any intrinsic order. Examples include gender, country, or color. Converting these variables into numerical values is essential for the following reasons:

1. **Compatibility with Algorithms:**
   - Many machine learning algorithms require numerical input. Algorithms such as linear regression, logistic regression, and neural networks cannot process categorical data directly.

2. **Improved Performance:**
   - Numerical encoding of categorical variables often leads to better model performance by enabling the algorithms to find meaningful patterns and relationships.

3. **Distance Metrics:**
   - Algorithms that rely on distance metrics (e.g., k-Nearest Neighbors, k-Means clustering) need numerical values to calculate distances between data points.

### How to Convert Nominal Variables

Several techniques can be used to convert nominal variables into numerical values:

#### 1. **Label Encoding**

Label encoding assigns a unique integer to each category. It is suitable for ordinal data where the categories have an inherent order, but can introduce unintended ordinal relationships for nominal data.

```python
from sklearn.preprocessing import LabelEncoder

# Example data
data = ['red', 'blue', 'green', 'blue', 'green', 'red']

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the data
encoded_data = label_encoder.fit_transform(data)

print(encoded_data)
```



#### 2. **One-Hot Encoding**

One-hot encoding creates binary columns for each category, ensuring no ordinal relationships are introduced. It is suitable for nominal data.

```python 
import pandas as pd

# Example data
data = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue', 'green', 'red']})

# Perform one-hot encoding
one_hot_encoded_data = pd.get_dummies(data, columns=['color'])

print(one_hot_encoded_data)


In [None]:

# using numpy arrays
'''
one_enc = OneHotEncoder(handle_unknown='ignore',sparse_output=False)
new_raw_data = one_enc.fit_transform(raw_data[dist_attr]) 
# print(new_raw_data.shape)
new_raw_data_df = pd.DataFrame(new_raw_data)

new_raw_data_df.index = raw_data.index
raw_data = raw_data.drop(dist_attr, axis=1)

X= pd.concat([new_raw_data_df, raw_data], axis=1)

X

'''

"\none_enc = OneHotEncoder(handle_unknown='ignore',sparse_output=False)\nnew_raw_data = one_enc.fit_transform(raw_data[dist_attr]) \n# print(new_raw_data.shape)\nnew_raw_data_df = pd.DataFrame(new_raw_data)\n\nnew_raw_data_df.index = raw_data.index\nraw_data = raw_data.drop(dist_attr, axis=1)\n\nX= pd.concat([new_raw_data_df, raw_data], axis=1)\n\nX\n\n"

In [7]:
# using pandas built-in functions 
# Comment the last cell and uncomment the following code

one_hot = pd.get_dummies(raw_data[dist_attr])
raw_data = raw_data.drop(dist_attr, axis=1)
X = raw_data.join(one_hot)
X

Unnamed: 0,1,2,7,10,13,14,15,0_a,0_b,3_l,...,6_z,8_f,8_t,9_f,9_t,11_f,11_t,12_g,12_p,12_s
0,0.256842,0.000000,0.043860,0.014925,0.1010,0.00000,+,False,True,False,...,False,False,True,False,True,True,False,True,False,False
1,0.675489,0.159286,0.106667,0.089552,0.0215,0.00560,+,True,False,False,...,False,False,True,False,True,True,False,True,False,False
2,0.161654,0.017857,0.052632,0.000000,0.1400,0.00824,+,True,False,False,...,False,False,True,True,False,True,False,True,False,False
3,0.211729,0.055000,0.131579,0.074627,0.0500,0.00003,+,False,True,False,...,False,False,True,False,True,False,True,True,False,False
4,0.096541,0.200893,0.060000,0.000000,0.0600,0.00000,+,False,True,False,...,False,False,True,True,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,0.110226,0.360179,0.043860,0.000000,0.1300,0.00000,-,False,True,False,...,False,True,False,True,False,True,False,True,False,False
686,0.134135,0.026786,0.070175,0.029851,0.1000,0.00394,-,True,False,False,...,False,True,False,False,True,False,True,True,False,False
687,0.172932,0.482143,0.070175,0.014925,0.1000,0.00001,-,True,False,False,...,False,True,False,False,True,False,True,True,False,False
688,0.062707,0.007321,0.001404,0.000000,0.1400,0.00750,-,False,True,False,...,False,True,False,True,False,True,False,True,False,False


#### 3. **Ordinal Encoding**

Ordinal encoding assigns integer values to categories based on a specified order. It is appropriate for ordinal data where the order of categories matters.

```python 
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Example data with an inherent order
data = pd.DataFrame({'size': ['small', 'medium', 'large', 'medium', 'small']})

# Define the order of categories
categories = [['small', 'medium', 'large']]

# Initialize the OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=categories)

# Fit and transform the data
encoded_data = ordinal_encoder.fit_transform(data[['size']])

print(encoded_data)


#### 4. **Target Encoding**
Target encoding replaces each category with the mean of the target variable for that category. It can be useful for high-cardinality categorical variables but may introduce data leakage if not applied carefully.

```python 

import pandas as pd

# Example data
data = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'A'],
                     'target': [1, 2, 3, 4, 5]})

# Calculate the mean target for each category
target_mean = data.groupby('category')['target'].mean()

# Replace categories with the mean target value
data['category_encoded'] = data['category'].map(target_mean)

print(data)


In [8]:
# splitting the dataset

Y = np.array(X.pop(15))
X = np.array(X)

print(X.shape, Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=.2)
print(X_test.shape, X_train.shape)

(690, 46) (690,)
(138, 46) (552, 46)
