### Random Noise Addition

This a data augmentation technique where a small amount of random noise is added to the existing data points in a dataset. The purpose of adding noise is to introduce variability to the data while preserving its overall characteristics. 


In [7]:
import pandas as pd
import numpy as np

data = pd.read_csv("german_credit_prepared.csv")

# Filter data slice
good_credit_slice = data[data['credit_history'] == "all credits at this bank paid back duly"]

# Augment credit amount and duration

# augmented_credit_amount = good_credit_slice['credit_amount'] * np.random.uniform(0.9, 1.1, len(good_credit_slice))
augmented_duration = good_credit_slice['duration_in_month'] + np.random.randint(-3, 4, len(good_credit_slice))
augmented_credit_amount = good_credit_slice['credit_amount'] + np.random.normal(0.0, 1.0, len(good_credit_slice))


# Create augmented data
augmented_data = good_credit_slice.copy()
augmented_data['credit_amount'] = augmented_credit_amount
augmented_data['duration_in_month'] = augmented_duration

# Concatenate augmented data with original data
augmented_data_combined = pd.concat([data, augmented_data], ignore_index=True)

# Export 
augmented_data_combined.to_csv("augmented_data_strategy1bis.csv", index=False)

2 datasets are formed, one using a uniform distribution on the credit amount where the original value is multiplied by a number between 0.9 and 1.1. (augmented_data_strategy1.csv) 

Uniform distribution - accuracy : 0.7904761904761904 

The other one is by adding a white noise to the credit amount. (augmented_data_strategy1bis.csv)

Gaussian Distribution - accuracy : 0.7857142857142857

However, a better result was achieved by multiplying the original value by the gaussian distribution instead of adding it. The accuracy was 0.8095238095238095.


### Feature Scaling

We are using now Feature Scaling, a preprocessing technique to standardize or normalize the range of independent variables (features) in the dataset.
The main goal is to bring all the features to a similar scale so that they contribute equally to the learning process. There are two common methods of feature scaling: Min-Max Scaling and Standardization.

Here, the used method is Min-Max Scaling, it transforms features to a specific range, usually between 0 and 1. It preserves the relationships between the data points by normalizing them.

Its formula is X_scaled = (X - X_min) / (X_max - X_min).


In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

data = pd.read_csv("german_credit_prepared.csv")

# Filter data slices
other_slice = data[data['purpose'] == "Other"]
duration_36_slice = data[data['duration_in_month'] == 36]
low_balance_slice = data[data['account_check_status'] == "<0 DM"]
good_credit_slice = data[data['credit_history'] == "all credits at this bank paid back duly"]

# Apply feature scaling to credit_amount and age for non-empty slices
scaler = MinMaxScaler()

if not other_slice.empty:
    scaled_credit_amount_other = scaler.fit_transform(other_slice[['credit_amount']])
    augmented_other_slice = other_slice.copy()
    augmented_other_slice['credit_amount'] = scaled_credit_amount_other

if not duration_36_slice.empty:
    scaled_age_duration_36 = scaler.fit_transform(duration_36_slice[['age']])
    augmented_duration_36_slice = duration_36_slice.copy()
    augmented_duration_36_slice['age'] = scaled_age_duration_36

# Concatenate augmented slices with the original data
augmented_data_slices = [data]
if not other_slice.empty:
    augmented_data_slices.append(augmented_other_slice)
if not duration_36_slice.empty:
    augmented_data_slices.append(augmented_duration_36_slice)
augmented_data_slices.append(low_balance_slice)
augmented_data_slices.append(good_credit_slice)
augmented_data = pd.concat(augmented_data_slices, ignore_index=True)

# Export 
augmented_data.to_csv("augmented_data_strategy3.csv", index=False)


accuracy: 0.7709251101321586

### Categorical Data Augmentation with Random Replacements

This technique involves randomly replacing categorical feature values with alternative values to create augmented data. It focuses on introducing variability and diversity into the categorical features of a dataset. Categorical features are variables that represent discrete categories or groups; in our case, "credit_history," "purpose," or "gender" are such categorical features, but we're going to focus only on "credit_history".

The alternative values we are going to take should correlate with the original data. For instance, if all credits are paid back duly, we won't choose "critical account".

In [12]:
import pandas as pd
import numpy as np

data = pd.read_csv("german_credit_prepared.csv")

# Filter data slice
credit_history_slice = data[data['credit_history'] == "all credits at this bank paid back duly"]

# Define the values in the credit_history column
credit_history_values = ['existing credits paid back duly till now	', 'delay in paying off in the past	', 'no credits taken/ all credits paid back duly', 'all credits at this bank paid back duly']

# Augment credit history using random replacements
augmented_credit_history = np.random.choice(credit_history_values, size=len(credit_history_slice), replace=True)

# Create augmented data
augmented_credit_history_slice = credit_history_slice.copy()
augmented_credit_history_slice['credit_history'] = augmented_credit_history

# Concatenate augmented slice with the original data
augmented_data = pd.concat([data, augmented_credit_history_slice], ignore_index=True)


# Export
augmented_data.to_csv("augmented_data_strategy4.csv", index=False)


accuracy: 0.7904761904761904

### Random Noise Addition (again)

We are going to add random noise on more features this time. The distribution is going to be normal for some and gaussian for others.

In [14]:
import pandas as pd
import numpy as np

# Load your dataset
data = pd.read_csv("german_credit_prepared.csv")

# Filter data slices
good_credit_slice = data[data['credit_history'] == "all credits at this bank paid back duly"]

# Augment multiple columns
columns_to_augment_normal = ['credit_amount', 'installment_as_income_perc']
columns_to_augment_uniform = [ 'duration_in_month',  'age']


# Define the range of perturbations for each column
perturbation_ranges = {
    'credit_amount': (0.9, 1.1),
    'duration_in_month': (-3, 4),
    'installment_as_income_perc': (0.9, 1.1),
    'age': (-2, 3)
}

# Create augmented data (uniform noise)
augmented_data = pd.DataFrame()
for column in columns_to_augment_uniform:
    perturbation_range = perturbation_ranges[column]
    
    # Apply perturbations
    perturbation_values = np.random.randint(perturbation_range[0], perturbation_range[1], len(good_credit_slice))
    augmented_column = good_credit_slice[column] * perturbation_values
    
    augmented_slice = good_credit_slice.copy()
    augmented_slice[column] = augmented_column
    
    augmented_data = pd.concat([augmented_data, augmented_slice], ignore_index=True)


# Create augmented data (Gaussian noise)
for column in columns_to_augment_normal:
    perturbation_range = perturbation_ranges[column]
    
    # Apply perturbations
    perturbation_values = np.random.normal(perturbation_range[0], perturbation_range[1], len(good_credit_slice))
    augmented_column = good_credit_slice[column] * perturbation_values
    
    augmented_slice = good_credit_slice.copy()
    augmented_slice[column] = augmented_column
    
    augmented_data = pd.concat([augmented_data, augmented_slice], ignore_index=True)

# Concatenate augmented data with original data
augmented_data_combined = pd.concat([data, augmented_data], ignore_index=True)

# Export augmented data to a new CSV file
augmented_data_combined.to_csv("augmented_data_strategy5.csv", index=False)


accuracy : 0.8083333333333333

### Data Duplication

By duplicating data, we increase the size of the Dataset. This could be good useful if the dataset is small, although it comes with a risk of overfitting.

In [32]:
import pandas as pd

data = pd.read_csv("german_credit_prepared.csv")

# Duplicate the dataset to double its size
duplicated_data = pd.concat([data, data], ignore_index=True)

# Export
duplicated_data.to_csv("augmented_data_strategy6.csv", index=False)


accuracy : 0.765


### Feature Scaling + Random Noise Addition + Shuffle

Here, we are combining two of the most fruitful methods we used that can be mixed without much difficulty. After that, the dataset will get shuffled to prevent any inherent pattern and to reduce possible bias.

In [20]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler


data = pd.read_csv("german_credit_prepared.csv")


# Filter data slices
other_slice = data[data['purpose'] == "Other"]
duration_36_slice = data[data['duration_in_month'] == 36]
low_balance_slice = data[data['account_check_status'] == "<0 DM"]
good_credit_slice = data[data['credit_history'] == "all credits at this bank paid back duly"]

# Apply feature scaling to credit_amount and age for non-empty slices
scaler = MinMaxScaler()

if not other_slice.empty:
    scaled_credit_amount_other = scaler.fit_transform(other_slice[['credit_amount']])
    scaked_other_slice = other_slice.copy()
    scaled_other_slice['credit_amount'] = scaled_credit_amount_other

if not duration_36_slice.empty:
    scaled_age_duration_36 = scaler.fit_transform(duration_36_slice[['age']])
    scaled_duration_36_slice = duration_36_slice.copy()
    scaled_duration_36_slice['age'] = scaled_age_duration_36

# Concatenate augmented slices with the original data
scaled_data_slices = [data]
if not other_slice.empty:
    scaled_data_slices.append(scaled_other_slice)
if not duration_36_slice.empty:
    scaled_data_slices.append(scaled_duration_36_slice)
scaled_data_slices.append(low_balance_slice)
scaled_data_slices.append(good_credit_slice)
scaled_data = pd.concat(scaled_data_slices, ignore_index=True)



# Augment multiple columns
columns_to_augment_normal = ['credit_amount', 'installment_as_income_perc']
columns_to_augment_uniform = [ 'duration_in_month',  'age']


# Define the range of perturbations for each column
perturbation_ranges = {
    'credit_amount': (0.9, 1.1),
    'duration_in_month': (-3, 4),
    'installment_as_income_perc': (0.9, 1.1),
    'age': (-2, 3)
}

good_credit_slice = scaled_data[scaled_data['credit_history'] == "all credits at this bank paid back duly"]


# Create augmented data (uniform noise)
augmented_data = pd.DataFrame()
for column in columns_to_augment_uniform:
    perturbation_range = perturbation_ranges[column]
    
    # Apply perturbations
    perturbation_values = np.random.randint(perturbation_range[0], perturbation_range[1], len(good_credit_slice))
    augmented_column = good_credit_slice[column] * perturbation_values
    
    augmented_slice = good_credit_slice.copy()
    augmented_slice[column] = augmented_column
    
    augmented_data = pd.concat([augmented_data, augmented_slice], ignore_index=True)


# Create augmented data (Gaussian noise)
for column in columns_to_augment_normal:
    perturbation_range = perturbation_ranges[column]
    
    # Apply perturbations
    perturbation_values = np.random.normal(perturbation_range[0], perturbation_range[1], len(good_credit_slice))
    augmented_column = good_credit_slice[column] * perturbation_values
    
    augmented_slice = good_credit_slice.copy()
    augmented_slice[column] = augmented_column
    
    augmented_data = pd.concat([augmented_data, augmented_slice], ignore_index=True)

# Concatenate augmented data with original data
augmented_data_combined = pd.concat([data, augmented_data], ignore_index=True)

# Shuffle the augmented dataset
shuffled_augmented_data = augmented_data.sample(frac=1, random_state=42).reset_index(drop=True)

# Export the shuffled augmented data to a new CSV file
shuffled_augmented_data.to_csv("augmented_data_strategy7.csv", index=False)


After training the model a few times with augmented datasets using this method, we can get these accuracies:    
0.9625   
0.975       
0.9375  
0.95    
which is by far better than the initial 0.755 we had with the original dataset.