## Navigation
1. [Start Here](hey.ipynb)
1. [Load Data and Clean](/eda.ipynb)
1. [To Clean, or Not To Clean?](eval_v1.ipynb)
1. Generate Datasets
    1. [Faker Naive](faker_naive.ipynb)
    1. [Faker Plus](faker_plus.ipynb)
    1. [SDV Naive](sdv_v1.ipynb)
    1. [SDV More Better](sdv_v2.ipynb)
    1. [SDV TVAE]()
1. Compare and Evaluate Performance
    1. [First impressions](eval_v2.ipynb)
    1. [Loan financial models](eval_v3.ipynb)
    1. [Predicting default risk](eval_v4.ipynb)
    1. [How hackable]()

# Faker Plus
#### Faker, but more better

The most naive version of using Faker to generate data leaves much to be desired. Here is a custom module that will generate slightly more sophisticated Faker data that better matches the statistical properties of the original data set.

Note that this should not be used for personally identifiable information such as names and addresses because this module will simply replicate the original string values, treating them like they are categorical.

These results will be compared to the most simplistic Faker() generated data sets. I have affectionately named this "Faker Plus".

In [1]:
# Make a more sophisticated Faker dataset that maintains more of the original statistical properties of the original dataset.
# Including min, max, mean, stddev, and frequency distributions for categorical variables
# Metadata for these are stored in a JSON file

## 
"""
The below is a test of the DataFrameGenerator class using an example JSON file.
"""
##
import pandas as pd
import numpy as np
from faker import Faker
import json
import logging
fake = Faker()
import random
import os, sys

# Configure logging
logging.basicConfig(
    filename='fauxnalysis.log',
    level=logging.DEBUG,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

class DataFrameGenerator:
    def __init__(self, config):
        """Initialize the DataFrameGenerator with a configuration."""
        self.config = config
        self.fake = Faker()
        self.n_rows = config.get('n_rows', 10)
        self.data = {}

    def generate_dataframe(self):
        """Generate a DataFrame based on the configuration."""
        for col in self.config['columns']:
            col_name = col['name']
            col_type = col['type']
            null_count = col.get('null_count', 0)

            self.data[col_name] = self.generate_column_data(col_type, col)

            # Introduce null values
            if null_count > 0:
                self._introduce_nulls(col_name, null_count)

        logging.info("Generated fake DataFrame.")
        return pd.DataFrame(self.data)

    def generate_column_data(self, col_type, col_config):
        """Generate data for a specific column type."""
        generator = {
            'categorical': self._generate_categorical_data,
            'float': self._generate_numerical_data,
            'int': self._generate_numerical_data,
            'bool': lambda _: np.random.choice([True, False], self.n_rows),
            'state': self._generate_state_data,
            'bothify': self._generate_bothify_data,
            'string': lambda _: [self.fake.sentence() for _ in range(self.n_rows)],
            'yar': lambda _: [self.fake.year() for _ in range(self.n_rows)],
            'date': lambda _: [self._generate_date() for _ in range(self.n_rows)]
        }

        if col_type in generator:
            return generator[col_type](col_config)
        
        raise ValueError(f"Unsupported data type: {col_type}")
    
    def _generate_date(self):
        year = fake.year()
        if int(year) < 2019 and int(year) > 2012:
            random_date = f'{year}'
            return random_date
        else:
            return self._generate_date()


    def _introduce_nulls(self, col_name, null_count):
        """Randomly introduce null values into the specified column."""
        indices = np.random.choice(self.n_rows, null_count, replace=False)
        for index in indices:
            self.data[col_name][index] = None

    def _generate_numerical_data(self, col_config):
        """Generate numerical data based on statistical parameters."""
        min_val = col_config.get('min')
        max_val = col_config.get('max')

        if min_val is None or max_val is None:
            raise ValueError("Both 'min' and 'max' values must be specified.")

        mean = col_config.get('mean', (min_val + max_val) / 2)
        stddev = col_config.get('stddev', (max_val - min_val) / 6)

        col_type = col_config['type']
        if col_type in ['float', 'curr']:
            data = np.random.normal(loc=mean, scale=stddev, size=self.n_rows)
        elif col_type == 'int':
            data = np.random.randint(min_val, max_val, self.n_rows)
        else:
            raise ValueError("Unsupported numerical type.")

        # Clip the data to be within min and max values, ensuring we handle None
        data = np.clip(data, min_val, max_val)
        return data.tolist()  # Convert to list to avoid potential None issues

    def _generate_bothify_data(self, col_config):
        """Generate data using Faker's bothify method with a custom format."""
        format_string = col_config.get('format', '???###')
        return [self.fake.bothify(format_string) for _ in range(self.n_rows)]

    def _generate_state_data(self, col_config):
        """Generate random state names based on frequency distribution."""
        use_abbr = col_config.get('abbreviation', False)
        states = [self.fake.state_abbr() if use_abbr else self.fake.state() for _ in range(self.n_rows)]

        # If a distribution is provided, sample according to it
        if 'distribution' in col_config:
            distribution = col_config['distribution']
            state_names = list(distribution.keys())
            probabilities = list(distribution.values())
            states = np.random.choice(state_names, self.n_rows, p=probabilities)

        return states

    def _generate_categorical_data(self, col_config):
        """Generate categorical data based on the input series frequency."""
        if 'distribution' not in col_config:
            raise ValueError("Frequency distribution must be provided for categorical data.")

        distribution = col_config['distribution']
        categories = list(distribution.keys())
        probabilities = list(distribution.values())
        # enforce probabilities sum to 1
        probabilities = [p / sum(probabilities) for p in probabilities]
        
        return np.random.choice(categories, self.n_rows, p=probabilities)

class ConfigLoader:
    @staticmethod
    def load_config(json_file):
        """Load column specifications from a JSON file."""
        try:
            with open(json_file, 'r') as f:
                return json.load(f)
        except (FileNotFoundError, json.JSONDecodeError) as e:
            logging.error(f"Error loading JSON file: {e}")
            raise

class Fauxnalysis:
    def __init__(self, config_file):
        """Initialize the Fauxnalysis instance with a configuration file."""
        self.config = ConfigLoader.load_config(config_file)

    def generate_fake_data(self):
        """Generate fake data and return it as a DataFrame."""
        generator = DataFrameGenerator(self.config)
        return generator.generate_dataframe()

    def save_to_csv(self, dataframe, filename):
        """Save the DataFrame to a CSV file."""
        dataframe.to_csv(filename, index=False, compression='gzip')
        logging.info(f"DataFrame saved to {filename}")

In [2]:
# Let's create a new config json file for the accepted data set
# Accepted dataframe
df = pd.read_csv('FILEPATH', compression='gzip', low_memory=False)

In [None]:
# Now process that dataframe to create json metadata file
# Function to process DataFrame and create JSON metadata
# Function to process DataFrame and create JSON metadata
def create_metadata_json(dataframe):
    metadata = []

    for col in dataframe.columns:
        try:
            if pd.api.types.is_object_dtype(dataframe[col]):
                unique_count = dataframe[col].nunique()
                
                # Check if unique values exceed 25 for object columns only
                if unique_count > 50:
                    column_metadata = {
                        "name": col,
                        "type": "string"
                    }
                else:
                    # Calculate distribution for categorical data
                    value_counts = dataframe[col].value_counts(normalize=True).to_dict()
                    # Normalize the distribution to ensure it sums to 1
                    total = sum(value_counts.values())
                    distribution = {key: round(value / total, 2) for key, value in value_counts.items()}
                    
                    # Create the metadata entry for categorical columns
                    column_metadata = {
                        "name": col,
                        "type": "categorical",
                        "distribution": distribution
                    }
                metadata.append(column_metadata)
                
            elif pd.api.types.is_numeric_dtype(dataframe[col]):
                # Determine if the numeric column is float or integer
                if pd.api.types.is_float_dtype(dataframe[col]):
                    column_metadata = {
                        "name": col,
                        "type": "float",
                        "min": float(dataframe[col].min()),  # Convert to float
                        "max": float(dataframe[col].max()),  # Convert to float
                        "mean": float(dataframe[col].mean()),  # Convert to float
                        "std_dev": float(dataframe[col].std())  # Convert to float
                    }
                elif pd.api.types.is_integer_dtype(dataframe[col]):
                    column_metadata = {
                        "name": col,
                        "type": "int",
                        "min": int(dataframe[col].min()),  # Convert to int
                        "max": int(dataframe[col].max()),  # Convert to int
                        "mean": float(dataframe[col].mean()),  # Convert to float
                        "std_dev": float(dataframe[col].std())  # Convert to float
                    }
                else:
                    continue  # Skip non-standard numeric types

                metadata.append(column_metadata)
                
        except Exception as e:
            print(f"Error processing column '{col}': {e}")
            continue  # Move on to the next column

    return metadata

# Generate the metadata
metadata = create_metadata_json(df)

# Convert metadata to JSON string
metadata_json = json.dumps(metadata, indent=4)

# Print the metadata JSON
print(metadata_json)

# Save the JSON to a file
with open('metadata.json', 'w') as json_file:
    json_file.write(metadata_json)

print("Metadata saved to 'metadata.json'.")

To clean it up, I manually made edits to the metadata.json to ensure the "faker plus" generator was as accurate as possible.

In [None]:
# How many rows?
n_rows = len(df)
n_rows

> Stop here and make any needed edits to the metadata.json before continuing

In [None]:
# Running the Fauxnalysis generator on my metadata.json
# Create an instance of Fauxnalysis
config_file = 'metadata.json'  # Ensure this file exists in the directory
fa = Fauxnalysis(config_file)

# Generate the fake DataFrame
fake_df = fa.generate_fake_data()

# Print the generated DataFrame
print("Generated Fake DataFrame:")
fake_df.info(verbose=True)

In [None]:
# Take a look at our memory usage
fake_df.memory_usage(index=False, deep=True).to_csv('FILEPATH')
fake_df.memory_usage(index=False, deep=True)

In [8]:
# Save the DataFrame to a CSV file
fake_df.to_csv('FILEPATH', index=False)