# Synthetic data generator for anomaly detection technical challenge problem

### Objective

- Create a N-column dataframe with three explanatory feature columns and a binary output/response variable column.
- Each column should have null values missing at random.
- Each explanatory feature column should consist of random values ranging from -1 to 1 
- Dataframe should be 10000 rows in length, so resulting dataframe should be a 10000 x N dimension matrix.
- Save the dataframe to a CSV called `raw_data.csv`  

### Installations

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# Approach 1: Using sci-kit learn's `make_classification` method

The `make_classification` method is a module under sci-kit learn's datasets APIs that creates an a random n-class classification problem. The clever part about this method is that it initially creates clusters of points normally distributed (std=1) about vertices of an `n_informative`-dimensional hypercube with sides of length 2*`class_sep` and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.See API reference documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification).

I enter the number of rows for the dataset

In [2]:
### enter number of features for the dataset
n_rows = int(input("Please enter number of rows for the dataset \n"))

Please enter number of rows for the dataset 
 10000


I enter the number of features I want in the dataset.

In [3]:
### enter number of features for the dataset
n_features = int(input("Please enter number of features for the dataset \n"))

Please enter number of features for the dataset 
 40


Then, I enter the number of informative features. The informative features serve as the vertices of the `n_informative`-dimensional hypercube.

In [4]:
### enter number of informative features for the dataset
n_informative = int(input("Please enter number of informative features for the dataset \n"))

Please enter number of informative features for the dataset 
 3


Then, I enter the probability of anomalies expected in the response variable

In [5]:
##### enter probability of anomalies in the output variables
flip_y = float(input("Please enter probability of anomalies expected in the response variable from 0 to 100 \n"))/100

Please enter probability of anomalies expected in the response variable from 0 to 100 
 .01


In [6]:
print(flip_y)

0.0001


Then, enter the weights or proportions of samples assigned to each class


In [7]:
weights = float(input("Please enter proportion of samples assigned to each class from 0 to 100 \n"))/100

Please enter proportion of samples assigned to each class from 0 to 100 
 5


#### Create a class `DataGenerator`

In [8]:
class DataGenerator:
    
    def __init__(self, n_samples, n_features, n_informative, n_redundant, n_repeated, n_classes, 
                 n_clusters_per_class, flip_y, null_prob):
        ''' generate data using the make_classification module '''
        self.n_samples = n_samples
        self.n_features = n_features
        self.n_informative = n_informative
        self.n_redundant = n_redundant
        self.n_repeated = n_repeated
        self.n_classes = n_classes
        self.n_clusters_per_class = n_clusters_per_class
        self.flip_y = flip_y
        self.null_prob = null_prob

    def generate_random_data(self):
        X,y = make_classification(n_samples=self.n_samples, n_features=self.n_features,
                            n_informative=self.n_informative, n_redundant=self.n_redundant, n_repeated=self.n_repeated, 
                            n_classes=self.n_classes, n_clusters_per_class=self.n_clusters_per_class, 
                            flip_y=self.flip_y, class_sep=1.0, hypercube=True, shift=0.0, 
                            scale=1.0, shuffle=True, random_state=None)
        return X,y
        
    def create_dataframe(self):
        X,y = self.generate_random_data()
        featurenames = list()
        for i in range(n_features):
            featurenames.append(f"x{i}")
        df = pd.concat([pd.DataFrame(X, columns=featurenames), 
                        pd.DataFrame(y, columns=['y'])], axis=1)

        ### generate random null values
        X = df.drop('y', axis=1) ### create dataframe with just X variables
        for col in X.columns:
            mask_values = np.random.choice([np.nan, True], size=df.shape[0], p=[self.null_prob, 1 - self.null_prob])
            df[col] = df[col]*mask_values
        return df

#### Run the `DataGenerator` class and generate the dataframe

In [9]:
generator = DataGenerator(n_samples=n_rows, n_features=n_features, n_informative=n_informative, 
                          n_redundant=2, n_repeated=0, n_classes=2, flip_y=flip_y, 
                          n_clusters_per_class=2, 
                          null_prob=.001)
df = generator.create_dataframe()

#### Inspect the resulting dataframe

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x0      9989 non-null   float64
 1   x1      9986 non-null   float64
 2   x2      9990 non-null   float64
 3   x3      9988 non-null   float64
 4   x4      9991 non-null   float64
 5   x5      9992 non-null   float64
 6   x6      9991 non-null   float64
 7   x7      9985 non-null   float64
 8   x8      9991 non-null   float64
 9   x9      9989 non-null   float64
 10  x10     9993 non-null   float64
 11  x11     9982 non-null   float64
 12  x12     9987 non-null   float64
 13  x13     9987 non-null   float64
 14  x14     9992 non-null   float64
 15  x15     9994 non-null   float64
 16  x16     9992 non-null   float64
 17  x17     9988 non-null   float64
 18  x18     9988 non-null   float64
 19  x19     9990 non-null   float64
 20  x20     9995 non-null   float64
 21  x21     9995 non-null   float64
 22 

In [11]:
df.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x31,x32,x33,x34,x35,x36,x37,x38,x39,y
0,1.563312,-1.417454,-0.49021,-1.568689,0.367421,0.070303,-0.535064,0.161154,0.130255,0.753823,...,-0.611644,1.807765,0.089677,-1.289779,0.303392,0.791001,-1.147663,0.311395,-0.097479,1
1,-1.952179,0.127013,1.538298,0.346906,1.188424,0.812583,-0.931559,0.815114,-0.331135,-1.20211,...,0.126718,-1.478616,-0.57393,-0.300777,-2.489546,1.101826,-0.736897,1.109234,0.411488,0
2,0.966699,-0.476276,0.491713,1.778521,-0.977243,-0.465307,-1.026294,0.526787,-0.423048,1.584421,...,0.372821,2.019703,1.059188,1.710667,-0.395971,0.533767,-1.245003,0.315812,-0.696607,0
3,0.24579,0.217226,0.044321,0.741132,2.287139,-0.013715,-0.359327,0.510777,0.765334,0.655096,...,-0.257412,-1.630353,1.33175,-0.744338,-0.536401,-0.619755,-0.867709,-0.840118,0.388468,0
4,0.224607,0.593478,-0.069498,-1.340076,0.573069,0.926663,-1.141124,-0.172414,-1.868113,0.791384,...,1.622028,-1.032052,-2.15192,-0.81478,-0.248285,-0.529315,1.086025,-0.307067,0.555658,0


### Save resulting dataframe to csv

In [12]:
df.to_csv('../data/raw/raw_data.csv', index=False)

# Approach 2: Using numpy `randomint` method 

This method generates arrays of random integers from specified low to high values drawn from a discrete uniform distribution. See the reference documentation [here](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html). The feature columns of the resulting dataset will be orthogonal to each other; that is, there will be no interdependence between the features as they each will be generated following an independent random process.

#### Create a class `DataGeneratorRandomInt`

In [13]:
class DataGeneratorRandomInt:
    def __init__(self, num_rows, num_features, min_value, max_value, null_prob, flip_y):
        self.num_rows = n_rows
        self.num_features = n_features
        self.min_value = min_value
        self.max_value = max_value
        self.null_prob = null_prob
        self.flip_y= flip_y

    def generate_random_data(self):
        X = np.random.randint(self.min_value, self.max_value, size=(self.num_rows+1, self.num_features+1))
        
        ### generate binary output variable    
        y = np.random.choice([0, 1], size=self.num_rows, p=[1-self.flip_y, self.flip_y])  # Binary output column
        return X,y

    def create_dataframe(self):
        X,y = self.generate_random_data()
        data = pd.DataFrame(data=X[1:,1:], index=X[1:,0], columns=X[0,1:])
        y = pd.DataFrame(pd.Series(data=y[:]), columns=['y'])
        
        # create column names
        featurenames = list()
        for i in range(n_features):
            featurenames.append(f"x{i}")
        data.columns = featurenames

        ### generate random null values
        for col in data.columns:
            mask_values = np.random.choice([np.nan, True], size=data.shape[0], p=[self.null_prob, 1 - self.null_prob])
            data[col] = data[col]*mask_values
        
        data = data.reset_index()
        
        df = pd.concat([pd.DataFrame(data, columns=featurenames),
                       pd.DataFrame(y, columns=['y'])], axis=1)

        ### create resulting dataframe
        df = pd.concat([pd.DataFrame(data, columns=featurenames),
               pd.DataFrame(y, columns=['y'])], axis=1)
        
        return df


### Create dataframe

In [14]:
# # Generate a small portion of null values at random
generator = DataGeneratorRandomInt(num_rows = n_rows, min_value = -200, max_value=500, 
                                   num_features = n_features,
                                   null_prob = .001, flip_y= flip_y)
df = generator.create_dataframe()

In [15]:
df.shape

(10000, 41)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x0      9989 non-null   float64
 1   x1      9988 non-null   float64
 2   x2      9991 non-null   float64
 3   x3      9990 non-null   float64
 4   x4      9989 non-null   float64
 5   x5      9987 non-null   float64
 6   x6      9988 non-null   float64
 7   x7      9988 non-null   float64
 8   x8      9989 non-null   float64
 9   x9      9994 non-null   float64
 10  x10     9994 non-null   float64
 11  x11     9989 non-null   float64
 12  x12     9993 non-null   float64
 13  x13     9984 non-null   float64
 14  x14     9992 non-null   float64
 15  x15     9990 non-null   float64
 16  x16     9994 non-null   float64
 17  x17     9986 non-null   float64
 18  x18     9990 non-null   float64
 19  x19     9991 non-null   float64
 20  x20     9992 non-null   float64
 21  x21     9991 non-null   float64
 22 

### Save resulting dataframe to csv

In [17]:
df.to_csv('../data/raw/raw_data2.csv', index=False)