# Data Science Quick Tip #015: Synthesizing Your Own Test Data
In this notebook, we'll be sharing how to synthesize your own test data for test purposes. We will cover how to synthesize data for three use cases: binary classification, multiclass classification, and regression.

## Project Setup

In [1]:
# Importing the required Python libraries
import pandas as pd
from sklearn.datasets import make_blobs, make_classification, make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, mean_absolute_error, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

## Use Case #1: Binary Classification

In [19]:
# Generating synthetic binary classification in the form of arrays
X, y = make_classification(n_samples = 10000,
                           n_features = 25,
                           n_informative = 10,
                           n_redundant = 10,
                           n_repeated = 5,
                           n_classes = 2,
                           weights = [0.6, 0.4])

In [20]:
# Transforming the arrays into Pandas DataFrames
df_X = pd.DataFrame(data = X)
df_y = pd.DataFrame(data = y, columns = ['target'])

In [21]:
# Performing a split on the data to save data as a holdout, validation set
X_train, X_val, y_train, y_val = train_test_split(df_X, df_y)

In [22]:
# Instantiating the binary classification model with the RandomForestClassifier algorithm
binary_classification_model = RandomForestClassifier()

In [23]:
# Training the binary classification model against the training data
binary_classification_model.fit(X_train, y_train)

  binary_classification_model.fit(X_train, y_train)


RandomForestClassifier()

In [24]:
# Getting inferential predictions from the validation dataset
val_preds = binary_classification_model.predict(X_val)

In [25]:
# Generating validation metrics by comparing the inferential predictions (val_preds) to the actuals (y_val)
val_accuracy = accuracy_score(y_val, val_preds)
val_roc_auc_score = roc_auc_score(y_val, val_preds)
val_f1_score = f1_score(y_val, val_preds)

In [26]:
# Printing out the average validation metrics
print(f'Accuracy score: {val_accuracy}')
print(f'ROC AUC score: {val_f1_score}')
print(f'F1 score: {val_f1_score}')

Accuracy score: 0.9484
ROC AUC score: 0.9330565646081993
F1 score: 0.9330565646081993
