### Synthetic Data Generation using Generative AI

In [3]:
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LeakyReLU, BatchNormalization
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import MinMaxScaler

data = pd.read_csv('screentime_analysis.csv')

data.head()

Unnamed: 0,Date,App,Usage (minutes),Notifications,Times Opened
0,2024-08-07,Instagram,81,24,57
1,2024-08-08,Instagram,90,30,53
2,2024-08-26,Instagram,112,33,17
3,2024-08-22,Instagram,82,11,38
4,2024-08-12,Instagram,59,47,16


The dataset contains the following columns:

- Date: The date of the screentime data.
- Usage: Total usage time of the app (likely in minutes).
- Notifications: The number of notifications received.
- Times opened: The number of times the app was opened.
- App: The name of the app.

To create a Generative AI model using GANs for generating synthetic data, we need to:

1. **Drop unnecessary columns**: We will not generate the Date or App fields as they are specific identifiers. Instead, we’ll focus on Usage, 2. 2. Notifications, and Times opened. In case, you want to use the app column, you can use the app column by converting the value of the column into numerical values.
2. **Normalize the data**: GANs perform better with normalized data, usually between 0 and 1.
3. **Prepare the dataset** for training: Ensure the remaining columns are numeric and ready for the model.

In [4]:
# drop unnecessary columns
data_gan = data.drop(columns=['Date', 'App'])

# initialize a MinMaxScaler to normalize the data between 0 and 1
scaler = MinMaxScaler()

# normalize the data
normalized_data = scaler.fit_transform(data_gan)

# convert back to a DataFrame
normalized_df = pd.DataFrame(normalized_data, columns=data_gan.columns)

normalized_df.head()

Unnamed: 0,Usage (minutes),Notifications,Times Opened
0,0.677966,0.163265,0.571429
1,0.754237,0.204082,0.530612
2,0.940678,0.22449,0.163265
3,0.686441,0.07483,0.377551
4,0.491525,0.319728,0.153061


The dataset has been normalized, with values between 0 and 1 for the following columns: Usage, Notifications, and Times opened. Now, let’s move on to building the GAN model.

## Using GANs to Build a Generative AI Model for Synthetic Data Generation
Here’s the process to define and train the GAN:

1. The generator will be trained to produce data similar to the normalized Usage, Notifications, and Times opened columns.
2. The discriminator will be trained to distinguish between the real and generated data.
3. Next, we will alternate between training the discriminator and the generator. The discriminator will be trained to classify real vs fake data, and the generator will be trained to fool the discriminator.

Let’s start building the GAN. The generator will take a latent noise vector as input and generate a synthetic sample similar to the data. Use the LeakyReLU activation for better gradient flow:

In [5]:
latent_dim = 100  # size of the random noise vector

latent_dim = 100  # latent space dimension (size of the random noise input)

def build_generator(latent_dim):
    model = Sequential([
        Dense(128, input_dim=latent_dim),
        LeakyReLU(alpha=0.01),
        BatchNormalization(momentum=0.8),
        Dense(256),
        LeakyReLU(alpha=0.01),
        BatchNormalization(momentum=0.8),
        Dense(512),
        LeakyReLU(alpha=0.01),
        BatchNormalization(momentum=0.8),
        Dense(3, activation='sigmoid')  # output layer for generating 3 features
    ])
    return model

In [6]:
# create the generator
generator = build_generator(latent_dim)
generator.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Here’s an example of generating data using the generator network:

In [8]:
# generate random noise for 1000 samples
noise = np.random.normal(0, 1, (1000, latent_dim))

# generate synthetic data using the generator
generated_data = generator.predict(noise)

# display the generated data
generated_data[:5]  # show first 5 samples

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step 


array([[0.6060132 , 0.42501992, 0.39655977],
       [0.52695286, 0.47599155, 0.51267755],
       [0.4648849 , 0.4885294 , 0.44336262],
       [0.52922916, 0.4034784 , 0.45273483],
       [0.47769418, 0.4243987 , 0.4684544 ]], dtype=float32)

Now, the discriminator will take a real or synthetic data sample and classify it as real or fake:

In [10]:
def build_discriminator():
    model = Sequential([
        Dense(512, input_shape=(3,)),
        LeakyReLU(alpha=0.01),
        Dense(256),
        LeakyReLU(alpha=0.01),
        Dense(128),
        LeakyReLU(alpha=0.01),
        Dense(1, activation='sigmoid')  # output: 1 neuron for real/fake classification
    ])
    model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
    return model

# create the discriminator
discriminator = build_discriminator()
discriminator.summary()

Next, we will freeze the discriminator’s weights when training the generator to ensure only the generator is updated during those training steps:

In [11]:
def build_gan(generator, discriminator):
    # freeze the discriminator’s weights while training the generator
    discriminator.trainable = False

    model = Sequential([generator, discriminator])
    model.compile(loss='binary_crossentropy', optimizer=Adam())
    return model

# create the GAN
gan = build_gan(generator, discriminator)
gan.summary()

Now, we will train the GAN using the following steps:

1. Generate random noise.
2. Use the generator to create fake data.
3. Train the discriminator on both real and fake data.
4. Train the generator via the GAN to fool the discriminator.

In [13]:
def train_gan(gan, generator, discriminator, data, epochs=10000, batch_size=128, latent_dim=100):
    for epoch in range(epochs):
        # select a random batch of real data
        idx = np.random.randint(0, data.shape[0], batch_size)
        real_data = data[idx]

        # generate a batch of fake data
        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        fake_data = generator.predict(noise)

        # labels for real and fake data
        real_labels = np.ones((batch_size, 1))  # real data has label 1
        fake_labels = np.zeros((batch_size, 1))  # fake data has label 0

        # train the discriminator
        d_loss_real = discriminator.train_on_batch(real_data, real_labels)
        d_loss_fake = discriminator.train_on_batch(fake_data, fake_labels)

        # train the generator via the GAN
        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        valid_labels = np.ones((batch_size, 1)) 
        g_loss = gan.train_on_batch(noise, valid_labels)

        # print the progress every 1000 epochs
        if epoch % 1000 == 0:
            print(f"Epoch {epoch}: D Loss: {0.5 * np.add(d_loss_real, d_loss_fake)}, G Loss: {g_loss}")

In [None]:
train_gan(gan, generator, discriminator, normalized_data, epochs=10000, batch_size=128, latent_dim=latent_dim)

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step




Epoch 0: D Loss: [0.69507 0.25   ], G Loss: 0.7105870246887207
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step


In [16]:
# generate new data
noise = np.random.normal(0, 1, (1000, latent_dim))  # generate 1000 synthetic samples
generated_data = generator.predict(noise)

# convert the generated data back to the original scale
generated_data_rescaled = scaler.inverse_transform(generated_data)

# convert to DataFrame
generated_df = pd.DataFrame(generated_data_rescaled, columns=data_gan.columns)

generated_df.head()

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step 


Unnamed: 0,Usage (minutes),Notifications,Times Opened
0,1.000253,0.000372,1.000049
1,1.000019,3.5e-05,1.000003
2,1.00006,8.4e-05,1.000018
3,1.00003,0.000156,1.000036
4,1.000057,0.000167,1.000026
