# Example Commands for Data Generation and Visualization

This Jupyter Notebook demonstrates how to generate synthetic datasets and visualize them using Python libraries such as `matplotlib`, `numpy`, and `pandas`. The notebook includes the following steps:

1. **Importing Libraries**: Importing necessary libraries for data manipulation and visualization.
2. **Setting Plot Parameters**: Configuring `matplotlib` plot parameters for consistent styling.
3. **Generating Synthetic Data**: Creating two sets of points with Gaussian distributions centered at different coordinates.
4. **Descriptive Statistics**: Displaying descriptive statistics and the first few rows of the generated datasets.
5. **Data Labeling and Combination**: Adding labels to the datasets, combining them, and shuffling the combined dataset.
6. **Data Analysis**: Performing group-by operations to show counts and aggregate statistics (mean, standard deviation) for each label.
7. **Data Visualization**: 
   - Creating boxplots for the features.
   - Plotting histograms to show the distribution of the features.
   - Generating scatter plots to visualize the relationship between the features, including a scatter plot with different colors for each class.

This notebook is useful for demonstrating basic data manipulation, statistical analysis, and visualization techniques in Python.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
plt.rcParams.update({'font.family': 'cmr10',
                     'font.size': 12,
                     'axes.unicode_minus': False,
                     'axes.labelsize': 12,
                     'figure.figsize': (3, 3),
                     'figure.dpi': 80,
                     'mathtext.fontset': 'cm',
                     'mathtext.rm': 'serif',
                     'xtick.direction': 'in',
                     'ytick.direction': 'in',
                     'xtick.top': True,
                     'ytick.right': True
                     })

In [None]:
# Make a dataset with x and y that has two gaussian peaks
CENTER1X = 0.5
CENTER2X = 1.5
CENTER1Y = 1.5
CENTER2Y = 0.5
SIGMA = 0.5
# Create a grid of points
POINTS_1 = []
for count in range(200):
    x = np.random.normal(CENTER1X, SIGMA)
    y = np.random.normal(CENTER1Y, SIGMA)
    POINTS_1.append([x, y])
POINTS_1 = np.array(POINTS_1)
POINTS_1 = pd.DataFrame(POINTS_1, columns=['x', 'y'])

POINTS_2 = []
for count in range(300):
    x = np.random.normal(CENTER2X, SIGMA)
    y = np.random.normal(CENTER2Y, SIGMA)
    POINTS_2.append([x, y])
POINTS_2 = np.array(POINTS_2)
POINTS_2 = pd.DataFrame(POINTS_2, columns=['x', 'y'])


In [None]:
POINTS_1.describe()

In [None]:
POINTS_2.head(5)

In [None]:
# Add a "label" column to the dataframes
POINTS_1['label'] = 0
POINTS_2['label'] = 1
# Combine the two dataframes
POINTS = pd.concat([POINTS_1, POINTS_2], ignore_index=True)
# Shuffle the data
POINTS = POINTS.sample(frac=1).reset_index(drop=True)

In [None]:
POINTS.head(5)

In [None]:
# Example pandas commands: show a groupby
POINTS.groupby(['label']).count().reset_index(drop=True)

In [None]:
# Example pandas commands: show a groupby
POINTS.groupby(['label']).agg(["mean", "std"])

In [None]:
# Make a boxplot of each feature
plt.figure(figsize=(5, 5))

plt.boxplot([POINTS['x'],
             POINTS['y']],
            labels=['x', 'y'],
            patch_artist=True,
            boxprops=dict(facecolor='white', color='black'),
            medianprops=dict(color='red'),
            whiskerprops=dict(color='black'),
            flierprops=dict(marker='o', markerfacecolor='red', markersize=5, linestyle='none'),
)
plt.ylabel('Value')
plt.xlabel('Feature')
plt.title('Boxplot of Features')
plt.show()

While the medians are at different values, the max, min, and quantiles are otherwise similar. Maybe let's plot the dsitribution of the two datasets to see if they are similar.

In [None]:
plt.figure(figsize=(5, 5))
plt.hist(POINTS['x'], bins=np.linspace(0, 2, 21), alpha=0.5, label='x', color='C0')
plt.hist(POINTS['y'], bins=np.linspace(0, 2, 21), alpha=0.5, label='y', color='C1')
plt.xlabel('Value')
plt.ylabel('Count')
plt.title('Histogram of Features')
plt.legend()
plt.show()

We can also make scatter plots

In [None]:
plt.figure(figsize=(5, 5))
plt.scatter(POINTS['x'], POINTS['y'])

In [None]:
plt.figure(figsize=(5, 5))

class_1 = POINTS[POINTS['label'] == 0]
class_2 = POINTS[POINTS['label'] == 1]
plt.scatter(class_1['x'], class_1['y'], label='Class 1', color='C0')
plt.scatter(class_2['x'], class_2['y'], label='Class 2', color='C1')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter Plot of Features')
plt.legend()
plt.show()