<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_1/Section_6_Python_Example__Simulating_Data_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 6 Python Example: Simulating Data Generation

Simulating data generation is a valuable technique in data science, particularly useful for testing hypotheses, validating models, and training machine learning algorithms when real datasets are incomplete, unavailable, or when privacy issues restrict their use. Python offers various tools and libraries, such as NumPy and SciPy, which can be leveraged to create synthetic datasets that closely mimic real-world data characteristics. This section provides a practical example to illustrate how to generate simulated data in Python, covering basic statistical distributions and more complex data structures.

1. Generating Random Data with NumPy:

NumPy is a fundamental package for scientific computing in Python. It includes support for a wide range of mathematical operations and has powerful data generation capabilities. Here, we'll use NumPy to generate random data from different statistical distributions:

In [3]:
import numpy as np
import pandas as pd

# Set seed for reproducibility
np.random.seed(42)

# Generate random integers
random_integers = np.random.randint(1, 100, size=10)
print("Random Integers:", random_integers)

# Generate random samples from a normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=1000)
print("Normal Distribution Sample Mean:", np.mean(normal_data))
print("Normal Distribution Sample Standard Deviation:", np.std(normal_data))

# Generate random samples from a uniform distribution
uniform_data = np.random.uniform(low=0, high=1, size=1000)
print("Uniform Distribution Sample Mean:", np.mean(uniform_data))
print("Uniform Distribution Sample Range:", (np.min(uniform_data), np.max(uniform_data)))

Random Integers: [52 93 15 72 61 21 83 87 75 75]
Normal Distribution Sample Mean: 0.025354699638558926
Normal Distribution Sample Standard Deviation: 1.0003731428167348
Uniform Distribution Sample Mean: 0.5016124771821006
Uniform Distribution Sample Range: (0.0032182636042786816, 0.9994137257706666)


2. Simulating a Time Series Data:

Time series data is sequential data indexed in time order, often found in economics, environmental science, or even server log data. Here’s how you might simulate a simple daily temperature dataset using Python:

In [4]:
# Generate a time series of daily temperatures
np.random.seed(42)
days = 365
mean_temperature = 20  # degrees Celsius
temperature_variation = 10  # daily variation in temperature

daily_temperatures = np.random.normal(loc=mean_temperature, scale=temperature_variation, size=days)
dates = pd.date_range(start='2021-01-01', periods=days, freq='D')
temperature_series = pd.Series(data=daily_temperatures, index=dates)

print("Simulated Daily Temperatures for 2021:")
print(temperature_series.head())

Simulated Daily Temperatures for 2021:
2021-01-01    24.967142
2021-01-02    18.617357
2021-01-03    26.476885
2021-01-04    35.230299
2021-01-05    17.658466
Freq: D, dtype: float64


3. Creating a Synthetic Classification Dataset with Scikit-learn:

For machine learning applications, especially in classification tasks, simulating datasets with specific properties can be very useful. Scikit-learn offers utilities for generating datasets for various machine learning tasks. Here’s how to generate a simple binary classification dataset:

In [5]:
from sklearn.datasets import make_classification

# Generate a binary classification dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=10, n_classes=2, random_state=42)

# Convert to DataFrame for better visualization
import pandas as pd
feature_names = [f"Feature_{i+1}" for i in range(X.shape[1])]
data = pd.DataFrame(X, columns=feature_names)
data['Target'] = y

print("Synthetic Classification Dataset Preview:")
print(data.head())

Synthetic Classification Dataset Preview:
   Feature_1  Feature_2  Feature_3  Feature_4  Feature_5  Feature_6  \
0   0.866690   0.563581  -0.919093  -0.533558   0.830336   0.248221   
1   0.624825   0.901690  -0.669515  -0.921694  -0.790474   0.725767   
2   0.484315   1.358280  -0.528156  -1.429220  -0.756351  -0.955540   
3   0.925419   0.474106  -0.979591  -0.431315  -1.304470  -0.032695   
4  -0.518667  -1.146568   0.561319   1.196641  -0.039555  -1.276749   

   Feature_7  Feature_8  Feature_9  Feature_10  ...  Feature_12  Feature_13  \
0  -1.859457   0.181866  -0.856084    0.003484  ...    0.605559   -1.249347   
1  -1.695181   0.109395   0.471468    0.457738  ...    0.649019   -0.663373   
2  -1.785998  -0.127918  -1.422254    0.960709  ...    0.785842   -0.198320   
3  -1.894064   0.384065   0.669673   -0.113594  ...    0.591843   -1.395164   
4   1.692146   0.271579   0.681501   -0.745771  ...   -0.709469    0.359963   

   Feature_14  Feature_15  Feature_16  Feature_17  Featu

These examples demonstrate the flexibility and power of Python for generating synthetic data. Such simulations are essential tools in data science, enabling researchers and analysts to perform robust testing and development of analytical models and methods. Simulated data must be used judiciously, especially in ensuring that it reflects the characteristics of real data closely enough to provide meaningful insights when applied to actual scenarios.

References:

VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.

McKinney, W. (2012). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media.

Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python, JMLR 12, pp. 2825-2830.