## Ejercicio 0: Generación de Datos Sintéticos

### By:
Auberth Eduardo Hurtado

### Date:
2025-01-17

### Description:

Create an artificial dataset by following the steps described in the test. These steps will be performed in Python.


## 📚 Libraries used

In [1]:
import pandas as pd
import numpy as np
import uuid
from datetime import datetime, timedelta
import random
import warnings
import matplotlib

# Hide warnings
warnings.filterwarnings('ignore')

### 1. Escribe un script en Python que genere un conjunto de datos con al menos 50.000 filas siguiendo este esquema: 
```json
{ 
"order_id": "uuid", 
"customer_id": "random_int(1, 10_000)", 
"product_id": "random_int(1, 1_000)", 
"quantity": "random_int(1, 20)", 
"price": "random_float(1.0, 500.0)", 
"discount": "random_float(0.0, 0.3)", 
"order_date": "random_date(2023-01-01, 2024-12-31)", 
"shipping_priority": "random_choice(['Low', 'Medium', 
'High'])", 
"region": "random_choice(['North', 'South', 'East', 'West'])" 
}
```
#### Asegúrate de que: 
- `order_id` sea único. 
- `order_date` esté distribuido con un patrón de estacionalidad y tendencia creciente (más órdenes en 2024). 
- `discount` esté correlacionado inversamente con price. 
- `shipping_priority` sea proporcional a region (por ejemplo, más alta prioridad en "North"). 

### 💾 Parameters

In [2]:
num_rows = 50000
max_customer_id = 10000
max_product_id = 1000
max_quantity = 20
max_price = 500
max_discount = 0.3
start_date = datetime(2023, 1, 1)
end_date = datetime(2024, 12, 31)
noise_percent = 0.05
noise_options = ['NULL', -9999, np.nan]


## 👷 1. Data generation

- **order_id**: sea único.

`order_id` is a uuid (Universally Unique Identifier), for that reason I'm going to use the library uuid

In [3]:
order_id = [str(uuid.uuid4()) for _ in range(num_rows)]
#print(order_id)

- **customer_id**: random integer between 1 y 10000.

In [4]:
customer_id = np.random.randint(1, max_customer_id + 1, num_rows)
#print(customer_id.min(), customer_id.max())


- **product_id**: random integer between 1 y 1000.

In [5]:
product_id = np.random.randint(1, max_product_id + 1, num_rows)
#print(product_id.min(), product_id.max())

- **quantity**: random integer between 1 y 20.

In [6]:
quantity = np.random.randint(1, max_quantity + 1, num_rows)
#print(quantity.min(), quantity.max())

- **price**: aleatorio entre 1.0 y 500.0

In [None]:
price = np.random.uniform(1.0, max_price, num_rows)
print(price.min(), price.max())

- **discount**: random number between 0.0 and 0.3. It should be inversely correlated with `price`

In [8]:
discount = max_discount - (price / max_price) * max_discount
discount = np.clip(discount, 0.0, max_discount)
#print(discount.min(), discount.max())
#print("Correlation between price and discount:", np.corrcoef(price, discount)[0, 1])

- **order_date**: random number between 2023-01-01 and 2024-12-31. Additionally, it should be distributed with a seasonal pattern and an increasing trend (more orders in 2024).

In [9]:
orders_weight = {2023: 0.4, 2024: 0.6}

def orders_date(start, end, weights):
    """
    Generates a random date between the given start and end dates,
    with a seasonal pattern and increasing trend based on year weights.

    Parameters:
    start (datetime): The start date of the range.
    end (datetime): The end date of the range.
    weights (dict): A dictionary with years as keys and weights as values,
                    indicating the probability of selecting a date in that year.

    Returns:
    datetime: A random date within the specified range, adjusted according to the year weights.
    """
    date = start + timedelta(days=random.randint(0, int((end - start).days)))
    while random.random() > weights[date.year]:
        date = start + timedelta(days=random.randint(0, int((end - start).days)))
    return date

order_date = [orders_date(start_date, end_date, orders_weight) for _ in range(num_rows)]
#orders = pd.DataFrame([date.strftime("%y-%m") for date in order_date]).value_counts().sort_index()
#orders.plot(kind="line", title="Number of Orders per Month", xlabel="Date", ylabel="Number of Orders")

- **shipping_priority**: random between ['Low', 'Medium', 'High']. Additionally, it should be proportional to `region` (for example, higher priority in "North").

In [10]:
# region: random between: ['North', 'South', 'East', 'West']
region = np.random.choice(['North', 'South', 'East', 'West'], num_rows)

shipping_priority = np.random.choice(['Low', 'Medium', 'High'], num_rows, p=[0.2, 0.3, 0.5])
# Adjust shipping_priority based on region
for i in range(num_rows):
    if region[i] == 'North':
        shipping_priority[i] = np.random.choice(['Low', 'Medium', 'High'], p=[0.1, 0.3, 0.6])
    elif region[i] == 'South':
        shipping_priority[i] = np.random.choice(['Low', 'Medium', 'High'], p=[0.3, 0.5, 0.2])
    elif region[i] == 'East':
        shipping_priority[i] = np.random.choice(['Low', 'Medium', 'High'], p=[0.4, 0.4, 0.2])
    elif region[i] == 'West':
        shipping_priority[i] = np.random.choice(['Low', 'Medium', 'High'], p=[0.5, 0.3, 0.2])


- DataFrame creation

In [None]:
sales = {
    "order_id": order_id,
    "customer_id": customer_id,
    "product_id": product_id,
    "quantity": quantity,
    "price": price,
    "discount": discount,
    "order_date": order_date,
    "shipping_priority": shipping_priority,
    "region": region
}

df = pd.DataFrame(sales)
df.head()

## 👷 2. Introduction of noise

Introduce ruido y valores faltantes en un 5% de las filas siguiendo estos criterios: 
- En al menos tres columnas aleatorias por fila. 
- Opciones para ruido: eliminar valores, introducir cadenas como "NULL", o 
números extremos (ej. -9999). 

In [12]:
noisy_rows = df.sample(n=int(noise_percent * num_rows)).index
columns = df.columns.tolist()
columns.remove('order_id')  # I'll keep 'order_id' intact to maintain the primary key.

for idx in noisy_rows:
    noisy_columns = random.sample(columns, 3)
    for col in noisy_columns:
        #print("Adding noise to (row,col):", (idx, col))
        noise_type = random.choice(noise_options)
        #print(df[col].dtype, type(noise_type))
        df.at[idx, col] = random.choice(noise_options)

#df.head()

## 📊 3. Saving results 

Guarda el dataset generado en raw_sales_data.csv


In [13]:
df.to_csv(r'../data/raw/raw_sales_data.csv', index=False)