## Creating a Sample Dataset with 10 different data categories

### Import the required libraries

In [1]:
import pandas as pd
import numpy as np

### Ensure that the random numbers generated by NumPy will be the same every time you run your script

In [2]:
# Set random seed for reproducibility
np.random.seed(42)

### Create 100 Rows of Data with 10 Columns - Age, Income, Category, Items Purchased, Product Description,Transaction Date, Is Member, Customer Satisfaction Score, Purcahsed Product Category, Loyality Tier

#### Column Descriptions for the Dataset

| **Column Name**                 | **Description**                                                              |
|----------------------------------|------------------------------------------------------------------------------|
| `Age`                           | Age of the individual (Discrete numerical).                                 |
| `Income`                        | Annual income of the individual in USD (Continuous numerical).              |
| `Category`                      | Spending category (`Low`, `Medium`, `High`) (Categorical).                 |
| `Items Purchased`               | Number of items purchased in a transaction (Discrete numerical).           |
| `Product Description`           | Description of the product purchased (`Gadget`, `Furniture`, `Clothing`) (Text data). |
| `Transaction Date`              | Date of the transaction (Date/Time).                                        |
| `Is Member`                     | Whether the customer is a member (`True`/`False`) (Boolean).                |
| `Customer Satisfaction Score`   | Satisfaction rating on a scale of 1 to 10 (Target for regression).          |
| `Purchased Product Category`    | Binary target indicating purchase category (e.g., `0 = Non-Tech`, `1 = Tech`). |
| `Loyalty Tier`                  | Loyalty program tier: `0 = Basic`, `1 = Silver`, `2 = Gold` (Multi-class target). |

This dataset provides a variety of data types, including:
- **Numerical features** for regression and numerical analysis.
- **Categorical features** for classification tasks.
- **Text features** for natural language processing.
- **Date/Time features** for time series analysis.
- **Binary and multi-class targets** for classification problems.

The dataset is diverse and can be used across all 20 machine learning algorithms.


In [3]:
# Create 100 rows of data
data = {
    'Age': np.random.randint(18, 65, 100),  # Discrete numerical feature
    'Income': np.random.uniform(20000, 150000, 100),  # Continuous numerical feature
    'Category': np.random.choice(['Low', 'Medium', 'High'], size=100),  # Categorical feature
    'Items Purchased': np.random.randint(1, 10, 100),  # Discrete numerical feature
    'Product Description': np.random.choice(['Gadget', 'Furniture', 'Clothing'], size=100),  # Text data
    'Transaction Date': pd.date_range('2024-01-01', periods=100, freq='D'),  # Date/Time feature
    'Is Member': np.random.choice([True, False], size=100),  # Boolean feature
    'Customer Satisfaction Score': np.random.normal(7, 1.5, 100).round(1),  # Continuous target (e.g., regression)
    'Purchased Product Category': np.random.choice([0, 1], size=100),  # Binary target
    'Loyalty Tier': np.random.choice([0, 1, 2], size=100),  # Multi-class target
}

### Convert the data to Pandas Dataframe model

In [4]:
# Convert to DataFrame
df = pd.DataFrame(data)

### Display the created 100 rows of dat with 10 column labels

In [5]:
df.head()

Unnamed: 0,Age,Income,Category,Items Purchased,Product Description,Transaction Date,Is Member,Customer Satisfaction Score,Purchased Product Category,Loyalty Tier
0,56,58153.462713,Medium,5,Clothing,2024-01-01,True,6.8,0,2
1,46,21830.376953,High,3,Furniture,2024-01-02,False,7.6,1,0
2,32,45849.512532,High,1,Furniture,2024-01-03,True,8.0,0,1
3,60,112474.453857,High,4,Furniture,2024-01-04,True,6.4,0,0
4,25,122722.820269,High,5,Gadget,2024-01-05,True,7.3,0,0


### Convert categorical variables into Binary Vectors 

In [6]:
# One-hot encode categorical and text features
df_encoded = pd.get_dummies(df, columns=['Category', 'Product Description'], drop_first=True)

### Print the encoded dataset

In [7]:
df_encoded.head()

Unnamed: 0,Age,Income,Items Purchased,Transaction Date,Is Member,Customer Satisfaction Score,Purchased Product Category,Loyalty Tier,Category_Low,Category_Medium,Product Description_Furniture,Product Description_Gadget
0,56,58153.462713,5,2024-01-01,True,6.8,0,2,False,True,False,False
1,46,21830.376953,3,2024-01-02,False,7.6,1,0,False,False,True,False
2,32,45849.512532,1,2024-01-03,True,8.0,0,1,False,False,True,False
3,60,112474.453857,4,2024-01-04,True,6.4,0,0,False,False,True,False
4,25,122722.820269,5,2024-01-05,True,7.3,0,0,False,False,False,True


### Save the encoded dataset as a CSV file

In [8]:
# Save as CSV for reuse
df_encoded.to_csv('labeled_dataset.csv', index=False)

### Print a summary info of the dataset

In [9]:
# Summary of dataset
print("Dataset created with labeled columns:")
print(df_encoded.info())
print(df.describe())

Dataset created with labeled columns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   Age                            100 non-null    int32         
 1   Income                         100 non-null    float64       
 2   Items Purchased                100 non-null    int32         
 3   Transaction Date               100 non-null    datetime64[ns]
 4   Is Member                      100 non-null    bool          
 5   Customer Satisfaction Score    100 non-null    float64       
 6   Purchased Product Category     100 non-null    int64         
 7   Loyalty Tier                   100 non-null    int64         
 8   Category_Low                   100 non-null    bool          
 9   Category_Medium                100 non-null    bool          
 10  Product Description_Furniture  100 non-null    bo