### Dataset Structure
1. **Employee Data**:
   - `EmployeeID`: Unique identifier for employees.
   - `Name`: Random names for employees.
   - `Age`: Age of employees.
   - `Department`: Department they work in (e.g., HR, Sales, Tech).
   - `JoiningDate`: Date they joined the company.
   - `MonthlySalary`: Monthly salary of the employee.
   - `PerformanceScore`: A score (1-10) representing employee performance.

2. **Sales Data**:
   - `TransactionID`: Unique identifier for sales transactions.
   - `EmployeeID`: Reference to the employee who made the sale.
   - `Product`: Name of the product sold.
   - `QuantitySold`: Quantity of the product sold.
   - `SaleAmount`: Total amount of the sale.
   - `TransactionDate`: Date of the transaction.

### 1. Employee Data (`employee_df`)
| EmployeeID | Name         | Age | Department | JoiningDate | MonthlySalary | PerformanceScore |
|------------|--------------|-----|------------|-------------|---------------|------------------|
| 1001       | Drew Clark   | 50  | HR         | 2019-03-01  | 10400         | 8                |
| 1002       | Drew Smith   | 36  | Tech       | 2021-10-25  | 6170          | 2                |
| 1003       | Taylor Lewis | 29  | HR         | 2016-07-17  | 12874         | 6                |

- **Columns:**
  - `EmployeeID`: Unique identifier for each employee.
  - `Name`: Randomly generated names.
  - `Age`: Employee age (22-60).
  - `Department`: Department name (HR, Sales, Tech, Marketing).
  - `JoiningDate`: Random joining dates between 2015 and 2024.
  - `MonthlySalary`: Random monthly salary (3000-15000).
  - `PerformanceScore`: Score between 1 and 10.

### 2. Sales Data (`sales_df`)
| TransactionID | EmployeeID | Product    | QuantitySold | SaleAmount | TransactionDate |
|---------------|------------|------------|--------------|------------|-----------------|
| 2001          | 1048       | Headphones | 4            | 1888       | 2023-09-18      |
| 2002          | 1085       | Monitor    | 2            | 1326       | 2023-09-18      |
| 2003          | 1039       | Tablet     | 5            | 2965       | 2023-12-16      |

- **Columns:**
  - `TransactionID`: Unique identifier for each sale.
  - `EmployeeID`: Reference to the employee who made the sale.
  - `Product`: Product name (Laptop, Phone, Tablet, Monitor, Headphones).
  - `QuantitySold`: Quantity of products sold (1-10).
  - `SaleAmount`: Total amount of the sale.
  - `TransactionDate`: Date of the transaction in 2023.

### Next Steps
You can now use these datasets to:
- **Define custom functions** and apply them using `apply`, `map`, and `lambda`.
- Explore data manipulation tasks like:
  - Calculating the total sales made by each employee.
  - Creating a new column to categorize employee performance.
  - Extracting and transforming date features.

Remember, the goal is to enhance your understanding of using custom functions and advanced pandas features.



In [32]:
#syenthetic datasets using pandas and numpy
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

# Setting a random seed for reproducibility
np.random.seed(42)

# Helper function to generate random names
def generate_random_name():
    first_names = ["Alex", "Jordan", "Taylor", "Morgan", "Casey", "Drew", "Robin", "Reese"]
    last_names = ["Smith", "Johnson", "Lee", "Brown", "Garcia", "Martinez", "Clark", "Lewis"]
    return f"{random.choice(first_names)} {random.choice(last_names)}"

# Generating Employee Data
num_employees = 100
employee_ids = np.arange(1001, 1001 + num_employees)
names = [generate_random_name() for _ in range(num_employees)]
ages = np.random.randint(22, 60, size=num_employees)
departments = np.random.choice(['HR', 'Sales', 'Tech', 'Marketing'], size=num_employees)
joining_dates = [datetime(2015, 1, 1) + timedelta(days=np.random.randint(0, 365*10)) for _ in range(num_employees)]
monthly_salaries = np.random.randint(3000, 15000, size=num_employees)
performance_scores = np.random.randint(1, 11, size=num_employees)

# Creating the Employee DataFrame
employee_df = pd.DataFrame({
    'EmployeeID': employee_ids,
    'Name': names,
    'Age': ages,
    'Department': departments,
    'JoiningDate': joining_dates,
    'MonthlySalary': monthly_salaries,
    'PerformanceScore': performance_scores
})

# Generating Sales Data
num_sales = 500
transaction_ids = np.arange(2001, 2001 + num_sales)
employee_ref_ids = np.random.choice(employee_ids, size=num_sales)
products = np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones'], size=num_sales)
quantities_sold = np.random.randint(1, 10, size=num_sales)
sale_amounts = quantities_sold * np.random.randint(100, 1000, size=num_sales)
transaction_dates = [datetime(2023, 1, 1) + timedelta(days=np.random.randint(0, 365)) for _ in range(num_sales)]

# Creating the Sales DataFrame
sales_df = pd.DataFrame({
    'TransactionID': transaction_ids,
    'EmployeeID': employee_ref_ids,
    'Product': products,
    'QuantitySold': quantities_sold,
    'SaleAmount': sale_amounts,
    'TransactionDate': transaction_dates
})

# Display the first few rows of each DataFrame to verify
employee_df.head(), sales_df.head()


(   EmployeeID            Name  Age Department JoiningDate  MonthlySalary  \
 0        1001  Casey Martinez   50         HR  2019-03-01          10400   
 1        1002    Taylor Smith   36       Tech  2021-10-25           6170   
 2        1003    Taylor Smith   29         HR  2016-07-17          12874   
 3        1004     Reese Clark   42       Tech  2018-07-15           5255   
 4        1005    Morgan Smith   40      Sales  2024-10-21           4154   
 
    PerformanceScore  
 0                 8  
 1                 2  
 2                 6  
 3                 7  
 4                 2  ,
    TransactionID  EmployeeID     Product  QuantitySold  SaleAmount  \
 0           2001        1048  Headphones             4        1888   
 1           2002        1085     Monitor             2        1326   
 2           2003        1039      Tablet             5        2965   
 3           2004        1100      Laptop             9        5742   
 4           2005        1033      Tablet 

In [33]:
employee_df.head()

Unnamed: 0,EmployeeID,Name,Age,Department,JoiningDate,MonthlySalary,PerformanceScore
0,1001,Casey Martinez,50,HR,2019-03-01,10400,8
1,1002,Taylor Smith,36,Tech,2021-10-25,6170,2
2,1003,Taylor Smith,29,HR,2016-07-17,12874,6
3,1004,Reese Clark,42,Tech,2018-07-15,5255,7
4,1005,Morgan Smith,40,Sales,2024-10-21,4154,2


In [34]:
sales_df.head()

Unnamed: 0,TransactionID,EmployeeID,Product,QuantitySold,SaleAmount,TransactionDate
0,2001,1048,Headphones,4,1888,2023-09-18
1,2002,1085,Monitor,2,1326,2023-09-18
2,2003,1039,Tablet,5,2965,2023-12-16
3,2004,1100,Laptop,9,5742,2023-12-18
4,2005,1033,Tablet,1,768,2023-01-08


### Next Steps
You can now use these datasets to:
- **Define custom functions** and apply them using `apply`, `map`, and `lambda`.
- Explore data manipulation tasks like:
  - Calculating the total sales made by each employee.
  - Creating a new column to categorize employee performance.
  - Extracting and transforming date features.

**Below calculation is done for calculating the total_sales**
1. **we initilaize the custom fx and the use .apply() fx to initiate the calculation**

In [35]:
def calculate_sales(row):
    return row['QuantitySold'] * row['SaleAmount']


sales_df['total_sales']  = sales_df.apply(calculate_sales,axis=1)
sales_df.head()

Unnamed: 0,TransactionID,EmployeeID,Product,QuantitySold,SaleAmount,TransactionDate,total_sales
0,2001,1048,Headphones,4,1888,2023-09-18,7552
1,2002,1085,Monitor,2,1326,2023-09-18,2652
2,2003,1039,Tablet,5,2965,2023-12-16,14825
3,2004,1100,Laptop,9,5742,2023-12-18,51678
4,2005,1033,Tablet,1,768,2023-01-08,768


**Below we use combination of lamdas and apply fx to reach the outcome**

In [36]:
employee_df['Performance_category']= employee_df['PerformanceScore'].apply(
    lambda x : "Excellent" if x >= 8 else "Good" if x >=5 else "Average" if x >= 3 else "Poor"
)
employee_df.head()

Unnamed: 0,EmployeeID,Name,Age,Department,JoiningDate,MonthlySalary,PerformanceScore,Performance_category
0,1001,Casey Martinez,50,HR,2019-03-01,10400,8,Excellent
1,1002,Taylor Smith,36,Tech,2021-10-25,6170,2,Poor
2,1003,Taylor Smith,29,HR,2016-07-17,12874,6,Good
3,1004,Reese Clark,42,Tech,2018-07-15,5255,7,Good
4,1005,Morgan Smith,40,Sales,2024-10-21,4154,2,Poor


**Calculating the *Year_since_joining* using the datetime fx or dt.**

In [37]:
current_date=pd.to_datetime('today')

employee_df['Year_since_joining']=  (current_date-employee_df['JoiningDate']).dt.days//365

In [38]:
employee_df.head()

Unnamed: 0,EmployeeID,Name,Age,Department,JoiningDate,MonthlySalary,PerformanceScore,Performance_category,Year_since_joining
0,1001,Casey Martinez,50,HR,2019-03-01,10400,8,Excellent,5
1,1002,Taylor Smith,36,Tech,2021-10-25,6170,2,Poor,3
2,1003,Taylor Smith,29,HR,2016-07-17,12874,6,Good,8
3,1004,Reese Clark,42,Tech,2018-07-15,5255,7,Good,6
4,1005,Morgan Smith,40,Sales,2024-10-21,4154,2,Poor,0


**Calculated the DaysSinceTransaction using datetime fx**

In [43]:
sales_df['DaysSinceTranscation']= (current_date-sales_df['TransactionDate']).dt.days
sales_df.head()

Unnamed: 0,TransactionID,EmployeeID,Product,QuantitySold,SaleAmount,TransactionDate,total_sales,DaysSinceTranscation
0,2001,1048,Headphones,4,1888,2023-09-18,7552,422
1,2002,1085,Monitor,2,1326,2023-09-18,2652,422
2,2003,1039,Tablet,5,2965,2023-12-16,14825,333
3,2004,1100,Laptop,9,5742,2023-12-18,51678,331
4,2005,1033,Tablet,1,768,2023-01-08,768,675
