## map

`map` is a powerful built-in Python function that applies a given function to every item of an iterable (e.g., list, tuple) and returns a new iterable (a map object).
`map(function, iterable)`
- `function`: The function you want to apply to each item in the iterable. It can be a built-in function, a user-defined function, or even an anonymous function (lambda).
- `iterable`: The iterable (e.g., list, tuple) whose elements will be processed.

The `map` function doesn't return a list directly—it returns a `map` object, which you can convert into a list or other data structures as needed.

### Converting Data Types

In [None]:
numbers = [1, 2, 3, 4, 5]
str_numbers = map(str, numbers)
print(list(str_numbers))

### Applying a Custom Function

In [None]:
def square(x):
    return x ** 2

numbers = [1, 2, 3, 4, 5]

squares = map(square, numbers)
print(list(squares))


### Using a Lambda Function

In [None]:
numbers = [1, 2, 3, 4, 5]

squares = map(lambda x: x**2, numbers)
print(list(squares))

### Processing Multiple Iterables

In [None]:
list1 = [1, 2, 3]
list2 = [4, 5, 6]

sums = map(lambda x,y:x+y, list1, list2)
print(list(sums))

### Data Cleaning

#### Removing leading and trailing spaces from strings

In [None]:
data = ["  Hello ", " World ", "Python  "]

cleaned_data = map(str.strip, data)
print(list(cleaned_data))

In Python, `strip` is not a standalone function; it's a method that belongs to string objects. That means you can only call `.strip()` on a string instance (e.g., `"hello".strip()`), but it doesn’t exist as a global function like `len()` or `print()`.

`str.strip` refers to the strip method bound to the str class. When you pass `str.strip` to map, it works because map applies it to each string element in data.

**What If Data Contains Non-Strings?**
If your data contains non-string elements, you might run into errors because `.strip()` only works on strings. To handle this, you can combine `str()` conversion with `strip`:

In [None]:
data = ["  hello  ", None, 42]
result = map(lambda x: str(x).strip(), data)  # Convert to string, then strip
print(list(result))

#### Formatting dates from YYYYMMDD to YYYY-MM-DD

In [None]:
dates = [20231222, 20240101, 20240214]

formatted_dates = map(lambda d: f"{str(d)[:4]}-{str(d)[4:6]}-{str(d)[6:]}", dates)
print(list(formatted_dates))

#### Replacing None with 0 in a list

In [None]:
data = [1, 2, None, 4, None, 6]
cleaned_data = map(lambda x: 0 if x is None else x, data)
print(list(cleaned_data)) 


## Deep Copy

**Why Create a Deep Copy?**
By default, if you assign one DataFrame to another, both will share the same underlying data. This means any changes you make to the new DataFrame will also reflect in the original DataFrame.

In [None]:
import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = df1  # Not a copy; df2 is just a reference to df1

df2['A'] = [10, 20, 30]  # Changes df2
print(df1)  # df1 is also changed
# Output:
#     A  B
# 0  10  4
# 1  20  5
# 2  30  6


To avoid this, you need to create a copy.



### Using `copy()` for a Deep Copy

Use the `.copy()` method to create a new DataFrame that is completely independent of the original.

In [None]:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = df1.copy()  # Creates an independent copy

df2['A'] = [10, 20, 30]  # Changes df2
print(df1)  # df1 remains unchanged
# Output:
#    A  B
# 0  1  4
# 1  2  5
# 2  3  6


### Assign Only a Subset

If you only want a subset of the original DataFrame, you can still use `.copy()` to ensure independence:

In [None]:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = df1[['A']].copy()  # Copy only column 'A'

df2['A'] = [10, 20, 30]  # Changes df2
print(df1)  # df1 remains unchanged
# Output:
#    A  B
# 0  1  4
# 1  2  5
# 2  3  6

**What Happens Without `.copy()`?**
Without `.copy()`, slicing or selecting columns returns a view of the original DataFrame, not a copy. Modifying this view will also modify the original DataFrame.

In [None]:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = df1[['A']]  # No copy, just a view

df2['A'] = [10, 20, 30]  # Changes df1 as well!
print(df1)
# Output:
#     A  B
# 0  10  4
# 1  20  5
# 2  30  6

## np.bincount 

`np.bincount` is a NumPy function that counts the occurrences of each integer value in an array of non-negative integers. It's a fast and efficient way to compute frequency distributions for integers.

`np.bincount(x, weights=None, minlength=0)`
- x: Input array of non-negative integers.
    - This array represents the values for which you want to count occurrences.
- weights (optional): Array of the same length as x that assigns weights to each value.
    - If provided, instead of counting occurrences, it computes the sum of weights for each unique integer.
- minlength (optional): Minimum length of the output array.
    - If the maximum value in x is smaller than minlength, the output array is padded with zeros.


### Count Occurrences

In [None]:
import numpy as np

# Input array
x = [0, 1, 1, 2, 2, 2, 3]

# Count occurrences
counts = np.bincount(x)
print(counts)


- Index 0 appears 1 time.
- Index 1 appears 2 times.
- Index 2 appears 3 times.
- Index 3 appears 1 time.

### Add Weights

In [None]:
# Input array
x = [0, 1, 1, 2, 2, 2, 3]

# Weights for each element in x
weights = [1, 0.5, 0.5, 2, 2, 2, 3]

# Weighted counts
weighted_counts = np.bincount(x, weights=weights)
print(weighted_counts)


- Index 0: Weighted sum = 1
- Index 1: Weighted sum = 0.5 + 0.5 = 1
- Index 2: Weighted sum = 2 + 2 + 2 = 6
- Index 3: Weighted sum = 3

### Using minlength

In [None]:
x = [0, 1, 1, 2]

# Count occurrences with minlength=5
counts = np.bincount(x, minlength=5)
print(counts)


The output array is padded with zeros to ensure a length of 5.

## Benford's Law

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate a dataset (e.g., population data)
data = np.random.lognormal(mean=2, sigma=1, size=1000)

In [None]:
data[:10]

In [None]:
# Extract first digits
first_digits = [int(str(int(x))[0]) for x in data if x > 0]

In [None]:
# Compute observed distribution
observed = np.bincount(first_digits, minlength=10)[1:]  # Exclude 0
observed = observed / sum(observed)

In [None]:
# Expected distribution from Benford's Law
expected = [np.log10(1 + 1/d) for d in range(1, 10)]

In [None]:
plt.bar(range(1, 10), observed, alpha=0.7, label="Observed")
plt.plot(range(1, 10), expected, 'ro-', label="Expected (Benford's Law)")
plt.xlabel("First Digit")
plt.ylabel("Proportion")
plt.legend()
plt.show()


**Quantify Deviations** Use statistical tests to determine if the deviations are significant.

- Chi-Square Goodness-of-Fit Test:
    - Tests whether the observed distribution matches the expected distribution.
- Kolmogorov-Smirnov Test:
    - Compares the cumulative distribution of observed and expected frequencies.
- MAD (Mean Absolute Deviation):
    - Measures the average deviation of observed frequencies from expected values.

In [None]:
# Chi-Square Goodness-of-Fit Test:
from scipy.stats import chisquare

# Perform chi-square test
chi_stat, p_value = chisquare(f_obs=observed, f_exp=[e * sum(observed) for e in expected])
print("Chi-Square Statistic:", chi_stat)
print("P-Value:", p_value)


**Interpret the Results**
- Expected Outcome
    - If the data follows Benford's Law, the observed distribution of first digits will closely match the expected distribution.

- Significant deviations might indicate
    - Fraudulent manipulation (e.g., fabricated numbers).
    - Data entry errors.
    - Specific systematic issues.


Use Case: Expense Report Fraud
A company suspects that employees might be falsifying expense reports. You can use Benford's Law to check whether **the first digits of expense amounts deviate significantly from the expected distribution**.

## Dynamic smoothing

Dynamic smoothing is a technique used to adjust data values dynamically based on specific criteria or conditions. Unlike static smoothing, which applies a fixed formula uniformly to all data points, dynamic smoothing adapts the smoothing intensity depending on context, position, or other variables.

**Use Cases of Sigmoid Smoothing**
- Time-Series Data:
    - Smooth time-series values while allowing certain intervals (e.g., around nmid) to remain more faithful to the raw data.
    - Example: Smoothing temperature data around specific days.
- Seasonal Adjustment:
    - Gradually smooth noisy data over time to reveal trends while respecting seasonality.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Simulate a dataset
np.random.seed(42)
x = np.arange(1, 101)  # Sequence of 100 points # Context variable (e.g., position, time, or sequence index).
y_dow = 10 + 2 * np.sin(x / 10) + np.random.normal(scale=0.5, size=len(x))  # Original data with noise
y_avg = np.mean(y_dow)  # Baseline or reference value (e.g., mean).

# Sigmoid smoothing parameters
nmid = 50  # Transition center point; the position where smoothing starts to change.
c = 10  # Controls steepness of the sigmoid curve, a smaller c makes the transition steeper

# Apply smoothing formula
y_dow_smooth = y_avg + (y_dow - y_avg) / (1 + np.exp(-(x - nmid) / c))

# Plot the original and smoothed data
plt.figure(figsize=(12, 6))
plt.plot(x, y_dow, label="Original Data (y_dow)", alpha=0.7)
plt.plot(x, y_dow_smooth, label="Smoothed Data (y_dow_smooth)", linewidth=2, color="orange")
plt.axhline(y_avg, color="green", linestyle="--", label="Baseline (y_avg)")
plt.axvline(nmid, color="red", linestyle="--", label="Smoothing Center (nmid)")
plt.xlabel("Index (num)")
plt.ylabel("Value")
plt.title("Sigmoid-Based Smoothing Demonstration")
plt.legend()
plt.grid()
plt.show()


**Reducing Noise in Sparse or Uneven Data**
- Problem: Some days of the week may have insufficient data, leading to unreliable or overly noisy target encoding values.
    - For example, if Monday has only a few observations, its target mean might not represent its true pattern.
- Solution: Sigmoid smoothing balances the raw target mean (from DOW-specific data) with a baseline (global average). This ensures that sparse or noisy days are smoothed toward the baseline while still reflecting the observed trends.

**Handling Edge Cases or Outliers**
- Problem: Certain days might have extreme target values due to outliers or anomalies.
    - For example, a one-time spike in transactions on Sunday might inflate its target encoding value.
- Solution: Sigmoid smoothing down-weights the influence of extreme deviations by blending the raw value with the global average. The gradual transition reduces the risk of overfitting to anomalies.


**Dynamic Smoothing Based on DOW Position**
- Problem: Some days may be more important (e.g., weekends for retail sales) and need to retain more of their raw target behavior, while others (e.g., midweek) might require stronger smoothing.
- Solution: The sigmoid function adjusts the smoothing intensity dynamically:
    - Days near a key reference point (e.g., a midpoint) retain more of their raw target values.
    - Days further from the reference point are smoothed more heavily toward the global baseline.

**Preventing Overfitting**
- Problem: When using raw target encoding, the model might overfit to small variations in target values for each DOW.
    - For instance, it might assume Sunday is always "special" because of a high target mean, even if the sample size is small.
- Solution: Sigmoid smoothing regularizes the target encoding, ensuring the model generalizes better by reducing reliance on overly specific DOW trends.

**Preserving Global and Local Patterns**
- Problem: Balancing the global trend (e.g., overall target average) with local patterns (e.g., DOW-specific means) can be tricky.
- Solution: Sigmoid smoothing provides a flexible way to preserve both:
    - Raw DOW-specific target means highlight local patterns.
    - The global average acts as a stabilizing baseline for days with less data or extreme values.

Suppose you are encoding DOW fraud probabilities, and your data shows:

- Weekends (Saturday/Sunday) have high variance.
- Weekdays (Monday-Friday) have low variance.


- nmin=6, Center the smoothing around Saturday, as weekends are more critical.
- c=10, Moderate steepness, balancing transitions across the week.

## OOT Data

### What is OOT Data?
- Definition:

    - OOT data is a dataset reserved exclusively for testing or validating a model after it has been trained.
    - It is never used during the training phase to ensure the model generalizes well to unseen data.
- Difference from Test Data:
    - OOT data is sometimes used interchangeably with "test data," but in some cases, it refers to a completely independent dataset collected after the initial training/testing split, such as data from a future time period or a different geographic region.

### Why is OOT Data Important?
- Prevents Overfitting:
    - If the same dataset is used for both training and evaluation, the model may overfit to the training data, leading to overly optimistic performance metrics.
- Tests Generalization:
    - OOT data provides a realistic test of the model's ability to perform well on new, unseen data.
- Validates Real-World Performance:
    - OOT data often mimics real-world conditions or future distributions, making it ideal for assessing how the model performs in production.
- Detects Data Drift:
    - By comparing OOT data to training data, you can identify shifts in data distributions (e.g., seasonality changes, user behavior shifts).


### How is OOT Data Used?
- Model Validation:
    - Once the model is trained and evaluated on a validation set, OOT data is used to provide a final unbiased estimate of its performance.
    - Example: Splitting data into training, validation, and OOT sets.
- Monitoring Performance in Production:
    - OOT data can simulate real-world data to evaluate whether the model's predictions remain accurate when applied to unseen data.
- Model Robustness Testing:
    - OOT data can help verify how well the model handles different scenarios or distributions, such as:
        - Data **from a different time period**.
        - Data **from a different region or demographic**.
- Fraud Detection:
    - OOT data is often used in fraud detection models to test **whether the model can identify new fraudulent patterns that were not present in the training data**.

### How to Create OOT Data?
- Time-Based Splitting:
    - Use a cutoff date to reserve future data for OOT.
    - Example:
        - **Training: January–June**.
        - **OOT: July–December**.
- Random Sampling:
    - Randomly sample a portion of the dataset (e.g., 20%) as OOT.
- Scenario-Based Sampling:
    - Create OOT datasets that reflect specific conditions (e.g., **different regions**, **user groups**, or event periods).

## np.argsort

`numpy.argsort` is a function in the NumPy library that returns the indices that would sort an array. It is useful when you need to know the order of elements in an array without modifying the original array.

- a: The input array.
- axis:
    - Axis along which to sort.
    - Default is -1 (last axis).
    - Use None to flatten the array before sorting.
- kind:
    - Sorting algorithm: "quicksort", "mergesort", "heapsort", or "stable".
    - Default: "quicksort".
- order: If a is a structured array, this is the field(s) to sort by.


In [None]:
import numpy as np

# Example array
a = np.array([10, 3, 7, 1])

# Get indices to sort the array
sorted_indices = np.argsort(a)
print(sorted_indices)  # Output: [3, 1, 2, 0]

# Use indices to sort the array
sorted_array = a[sorted_indices]
print(sorted_array)  # Output: [ 1  3  7 10 ]
