# 3. Conditional Transformations

Conditional transformations allow you to modify data values based on specified conditions, enabling dynamic updates and enhancing dataset usability for analysis and modeling.

## Concept

### Modifying Data Values Conditionally
- Conditional transformations let you adjust values in a dataset based on logical conditions.
- They are widely used for: 
  - Creating new categorical variables.
  - Handling outliers or special cases.
  - Applying business rules to customize data.

## Topics to Cover
### 1. Using `numpy.where()` for Conditional Value Assignments
- A vectorized function that assigns values based on conditions.
- Efficient for large datasets due to its speed.

### 2. Using Pandas `.apply()` with Custom Conditional Logic
- Applies custom logic row-wise or column-wise.
- Ideal for complex transformations requiring user-defined functions.

### 3. Updating Values Conditionally Within a Column
- Directly modify values in a column based on conditions.
- Useful for simple updates.



In [None]:
import pandas as pd
import numpy as np

# Load a sample dataset
data_path = '../DataSets/Data_COVID19_Indonesia.csv'
covid_data = pd.read_csv(data_path)
print('Dataset Preview:')
print(covid_data.head())

### Using `numpy.where()` for Conditional Value Assignments

Suppose we want to create a new column indicating whether 'New Cases' exceed 1000.


In [None]:
# Add a column indicating high case counts
covid_data['High Cases'] = np.where(covid_data['New Cases'] > 1000, 'Yes', 'No')
print('Data with High Cases Column:')
print(covid_data[['Date', 'New Cases', 'High Cases']].head())

### Using Pandas `.apply()` with Custom Conditional Logic

You can use `.apply()` to apply a function row-wise or column-wise for more complex conditional logic.

Suppose we want to classify rows based on both 'New Cases' and 'New Deaths'.


In [None]:
# Define a custom function
def classify_risk(row):
    if row['New Cases'] > 1000 and row['New Deaths'] > 50:
        return 'High Risk'
    elif row['New Cases'] > 500:
        return 'Moderate Risk'
    else:
        return 'Low Risk'

# Apply the function to the DataFrame
covid_data['Risk Level'] = covid_data.apply(classify_risk, axis=1)
print('Data with Risk Level Column:')
print(covid_data[['Date', 'New Cases', 'New Deaths', 'Risk Level']].head())

### Updating Values Conditionally Within a Column

Sometimes, you may want to modify values in a column directly based on conditions.

For instance, suppose we want to replace all 'High Risk' values in the 'Risk Level' column with 'Critical'.


In [None]:
# Update values in the Risk Level column
covid_data.loc[covid_data['Risk Level'] == 'High Risk', 'Risk Level'] = 'Critical'
print('Data with Updated Risk Level:')
print(covid_data[['Date', 'New Cases', 'New Deaths', 'Risk Level']].head())

### Conclusion

Conditional transformations in Pandas provide powerful tools for modifying data dynamically based on specific conditions. Techniques such as `numpy.where()`, `.apply()` with custom logic, and direct conditional updates offer flexibility and efficiency for handling diverse use cases.

# 4. Using apply() and map()

The `apply()` and `map()` methods in Pandas allow flexible and powerful transformations on DataFrames and Series, enabling the application of custom functions and logic to manipulate data efficiently.

## Concept

### Applying Functions to Transform Data Flexibly
- `apply()`: Used for applying functions to rows or columns of a DataFrame, or to individual elements.
- `map()`: Designed specifically for transforming Series objects with functions or mapping values using dictionaries.

These methods are essential for dynamic and complex transformations.

## Topics to Cover
### `apply()`
- **Row-wise and Column-wise Operations**: Use `axis=0` for columns and `axis=1` for rows.
- **Using Custom Functions**: Apply user-defined functions for tailored operations.
- **Applying Complex Logic**: Implement advanced transformations using Python logic and conditionals.

### `map()`
- **Transformations for Series Objects**: Apply functions to each element in a Series.
- **Value Mapping with Dictionaries or Functions**: Map specific values to new ones using a dictionary or custom logic.



In [None]:
import pandas as pd
import numpy as np

# Load a sample dataset
data_path = '../DataSets/Data_COVID19_Indonesia.csv'
covid_data = pd.read_csv(data_path)
print('Dataset Preview:')
print(covid_data.head())

### Applying Custom Functions Row-wise/Column-wise

Using `apply()` allows you to perform transformations across rows or columns. For example, let’s calculate the ratio of 'New Cases' to 'Total Cases'.


In [None]:
# Calculate the ratio of New Cases to Total Cases
covid_data['Case Ratio'] = covid_data.apply(lambda row: row['New Cases'] / row['Total Cases'] if row['Total Cases'] > 0 else 0, axis=1)
print('Data with Case Ratio Column:')
print(covid_data[['Date', 'New Cases', 'Total Cases', 'Case Ratio']].head())

### Mapping Specific Values with a Dictionary

Using `map()` allows you to map specific values in a Series. For example, let’s map 'Yes' and 'No' in the 'Student' column to 1 and 0, respectively.


In [7]:
# Map Yes/No to 1/0 in the Student column
if 'Student' in covid_data.columns:  # Ensure the column exists
    covid_data['Student Mapped'] = covid_data['Student'].map({'Yes': 1, 'No': 0})
    print('Data with Student Mapped Column:')
    print(covid_data[['Student', 'Student Mapped']].head())

### Normalizing Data Using a Custom Function

Normalization scales values to a specific range, typically [0, 1]. Let’s normalize the 'Total Cases' column.


In [None]:
# Define a normalization function
def normalize(series):
    return (series - series.min()) / (series.max() - series.min())

# Apply normalization to the Total Cases column
covid_data['Normalized Total Cases'] = normalize(covid_data['Total Cases'])
print('Data with Normalized Total Cases:')
print(covid_data[['Total Cases', 'Normalized Total Cases']].head())

### Conclusion

The `apply()` and `map()` methods are indispensable for performing flexible and efficient data transformations in Pandas. While `apply()` is ideal for row-wise and column-wise operations, `map()` excels in transforming individual Series elements with simplicity.

# 5. Best Practices and Tips

Transforming data in Pandas is a critical step in data preprocessing. Following best practices ensures efficiency, reliability, and reproducibility of your transformations. Below are essential tips to keep in mind while working with data transformations.

## Avoid Overwriting Original Data

### Why It Matters
- Overwriting original data can lead to data loss or unintended modifications that are difficult to trace.
- Maintaining the integrity of the original dataset allows you to reference it if transformations need to be revisited.

### How to Avoid
- Always create a copy of the dataset before performing transformations:

```python
# Create a copy of the dataset
transformed_data = original_data.copy()
print(transformed_data.head())  # Preview the copied dataset
```
- Use `.copy()` when slicing or filtering subsets of data to avoid unintended changes:

```python
# Avoid SettingWithCopyWarning
subset = original_data[['Column1', 'Column2']].copy()
print(subset.head())  # Inspect the subset
```

## Performance Considerations

### Why It Matters
- Efficient operations save time and computational resources, especially with large datasets.
- Vectorized operations in Pandas are significantly faster than Python loops.

### Tips for Improved Performance
1. **Use Vectorized Operations**:
   - Avoid using `apply()` or `for` loops unless necessary. Instead, use built-in Pandas or NumPy functions that operate on entire Series or DataFrames.

   ```python
   # Vectorized addition
   data['New Column'] = data['Column1'] + data['Column2']
   print(data[['Column1', 'Column2', 'New Column']].head())
   ```

2. **Leverage NumPy for Numerical Operations**:
   - NumPy functions are faster for numerical computations:

   ```python
   import numpy as np
   data['Log Column'] = np.log(data['Numeric Column'])
   print(data[['Numeric Column', 'Log Column']].head())
   ```

3. **Filter Rows and Columns Efficiently**:
   - Use boolean indexing or `.loc[]` for filtering data:

   ```python
   filtered_data = data.loc[data['Value'] > 10, ['Column1', 'Column2']]
   print(filtered_data.head())
   ```

## Testing Transformations

### Why It Matters
- Validating transformations ensures accuracy and prevents errors from propagating through the analysis pipeline.

### How to Test Transformations
1. **Preview Results**:
   - Use `.head()`, `.tail()`, or `.sample()` to inspect transformed data.

   ```python
   print(data.head())
   print(data.sample(5))  # Random sample for validation
   ```

2. **Check Data Consistency**:
   - Use `.info()` and `.describe()` to ensure data types and statistics align with expectations.

   ```python
   print(data.info())
   print(data.describe())
   ```

3. **Assert Expected Outcomes**:
   - Use assertions to test specific conditions:

   ```python
   assert data['New Column'].isnull().sum() == 0, 'Missing values found in New Column'
   assert data['Value'].max() <= 100, 'Values exceed expected range'
   ```

## Document Transformation Steps

### Why It Matters
- Clear documentation ensures reproducibility and helps others (or future you) understand the steps taken.
- Provides a record of the transformations applied, which is especially useful in collaborative projects.

### How to Document
1. **Comment Your Code**:
   - Add clear comments describing the purpose of each transformation:

   ```python
   # Add a column indicating high case counts
   data['High Cases'] = np.where(data['New Cases'] > 1000, 'Yes', 'No')
   ```

2. **Use Notebooks for Analysis**:
   - Jupyter notebooks allow combining code, markdown, and visualizations for a clear narrative:

   ```python
   print(data.head())
   ```

3. **Maintain a Change Log**:
   - Keep track of major transformations in a separate document or notebook section.

   ```markdown
   ## Change Log
   - Added 'High Cases' column based on 'New Cases'.
   - Filtered data to exclude outliers.
   ```

## Conclusion

Adhering to these best practices ensures that your data transformations are efficient, accurate, and reproducible. Following these tips will save time and minimize errors in your data analysis pipeline.