# 1. Introduction to Data Transformation

Data transformation is a critical process in data analysis and preparation, involving the modification, reorganization, or conversion of data to make it more suitable for analysis or modeling.

## Definition
Data transformation is the process of converting raw data into a format or structure that is more appropriate for analysis, visualization, or machine learning. It includes tasks like filtering, aggregating, and mapping data to create new features or enhance its usability.

## Importance
Transforming data is an essential step in any data analysis or machine learning pipeline. Here’s why it’s crucial:

### 1. Preparing Data for Analysis or Modeling
- Many datasets contain raw, unprocessed information that cannot be directly used in analysis or predictive models.
- Data transformation ensures data is in the right format, scale, and structure for specific analytical tasks.

### 2. Enhancing Data Interpretability and Usability
- Transformation improves the clarity and usefulness of data by aligning it with analytical goals.
- For example, converting categorical data to numerical representations can make it usable in machine learning models.

### 3. Supporting Specific Analytical Needs or Objectives
- Transformation enables tailoring the dataset to focus on particular aspects of interest.
- For instance, extracting key metrics or creating aggregated summaries can provide deeper insights into the data.

## Applications
### 1. Filtering Subsets of Data for Exploratory Analysis
- Focus on relevant data by filtering subsets based on conditions.
- Example: Selecting rows where sales exceed a certain threshold to analyze high-performing products.

### 2. Applying Business Rules to Transform Data
- Apply domain-specific rules to modify or organize data.
- Example: Categorizing customers into different tiers based on their purchase history.

### 3. Creating New Features for Machine Learning Models
- Engineer new features that enhance predictive power.
- Example: Calculating the ratio of revenue to customer lifetime value to identify valuable customers.

Data transformation is the bridge between raw data and actionable insights, making it a fundamental skill for data analysts and machine learning practitioners alike.

# 2. Filtering and Selecting Subsets

Filtering and selecting subsets of data is a foundational skill in data analysis. It enables analysts to focus on relevant data, reducing noise and preparing datasets for further processing.

## Concept
### Filtering Data Based on Conditions or Subsets
Filtering involves extracting rows or columns from a dataset based on specific criteria. This process is critical for narrowing down data to relevant subsets that meet analytical requirements.

### Importance in Data Preprocessing and Analysis
- Filtering helps reduce dataset size, making it easier to manage and analyze.
- It improves the clarity and relevance of data by focusing on key attributes.
- Essential for targeted analysis, exploratory data analysis (EDA), and feature engineering.

## Topics to Cover

### 1. Indexing (`loc[]` and `iloc[]`)
- **`loc[]`**: Label-based indexing for selecting rows and columns by names.
- **`iloc[]`**: Position-based indexing for selecting rows and columns by numerical indices.

### 2. Boolean Filtering
- Apply conditions to filter rows, e.g., selecting rows where a column exceeds a threshold.
- Combine multiple conditions with `&` (and), `|` (or), and `~` (not).

### 3. Advanced Filters
- Use methods like `isin()` to filter rows based on a list of values.
- Use `between()` to filter rows within a range of values.

## Code Examples

### Selecting Rows Based on a Single Condition


In [None]:
import pandas as pd

# Example DataFrame
data = {"Name": ['Alice', 'Bob', 'Charlie', 'David'],
        "Age": [24, 35, 19, 42],
        "Salary": [50000, 60000, 45000, 80000]}
df = pd.DataFrame(data)

# Filter rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
print('Filtered DataFrame:')
print(filtered_df)

### Combining Multiple Conditions


In [None]:
# Filter rows where Age is greater than 30 and Salary is greater than 55000
filtered_df = df[(df['Age'] > 30) & (df['Salary'] > 55000)]
print('Filtered DataFrame with Multiple Conditions:')
print(filtered_df)

### Filtering Specific Columns by Names or Indices


In [None]:
# Select specific columns
selected_columns = df[['Name', 'Salary']]
print('Selected Columns:')
print(selected_columns)

# Select columns by index position using iloc
selected_columns = df.iloc[:, [0, 2]]
print('Selected Columns by Index:')
print(selected_columns)

# 2. Filtering and Selecting Subsets

Filtering and selecting subsets of data are essential operations in data analysis, allowing us to focus on specific parts of a dataset that meet certain conditions. This capability is critical for preprocessing, exploratory analysis, and creating targeted datasets.

## Concept

### Filtering Data Based on Conditions or Subsets
Filtering involves extracting rows or columns that meet specific criteria. These criteria can include numerical thresholds, string matching, or logical combinations.

### Importance in Data Preprocessing and Analysis
1. **Data Reduction**: Helps reduce the size of a dataset, making it easier to analyze.
2. **Focus on Relevance**: Allows you to concentrate on data relevant to your analysis goals.
3. **Improved Efficiency**: Reduces computational complexity by working only with the necessary data.

## Topics to Cover
### Indexing (`loc[]` and `iloc[]`)
- `loc[]`: Label-based indexing to select rows and columns by their labels.
- `iloc[]`: Position-based indexing to select rows and columns by their integer indices.

### Boolean Filtering
- Use conditional expressions to filter rows where specific column values meet criteria.
- Example: Selecting rows where a column exceeds a threshold.

### Advanced Filters
- Combine multiple conditions using logical operators (`&` for AND, `|` for OR).
- Example: Selecting rows where two or more conditions are simultaneously satisfied.

### Conditional Column Filtering
- Use `.isin()` to filter rows where a column’s value is in a specified list.
- Use `.between()` to filter rows where a column’s value falls within a range.



In [None]:
import pandas as pd

# Load a sample dataset
data_path = '../DataSets/Data_COVID19_Indonesia.csv'
covid_data = pd.read_csv(data_path)
print('Dataset Preview:')
print(covid_data.head())

### Selecting Rows Based on a Single Condition

Suppose we want to filter rows where the number of new cases exceeds 500.


In [None]:
# Filter rows where 'New Cases' > 500
high_cases = covid_data[covid_data['New Cases'] > 500]
print('Rows where New Cases > 500:')
print(high_cases.head())

### Combining Multiple Conditions

Now, let’s filter rows where the number of new cases exceeds 500 **and** the number of new deaths is less than 10.


In [None]:
# Filter rows with multiple conditions
filtered_data = covid_data[(covid_data['New Cases'] > 500) & (covid_data['New Deaths'] < 10)]
print('Rows where New Cases > 500 and New Deaths < 10:')
print(filtered_data.head())

### Filtering Specific Columns by Names

You can filter specific columns by providing their names explicitly. For example, let’s select only the columns 'Date', 'New Cases', and 'New Deaths'.


In [None]:
# Select specific columns
subset_columns = covid_data[['Date', 'New Cases', 'New Deaths']]
print('Subset with Selected Columns:')
print(subset_columns.head())

### Advanced Filtering: Using `.isin()` and `.between()`

#### Using `.isin()`
Filter rows where the 'Province' column contains specific values (e.g., 'Jakarta' or 'West Java').


In [None]:
# Filter using .isin()
provinces = ['DKI Jakarta', 'Maluku Utara']
filtered_by_province = covid_data[covid_data['Province'].isin(provinces)]
print('Rows where Province is DKI Jakarta or Maluku Utara:')
print(filtered_by_province.head())

#### Using `.between()`
Filter rows where the 'Total Cases' fall within a specified range (e.g., between 1000 and 5000).


In [None]:
# Filter using .between()
filtered_by_cases = covid_data[covid_data['Total Cases'].between(1000, 5000)]
print('Rows where Total Cases are between 1000 and 5000:')
print(filtered_by_cases.head())

### Conclusion

Filtering and selecting subsets of data are essential operations in Pandas for narrowing down datasets to meet specific analysis needs. By combining techniques like indexing, boolean filtering, and advanced methods (`isin`, `between`), you can efficiently extract and manipulate relevant data.