<a href="https://colab.research.google.com/github/cloudpedagogy/data-science-programming/blob/main/data-analysis-pandas/05_Data_Transformation_and_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Transformation and Feature Engineering


## Overview


Data transformation and feature engineering are essential steps in data analysis and machine learning workflows. They involve manipulating and modifying data to make it more suitable for analysis or to create new features that can improve the performance of predictive models. Pandas, a popular data manipulation library in Python, provides a wide range of functions and methods to facilitate these tasks.

Data transformation involves converting, cleaning, and reshaping data to make it more consistent and meaningful. It often involves tasks such as handling missing values, converting data types, removing duplicates, and normalizing or scaling variables. These transformations help to ensure data quality, improve the accuracy of analyses, and enhance the effectiveness of machine learning algorithms.

Feature engineering, on the other hand, focuses on creating new features or extracting relevant information from existing ones. This process aims to capture the underlying patterns or relationships in the data that can improve model performance. Feature engineering techniques include creating interaction terms, binning or discretizing variables, encoding categorical variables, applying mathematical transformations, and deriving time-based features.

Pandas provides a powerful set of tools to perform these data transformation and feature engineering tasks. It offers functions for handling missing data, reshaping and pivoting data, applying mathematical and statistical operations, merging and joining datasets, and much more. Additionally, pandas seamlessly integrates with other libraries commonly used in data analysis and machine learning, such as NumPy, scikit-learn, and matplotlib, allowing for a comprehensive and efficient workflow.

In this context, understanding pandas' functionality and mastering its data transformation and feature engineering capabilities are crucial skills for data scientists and analysts. By effectively manipulating and engineering features, you can uncover hidden insights, reduce noise, and build more accurate predictive models, ultimately leading to better decision-making and business outcomes.



# Applying mathematical operations to columns



In Pandas, you can apply mathematical operations to columns using arithmetic operators or built-in mathematical functions. These operations allow you to perform calculations on specific columns and create new columns based on the results. Here's an example of applying mathematical operations to columns in the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Apply mathematical operations to columns
dataset['BMI_squared'] = dataset['BMI'] ** 2
dataset['Glucose_divided_by_Age'] = dataset['Glucose'] / dataset['Age']

# Print the modified dataset
print(dataset.head())


In this example, we load the Pima Indian Diabetes dataset using Pandas. We then apply two different mathematical operations to create new columns.

1. The line `dataset['BMI_squared'] = dataset['BMI'] ** 2` squares the values in the 'BMI' column and assigns the result to a new column called 'BMI_squared'. This operation calculates the BMI squared for each record in the dataset.

2. The line `dataset['Glucose_divided_by_Age'] = dataset['Glucose'] / dataset['Age']` divides the values in the 'Glucose' column by the values in the 'Age' column and assigns the result to a new column called 'Glucose_divided_by_Age'. This operation calculates the ratio of Glucose to Age for each record in the dataset.

Finally, we print the modified dataset using `dataset.head()` to display the first few rows, including the newly created columns.


# Creating new features from existing ones


Creating new features from existing ones in pandas allows you to extract additional information or combine existing features to enhance the predictive power of your dataset. These new features can capture relationships, patterns, or domain-specific insights that might not be apparent in the original features alone.

Here's an example of creating a new feature from existing ones in the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Create a new feature: BMI category
dataset['BMI_Category'] = pd.cut(dataset['BMI'], bins=[0, 18.5, 24.9, 29.9, 100], labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

# Print the dataset with the new feature
print(dataset.head())


In this example, we load the Pima Indian Diabetes dataset using Pandas. We then create a new feature called 'BMI_Category' by categorizing the 'BMI' feature into different BMI ranges. We use the `pd.cut()` function to define the bin intervals (0-18.5, 18.5-24.9, 24.9-29.9, 29.9-100) and assign corresponding labels ('Underweight', 'Normal', 'Overweight', 'Obese') to each category.

The resulting dataset will have an additional column 'BMI_Category' that represents the BMI category for each record. This new feature provides a more categorical representation of BMI, which can be useful for analysis or modeling purposes.

You can create new features by applying various transformations, combining columns, extracting information from text or dates, performing mathematical operations, or any other operation that derives meaningful information from existing features in your dataset.


# Handling categorical variables using one-hot encoding


Handling categorical variables, such as converting them into numerical values, is essential for many machine learning algorithms. One-hot encoding is a common technique used to convert categorical variables into binary vectors, where each category becomes a separate binary feature. This transformation allows the algorithm to interpret and utilize categorical data effectively.

In Pandas, you can perform one-hot encoding using the `get_dummies()` function. This function converts categorical variables into dummy/indicator variables, creating new binary columns for each category. Here's an example of using one-hot encoding on the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Select the categorical variable to encode
categorical_variable = "Outcome"

# Perform one-hot encoding
encoded_dataset = pd.get_dummies(dataset, columns=[categorical_variable])

# Print the encoded dataset
print(encoded_dataset.head())


In this example, we load the Pima Indian Diabetes dataset using Pandas. We specify the categorical variable we want to encode, in this case, "Outcome," which represents whether an individual has diabetes or not. We then apply the `get_dummies()` function on the dataset, passing the column name of the categorical variable to encode. The function creates new binary columns corresponding to each category and assigns a value of 1 if the category is present for that row and 0 otherwise. Finally, we print the encoded dataset using `encoded_dataset.head()` to display the first few rows.

After one-hot encoding, the original categorical variable will be replaced by the binary columns representing each category. This allows the categorical information to be effectively utilized by machine learning algorithms that require numerical input.


# Reflection Points

1. **Applying mathematical operations to columns**:
   - How can mathematical operations be applied to columns in a Pandas DataFrame?
   - What are some commonly used mathematical operations (e.g., addition, subtraction, multiplication, division) and their corresponding methods or functions in Python?
   - How can you perform mathematical operations on specific columns or groups of columns in a DataFrame?

2. **Creating new features from existing ones**:
   - Why is feature creation important in data analysis and machine learning?
   - What are some techniques for creating new features based on existing ones?
   - How can you use mathematical operations, string manipulation, or functions to generate new features?
   - What considerations should be taken into account when creating new features, such as dealing with missing data or outliers?

3. **Handling categorical variables using one-hot encoding**:
   - What are categorical variables, and why do we need to handle them differently?
   - How does one-hot encoding work, and why is it commonly used to encode categorical variables?
   - What libraries and functions can you use in Python to perform one-hot encoding?
   - How can you handle scenarios with a large number of categories or deal with new categories in test data?


Answers to the Reflection Points:

1. Example answers could include:
   - Mathematical operations can be applied to columns using arithmetic operators (+, -, *, /) or built-in functions like `add()`, `subtract()`, `multiply()`, and `divide()`.
   - You can perform operations on specific columns by accessing them using DataFrame indexing or by applying operations to groups of columns using functions like `apply()`.
   
2. Example answers could include:
   - Feature creation allows you to extract more meaningful information or capture specific patterns from existing data.
   - Techniques for creating new features include combining columns, transforming variables, deriving statistical aggregations, or applying domain-specific knowledge.
   - Mathematical operations, string manipulation methods, or custom functions can be used to generate new features.

3. Example answers could include:
   - Categorical variables represent discrete values and need to be encoded numerically for machine learning algorithms to process them.
   - One-hot encoding converts categorical variables into binary columns, representing the presence or absence of each category.
   - Libraries like pandas provide functions such as `get_dummies()` or `OneHotEncoder` from scikit-learn can be used for one-hot encoding.
   - Considerations include handling missing values or handling categories that are not present in the training data.


# Exercise


1. Load the dataset from the provided URL.
2. Explore the dataset to understand its structure and statistics.
3. Handle any missing values in the dataset.
4. Perform feature engineering by adding new relevant features (e.g., age groups, BMI category, etc.).
5. Encode categorical variables (if any) into numerical format.
6. Normalize or scale numerical features if required.
7. Save the preprocessed data to a new CSV file.



# Sample Solution

In [None]:
import pandas as pd

# Step 1: Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['pregnancies', 'glucose', 'blood_pressure', 'skin_thickness', 'insulin', 'bmi', 'diabetes_pedigree', 'age', 'outcome']
df = pd.read_csv(url, names=columns)

# Step 2: Explore the dataset
print(df.head())       # Display the first few rows
print(df.info())       # Overview of data types and missing values
print(df.describe())   # Statistical summary of the dataset

# Step 3: Handle missing values (if any)
# For this exercise, assume there are no missing values in the dataset.
# If there were missing values, you could use df.fillna() or df.dropna() to handle them.

# Step 4: Feature engineering
# Example: Create age groups based on age
age_bins = [0, 25, 40, 60, float('inf')]
age_labels = ['young', 'young-adult', 'adult', 'senior']
df['age_group'] = pd.cut(df['age'], bins=age_bins, labels=age_labels)

# Example: Categorize BMI into three categories
bmi_bins = [0, 18.5, 24.9, float('inf')]
bmi_labels = ['underweight', 'normal', 'overweight']
df['bmi_category'] = pd.cut(df['bmi'], bins=bmi_bins, labels=bmi_labels)

# Step 5: Encode categorical variables (if any)
# For this dataset, there are no categorical variables to encode.
# If there were categorical variables, you could use pd.get_dummies() or other encoding methods.

# Step 6: Normalize or scale numerical features (if required)
# For this exercise, assume that the numerical features are already in an appropriate scale.
# If required, you can use techniques like Min-Max Scaling or Standard Scaling.

# Step 7: Save the preprocessed data to a new CSV file
df.to_csv('pima_indian_preprocessed.csv', index=False)

print("Preprocessing and Feature Engineering Completed!")


   pregnancies  glucose  blood_pressure  skin_thickness  insulin   bmi  \
0            6      148              72              35        0  33.6   
1            1       85              66              29        0  26.6   
2            8      183              64               0        0  23.3   
3            1       89              66              23       94  28.1   
4            0      137              40              35      168  43.1   

   diabetes_pedigree  age  outcome  
0              0.627   50        1  
1              0.351   31        0  
2              0.672   32        1  
3              0.167   21        0  
4              2.288   33        1  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   pregnancies        768 non-null    int64  
 1   glucose            768 non-null    int64  
 2   blood_pressure     768 non-null 

# A quiz on Data Transformation and Feature Engineering


1. Which pandas function is used to apply a mathematical operation to a column?
<br>a) apply()
<br>b) map()
<br>c) transform()
<br>d) operate()

2. What does the following line of code do?
   df['new_column'] = df['column1'] + df['column2']
<br>a) Creates a new column in the DataFrame with the sum of 'column1' and 'column2'
<br>b) Multiplies 'column1' and 'column2' and assigns the result to a new column
<br>c) Subtracts 'column2' from 'column1' and assigns the result to a new column
<br>d) Raises an error because mathematical operations cannot be applied directly to columns in pandas

3. How can you create a new feature in a pandas DataFrame by combining existing features?
<br>a) By using the merge() function
<br>b) By using the concat() function
<br>c) By using the add() function
<br>d) By using the assign() function

4. Which pandas function is used to handle categorical variables by converting them into one-hot encoded columns?
<br>a) get_dummies()
<br>b) one_hot_encode()
<br>c) categorical_encode()
<br>d) encode_categorical()

5. What is the purpose of one-hot encoding categorical variables?
<br>a) To convert categorical variables into numerical values
<br>b) To reduce the dimensionality of the dataset
<br>c) To make the data more visually appealing
<br>d) To perform mathematical operations on categorical variables

---
Answers:

1. a) apply()
2. a) Creates a new column in the DataFrame with the sum of 'column1' and 'column2'
3. d) By using the assign() function
4. a) get_dummies()
5. a) To convert categorical variables into numerical values
---