In [None]:
import pandas as pd
import numpy as np  # for nan

## **Data Type Conversion**

In [None]:
# Data Types
data_types = data.dtypes
print("Data Types for Each Column in Your Data")

In [None]:
import pandas as pd


def convert_data_types(data):
  """
  Prints data types for each column and allows user to change them.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.

  Returns:
      pandas.DataFrame: The DataFrame potentially with changed data types.
  """

  # Explain data types in a dictionary for easy reference
  dtype_explanations = {
      'int64': "Integer (whole numbers, positive or negative)",
      'float64': "Decimal number",
      'object': "Text data (strings)",
      'category': "Categorical data (limited set of options)",
      'datetime64[ns]': "Date and time",
      'bool': "Boolean (True or False)"
  }

  # Print data types with explanations
  for col, dtype in data_types.items():
    print(f"- {col}: {dtype} ({dtype_explanations.get(dtype, 'Unknown')})")

  # Prompt user for data type changes
  change_dtypes = input("Would you like to change any data types (y/n)? ").lower()
  if change_dtypes == "y":
    while True:
      # Ask for column and desired data type
      col_to_change = input("Enter the column name to change the data type: ").lower()
      new_dtype = input("Enter the desired new data type (int, float, object, etc.): ").lower()

      # Check if column exists and new data type is valid
      if col_to_change in data.columns and new_dtype in dtype_explanations.keys():
        try:
          # Attempt conversion (handles potential errors)
          data[col_to_change] = data[col_to_change].astype(new_dtype)
          print(f"Data type for '{col_to_change}' changed to {new_dtype}.")
          # **Modified break logic:**
          break_loop = input("Do you want to convert another column (y/n)? ").lower()
          if break_loop != "y":
            break
        except (ValueError, TypeError) as e:
          print(f"Error converting '{col_to_change}' to {new_dtype}: {e}")
          # **Prompt to continue after error**
          continue_loop = input("Would you like to try converting another column (y/n)? ").lower()
          if continue_loop != "y":
            break
      else:
        print(f"Invalid column name or data type. Please try again.")

  return data

# Example usage 
data = convert_data_types(data.copy())  # Avoid modifying original data


Focus: Data cleaning addresses inconsistencies, errors, and missing values within the data itself.

Data Type Conversion: In this context, converting data types is often a cleaning step when the data type is incorrect or incompatible with how the data should be represented.
Examples:
- Inconsistent date formats (text to datetime).
- Text values in numerical columns (text to numerical).
- Incorrect data types due to import issues (e.g., strings instead of integers).

## **Dealing With Normality and Skewness**

The most efficient way to assess normality and skewness in your data columns depends on a few factors:

**1. Number of Columns:**

* **Few Columns:** `(still need implementation)` If you have a small number of columns (less than 10), visual inspection using histograms and QQ plots might be the most efficient approach. These techniques are easy to understand and interpret, providing a quick grasp of the data distribution.

* **Many Columns:** With a large number of columns (more than 10), visual inspection becomes cumbersome. Here, statistical tests like the Shapiro-Wilk test can be more efficient. You can calculate the test statistic and p-value for each column to identify potential deviations from normality. A threshold for the p-value (e.g., 0.05) can be used to decide if the data is likely non-normal.

**2. Desired Level of Detail:**

* **Basic Assessment:** If you just need a quick indication of normality or skewness, histograms and statistical tests with p-values provide a sufficient level of detail.

* **Detailed Analysis:** For a more in-depth analysis, you can combine both approaches. Start with histograms and QQ plots to get a visual sense of the distribution, and then follow up with statistical tests to confirm your observations or explore borderline cases with p-values close to the chosen threshold.

Here's a breakdown of the efficiency considerations:

| Method | Efficiency for Few Columns | Efficiency for Many Columns | Level of Detail |
|---|---|---|---|
| Histograms & QQ Plots | High (easy to interpret visually) | Low (time-consuming for many columns) | High (visual assessment of shape) |
| Statistical Tests | Low (requires calculations) | High (efficient for many columns) | Moderate (p-value indicates normality likelihood) |

**Combined Approach:**

In practice, a combination of visual inspection and statistical tests often offers the best balance between efficiency and detail. Start with histograms and QQ plots for a quick overview, then use statistical tests for more rigorous confirmation, especially when dealing with many columns.

Here are some additional factors to consider:

* **Computational Resources:** If computational resources are limited, visual methods might be preferred. Statistical tests, especially for large datasets, can require more processing power.
* **Domain Knowledge:** If you have domain knowledge about the data, you might have an initial expectation about the normality of certain features. This can guide your choice of method, focusing on tests for features where normality is critical for your analysis.

Ultimately, the most efficient approach depends on your specific needs and the size of your dataset. Combining visual and statistical methods often provides a comprehensive and efficient way to assess normality and skewness in your data columns. 

## **Normalizing/Scaling Data**

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler


def scale(data):
  """
  Identifies skewed features, suggests corrections, and performs scaling/normalization.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.

  Returns:
      pandas.DataFrame: The transformed DataFrame with addressed skewness and scaling/normalization.
  """
  numerical_cols = data.select_dtypes(include=[np.number])
  skewed_cols = []  # List to store column names with skewness

  # Threshold for skewness (adjust as needed)
  skewness_threshold = 0.5

  for col in numerical_cols:
    # Calculate skewness
    skew = data[col].skew()
    if abs(skew) > skewness_threshold:
      skewed_cols.append(col)
      print(f"Column '{col}' appears skewed (skewness: {skew:.2f}).")


      # Inform decision-making
      print("Here's a brief explanation of the available correction methods:")
      print("  - Log transformation (log(x + 1)): This method is often effective for right-skewed data (where values are concentrated on the left side of the distribution).")
      print("    It compresses the larger values and stretches the smaller ones, aiming for a more symmetrical distribution.")
      print("  - Square root transformation (sqrt(x)): This method can be helpful for moderately skewed data, positive-valued features, or data with a large number of zeros.")
      print("    It reduces the influence of extreme values and can bring the distribution closer to normality.")
      print("**Please consider the characteristics of your skewed feature(s) when making your choice.**")
      print("If you're unsure, you can experiment with both methods and compare the results visually (e.g., using histograms) to see which one normalizes the data more effectively for your specific case.")

      # User prompt for addressing skewness
      action = input("Do you want to address the skewness (y/n)? ").lower()
      if action == "y":
        

        # User chooses to address skewness
        while True:  # Loop until a valid choice is made
          fix_method = input("Choose a correction method (log/sqrt/none): ").lower()
          if fix_method in ["log", "sqrt"]:
            # Apply transformation (log or sqrt)
            if fix_method == "log":
              data[col] = np.log(data[col] + 1)  # Avoid log(0) errors by adding 1
              print(f"Applied log transformation to column '{col}'.")
            else:
              data[col] = np.sqrt(data[col])
              print(f"Applied square root transformation to column '{col}'.")
            break  # Exit the loop if a valid choice is made
          else:
            print("Invalid choice. Please choose 'log', 'sqrt', or 'none'.")

      else:
        print(f"Skewness in '{col}' remains unaddressed.")
    
    if not skewed_cols:
      print("No significant skewness detected in numerical columns.")

  # User prompt for scaling/normalization (if applicable)
  if len(numerical_cols) > 0:

    print("Here's a brief explanation of the available scaling/normalization methods:")
    print("  - Standard scaling: This method transforms features by subtracting the mean and dividing by the standard deviation.")
    print("    This results in features centered around zero with a standard deviation of 1.")
    print("    It's suitable for algorithms that assume a normal distribution of features (e.g., Logistic Regression, Support Vector Machines).")
    print("  - Min-max scaling: This method scales each feature to a specific range, typically between 0 and 1.")
    print("    It achieves this by subtracting the minimum value and then dividing by the difference between the maximum and minimum values in the feature.")
    print("    This can be useful for algorithms that are sensitive to the scale of features (e.g., K-Nearest Neighbors).")
    print("**Choosing the right method depends on your data and the algorithm you're using.**")
    print("  - If you're unsure about the underlying distribution of your data, standard scaling might be a safer choice as it doesn't make assumptions about normality.")
    print("  - If your algorithm is sensitive to feature scales and doesn't assume normality, min-max scaling might be preferable.")
    print("Consider the characteristics of your data and algorithm when making your decision. You can also experiment with both methods")
    print("and compare the results using model performance metrics to see which one works best for your specific case.")

    action = input("Do you want to scale or normalize the numerical features (y/n)? ").lower()
    if action == "y":

      while True:  # Loop until a valid choice is made
        method = input("Choose scaling/normalization method (standard/minmax/skip): ").lower()
        if method in ["standard", "minmax"]:
          if method == "standard":
            scaler = StandardScaler()
            data[numerical_cols] = scaler.fit_transform(data[numerical_cols])
            print(f"Applied standard scaling to numerical features.")
          else:
            scaler = MinMaxScaler(feature_range=(0, 1))
            data[numerical_cols] = scaler.fit_transform(data[numerical_cols])
            print(f"Applied min-max scaling to numerical features (range 0-1).")
          break  # Exit the loop if a valid choice is made
        elif method == "skip":
          print("Skipping scaling/normalization.")
          break
        else:
          print("Invalid choice. Please choose 'standard', 'minmax', or 'skip'.")


  if not skewed_cols:
    print("No significant skewness detected in numerical columns.")
  return data


# Example usage:
preprocessed_data = scale(data.copy())  # Operate on a copy to avoid modifying original data


The `scale` function offers a user-guided approach to data preprocessing, addressing both skewness in numerical features and optional scaling/normalization. Here's a breakdown of its functionalities:

1. **Skewness Identification:**
   - It identifies features with significant skewness (asymmetry in the distribution) based on a user-defined threshold.
   - For skewed features, it informs the user about the skewness value and prompts them to address it.

2. **Skewness Correction (Optional):**
   - If the user chooses to address skewness, it offers options for log or square root transformation to potentially reduce skewness.
   - The chosen transformation is applied to the specific feature(s).

3. **Scaling/Normalization (Optional):**
   - After addressing skewness (or if no skewness is found), it prompts the user to decide if they want to scale or normalize the numerical features.
   - If the user chooses to proceed, it offers options for standard scaling (centering and scaling features to have zero mean and unit variance) or min-max scaling (scaling features to a specific range, typically 0-1).
   - The chosen scaling/normalization method is applied using scikit-learn's `StandardScaler` or `MinMaxScaler` to transform the numerical features.


Overall, this function simplifies data preprocessing by guiding the user through common steps like handling skewness and applying scaling/normalization techniques. It provides flexibility and explanations, making it easier to understand and customize the preprocessing process.


## **Creating Interaction Features**

In [None]:
def create_interaction_features(data, categorical_cols=None):
  """
  Creates interaction features from categorical columns in a DataFrame.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.
      categorical_cols (list, optional): A list of column names to consider for interaction features. If None, all categorical columns will be used. Defaults to None.

  Returns:
      pandas.DataFrame: The DataFrame with additional interaction features.
  """

  if categorical_cols is None:
    categorical_cols = [col for col in data.columns if data[col].dtype == 'category']

  if not categorical_cols:
    print("No categorical columns found in the data. Skipping interaction feature creation.")
    return data

  # Display recommendations before prompting user
  print("** Recommendations for Interaction Features:**")
  print("- Interaction features can capture complex relationships, potentially improving model performance.")
  print("- However, creating all possible interactions can lead to data sparsity and longer training times.")
  print("- Consider your domain knowledge to prioritize specific interactions.")
  print("- Start with a smaller set and use feature selection techniques for better interpretability.")


  # Get user confirmation to proceed
  action = input("Do you want to create interaction features from categorical columns (y/n)? ").lower()
  if action != "y":
    print("Skipping interaction feature creation.")
    return data

  # Prompt user to choose specific columns or create all possible interactions
  while True:
    choice = input("Choose interaction feature creation method (all/specific): ").lower()
    if choice in ["all", "specific"]:
      break
    else:
      print("Invalid choice. Please choose 'all' or 'specific'.")

  if choice == "all":
    # Create all pairwise interaction features
    for col1 in categorical_cols:
      for col2 in categorical_cols:
        if col1 != col2:
          data[f"{col1}_x_{col2}"] = data[col1].astype(str) + "_" + data[col2].astype(str)
    print("Created all possible pairwise interaction features.")

  else:
    # Prompt user to choose specific columns for interaction
    selected_cols = []
    while True:
      col_name = input("Enter a categorical column name (or 'done' to finish): ").lower()
      if col_name == "done":
        if not selected_cols:
          print("No columns selected. Skipping interaction feature creation.")
        else:
          for col1 in selected_cols:
            for col2 in selected_cols:
              if col1 != col2:
                data[f"{col1}_x_{col2}"] = data[col1].astype(str) + "_" + data[col2].astype(str)
          print(f"Created interaction features for selected columns: {', '.join(selected_cols)}")
        break
      elif col_name in categorical_cols:
        selected_cols.append(col_name)
        print(f"Column '{col_name}' added for interaction features.")
      else:
        print(f"Invalid column name: '{col_name}'. Please choose from categorical columns.")

  return data

create_interaction_features(data.copy())

The `create_interaction_features` function offers a user-guided approach to creating interaction features from categorical columns in a DataFrame. Here's a breakdown of its functionalities:

- **Input:** It takes a DataFrame (`data`) and an optional list of categorical column names (`categorical_cols`).
- **Categorical Column Identification:** If no `categorical_cols` are provided, it identifies all categorical columns in the DataFrame.
- **User Confirmation:** It prompts the user to confirm if they want to create interaction features.
- **Interaction Method Selection:** If the user chooses to proceed, it offers two options:
    1. **Create All Pairwise Interactions:** This creates interaction features for all unique combinations of categorical columns.
    2. **Create Interactions for Specific Columns:** This allows the user to select specific categorical columns for interaction feature generation.
- **Feature Creation:** Based on the user's choice, it creates new features in the DataFrame by combining category combinations from the selected columns with underscores (e.g., "column1_x_column2").
- **Informative Messages:** It provides clear messages throughout the process, explaining the purpose, choices available, and actions taken based on user input.
- **Output:** It returns the modified DataFrame with additional interaction features (if created).

Overall, this function simplifies interaction feature creation by guiding the user through the process and offering flexibility in choosing the level of interaction desired. It enhances user control and understanding during data preprocessing.


## **Feature Binning**

In [None]:
def create_feature_bins(data, continuous_cols=None, n_bins=5):
  """
  Creates bins (intervals) for continuous features in a DataFrame.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.
      continuous_cols (list, optional): A list of column names to bin. If None, all continuous columns will be considered. Defaults to None.
      n_bins (int, optional): The number of bins to create for each feature. Defaults to 5.

  Returns:
      pandas.DataFrame: The DataFrame with new bin features (categorical).
  """

  if continuous_cols is None:
    continuous_cols = [col for col in data.columns if data[col].dtype in ['float64', 'int64']]

  if not continuous_cols:
    print("No continuous features found in the data. Skipping binning.")
    return data

  # Get user confirmation to proceed
  action = input("Do you want to create bins for continuous features (y/n)? ").lower()
  if action != "y":
    print("Skipping binning.")
    return data

  # Allow user to choose specific columns or bin all continuous features
  while True:
    choice = input("Choose binning method (all/specific): ").lower()
    if choice in ["all", "specific"]:
      break
    else:
      print("Invalid choice. Please choose 'all' or 'specific'.")

  if choice == "all":
    # Bin all continuous features
    for col in continuous_cols:
      bins = pd.cut(data[col], bins=n_bins, labels=False) + 1  # Add 1 for informative bin names
      data[f"binned_{col}"] = bins.astype("category")
      print(f"Created bins for feature '{col}'.")

  else:
    # Prompt user to choose specific columns for binning
    selected_cols = []
    while True:
      col_name = input("Enter a continuous feature name (or 'done' to finish): ").lower()
      if col_name == "done":
        if not selected_cols:
          print("No columns selected. Skipping binning.")
        else:
          for col in selected_cols:
            bins = pd.cut(data[col], bins=n_bins, labels=False) + 1  # Add 1 for informative bin names
            data[f"binned_{col}"] = bins.astype("category")
            print(f"Created bins for feature '{col}'.")
        break
      elif col_name in continuous_cols:
        selected_cols.append(col_name)
        print(f"Feature '{col_name}' added for binning.")
      else:
        print(f"Invalid column name: '{col_name}'. Please choose from continuous features.")

  return data

create_feature_bins(data.copy())

The `create_feature_bins` function offers an interactive approach to creating bins (intervals) for continuous features in a DataFrame. Here's a breakdown of its functionalities:

- **Input:** It takes a DataFrame (`data`), an optional list of continuous column names (`continuous_cols`), and the desired number of bins per feature (`n_bins`).
- **Continuous Feature Identification:** If no `continuous_cols` are provided, it identifies all continuous features (numeric data types).
- **User Confirmation:** It prompts the user to confirm if they want to create bins for continuous features.
- **Binning Method Selection:** If the user chooses to proceed, it allows them to choose between:
   1. **Binning All Continuous Features:** This creates bins for all identified continuous features.
   2. **Binning Specific Features:** This allows the user to select specific continuous features for binning.
- **Feature Selection for Binning:** If specific features are chosen, it guides the user through selecting features for binning.  
- **Bin Creation:** Based on the chosen method and user selection, it creates new categorical features in the DataFrame named "binned_<original_feature_name>". Each new feature represents the bin (interval) a data point falls into for the corresponding continuous feature.
- **Informative Messages:** It provides clear messages throughout the process, explaining the purpose, choices available, and actions taken based on user input.
- **Output:** It returns the modified DataFrame with additional binned features (categorical).

Overall, this function simplifies feature binning by providing a user-friendly interface and allowing for flexibility in selecting features and binning approach. It empowers users to participate in the data preprocessing step and potentially improve the model's ability to capture non-linear relationships in the data. 

## **Feature Creation**

In [None]:
import sympy as sp  # Symbolic math library (optional)

def create_custom_features(data):
  """
  Allows users to define and create custom features from existing features.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.

  Returns:
      pandas.DataFrame: The DataFrame with additional custom features.
  """

  print("** Feature Creation Options:")
  print("- Define a new feature using existing features with mathematical expressions.")
  print("- Create interaction features from categorical columns.")  # Reference existing function

  while True:
    choice = input("Choose a feature creation method (expression/interaction/none): ").lower()
    if choice in ["expression", "interaction", "none"]:
      break
    else:
      print("Invalid choice. Please choose 'expression', 'interaction', or 'none'.")

  if choice == "expression":
    # Feature creation using expressions
    while True:
      expression = input("Enter a mathematical expression using existing feature names (or 'done' to finish): ")
      if expression == "done":
        break

      # Validate expression using symbolic math library (optional)
      try:
        sp.sympify(expression)  # Raises an error for invalid expressions (optional)
      except (TypeError, NameError):
        print("Invalid expression. Please use existing feature names and basic mathematical operators (+, -, *, /).")
        continue

      # Create and add the new feature
      new_feature_name = input("Enter a name for the new feature: ")
      try:
        data[new_feature_name] = eval(expression)  # Evaluate the expression on the DataFrame
        print(f"Created new feature: '{new_feature_name}'")
        break  # Exit the loop if expression is valid
      except (NameError, SyntaxError):
        print("Error evaluating expression. Please check for typos or invalid syntax.")

  elif choice == "interaction":
    # Call the existing create_interaction_features function (assuming it's defined)
    data = create_interaction_features(data.copy())  # Avoid modifying original data

  else:
    print("Skipping custom feature creation.")

  return data

create_custom_features(data.copy())

The `create_custom_features` function provides an interactive interface for users to define and create custom features from existing features in a DataFrame. Here's a breakdown of its functionalities:

- **Input:** It takes a DataFrame (`data`) containing the features to potentially use for custom feature creation.
- **Feature Creation Options:** It presents two main approaches for creating custom features:
   1. **Expression-based Feature Creation:** Users can define a new feature using mathematical expressions involving existing feature names and basic operators (+, -, *, /).
   2. **Interaction Feature Creation (Optional):** It offers the option to leverage the `create_interaction_features` function (assuming it's defined elsewhere) to create interaction features from categorical columns.
- **User Choice:** The user selects their preferred method for creating custom features.
- **Expression Definition (if chosen):**
   - It guides the user through defining a mathematical expression.
   - (Optional) It can perform basic validation on the expression using the `sympy` library (commented out by default) to catch syntax errors or typos.
   - It prompts the user for a name for the newly created feature.
   - It evaluates the user-defined expression on the DataFrame and adds the result as a new feature.
- **Interaction Feature Creation (if chosen):**
   - It calls the existing `create_interaction_features` function (assuming it's defined) to handle interaction feature creation. This avoids modifying the original DataFrame unnecessarily.  
- **Output:** It returns the modified DataFrame with any newly created custom features.

Overall, this function empowers users to participate in feature engineering by defining new features based on their domain knowledge and understanding of the data. It offers flexibility in the approach and provides some guidance for expression-based feature creation. Remember to exercise caution with expression evaluation and potentially implement additional validation if needed for your specific use case.

## **Encoding**

#### **One-Hot Encoding**

In [None]:
def create_one_hot_encoding(data, categorical_cols=None):
  """
  Creates one-hot encoded features from categorical columns in a DataFrame.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.
      categorical_cols (list, optional): A list of column names to encode. If None, all categorical columns will be considered. Defaults to None.

  Returns:
      pandas.DataFrame: The DataFrame with additional one-hot encoded features.
  """

  if categorical_cols is None:
    categorical_cols = [col for col in data.columns if data[col].dtype == "object"]

  if not categorical_cols:
    print("No categorical features found in the data. Skipping one-hot encoding.")
    return data

  print("One-hot encoding is a technique for representing categorical features (like 'color' or 'size') as separate binary features.")
  print("Imagine a feature 'color' with values 'red', 'green', and 'blue'. One-hot encoding would create three new features:")
  print("  - 'color_red' (1 if the color is red, 0 otherwise)")
  print("  - 'color_green' (1 if the color is green, 0 otherwise)")
  print("  - 'color_blue' (1 if the color is blue, 0 otherwise)")
  print("This allows machine learning models to understand the relationships between these categories more effectively.")
  print("However, one-hot encoding can increase the number of features in your data significantly, which might require more computational resources.")


  # Get user confirmation to proceed
  action = input("Do you want to create one-hot encoded features (y/n)? ").lower()
  if action != "y":
    print("Skipping one-hot encoding.")
    return data

  # Informative message about one-hot encoding
  print("One-hot encoding will create a separate binary feature for each unique category in a categorical column.")

  # Option to choose all or specific categorical features
  while True:
    choice = input("Choose encoding method (all/specific): ").lower()
    if choice in ["all", "specific"]:
      break
    else:
      print("Invalid choice. Please choose 'all' or 'specific'.")

  if choice == "all":
    # Encode all categorical features
    data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)
    print("Created one-hot encoded features for all categorical columns.")

  else:
    # Prompt user to choose specific columns for encoding
    selected_cols = []
    while True:
      col_name = input("Enter a categorical feature name (or 'done' to finish): ").lower()
      if col_name == "done":
        if not selected_cols:
          print("No columns selected. Skipping one-hot encoding.")
        else:
          data = pd.get_dummies(data, columns=selected_cols, drop_first=True)
          print(f"Created one-hot encoded features for selected columns.")
        break
      elif col_name in categorical_cols:
        selected_cols.append(col_name)
        print(f"Feature '{col_name}' added for one-hot encoding.")
      else:
        print(f"Invalid column name: '{col_name}'. Please choose from categorical features.")

  return data

create_one_hot_encoding(data.copy())

The provided function, `create_one_hot_encoding`, effectively addresses one-hot encoding for categorical features in a DataFrame. Here's a breakdown of its functionality:

**Functionality:**

1. **Input:**
    - `data`: The DataFrame containing the data.
    - `categorical_cols (optional)`: A list of column names to encode (defaults to all categorical columns).

2. **Categorical Feature Identification:**
    - If `categorical_cols` is not provided, it identifies all object data type columns as potential categorical features.
    - It checks if any categorical features exist and informs the user if none are found.

3. **Explanation of One-Hot Encoding:**
    - If categorical features are present, it provides a clear explanation of one-hot encoding, including the creation of separate binary features for each unique category.
    - It highlights both the benefit (improved model understanding) and the drawback (increased number of features) of one-hot encoding.

4. **User Confirmation:**
    - It asks the user for confirmation ("y/n") to proceed with creating one-hot encoded features.

5. **Informative Message:**
    - If the user confirms, it provides another informative message explaining how each categorical feature will have separate binary features.

6. **Choice of Encoding Method:**
    - It prompts the user to choose between encoding all categorical features or selecting specific ones ("all/specific").
    - It uses a `while` loop to ensure a valid choice ("all" or "specific").

7. **Encoding Process:**
    - **All Categorical Features:** If "all" is chosen, it uses `pandas.get_dummies` to encode all categorical features specified in `categorical_cols`. It also sets `drop_first=True` to avoid creating dummy traps (redundant features).
    - **Specific Categorical Features:** If "specific" is chosen, it uses a `while` loop to allow the user to enter column names one at a time. Valid selections are added to `selected_cols`. It uses `pandas.get_dummies` to encode only the chosen features with `drop_first=True`.

8. **Output:**
    - It returns the DataFrame (`data`) with additional one-hot encoded features.

**Additional Notes:**

- The function operates on a copy of the DataFrame (`data.copy()`) to avoid modifying the original data.
- The informative messages and user interaction make the process user-friendly and transparent.

Overall, this function provides a well-structured and informative approach to creating one-hot encoded features in Python.

#### **Label Encoding**

In [None]:
def create_label_encoding(data, categorical_cols=None):
  """
  Creates label encoded features from categorical columns in a DataFrame.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.
      categorical_cols (list, optional): A list of column names to encode. If None, all categorical columns will be considered. Defaults to None.

  Returns:
      pandas.DataFrame: The DataFrame with label encoded features (integers).
  """

  if categorical_cols is None:
    categorical_cols = [col for col in data.columns if data[col].dtype == "object"]

  if not categorical_cols:
    print("No categorical features found in the data. Skipping label encoding.")
    return data

  print("Label encoding is a simpler way to handle categorical features. It assigns a unique number to each different category.")
  print("For example, a feature 'fruit' with values 'apple', 'banana', and 'orange' might be encoded as:")
  print("  - apple: 0")
  print("  - banana: 1")
  print("  - orange: 2")
  print("This allows machine learning models to process the data more easily. However, it's important to be aware of a potential drawback:")
  print("  - Label encoding might treat higher numbers as more 'important' even if the categories have no inherent order.")
  print("For example, 'orange' (encoded as 2) might seem 'better' than 'apple' (encoded as 0) to the model, even though they are just different fruits.")
  print("If the order of your categories doesn't matter, label encoding can be a good choice. But if the order is important, you might want to consider other encoding techniques.")


  # Get user confirmation to proceed
  action = input("Do you want to create label encoded features (y/n)? ").lower()
  if action != "y":
    print("Skipping label encoding.")
    return data

  # Informative message about label encoding
  print("Label encoding assigns a unique integer value to each category in a categorical column.")
  print("** Caution:** This might introduce unintended ordering between categories.")

  # Option to choose all or specific categorical features
  while True:
    choice = input("Choose encoding method (all/specific): ").lower()
    if choice in ["all", "specific"]:
      break
    else:
      print("Invalid choice. Please choose 'all' or 'specific'.")

  if choice == "all":
    # Encode all categorical features
    for col in categorical_cols:
      le = sklearn.preprocessing.LabelEncoder()
      data[col] = le.fit_transform(data[col])
    print("Created label encoded features for all categorical columns.")

  else:
    # Prompt user to choose specific columns for encoding
    selected_cols = []
    while True:
      col_name = input("Enter a categorical feature name (or 'done' to finish): ").lower()
      if col_name == "done":
        if not selected_cols:
          print("No columns selected. Skipping label encoding.")
        else:
          for col in selected_cols:
            le = sklearn.preprocessing.LabelEncoder()
            data[col] = le.fit_transform(data[col])
          print(f"Created label encoded features for selected columns.")
        break
      elif col_name in categorical_cols:
        selected_cols.append(col_name)
        print(f"Feature '{col_name}' added for label encoding.")
      else:
        print(f"Invalid column name: '{col_name}'. Please choose from categorical features.")

  return data

create_label_encoding(data.copy())

Absolutely, here's a summary of the `create_label_encoding` function:

**Functionality:**

1. **Input:**
    - `data`: The DataFrame containing the data.
    - `categorical_cols (optional)`: A list of column names to encode (defaults to all categorical columns).

2. **Categorical Feature Identification:**
    - If `categorical_cols` is not provided, it identifies all object data type columns as potential categorical features.
    - It checks if any categorical features exist and informs the user if none are found.

3. **Explanation of Label Encoding:**
    - If categorical features are present, it provides a clear explanation of label encoding, including assigning unique integer values to each category.
    - It showcases an example for clarity.
    - It highlights the benefit (easier processing for models) and the drawback (potential introduction of unintended order) of label encoding.
    - It emphasizes the importance of considering the inherent order of categories when choosing label encoding.

4. **User Confirmation:**
    - It asks the user for confirmation ("y/n") to proceed with creating label encoded features.

5. **Informative Message:**
    - If the user confirms, it provides another informative message explaining the integer assignment to categories and a cautionary note about potential order assumptions.

6. **Choice of Encoding Method:**
    - It prompts the user to choose between encoding all categorical features or selecting specific ones ("all/specific").
    - It uses a `while` loop to ensure a valid choice ("all" or "specific").

7. **Encoding Process:**
    - **All Categorical Features:** If "all" is chosen, it iterates through each column in `categorical_cols`. Inside the loop, it creates a `LabelEncoder` object from `sklearn.preprocessing`. It uses `fit_transform` on the corresponding column in the DataFrame to encode the categories and update the column with the encoded values.
    - **Specific Categorical Features:** If "specific" is chosen, it uses a `while` loop to allow the user to enter column names one at a time. Valid selections are added to `selected_cols`. It follows a similar approach as "all" but iterates only through `selected_cols` to encode specific features.

8. **Output:**
    - It returns the DataFrame (`data`) with the categorical columns replaced by their label encoded integer values.

**Additional Notes:**

- The function operates on a copy of the DataFrame (`data.copy()`) to avoid modifying the original data.
- The informative messages and user interaction make the process user-friendly and transparent.

Overall, this function provides a well-structured and informative approach to creating label encoded features in Python. It effectively explains the concepts and guides the user through the process while highlighting potential limitations.

## **Handling Class Imbalance**

In [None]:
def handle_class_imbalance(data, target_col):
  """
  Provides options to handle class imbalance in a dataset.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.
      target_col (str): The name of the column containing the target variable.

  Returns:
      pandas.DataFrame: The DataFrame with potentially balanced classes.
  """

  # Display class distribution
  print("** Class Distribution:")
  class_counts = data[target_col].value_counts().sort_values(ascending=False)
  print(class_counts)

  # Check for imbalance
  majority_class = class_counts.index[0]
  majority_count = class_counts.iloc[0]
  imbalanced = majority_count / len(data) > 0.5  # Ratio check for imbalance

  if not imbalanced:
    print("Class distribution seems balanced. Skipping imbalance handling.")
    return data

  # Explain class imbalance
  print("\n** What is Class Imbalance?**")
  print("In machine learning, class imbalance occurs when a classification task has a significant skew")
  print("in the number of examples between different classes. Typically, one class (the majority class)")
  print("has many more examples than the other classes (the minority class).")
  print("This imbalance can lead to models that are biased towards the majority class and perform poorly")
  print("on the minority class.")

  # Get user choice for handling imbalance
  print("** Handling Class Imbalance:")
  print("- Undersampling (reduce majority class size)")
  print("  - Recommended if the majority class might be noisy or irrelevant.")
  print("- Oversampling (increase minority class size)")
  print("  - Recommended if the minority class is informative and you have enough data.")
  print("  - We will use the Synthetic Minority Oversampling Technique (SMOTE) for oversampling to avoid overfitting.")
  print("- No action (continue with imbalanced data)")
  print("  - Only recommended if the class imbalance doesn't significantly affect the model.")


  while True:
    choice = input("Choose an option (undersample/oversample/none): ").lower()
    if choice in ["undersample", "oversample", "none"]:
      break
    else:
      print("Invalid choice. Please choose 'undersample', 'oversample', or 'none'.")

  if choice == "none":
    print("Continuing with imbalanced data.")
    return data

  # Handle undersampling or oversampling based on user choice
  if choice in ["undersample", "oversample"]:
    print(f"Selected '{choice}'.")
    sampling_ratio = float(input("Enter desired sampling ratio (between 0 and 1): "))
    if sampling_ratio <= 0 or sampling_ratio > 1:
      print("Invalid sampling ratio. Please enter a value between 0 and 1.")
      return data  # Avoid errors with invalid ratio

    from imblearn.under_sampling import RandomUnderSampler  # Import for undersampling
    from imblearn.over_sampling import SMOTE  # Import for oversampling

    if choice == "undersample":
      rus = RandomUnderSampler(sampling_strategy={majority_class: int(sampling_ratio * majority_count)})
      data = rus.fit_resample(data, data[target_col])
      print(f"Undersampled majority class to {int(sampling_ratio * majority_count)} samples.")
    else:
      sm = SMOTE(sampling_strategy={target_col: "auto"})
      data = sm.fit_resample(data, data[target_col])
      print(f"Oversampled minority class to match the majority class size.")

  # Display final class distribution
  print("** Final Class Distribution:")
  class_counts = data[target_col].value_counts().sort_values(ascending=False)
  print(class_counts)

  return data

handle_class_imbalance(data.copy())

Absolutely, here's a summary of the `handle_class_imbalance` function:

**Functionality:**

1. **Input:**
    - `data`: The DataFrame containing the data.
    - `target_col`: The name of the column containing the target variable (class labels).

2. **Initial Analysis:**
    - It displays the class distribution using `value_counts` to show the number of examples for each class.
    - It calculates the ratio of the majority class size to the total number of samples to identify imbalance (ratio > 0.5 signifies imbalance).

3. **Class Imbalance Explanation (if applicable):**
    - If the data is imbalanced, it provides a clear explanation of class imbalance, its consequences (biased models towards the majority class), and its impact on model performance.

4. **User Choice for Handling Imbalance (if applicable):**
    - If the data is imbalanced, it presents the user with options to handle the imbalance:
        - Undersampling (reduce majority class size) - recommended for noisy or irrelevant majority class.
        - Oversampling (increase minority class size) - recommended for informative minority class with sufficient data. It mentions using SMOTE for oversampling to avoid overfitting.
        - No action (continue with imbalanced data) - only recommended if the imbalance has minimal impact on the model.
    - It uses a `while` loop to ensure a valid choice ("undersample", "oversample", or "none").

5. **Handling Imbalance (if applicable):**
    - Based on the user's choice:
        - **No Action:** If "none" is chosen, it informs the user and returns the original data.
        - **Undersampling or Oversampling:** 
            - It prompts the user for a desired sampling ratio (between 0 and 1).
            - It imports necessary libraries (`imblearn.under_sampling` or `imblearn.over_sampling`).
            - If undersampling is chosen, it creates a `RandomUnderSampler` object with the desired sampling ratio for the majority class and uses `fit_resample` to reduce its size.
            - If oversampling is chosen, it creates an `SMOTE` object with the target column (`target_col`) and uses `fit_resample` to increase the minority class size to match the majority.
            - It provides feedback on the sampling action performed.

6. **Final Class Distribution:**
    - It displays the final class distribution after any potential balancing actions.

7. **Output:**
    - It returns the DataFrame (`data`) with potentially balanced classes (depending on the user's choice).

**Additional Notes:**

- The function operates on a copy of the DataFrame (`data.copy()`) to avoid modifying the original data.
- User interaction allows for informed decision-making about handling class imbalance.
- Informative messages guide the user through the process.

Overall, this function provides a well-structured and informative approach to addressing class imbalance in Python. It effectively balances user guidance with clear explanations and execution.

## **Feature Selection**

In [None]:
target_column = input("Enter the name of the column containing the target variable (the variable you wish to predict/classify):")

def feature_selection(data, target_column):
  """
  Provides options for feature selection in machine learning tasks.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.
      target_col (str): The name of the column containing the target variable.

  Returns:
      pandas.DataFrame: The DataFrame with potentially reduced features.
  """

  # Display initial information
  print("** Feature Selection helps identify the most relevant features for your machine learning model.")
  print("It can improve model performance, reduce training time, and make the model easier to interpret.")

  # Get user preference for selection method
  print("\n** Feature Selection Methods:")
  print("- Filter Methods (based on statistical tests for individual features)")
  print("- Wrapper Methods (use a machine learning model to evaluate feature subsets)")
  print("- Embedded Methods (integrated within a machine learning model)")
  print("\n** We will focus on Filter Methods for this session.**")

  while True:
    choice = input("Do you want to proceed with Filter Methods (y/n)? ").lower()
    if choice in ["y", "n"]:
      break
    else:
      print("Invalid choice. Please choose 'y' or 'n'.")

  if choice == "n":
    print("Skipping feature selection. Using all features.")
    return data

  # Filter Method Selection
  print("\n** Filter Methods Options:")
  print("- Select K Best (choose a specific number of features)")
  print("- Select Percentile (choose a percentage of features)")
  print("** We will use Select K Best for this session.**")

  # Select K Best configuration
  while True:
    try:
      k = int(input("Enter the desired number of features to select (integer): "))
      if k > 0:
        break
      else:
        print("Invalid number. Please enter a positive integer.")
    except ValueError:
      print("Invalid input. Please enter an integer.")

  # Feature selection using SelectKBest
  from sklearn.feature_selection import SelectKBest
  from sklearn.feature_selection import chi2  # Example statistical test

  X = data.drop(target_col, axis=1)  # Separate features (X) and target (y)
  y = data[target_col]

  selector = SelectKBest(chi2, k=k)  # Use chi-square test for filter
  selector.fit(X, y)
  selected_features = X.columns[selector.get_support(indices=True)]

  # Informative output
  print(f"\n** Selected Features using SelectKBest (chi-square):**")
  for feature in selected_features:
    print(f"- {feature}")

  # User confirmation for using selected features
  print("\n** These features will be used for model training.**")
  while True:
    choice = input("Continue with selected features (y/n)? ").lower()
    if choice in ["y", "n"]:
      break
    else:
      print("Invalid choice. Please choose 'y' or 'n'.")

  if choice == "n":
    print("Original features will be used for model training.")
    return data

  # Return DataFrame with selected features
  return data[selected_features]

feature_selection(data.copy(), target_column)


This Python function, `feature_selection`, offers an interactive interface to assist with feature selection in machine learning tasks. It guides the user through the process, focusing on Filter Methods (specifically, Select K Best in this version). Here's a breakdown:

- **Input:** It takes a DataFrame (`data`) containing your features and target variable.
- **Functionality:**
    - Explains the benefits of feature selection.
    - Prompts the user to choose between Filter Methods (current focus) or skipping selection altogether.
    - If Filter Methods are chosen, it provides basic explanations of options (Select K Best, Select Percentile).
    - It guides the user through selecting a desired number of features (k) for Select K Best.
    - It performs feature selection using `SelectKBest` with chi-square test (as an example) to identify relevant features.
    - It informs the user about the selected features.
    - It asks for confirmation before using the selected features and modifying the DataFrame.
- **Output:** It returns a new DataFrame potentially containing a reduced set of features based on user choices and the selected method.

This function prioritizes user interaction and offers a starting point for feature selection. It's important to note that running it on every column might not be ideal, and other techniques like correlation analysis or feature importance scores can be valuable tools alongside it.