In [None]:
import pandas as pd
import numpy as np  # for nan

## **Data Type Conversion**

In [None]:
# Data Types
data_types = data.dtypes
print("Data Types for Each Column in Your Data")

In [None]:
import pandas as pd


def convert_data_types(data):
  """
  Prints data types for each column and allows user to change them.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.

  Returns:
      pandas.DataFrame: The DataFrame potentially with changed data types.
  """

  # Explain data types in a dictionary for easy reference
  dtype_explanations = {
      'int64': "Integer (whole numbers, positive or negative)",
      'float64': "Decimal number",
      'object': "Text data (strings)",
      'category': "Categorical data (limited set of options)",
      'datetime64[ns]': "Date and time",
      'bool': "Boolean (True or False)"
  }

  # Print data types with explanations
  for col, dtype in data_types.items():
    print(f"- {col}: {dtype} ({dtype_explanations.get(dtype, 'Unknown')})")

  # Prompt user for data type changes
  change_dtypes = input("Would you like to change any data types (y/n)? ").lower()
  if change_dtypes == "y":
    while True:
      # Ask for column and desired data type
      col_to_change = input("Enter the column name to change the data type: ").lower()
      new_dtype = input("Enter the desired new data type (int, float, object, etc.): ").lower()

      # Check if column exists and new data type is valid
      if col_to_change in data.columns and new_dtype in dtype_explanations.keys():
        try:
          # Attempt conversion (handles potential errors)
          data[col_to_change] = data[col_to_change].astype(new_dtype)
          print(f"Data type for '{col_to_change}' changed to {new_dtype}.")
          # **Modified break logic:**
          break_loop = input("Do you want to convert another column (y/n)? ").lower()
          if break_loop != "y":
            break
        except (ValueError, TypeError) as e:
          print(f"Error converting '{col_to_change}' to {new_dtype}: {e}")
          # **Prompt to continue after error**
          continue_loop = input("Would you like to try converting another column (y/n)? ").lower()
          if continue_loop != "y":
            break
      else:
        print(f"Invalid column name or data type. Please try again.")

  return data

# Example usage 
data = convert_data_types(data.copy())  # Avoid modifying original data


Focus: Data cleaning addresses inconsistencies, errors, and missing values within the data itself.

Data Type Conversion: In this context, converting data types is often a cleaning step when the data type is incorrect or incompatible with how the data should be represented.
Examples:
- Inconsistent date formats (text to datetime).
- Text values in numerical columns (text to numerical).
- Incorrect data types due to import issues (e.g., strings instead of integers).

## **Dealing With Normality and Skewness**

The most efficient way to assess normality and skewness in your data columns depends on a few factors:

**1. Number of Columns:**

* **Few Columns:** `(still need implementation)` If you have a small number of columns (less than 10), visual inspection using histograms and QQ plots might be the most efficient approach. These techniques are easy to understand and interpret, providing a quick grasp of the data distribution.

* **Many Columns:** With a large number of columns (more than 10), visual inspection becomes cumbersome. Here, statistical tests like the Shapiro-Wilk test can be more efficient. You can calculate the test statistic and p-value for each column to identify potential deviations from normality. A threshold for the p-value (e.g., 0.05) can be used to decide if the data is likely non-normal.

**2. Desired Level of Detail:**

* **Basic Assessment:** If you just need a quick indication of normality or skewness, histograms and statistical tests with p-values provide a sufficient level of detail.

* **Detailed Analysis:** For a more in-depth analysis, you can combine both approaches. Start with histograms and QQ plots to get a visual sense of the distribution, and then follow up with statistical tests to confirm your observations or explore borderline cases with p-values close to the chosen threshold.

Here's a breakdown of the efficiency considerations:

| Method | Efficiency for Few Columns | Efficiency for Many Columns | Level of Detail |
|---|---|---|---|
| Histograms & QQ Plots | High (easy to interpret visually) | Low (time-consuming for many columns) | High (visual assessment of shape) |
| Statistical Tests | Low (requires calculations) | High (efficient for many columns) | Moderate (p-value indicates normality likelihood) |

**Combined Approach:**

In practice, a combination of visual inspection and statistical tests often offers the best balance between efficiency and detail. Start with histograms and QQ plots for a quick overview, then use statistical tests for more rigorous confirmation, especially when dealing with many columns.

Here are some additional factors to consider:

* **Computational Resources:** If computational resources are limited, visual methods might be preferred. Statistical tests, especially for large datasets, can require more processing power.
* **Domain Knowledge:** If you have domain knowledge about the data, you might have an initial expectation about the normality of certain features. This can guide your choice of method, focusing on tests for features where normality is critical for your analysis.

Ultimately, the most efficient approach depends on your specific needs and the size of your dataset. Combining visual and statistical methods often provides a comprehensive and efficient way to assess normality and skewness in your data columns. 

## **Normalizing/Scaling Data**

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler


def scale(data):
  """
  Identifies skewed features, suggests corrections, and performs scaling/normalization.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.

  Returns:
      pandas.DataFrame: The transformed DataFrame with addressed skewness and scaling/normalization.
  """
  numerical_cols = data.select_dtypes(include=[np.number])
  skewed_cols = []  # List to store column names with skewness

  # Threshold for skewness (adjust as needed)
  skewness_threshold = 0.5

  for col in numerical_cols:
    # Calculate skewness
    skew = data[col].skew()
    if abs(skew) > skewness_threshold:
      skewed_cols.append(col)
      print(f"Column '{col}' appears skewed (skewness: {skew:.2f}).")


      # Inform decision-making
      print("Here's a brief explanation of the available correction methods:")
      print("  - Log transformation (log(x + 1)): This method is often effective for right-skewed data (where values are concentrated on the left side of the distribution).")
      print("    It compresses the larger values and stretches the smaller ones, aiming for a more symmetrical distribution.")
      print("  - Square root transformation (sqrt(x)): This method can be helpful for moderately skewed data, positive-valued features, or data with a large number of zeros.")
      print("    It reduces the influence of extreme values and can bring the distribution closer to normality.")
      print("**Please consider the characteristics of your skewed feature(s) when making your choice.**")
      print("If you're unsure, you can experiment with both methods and compare the results visually (e.g., using histograms) to see which one normalizes the data more effectively for your specific case.")

      # User prompt for addressing skewness
      action = input("Do you want to address the skewness (y/n)? ").lower()
      if action == "y":
        

        # User chooses to address skewness
        while True:  # Loop until a valid choice is made
          fix_method = input("Choose a correction method (log/sqrt/none): ").lower()
          if fix_method in ["log", "sqrt"]:
            # Apply transformation (log or sqrt)
            if fix_method == "log":
              data[col] = np.log(data[col] + 1)  # Avoid log(0) errors by adding 1
              print(f"Applied log transformation to column '{col}'.")
            else:
              data[col] = np.sqrt(data[col])
              print(f"Applied square root transformation to column '{col}'.")
            break  # Exit the loop if a valid choice is made
          else:
            print("Invalid choice. Please choose 'log', 'sqrt', or 'none'.")

      else:
        print(f"Skewness in '{col}' remains unaddressed.")
    
    if not skewed_cols:
      print("No significant skewness detected in numerical columns.")

  # User prompt for scaling/normalization (if applicable)
  if len(numerical_cols) > 0:

    print("Here's a brief explanation of the available scaling/normalization methods:")
    print("  - Standard scaling: This method transforms features by subtracting the mean and dividing by the standard deviation.")
    print("    This results in features centered around zero with a standard deviation of 1.")
    print("    It's suitable for algorithms that assume a normal distribution of features (e.g., Logistic Regression, Support Vector Machines).")
    print("  - Min-max scaling: This method scales each feature to a specific range, typically between 0 and 1.")
    print("    It achieves this by subtracting the minimum value and then dividing by the difference between the maximum and minimum values in the feature.")
    print("    This can be useful for algorithms that are sensitive to the scale of features (e.g., K-Nearest Neighbors).")
    print("**Choosing the right method depends on your data and the algorithm you're using.**")
    print("  - If you're unsure about the underlying distribution of your data, standard scaling might be a safer choice as it doesn't make assumptions about normality.")
    print("  - If your algorithm is sensitive to feature scales and doesn't assume normality, min-max scaling might be preferable.")
    print("Consider the characteristics of your data and algorithm when making your decision. You can also experiment with both methods")
    print("and compare the results using model performance metrics to see which one works best for your specific case.")

    action = input("Do you want to scale or normalize the numerical features (y/n)? ").lower()
    if action == "y":

      while True:  # Loop until a valid choice is made
        method = input("Choose scaling/normalization method (standard/minmax/skip): ").lower()
        if method in ["standard", "minmax"]:
          if method == "standard":
            scaler = StandardScaler()
            data[numerical_cols] = scaler.fit_transform(data[numerical_cols])
            print(f"Applied standard scaling to numerical features.")
          else:
            scaler = MinMaxScaler(feature_range=(0, 1))
            data[numerical_cols] = scaler.fit_transform(data[numerical_cols])
            print(f"Applied min-max scaling to numerical features (range 0-1).")
          break  # Exit the loop if a valid choice is made
        elif method == "skip":
          print("Skipping scaling/normalization.")
          break
        else:
          print("Invalid choice. Please choose 'standard', 'minmax', or 'skip'.")


  if not skewed_cols:
    print("No significant skewness detected in numerical columns.")
  return data


# Example usage:
preprocessed_data = scale(data.copy())  # Operate on a copy to avoid modifying original data


The `scale` function offers a user-guided approach to data preprocessing, addressing both skewness in numerical features and optional scaling/normalization. Here's a breakdown of its functionalities:

1. **Skewness Identification:**
   - It identifies features with significant skewness (asymmetry in the distribution) based on a user-defined threshold.
   - For skewed features, it informs the user about the skewness value and prompts them to address it.

2. **Skewness Correction (Optional):**
   - If the user chooses to address skewness, it offers options for log or square root transformation to potentially reduce skewness.
   - The chosen transformation is applied to the specific feature(s).

3. **Scaling/Normalization (Optional):**
   - After addressing skewness (or if no skewness is found), it prompts the user to decide if they want to scale or normalize the numerical features.
   - If the user chooses to proceed, it offers options for standard scaling (centering and scaling features to have zero mean and unit variance) or min-max scaling (scaling features to a specific range, typically 0-1).
   - The chosen scaling/normalization method is applied using scikit-learn's `StandardScaler` or `MinMaxScaler` to transform the numerical features.


Overall, this function simplifies data preprocessing by guiding the user through common steps like handling skewness and applying scaling/normalization techniques. It provides flexibility and explanations, making it easier to understand and customize the preprocessing process.


## **Feature Engineering**

## **Label Encoding**

## **Handling Class Imbalance**

## **Feature Selection**