In [None]:
import pandas as pd
import numpy as np  # for nan

## **Data Type Conversion**

In [None]:
# Data Types
data_types = data.dtypes
print("Data Types for Each Column in Your Data")

In [None]:
import pandas as pd


def convert_data_types(data):
  """
  Prints data types for each column and allows user to change them.

  Args:
      data (pandas.DataFrame): The DataFrame containing the data.

  Returns:
      pandas.DataFrame: The DataFrame potentially with changed data types.
  """

  # Explain data types in a dictionary for easy reference
  dtype_explanations = {
      'int64': "Integer (whole numbers, positive or negative)",
      'float64': "Decimal number",
      'object': "Text data (strings)",
      'category': "Categorical data (limited set of options)",
      'datetime64[ns]': "Date and time",
      'bool': "Boolean (True or False)"
  }

  # Print data types with explanations
  for col, dtype in data_types.items():
    print(f"- {col}: {dtype} ({dtype_explanations.get(dtype, 'Unknown')})")

  # Prompt user for data type changes
  change_dtypes = input("Would you like to change any data types (y/n)? ").lower()
  if change_dtypes == "y":
    while True:
      # Ask for column and desired data type
      col_to_change = input("Enter the column name to change the data type: ").lower()
      new_dtype = input("Enter the desired new data type (int, float, object, etc.): ").lower()

      # Check if column exists and new data type is valid
      if col_to_change in data.columns and new_dtype in dtype_explanations.keys():
        try:
          # Attempt conversion (handles potential errors)
          data[col_to_change] = data[col_to_change].astype(new_dtype)
          print(f"Data type for '{col_to_change}' changed to {new_dtype}.")
          # **Modified break logic:**
          break_loop = input("Do you want to convert another column (y/n)? ").lower()
          if break_loop != "y":
            break
        except (ValueError, TypeError) as e:
          print(f"Error converting '{col_to_change}' to {new_dtype}: {e}")
          # **Prompt to continue after error**
          continue_loop = input("Would you like to try converting another column (y/n)? ").lower()
          if continue_loop != "y":
            break
      else:
        print(f"Invalid column name or data type. Please try again.")

  return data

# Example usage (assuming data is your DataFrame)
data = convert_data_types(data.copy())  # Avoid modifying original data
print(data)


Focus: Data cleaning addresses inconsistencies, errors, and missing values within the data itself.

Data Type Conversion: In this context, converting data types is often a cleaning step when the data type is incorrect or incompatible with how the data should be represented.
Examples:
- Inconsistent date formats (text to datetime).
- Text values in numerical columns (text to numerical).
- Incorrect data types due to import issues (e.g., strings instead of integers).

## **Dealing With Normality and Skewness**

The most efficient way to assess normality and skewness in your data columns depends on a few factors:

**1. Number of Columns:**

* **Few Columns:** If you have a small number of columns (less than 10), visual inspection using histograms and QQ plots might be the most efficient approach. These techniques are easy to understand and interpret, providing a quick grasp of the data distribution.

* **Many Columns:** With a large number of columns (more than 10), visual inspection becomes cumbersome. Here, statistical tests like the Shapiro-Wilk test can be more efficient. You can calculate the test statistic and p-value for each column to identify potential deviations from normality. A threshold for the p-value (e.g., 0.05) can be used to decide if the data is likely non-normal.

**2. Desired Level of Detail:**

* **Basic Assessment:** If you just need a quick indication of normality or skewness, histograms and statistical tests with p-values provide a sufficient level of detail.

* **Detailed Analysis:** For a more in-depth analysis, you can combine both approaches. Start with histograms and QQ plots to get a visual sense of the distribution, and then follow up with statistical tests to confirm your observations or explore borderline cases with p-values close to the chosen threshold.

Here's a breakdown of the efficiency considerations:

| Method | Efficiency for Few Columns | Efficiency for Many Columns | Level of Detail |
|---|---|---|---|
| Histograms & QQ Plots | High (easy to interpret visually) | Low (time-consuming for many columns) | High (visual assessment of shape) |
| Statistical Tests | Low (requires calculations) | High (efficient for many columns) | Moderate (p-value indicates normality likelihood) |

**Combined Approach:**

In practice, a combination of visual inspection and statistical tests often offers the best balance between efficiency and detail. Start with histograms and QQ plots for a quick overview, then use statistical tests for more rigorous confirmation, especially when dealing with many columns.

Here are some additional factors to consider:

* **Computational Resources:** If computational resources are limited, visual methods might be preferred. Statistical tests, especially for large datasets, can require more processing power.
* **Domain Knowledge:** If you have domain knowledge about the data, you might have an initial expectation about the normality of certain features. This can guide your choice of method, focusing on tests for features where normality is critical for your analysis.

Ultimately, the most efficient approach depends on your specific needs and the size of your dataset. Combining visual and statistical methods often provides a comprehensive and efficient way to assess normality and skewness in your data columns. 

In [None]:
# Correcting Skewed Data

def identify_skewed_data(data):
    """
    Identifies skewed features in a DataFrame and prompts for correction.

    Args:
        data (pandas.DataFrame): The DataFrame containing the data.

    Returns:
        pandas.DataFrame: The original DataFrame (potentially with user-guided transformations).
    """
    numerical_cols = data.select_dtypes(include=[np.number])
    skewed_cols = []  # List to store column names with skewness

    # Threshold for skewness (adjust as needed)
    skewness_threshold = 0.5

    for col in numerical_cols:
        # Calculate skewness
        skew = data[col].skew()

    if abs(skew) > skewness_threshold:
        skewed_cols.append(col)
        print(f"Column '{col}' appears skewed (skewness: {skew:.2f}).")
        action = input("Do you want to address the skewness (y/n)? ").lower()
        if action == "y":
            # User chooses to address skewness
            fix_method = input("Choose a correction method (log/sqrt/none): ").lower()
            if fix_method in ["log", "sqrt"]:
                # Apply transformation (log or sqrt)
                if fix_method == "log":
                    data[col] = np.log(data[col] + 1)  # Avoid log(0) errors by adding 1
                    print(f"Applied log transformation to column '{col}'.")
                else:
                    data[col] = np.sqrt(data[col])
                    print(f"Applied square root transformation to column '{col}'.")
            else:
                print(f"Skewness in '{col}' remains unaddressed.")
        else:
            print(f"Skewness in '{col}' remains unaddressed.")

    if not skewed_cols:
        print("No significant skewness detected in numerical columns.")

    return data

    # Example usage
    data = pd.DataFrame({ 'col1': [1, 2, 3, 4, 5], 'col2': [10, 100, 1000, 10000, 100000]})
    data = identify_skewed_data(data.copy())  # Avoid modifying original data


**Explanation:**

1. Function Definition: The identify_skewed_data function takes a DataFrame (data) as input.
2. Numerical Columns: It selects numerical columns using select_dtypes.
3. Skewness Threshold: A threshold for skewness (skewness_threshold) is defined (adjustable based on your needs).
4. Iterating Through Columns: The code loops through each numerical column.
5. Skewness Calculation: It calculates the skewness using data[col].skew().
6. Identifying Skewed Columns: If the absolute value of skewness is greater than the threshold, the column name is added to the skewed_cols list, and a message is printed informing the user about the skewness value.
7. User Prompt: The user is then prompted to address the skewness (yes/no).
8. User Choice: If the user chooses "y", another prompt asks for a correction method (log, sqrt, or none).
9. Transformation Options: Based on the chosen method (log or sqrt), the data in that column is transformed using np.log or np.sqrt (with safeguards to avoid log(0) errors). A message confirms the applied transformation.
10. No Transformation: If the user chooses "n" or an invalid method, a message indicates that the skewness remains unaddressed.
11. No Skewness Detected: If no columns have significant skewness, a message informs the user.
12. Data Return: The original DataFrame is returned (potentially with user-guided transformations on skewed columns).

In [None]:
# Normalizing/Scaling Data

Normalization and scaling are closely related data preprocessing techniques used in machine learning, but they have subtle differences:

**Normalization:**

* **Goal:** Normalize data features to a specific range (often 0 to 1 or -1 to 1). This ensures all features contribute equally during model training and can improve the convergence and stability of some algorithms. 
* **Methods:** Common normalization techniques include:
    * **Min-Max Scaling:** Scales each feature to a range between a minimum value (usually 0) and a maximum value (usually 1).
    * **Standardization (Z-score):** Subtracts the mean of each feature from its values and then divides by the standard deviation. This results in a standard normal distribution with a mean of 0 and a standard deviation of 1.

**Scaling:**

* **Goal:** Scale data features to have a similar range or variance. This can be helpful for algorithms that are sensitive to the scale of the data. While normalization achieves a specific range, scaling can involve transformations to a broader range depending on the chosen method.
* **Methods:** Scaling encompasses various techniques, including normalization (min-max scaling and standardization) as well as:
    * **Robust Scaling:** Similar to standardization but uses the median and interquartile range (IQR) to be less sensitive to outliers.

**Here's an analogy:**

Imagine ingredients for a recipe.

* **Normalization:** This is like measuring all ingredients in teaspoons or grams (a specific unit system).
* **Scaling:** This is like ensuring all ingredients are in similar quantities, even if not using the same units (e.g., adjusting cup measurements to be closer to the amount of a teaspoon measurement used in another ingredient).

**Key Points:**

* **Normalization is a specific type of scaling:**  Normalization techniques (min-max scaling and standardization) fall under the broader category of scaling.
* **Scaling can be more general:** Scaling can encompass methods beyond normalization, like robust scaling, which might be preferable in some scenarios.
* **Focus:** Normalization emphasizes bringing features to a specific range, while scaling focuses on making features have similar scales or variances.

**In practice, the terms "normalization" and "scaling" are sometimes used interchangeably, especially when referring to min-max scaling or standardization.** However, it's important to understand the nuances to choose the most appropriate technique for your data and modeling task.


## **Feature Engineering**

## **Label Encoding**

## **Handling Class Imbalance**

## **Feature Selection**