Data cleaning and data preprocessing are both crucial steps in preparing data for analysis or machine learning models. However, they serve distinct purposes within the data preparation pipeline:

**Data Cleaning:**

* **Focus:** Addresses inconsistencies, errors, and missing values within the data itself.
* **Goals:**
    * Ensure data accuracy and consistency.
    * Improve data quality for analysis.
    * Prepare data for further processing.
* **Techniques:**
    * Identifying and handling missing values (imputation, removal).
    * Detecting and correcting outliers (winsorization, removal).
    * Dealing with inconsistencies (e.g., formatting errors, typos).
    * Handling invalid or irrelevant data points.

**Data Preprocessing:**

* **Focus:** Transforms the data into a format suitable for analysis or modeling algorithms.
* **Goals:**
    * Improve model performance by making data more usable.
    * Reduce computational complexity for models.
    * Engineer new features that might be more informative.
* **Techniques:**
    * Feature scaling or normalization (putting all features on the same scale).
    * Encoding categorical variables (converting text categories to numerical values).
    * Feature selection (choosing relevant features for modeling).
    * Feature engineering (creating new features based on existing ones).

**Key Differences:**

* **Data cleaning** deals with the quality and integrity of the raw data, while **data preprocessing** focuses on transforming the data for specific modeling tasks.
* **Data cleaning** is often more about fixing errors and inconsistencies, while **data preprocessing** involves feature engineering and preparing the data for algorithms.

**In summary:**

* Data cleaning is a prerequisite for data preprocessing. You clean the data before transforming it for modeling.
* Both data cleaning and data preprocessing are essential steps for building robust and effective machine learning models.

# **data_handler/preprocessing.py**


In [1]:
import pandas as pd
import json
import numpy as np

## **Reading Data**

In [2]:
# read_data

file = input("Please upload your data file. We support CSV, Excel, TSV, and JSON")

def read_data(file):
    """
    Reads data from uploaded file (supports CSV, Excel, TSV, JSON).

    Args:
        file (object): The uploaded file object from Flask request.

    Returns:
        pandas.DataFrame (or list/dict): The loaded data in a suitable format.
    """
    # Identify file format based on filename extension or MIME type (consider using magic library)
    if file.filename.endswith(".csv"):
        data = pd.read_csv(file)
    elif file.filename.endswith(".xlsx"):
        data = pd.read_excel(file)
    elif file.filename.endswith(".tsv"):
        data = pd.read_csv(file, sep="\t")  # Use tab separator for TSV
    elif file.filename.endswith(".json"):
        try:
            data = json.load(file)  # Assuming JSON data represents a list or dictionary
        except json.JSONDecodeError:
            raise ValueError("Invalid JSON format. Please check your data.")
    else:
        raise ValueError("Unsupported file format. Please upload CSV, Excel, TSV, or JSON files.")

    return data




    # Can add logic for handling image formats (consider for future development)
    # elif file.content_type.startswith("image/"):  # Check for image content type
    #     # Implement image loading and pre-processing logic here (libraries like OpenCV)
    #     # Return a suitable data structure for image data
    #     pass




## **Data Cleaning**

In [3]:
# Descriptive Statistics
data_descriptive_statistics = data.describe()
print("Descriptive Statistics of Your Data:\n", data_descriptive_statistics)

NameError: name 'data' is not defined

This will display summary statistics for numerical columns in your data, including:

- count: The number of non-null values in each column.
- mean: The average value.
- std: The standard deviation.
- min: The minimum value.
- 25%: The first quartile (25th percentile).
- 50%: The median (50th percentile).
- 75%: The third quartile (75th percentile).
- max: The maximum value.

Here's how the summary statistics from `data.describe()` can inform data cleaning methods for robust modeling:

**1. Analyzing Central Tendency:**

* **Mean & Median:** These statistics represent the "average" value in a column. Significant deviations between mean and median can indicate a skewed distribution.

* **Skew:** This statistic directly measures the skewness of the data. A positive skew indicates more data points concentrated towards lower values, while negative skew suggests a tail towards higher values.

**Implications for Cleaning:**

* Skewed data can affect the performance of some machine learning models. Depending on the model and the severity of the skew, you might consider data transformation (e.g., log transformation) or using models robust to skewed data.

**2. Understanding Dispersion:**

* **Standard Deviation (Std):** This statistic shows how spread out the data is from the mean. A high Std indicates high variability, while a low Std suggests the data is clustered around the mean.

* **Minimum & Maximum:** These values reveal the range of the data. Outliers (values far from the rest) can be identified by comparing them to the IQR (Interquartile Range) or a certain number of standard deviations from the mean.

**Implications for Cleaning:**

* Outliers can significantly impact some models. You might need to decide on handling outliers through winsorization (capping them to a certain threshold), removal, or using models less sensitive to outliers.

**3. Exploring Data Types:**

* **Data Types:** `data.describe()` often shows the data type (e.g., int, float) of each column. Inconsistencies or incorrect data types (e.g., dates stored as text) can lead to errors in modeling.

**Implications for Cleaning:**

* You might need to convert data types (e.g., text to numeric for numerical features, handling dates appropriately) to ensure compatibility with modeling algorithms.

**Overall, `data.describe()` provides a high-level overview of the data's central tendency, dispersion, and potential issues like missing values and outliers. By analyzing these statistics, you can identify areas where data cleaning is necessary to prepare your data for robust modeling.**


In [None]:
# Null Values for each column
missing_values = data.isnull().sum()
print("Missing/Null Values for Each Column/Feature of Your Data:\n", missing_values)

NameError: name 'data' is not defined

In [None]:
# Data Types
data_types = data.dtypes
print("Data Types for Each Column in Your Data")

Focus: Data cleaning addresses inconsistencies, errors, and missing values within the data itself.

Data Type Conversion: In this context, converting data types is often a cleaning step when the data type is incorrect or incompatible with how the data should be represented.
Examples:
- Inconsistent date formats (text to datetime).
- Text values in numerical columns (text to numerical).
- Incorrect data types due to import issues (e.g., strings instead of integers).

In [None]:
# Outlier Detection

def identify_and_handle_outliers(data):
    """
    Identifies outliers and prompts user for imputation, removal, or keeping outliers.

    Args:
        data (pandas.DataFrame): The DataFrame containing the data.

    Returns:
        pandas.DataFrame: The modified DataFrame (potentially with outliers left unchanged).
    """
    numerical_cols = data.select_dtypes(include=[np.number])
    outliers_exist = False  # Flag to track presence of outliers

    for col in numerical_cols:
        # Calculate quartiles and IQR
        Q1 = numerical_cols[col].quantile(0.25)
        Q3 = numerical_cols[col].quantile(0.75)
        IQR = Q3 - Q1

        # Identify outliers based on IQR outlier rule
        lower_bound = Q1 - (1.5 * IQR)
        upper_bound = Q3 + (1.5 * IQR)
        outlier_count = (numerical_cols[col] < lower_bound | numerical_cols[col] > upper_bound).sum()

    if outlier_count > 0:
        outliers_exist = True
        print(f"Found {outlier_count} potential outliers in column '{col}'.")
        print("""
        Outlier Treatment Options:

        * Imputation: Replaces outliers with estimates (mean, median, mode) to preserve data.
        * Removal: Removes rows containing outliers, suitable for errors or irrelevant data.
        * Keep: Leave outliers unchanged for further analysis (consider impact on results).

        Choosing the right option depends on the number of outliers, their impact on analysis, and data quality.
        """)
        action = input("Do you want to (i)mpute, (r)emove, or (k)eep outliers (i/r/k)? ").lower()
        if action == "i":
            # FUTURE DEVELOPMENT: See markdown below this cell to determine which imputation method to choose.
            # Choose imputation method
            print("""
            Choosing the Right Imputation Method:

            * **Mean:** Use mean if the data is normally distributed (consider histograms or normality tests). Mean is sensitive to outliers, so consider if there are extreme values that might distort the average.

            * **Median:** Use median if the data is skewed (uneven distribution) or has extreme outliers. Median is less sensitive to outliers compared to mean and represents the 'middle' value in the data.

            * **Mode:** Use mode for categorical data with a dominant value. Mode represents the most frequent value in the data and is suitable for non-numerical categories.
            """)
            imputation_method = input("Choose imputation method (mean/median/mode): ").lower()
            if imputation_method == "mean":
                data.loc[numerical_cols[col].index[numerical_cols[col] < lower_bound | numerical_cols[col] > upper_bound], col] = numerical_cols[col].mean()
                print(f"Imputing outliers in '{col}' with mean.")
            elif imputation_method == "median":
                data.loc[numerical_cols[col].index[numerical_cols[col] < lower_bound | numerical_cols[col] > upper_bound], col] = numerical_cols[col].median()
                print(f"Imputing outliers in '{col}' with median.")
            else:
                # Mode imputation (consider using libraries like scikit-learn for categorical data handling)
                data.loc[numerical_cols[col].index[numerical_cols[col] < lower_bound | numerical_cols[col] > upper_bound], col] = numerical_cols[col].mode()[0]  # Assuming single most frequent value
                print(f"Imputing outliers in '{col}' with mode (considering first most frequent value).")
        elif action == "r":
            # Remove rows with outliers
            data = data[~(numerical_cols[col] < lower_bound | numerical_cols[col] > upper_bound)]
            print(f"Removing rows with outliers in column '{col}'.")
        elif action == "k":
            print(f"Keeping outliers in column '{col}' for further analysis.")
        else:
            print(f"Invalid choice. Outliers in '{col}' remain unaddressed.")

    if not outliers_exist:
        print("No outliers detected in numerical columns.")

    return data


The most efficient way to assess normality and skewness in your data columns depends on a few factors:

**1. Number of Columns:**

* **Few Columns:** If you have a small number of columns (less than 10), visual inspection using histograms and QQ plots might be the most efficient approach. These techniques are easy to understand and interpret, providing a quick grasp of the data distribution.

* **Many Columns:** With a large number of columns (more than 10), visual inspection becomes cumbersome. Here, statistical tests like the Shapiro-Wilk test can be more efficient. You can calculate the test statistic and p-value for each column to identify potential deviations from normality. A threshold for the p-value (e.g., 0.05) can be used to decide if the data is likely non-normal.

**2. Desired Level of Detail:**

* **Basic Assessment:** If you just need a quick indication of normality or skewness, histograms and statistical tests with p-values provide a sufficient level of detail.

* **Detailed Analysis:** For a more in-depth analysis, you can combine both approaches. Start with histograms and QQ plots to get a visual sense of the distribution, and then follow up with statistical tests to confirm your observations or explore borderline cases with p-values close to the chosen threshold.

Here's a breakdown of the efficiency considerations:

| Method | Efficiency for Few Columns | Efficiency for Many Columns | Level of Detail |
|---|---|---|---|
| Histograms & QQ Plots | High (easy to interpret visually) | Low (time-consuming for many columns) | High (visual assessment of shape) |
| Statistical Tests | Low (requires calculations) | High (efficient for many columns) | Moderate (p-value indicates normality likelihood) |

**Combined Approach:**

In practice, a combination of visual inspection and statistical tests often offers the best balance between efficiency and detail. Start with histograms and QQ plots for a quick overview, then use statistical tests for more rigorous confirmation, especially when dealing with many columns.

Here are some additional factors to consider:

* **Computational Resources:** If computational resources are limited, visual methods might be preferred. Statistical tests, especially for large datasets, can require more processing power.
* **Domain Knowledge:** If you have domain knowledge about the data, you might have an initial expectation about the normality of certain features. This can guide your choice of method, focusing on tests for features where normality is critical for your analysis.

Ultimately, the most efficient approach depends on your specific needs and the size of your dataset. Combining visual and statistical methods often provides a comprehensive and efficient way to assess normality and skewness in your data columns. 

---

Here are some prominent machine learning methods used to analyze photos (mainly for assessing data normality):

**1. Convolutional Neural Networks (CNNs):**

* This is a dominant approach for image analysis tasks like:
    * **Image Classification:** Classifying images based on their content (e.g., cat, dog, car). CNNs excel at recognizing patterns and features in images.
    * **Object Detection:** Identifying and locating specific objects within an image (e.g., identifying pedestrians, traffic signs). CNNs can not only classify objects but also pinpoint their location in the image.
    * **Image Segmentation:** Dividing an image into regions corresponding to different objects or parts of the scene. This helps in understanding the image composition and relationships between objects.

**2. Generative Adversarial Networks (GANs):**

* These involve two competing neural networks:
    * **Generator:** Creates new images that resemble real photos based on the data it's trained on.
    * **Discriminator:** Tries to distinguish between real photos and the generated images.
* Applications include:
    * **Image Inpainting:** Filling in missing parts of an image realistically.
    * **Style Transfer:** Applying the style of one image to another (e.g., making a photo look like a painting).
    * **Super-Resolution:** Creating a higher-resolution version of an image.

**3. Autoencoders:**

* These are neural networks that learn to compress an input image into a lower-dimensional representation (encoded) and then reconstruct the original image from that encoding (decoded).
* Applications include:
    * **Anomaly Detection:** Identifying unusual or unexpected images by comparing the reconstruction error.
    * **Dimensionality Reduction:** Representing images in a more compact form for storage or processing.
    * **Data Denoising:** Removing noise from images.

**4. Object Recognition with Transformers:**

* While CNNs are traditional leaders, transformers, known for their success in natural language processing, are making inroads in image recognition. 
* These models process image data by breaking it down into smaller patches and analyzing relationships between them, potentially offering an alternative approach to CNNs for certain tasks.

**Choosing the Right Method:**

The choice of machine learning method for photo analysis depends on the specific task you want to accomplish. Here are some factors to consider:

* **Task Type:** Classification, object detection, segmentation, image generation, etc.
* **Data Availability:** The amount and type of labeled photo data available for training.
* **Computational Resources:** The complexity of the model and the hardware required to train and run it.

**Additional Considerations:**

* **Explainability:** Some methods, like CNNs, can be challenging to interpret in terms of how they arrive at their decisions. This is an ongoing area of research in machine learning. 
* **Transfer Learning:** Pre-trained models on large image datasets can be fine-tuned for specific tasks, reducing training time and potentially improving performance.

By understanding these methods and the factors influencing their choice, you can leverage machine learning to extract valuable insights from your photo data.

In [None]:
# Handling Class Imbalance in preprocessing section

In [None]:

def clean_data(data):
    # Your data cleaning logic here (handling missing values, outliers, etc.)
    pass

def pre_process_data(data, task):
    # Specific pre-processing steps based on chosen task
    pass

def handle_data_upload():
    # Load data from uploaded file
    data = ...
    
    # Preprocess data
    # cleaned_data = preprocessing.clean_data(data)
    # preprocessed_data = preprocessing.pre_process_data(cleaned_data, chosen_task)
    
    # Use preprocessed data for further analysis or modeling