# DataMop Tutorial

Welcome to the tutorial for `datamop`, the ultimate Python package for cleaning and preparing your datasets with minimal effort. Data cleaning can often feel like the most tedious part of any data analysis or machine learning project. Missing values, inconsistent scales, and different data types can slow you down and distract from the real task: extracting insights from your data.

That is where `datamop` package comes in! This powerful, easy-to-use package automates many of the common data cleaning tasks, like imputing missing values, encoding categorical features and scaling numerical features, saving you time and effort while ensuring your data is consistent, complete, and ready for analysis.

Here we will show example usages for each function in the package, including `sweep_nulls`, `column_encoder`, and `column_scaler`. Your messy data will be ready to use after using this robust package. With `datamop`, you can focus more on analysis and less on tedious preprocessing. 

## Importing and Version Checking


Before we get started, let's install and import the `datamop` package. We will demonstrate each functions in the `datamop` package with examples using the Airbnb Open Data from kaggle.

In [60]:
# import modules
import pandas as pd
import numpy as np
from datamop.sweep_nulls import sweep_nulls
from datamop.column_encoder import column_encoder
from datamop.column_scaler import column_scaler

# import Airbnb Open Data
data = pd.read_csv("../src/data/Airbnb_Open_Data.csv")

## Handling missing values with `sweep_nulls()`

One of the most common challenges in data cleaning process is dealing with missing values. `datamop` provides a convenient method called `sweep_nulls()` to help you handle these issues effortlessly. The `sweep_nulls()` function scans your dataset for missing values and allows you to handle them using various strategies, including 'mean'(numeric only), 'median'(numeric only), 'mode', 'constant', and 'drop'.

Let's start by checking the missing values in the dataset:

In [None]:
data.info()

In [None]:
data.head()

In [None]:
data.isnull().sum()

### Imputing all columns

When dealing with datasets containing missing values across multiple columns, `sweep_nulls()`makes it easy to impute all columns simultaneously. This feature ensures consistent handling of missing data throughout the dataset, whether you’re using the mean, median, mode, or a custom value for imputation.

Since 'mean' and 'median' are designed for numerical features only, it is better to use 'mode', 'constant' or 'drop' when you have mixed data types in the dataset. 

In [None]:
# using mode to impute missing value with the most common values in the column
sweep_nulls(data, strategy='mode')

In [None]:
# using constant to impute missing value with a number
sweep_nulls(data, strategy='constant', fill_value=-999)

### Imputing specific numerical columns

If you want to focus on imputing missing values in specific numerical columns of your dataset without affecting other columns, you can achieve this by using `sweep_nulls()` to select the desired columns and apply an imputation strategy only to them.

In [None]:
# using mean to impute price and service fee columns
sweep_nulls(data, strategy='mean', columns=['price', 'service fee'])

In [None]:
# using constant to impute price and service fee columns with a negative number
sweep_nulls(data, strategy='constant', columns=['price', 'service fee'], fill_value=-999)

### Imputing specific categorical columns

When working with datasets containing missing values in categorical columns, you can impute missing values in specific categorical columns using common strategies like filling with the mode, or a custom value.

In [None]:
# using constant to impute missing value with a string
sweep_nulls(data, strategy='constant', columns=['host_identity_verified'], fill_values='missing')

In [None]:
# using mode to impute missing value with the most common values in the column
sweep_nulls(data, strategy='mode', columns=['country'])

### Dropping columns

When working with datasets, some columns may have excessive missing values, which makes them unhelpful for analysis. Imputing them can introduce noise, therefore `sweep_nulls()` allows you to drop columns with missing values.

In [None]:
# dropping one column
sweep_nulls(data, strategy='drop', columns=['instant_bookable'])

In [None]:
# dropping multiple columns
sweep_nulls(data, strategy='drop', columns=['instant_bookable', 'host_identity_verified'])

## Scaling Numerical Features with `column_scaler()`

When working with numerical data, inconsistent scales can distort analysis or machine learning results. 
For example, a column measuring `price` in thousands might dominate another column measuring `rating` on a 1-5 scale. 
To avoid this issue, scaling the numerical data to a consistent range or distribution can mitigate this problem.

The `column_scaler()` function in the `datamop` package allows users to scale any numeric column in a dataset. It supports two methods:
- **Min-Max Scaling**: Scale values to a specific range, such as `[0, 1]` or `[10, 20]`.
- **Standard Scaling**: Transform values to have a mean of `0` and a standard deviation of `1`.

The `column_scaler()` function allows flexible usage for both in-place scaling (replacing the original column) and creating a new scaled column.

Let’s walk through how to use `column_scaler()`.


### Preparing the Data for Scaling

In [72]:
data.head()

Unnamed: 0,id,host_identity_verified,neighbourhood group,neighbourhood,lat,long,country,instant_bookable,cancellation_policy,room type,...,price,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,price_scaled
0,1001254,unconfirmed,Brooklyn,Kensington,40.64749,-73.97237,United States,False,strict,Private room,...,966.0,1.024837,10.0,9.0,10/19/2021,0.002222,4.0,6.0,286.0,418.608696
1,1002102,verified,Manhattan,Midtown,40.75362,-73.98377,United States,False,moderate,Entire home/apt,...,142.0,-1.462885,30.0,45.0,5/21/2022,0.004112,4.0,2.0,228.0,132.0
2,1002403,,Manhattan,Harlem,40.80902,-73.9419,United States,True,flexible,Private room,...,620.0,-0.015483,3.0,0.0,,,5.0,1.0,352.0,298.26087
3,1002755,unconfirmed,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,True,moderate,Entire home/apt,...,368.0,-0.769338,30.0,270.0,7/5/2019,0.05145,4.0,1.0,322.0,210.608696
4,1003689,verified,Manhattan,East Harlem,40.79851,-73.94399,United States,False,moderate,Entire home/apt,...,204.0,-1.266883,10.0,9.0,11/19/2018,0.001,3.0,1.0,289.0,153.565217


#### Pre-check column types
Before scaling, ensure that numerical columns are in the correct format. 
In the Airbnb dataset, the `price` and `service fee` columns are objects because they contain a `$` sign. 
We need to remove the `$` sign and convert these columns to floats. 
Additionally, we’ll demonstrate scaling on the `reviews per month` column, which is already a numeric column.

In [61]:
# Clean "price" and "service fee" columns by removing unwanted characters and converting to float
data["price"] = data["price"].str.strip().str.replace(r"[^0-9.]", "", regex=True).astype(float)

data["service fee"] = data["service fee"].str.strip().str.replace(r"[^0-9.]", "", regex=True).astype(float)

# Verify the changes
data[["price", "service fee"]].head()

Unnamed: 0,price,service fee
0,966.0,193.0
1,142.0,28.0
2,620.0,124.0
3,368.0,74.0
4,204.0,41.0


### Example 1: Min-Max Scaling

Let’s scale the `reviews per month` column to a range between 0 and 1 using min-max scaling.\
The scaled values will replace the original column (`inplace=True`).


In [62]:
# Using min-max scaling to scale "reviews per month" to a range between 0 and 1
column_scaler(data, column="reviews per month", method="minmax", new_min=0, new_max=1, inplace=True)

# Verify the scaled column
data[["reviews per month"]].head()



Unnamed: 0,reviews per month
0,0.002222
1,0.004112
2,
3,0.05145
4,0.001


### Example 2: Custom Min-Max Scaling with a New Column

Now let’s scale the `price` column to a range between 100 and 500. Instead of modifying the original column, we’ll create a new column called `price_scaled` by setting `inplace=False`.


In [63]:
# Using min-max scaling to scale "price" to a range between 100 and 500
column_scaler(data, column="price", method="minmax", new_min=100, new_max=500, inplace=False)

# Verify the new scaled column
data[["price", "price_scaled"]].head()



Unnamed: 0,price,price_scaled
0,966.0,418.608696
1,142.0,132.0
2,620.0,298.26087
3,368.0,210.608696
4,204.0,153.565217


### Example 3: Standard Scaling

Let's scale the `service fee` column using the standard scaling method, 
which transforms the values to have a mean of 0 and standard deviation of 1.
The scaled values will replace the original column (`incplace=True`).

In [64]:
# Using standard scaling method on "service fee" column
column_scaler(data, column="service fee", method="standard", inplace=True)

# Verify the scaled column
data[["service fee"]].head()



Unnamed: 0,service fee
0,1.024837
1,-1.462885
2,-0.015483
3,-0.769338
4,-1.266883


### Edge Case 1: Scaling Column with Single Unique Value

If a column contains only a single unique value, `column_scaler()` automatically assigns the midpoint of the range for min-max scaling and issues a warning message.

In [65]:
# Create DataFrame with a single-value column
single_value_df = pd.DataFrame({"price": [100, 100, 100]})

# Scale the column using min-max scaling
scaled_df = column_scaler(single_value_df, column="price", method="minmax", new_min=0, new_max=1)

# Verify the result
scaled_df



Unnamed: 0,price
0,0.5
1,0.5
2,0.5


### Edge Case 2: Handling Missing Values (NaN)

If a column contains missing values (`NaN`), `column_scaler()` leaves them unchanged and issue a warning. This ensures no data is lost or imputed incorrectly.

In [66]:
# Create a DataFrame with NaN values
nan_df = pd.DataFrame({"reviews per month": [10, np.nan, 20]})

# Scaled the column using min-max scaling
scaled_nan_df = column_scaler(nan_df, column="reviews per month", method="minmax", new_min=0, new_max=1)

# Verify the result
scaled_nan_df



Unnamed: 0,reviews per month
0,0.0
1,
2,1.0


### Error Case 1: Using Non-Existent Column

If the specified column does not exist in the DataFrame, `column_scaler()` raises a `KeyError`

In [67]:
# Pass non existent column in the 'column' argument
try:
    column_scaler(data, column="Non_existent", method="minmax")
except KeyError as e:
    print(e)

'Column not found in the DataFrame.'


### Error Case 2: Using Non-Numeric Columns

If you attempt to scale a non-numeric column, such column of strings, `column_scaler()` raises a `ValueError`.

In [68]:
# Pass column of objects (country column) to column scaler
try:
    column_scaler(data, column="country", method="minmax")
except ValueError as e:
    print(e)

Column must have numeric values.


### Error Case 3: Using Invalid Method

If you specify a method other than `minmax` or `standard`, `column_scaler()` raises `ValueError`.

In [69]:
# Pass invalid method to column scaler
try:
    column_scaler(data, column="price", method="invalid_method")
except ValueError as e:
    print(e)

Invalid method. Method should be `minmax` or `standard`.


### Error Case 4: Using Invalid `new_min` and `new_max` Values

For min-max scaling, if the `new_min` is greater than `new_max`, `column_scaler()` raises a `ValueError`.

In [71]:
# Pass new_min greater than new_max
try:
    column_scaler(data, column="price", method="minmax", new_min=10, new_max=5)
except ValueError as e:
    print(e)

`new_min` cannot be greater than `new_max`.
