# DataMop Tutorial

Welcome to the tutorial for `datamop`, the ultimate Python package for cleaning and preparing your datasets with minimal effort. Data cleaning can often feel like the most tedious part of any data analysis or machine learning project. Missing values, inconsistent scales, and different data types can slow you down and distract from the real task: extracting insights from your data.

That is where `datamop` package comes in! This powerful, easy-to-use package automates many of the common data cleaning tasks, like imputing missing values, encoding categorical features and scaling numerical features, saving you time and effort while ensuring your data is consistent, complete, and ready for analysis.

Here we will show example usages for each function in the package, including `sweep_nulls`, `column_encoder`, and `column_scaler`. Your messy data will be ready to use after using this robust package. With `datamop`, you can focus more on analysis and less on tedious preprocessing. 

## Importing and Version Checking


Before we get started, let's install and import the `datamop` package. We will demonstrate each functions in the `datamop` package with examples using the Airbnb Open Data from kaggle.

In [1]:
# import modules
import pandas as pd
import datamop
from datamop.sweep_nulls import sweep_nulls
from datamop.column_encoder import column_encoder
from datamop.column_scaler import column_scaler

# import Airbnb Open Data
data = pd.read_csv()

ModuleNotFoundError: No module named 'datamop'

## Handling missing values with `sweep_nulls()`

One of the most common challenges in data cleaning process is dealing with missing values. `datamop` provides a convenient method called `sweep_nulls()` to help you handle these issues effortlessly. The `sweep_nulls()` function scans your dataset for missing values and allows you to handle them using various strategies, including 'mean'(numeric only), 'median'(numeric only), 'mode', 'constant', and 'drop'.

Let's start by checking the missing values in the dataset:

In [None]:
data.info()

In [None]:
data.isnull().sum()

### Imputing all columns

When dealing with datasets containing missing values across multiple columns, `sweep_nulls()`makes it easy to impute all columns simultaneously. This feature ensures consistent handling of missing data throughout the dataset, whether you’re using the mean, median, mode, or a custom value for imputation.

Since 'mean' and 'median' are designed for numerical features only, it is better to use 'mode', 'constant' or 'drop' when you have mixed data types in the dataset. 

In [None]:
# using mode to impute missing value with the most common values in the column
sweep_nulls(data, strategy='mode')

In [None]:
# using constant to impute missing value with a number
sweep_nulls(data, strategy='constant', fill_value=-999)

### Imputing specific numerical columns

If you want to focus on imputing missing values in specific numerical columns of your dataset without affecting other columns, you can achieve this by using `sweep_nulls()` to select the desired columns and apply an imputation strategy only to them.

In [None]:
# using mean to impute price and service fee columns
sweep_nulls(data, strategy='mean', columns=['price', 'service fee'])

In [None]:
# using constant to impute price and service fee columns with a negative number
sweep_nulls(data, strategy='constant', columns=['price', 'service fee'], fill_value=-999)

### Imputing specific categorical columns

When working with datasets containing missing values in categorical columns, you can impute missing values in specific categorical columns using common strategies like filling with the mode, or a custom value.

In [None]:
# using constant to impute missing value with a string
sweep_nulls(data, strategy='constant', columns=['host_identity_verified'], fill_values='missing')

In [None]:
# using mode to impute missing value with the most common values in the column
sweep_nulls(data, strategy='mode', columns=['country'])

### Dropping columns

When working with datasets, some columns may have excessive missing values, which makes them unhelpful for analysis. Imputing them can introduce noise, therefore `sweep_nulls()` allows you to drop columns with missing values.

In [None]:
# dropping one column
sweep_nulls(data, strategy='drop', columns=['instant_bookable'])

In [None]:
# dropping multiple columns
sweep_nulls(data, strategy='drop', columns=['instant_bookable', 'host_identity_verified'])