# `dfcleaner` demonstration: usage examples

## Importing `dfcleaner` in a project:

To use `dfcleaner` in a project we have to import its subpackages, according to the type of operations that are required.

In [1]:
# import dfcleaner
# print(dfcleaner.__version__)

from dfcleaner import catvars as catvars # for methods operating on categorical variables 
from dfcleaner import strings as strings # for methods operating on strings
# etc...

Let's also import some external packages for the sake of this demonstration.

In [2]:
import pandas as pd

## Categorical variables

To display how `dfcleaner` works on categorical variables, let's first create a dummy `pandas` dataframe. In this case, we fill it with two columns: one representing a categorical variable full of "mistakes" and one with the "right" categories.

In [3]:
dirty_cats = ['monday', 'Tusday', 'Wednesday', 'thurda', 'Firday', 'saty', 'Sunday']
clean_cats = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

df = pd.DataFrame({'dirty_cats': dirty_cats, 'clean_cats': clean_cats})
df

Unnamed: 0,dirty_cats,clean_cats
0,monday,Monday
1,Tusday,Tuesday
2,Wednesday,Wednesday
3,thurda,Thursday
4,Firday,Friday
5,saty,Saturday
6,Sunday,Sunday


### catvars.categorical_variables()
Retrieves a `pandas.DataFrame`'s categorical variables and their unique values.

## Operations on strings

### strings.remove_chars()
Simple method for cleaning recurrent unwanted characters or substrings from a target variable of type `str`, `list`, `tuple`, or `pandas.Series`.

In [4]:
clean_pdSeries = strings.remove_chars(df['dirty_cats'], ['m', 'W', 'y'])
print(clean_pdSeries)
type(clean_pdSeries)

0       monda
1       Tusda
2    Wednesda
3      thurda
4       Firda
5         sat
6       Sunda
Name: dirty_cats, dtype: object


pandas.core.series.Series

In [5]:
clean_str = strings.remove_chars('monday', ['m', 'y'])
print(clean_str)
type(clean_str)

onda


str

### strings.match_ratio()
Function that provides different methods for comparing two given strings and return a match ratio.

In [6]:
str1 = 'mnday'
str2 = 'Monday'
print("Character by character comparison ratio: " + str(strings.match_ratio(str1, str2, method='charbychar', case_sensitive=False)))
print("Slice each 2 characters comparison ratio: " + str(strings.match_ratio(str1, str2, method='sliceeach2', case_sensitive=False)))
print("Slice each 3 characters comparison ratio: " + str(strings.match_ratio(str1, str2, method='sliceeach3', case_sensitive=False)))
print("Common characters ratio: " + str(strings.match_ratio(str1, str2, method='commonchars', case_sensitive=False)))

Character by character comparison ratio: 0.16666666666666666
Slice each 2 characters comparison ratio: 0.6
Slice each 3 characters comparison ratio: 0.5
Common characters ratio: 0.8333333333333334


### strings.slice()
Returns a `target` string subdivided in chunks (list), according to `chunk_size` variable.

In [10]:
string = "house"
print(strings.slice(string, 2))
print(strings.slice(string, 3))
print(strings.slice(string, 4))

['ho', 'ou', 'us', 'se']
['hou', 'ous', 'use']
['hous', 'ouse']
