# `tydier` demonstration: usage examples

## Importing `tydier` in a project:

To use `tydier` in a project we have to import its subpackages, according to the type of operations that need to be performed.

In [1]:
from tydier import catvars as catvars # for methods operating on categorical variables 
from tydier import numvars as numvars # for methods operating on numeric variables
from tydier import strings as strings # for methods operating on strings

Let's also import some external packages for the sake of this demonstration.

In [2]:
import pandas as pd

## Categorical variables

To display how `tydier` works on categorical variables, let's first create a dummy `pandas` dataframe. In this case, we fill it with two columns: one representing a categorical variable full of "typos" and one with the "right" values: let's take the example of week days.

In [3]:
dirty_cats = ['monday', 'Tusday', 'Wednesday', 'thurda', 'Firday', 'saty', 'Sunday']
clean_cats = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

df = pd.DataFrame({'dirty_cats': dirty_cats, 'clean_cats': clean_cats})
df

Unnamed: 0,dirty_cats,clean_cats
0,monday,Monday
1,Tusday,Tuesday
2,Wednesday,Wednesday
3,thurda,Thursday
4,Firday,Friday
5,saty,Saturday
6,Sunday,Sunday


### catvars.categorical_variables()
Retrieves a `pandas.DataFrame`'s categorical variables and their unique values.

In [4]:
print(catvars.categorical_variables(df))
print()
catvars.categorical_variables(df, display=True)

{'dirty_cats': array(['monday', 'Tusday', 'Wednesday', 'thurda', 'Firday', 'saty',
       'Sunday'], dtype=object), 'clean_cats': array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'], dtype=object)}

(1) dirty_cats | 7 unique values:
['monday' 'Tusday' 'Wednesday' 'thurda' 'Firday' 'saty' 'Sunday']

(2) clean_cats | 7 unique values:
['Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday' 'Saturday' 'Sunday']



### catvars.find_inconsistent_categories()
Find inconsistent categorical values in a `pd.Series` by checking it against a correct list of permitted parameters.

In [5]:
catvars.inconsistent_categories(dirty_cats, clean_cats, mapping_dict=True)

{'Firday': 'Friday',
 'saty': 'Saturday',
 'thurda': 'Thursday',
 'Tusday': 'Tuesday',
 'monday': 'Monday'}

Replace inconsistent categorical values in a `pd.DataFrame`:

In [6]:
mapping = catvars.inconsistent_categories(dirty_cats, clean_cats, mapping_dict=True)
df['cleaned_dirty_cats'] = df['dirty_cats'].replace(mapping)
df

Unnamed: 0,dirty_cats,clean_cats,cleaned_dirty_cats
0,monday,Monday,Monday
1,Tusday,Tuesday,Tuesday
2,Wednesday,Wednesday,Wednesday
3,thurda,Thursday,Thursday
4,Firday,Friday,Friday
5,saty,Saturday,Saturday
6,Sunday,Sunday,Sunday


## Numeric variables

### numvars.currency_to_float()
Automatically cleans a currency containing variable and prepares it for analysis by transforming it to `float`. Target variable of type `str`, `list`, `tuple`, or `pandas.Series`.

In [7]:
prices = pd.Series([' $50,    00', '30, 00€'])
print(numvars.currency_to_float(prices))

0    50.0
1    30.0
dtype: float64


## Operations on strings

### strings.remove_chars()
Simple method for cleaning recurrent unwanted characters or substrings from a target variable of type `str`, `list`, `tuple`, or `pandas.Series`.

In [8]:
clean_pdSeries = strings.remove_chars(df['dirty_cats'], ['F', 'T', 'W'])
print(clean_pdSeries)
type(clean_pdSeries)

0      monday
1       usday
2    ednesday
3      thurda
4       irday
5        saty
6      Sunday
Name: dirty_cats, dtype: object


pandas.core.series.Series

In [9]:
clean_str = strings.remove_chars(['monday', 'tuesday'], ['m', 'y'])
print(clean_str)
type(clean_str)

['onda', 'tuesda']


list

### strings.match_ratio()
Function that provides different methods for comparing two given strings and return a match ratio.

In [12]:
str1 = 'mnday'
str2 = 'Monday'

print("'Character by character' comparison ratio: " + str(strings.match_ratio(str1, str2, method='charbychar', case_sensitive=False)))
print("'Slice each 2 characters' comparison ratio: " + str(strings.match_ratio(str1, str2, method='sliceeach2', case_sensitive=False)))
print("'Slice each 3 characters' comparison ratio: " + str(strings.match_ratio(str1, str2, method='sliceeach3', case_sensitive=False)))
print("'Common characters' ratio: " + str(strings.match_ratio(str1, str2, method='commonchars', case_sensitive=False)))

'Character by character' comparison ratio: 0.16666666666666666
'Slice each 2 characters' comparison ratio: 0.6
'Slice each 3 characters' comparison ratio: 0.5
'Common characters' ratio: 0.8333333333333334


### strings.slice()
Returns a `target` string subdivided in chunks (in `list` type), according to `chunk_size` variable.

In [11]:
string = "house"

print(strings.slice(string, 2))
print(strings.slice(string, 3))
print(strings.slice(string, 4))

['ho', 'ou', 'us', 'se']
['hou', 'ous', 'use']
['hous', 'ouse']
