# `tydier` demonstration: usage examples

## Importing `tydier` in a project:

To use `tydier` in a project, just import its main module. Aliasing it as `ty` would be considered as a good practice.

In [2]:
import tydier as ty

For the sake of this demonstration, let's import also `pandas`.

In [3]:
import pandas as pd

## Categorical variables

To display how `tydier` works on categorical variables, let's first create a dummy `pandas` dataframe. In this case, we fill it with two columns: one representing a categorical variable full of "typos" and one with the "right" values: let's take week days as an example.

In [4]:
dirty_cats = ['monday', 'Tusday', 'Wednesday',
              'thurda', 'Firday', 'saty', 'Sunday']
clean_cats = ['Monday', 'Tuesday', 'Wednesday',
              'Thursday', 'Friday', 'Saturday', 'Sunday']

df = pd.DataFrame({'dirty_cats': dirty_cats, 'clean_cats': clean_cats})
df

Unnamed: 0,dirty_cats,clean_cats
0,monday,Monday
1,Tusday,Tuesday
2,Wednesday,Wednesday
3,thurda,Thursday
4,Firday,Friday
5,saty,Saturday
6,Sunday,Sunday


### catvars.categorical_variables()
**Retrieves a `pandas.DataFrame`'s categorical variables and their unique values.**

In [5]:
print(ty.categorical_variables(df))
print()
ty.categorical_variables(df, display=True)

{'dirty_cats': array(['monday', 'Tusday', 'Wednesday', 'thurda', 'Firday', 'saty',
       'Sunday'], dtype=object), 'clean_cats': array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'], dtype=object)}

(1) dirty_cats | 7 unique values:
['monday' 'Tusday' 'Wednesday' 'thurda' 'Firday' 'saty' 'Sunday']

(2) clean_cats | 7 unique values:
['Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday' 'Saturday' 'Sunday']



### utilities.clean_col_names()
**Method for cleaning column names of a `pandas.DataFrame`.**

To show the usage of this method, let's first rename our dataframe's columns in an "untidy" way.

In [6]:
df.columns = ['    Dirty  categorieS', ':/Clean !@#$%&*(){}[];:.,/\|˜ Categories']

Now that we have inconsistent column names that need to be cleaned, we can call `clean_col_names()` and assign it to `pandas.DataFrame.columns`. The method will fix *untidy* words and return a list of cleaned column names. The dataframe's columns will be renamed automatically.

In [7]:
df.columns = ty.clean_col_names(df.columns)
print(df.columns)

['dirty', 'categories']
['clean', 'categories']
Index(['dirty_categories', 'clean_categories'], dtype='object')


### catvars.find_inconsistent_categories()
**Find inconsistent categorical values in a `pandas.Series` by checking it against a correct list of permitted parameters.**

In [8]:
ty.inconsistent_categories(dirty_cats, clean_cats)

['Firday', 'saty', 'thurda', 'monday', 'Tusday']

Setting `mapping_dict` to `True`, will return a dictionary which we can pass to `pandas.Series.replace()` to automatically replace inconsistent categorical values in a `pandas.Series`:

In [9]:
mapping = ty.inconsistent_categories(dirty_cats, clean_cats, mapping_dict=True)
df['cleaned_dirty_cats'] = df['dirty_categories'].replace(mapping)
df

Unnamed: 0,dirty_categories,clean_categories,cleaned_dirty_cats
0,monday,Monday,Monday
1,Tusday,Tuesday,Tuesday
2,Wednesday,Wednesday,Wednesday
3,thurda,Thursday,Thursday
4,Firday,Friday,Friday
5,saty,Saturday,Saturday
6,Sunday,Sunday,Sunday


## Numeric variables

### numvars.currency_to_float()
**Automatically cleans a variable containing a numeric value expressed in currency notation (meaning a string composed by a numeric value + a currency symbol of three-letter code), and and prepares it for analysis by transforming it to `float` type. Target variable of type `str`, `list`, `tuple`, or `pandas.Series`.**

With a string:

In [10]:
value, currency = ty.currency_to_float('$50,00')
print(value, currency)

50.0 $


With a `list`/`tuple`:

In [11]:
prices = ["EUR 1200,45", "  23,000.12 $", "123,000.56USD", "$45", "$ 56,90"]

v, c = ty.currency_to_float(prices)
print("Values:")
print(v)
print("Currencies:")
print(c)

Values:
[1200.45, 23000.12, 123000.56, 45.0, 56.9]
Currencies:
['EUR', '$', 'USD', '$', '$']


And finally, with a `pandas.Series`:

In [12]:
prices = pd.Series(prices)
print(prices)

v, c = ty.currency_to_float(prices)
print("\nValues:")
print(v)
print("\nCurrencies:")
print(c)

0      EUR 1200,45
1      23,000.12 $
2    123,000.56USD
3              $45
4          $ 56,90
dtype: object

Values:
0      1200.45
1     23000.12
2    123000.56
3        45.00
4        56.90
dtype: float64

Currencies:
0    EUR
1      $
2    USD
3      $
4      $
dtype: object


## Operations on strings

### strings.remove_chars()
**Simple method for cleaning unwanted characters or substrings from a target variable of type `str`, `list`, `tuple`, or `pandas.Series`.**

In [13]:
clean_pdSeries = ty.remove_chars(df['dirty_categories'], ['F', 'T', 'W'])
print(clean_pdSeries)
type(clean_pdSeries)

0      monday
1       usday
2    ednesday
3      thurda
4       irday
5        saty
6      Sunday
Name: dirty_categories, dtype: object


pandas.core.series.Series

In [14]:
clean_str = ty.remove_chars(['monday', 'tuesday'], ['m', 'y'])
print(clean_str)
type(clean_str)

['onda', 'tuesda']


list

### strings.match_ratio()
**Function that provides different methods for comparing two given strings and return a match ratio.**

In [15]:
str1 = 'mnday'
str2 = 'Monday'

print("'Character by character' comparison ratio: " +
      str(ty.match_ratio(str1, str2, method='charbychar', case_sensitive=False)))
print("'Slice each 2 characters' comparison ratio: " +
      str(ty.match_ratio(str1, str2, method='sliceeach2', case_sensitive=False)))
print("'Slice each 3 characters' comparison ratio: " +
      str(ty.match_ratio(str1, str2, method='sliceeach3', case_sensitive=False)))
print("'Common characters' ratio: " + str(ty.match_ratio(str1,
      str2, method='commonchars', case_sensitive=False)))

'Character by character' comparison ratio: 0.16666666666666666
'Slice each 2 characters' comparison ratio: 0.6
'Slice each 3 characters' comparison ratio: 0.5
'Common characters' ratio: 0.8333333333333334


### strings.slice()
**Returns a `target` string subdivided in chunks (in `list` type), according to `chunk_size` variable.**

In [16]:
string = "house"

print(ty.slice(string, 2))
print(ty.slice(string, 3))
print(ty.slice(string, 4))

['ho', 'ou', 'us', 'se']
['hou', 'ous', 'use']
['hous', 'ouse']
