# `clean_date()`

The task is to identify, clean and impute empty part of each component of an date, and transform the date into a standardized form or split the date into its components. Users are allowed to define their own transformation functions. 

# API
  
## Function header

In [None]:
def clean_date(
    df: Union[pd.DataFrame, dd.DataFrame],
    column: str,
    target_format: str = 'YYYY-MM-DD HH:MM:SS',
    fix_empty: str = 'auto_minimum',
    show_report: bool = False,
    customized_rule: str = None,
    split: bool = False
)

## Format specification

The user can specify the target date format using the following formats 

* `YYYY/MM/DD HH:MM:SS`

* `YYYY-MM-DD HH:MM:SS`

* `YY-MM-DD HH:MM:SS`

* `YY/MM/DD HH:MM:SS`

* `DD/MM/YYYY HH:MM:SS`

* `MM/DD/YYYY HH:MM:SS`

* `DD-MM-YYYY HH:MM:SS`

* `MM-DD-YYYY HH:MM:SS`

* `DD-MM-YY HH:MM:SS`

* `MM-DD-YY HH:MM:SS`

* `DD/MM/YY HH:MM:SS`

* `MM/DD/YY HH:MM:SS`

### Parameter `target_format` 

The user can specify the `target_format` as one of specified format. The **default target format** is `"YYYY-MM-DD HH:MM:SS"`

Example.* df = 

| DateTime |
| --- |
| 2020-09-22 21:25:18 |

`clean_date(df, "DateTime", target_format="mm/dd hh:mm")` returns

| DateTime |
| --- |
| 09/22 21:25 |

### Parameter `fixed_empty` 

The user can specify the way of fixing empty value from value set: {'empty', 'auto_nearest', 'auto_minimum'}.  The **default fixed_empty** is `"auto_minimum"`

* empty: just left the missing component as it is

*Example .* df = 

| DateTime |
| --- |
| 2020-09-22|
`clean_date(df, "DateTime", fixed_empty='empty')` returns

| DateTime | cleaned_DateTime | 
| --- | --- | 
| 2020-09-22 | 2020-09-22 --:--:-- | 

* auto_nearest: 
    * For hours, minutes and seconds, just fill them with zeros
    * for years, months and days, fill it with the nearest value
    
*Example .* df = 

| DateTime |
| --- |
| 09-22|
`clean_date(df, "DateTime", fixed_empty='auto_nearest')` returns

| DateTime | cleaned_DateTime | 
| --- | --- | 
| 09-22 | 2020-09-22 00:00:00 | 

* auto_minimum: 
    * For hours, minutes and seconds, just fill them with zeros
    * for years, months and days, fill it with the minimum value
    
*Example .* df = 

| DateTime |
| --- |
| 2020-09|
`clean_date(df, "DateTime", fixed_empty='auto_minimum')` returns

| DateTime | cleaned_DateTime | 
| --- | --- | 
| 2020-09 | 2020-09-01 00:00:00 | 

### Parameter `show_report` 
If `show_report = True`, a report contains:

1) how many values are repaired
2) how they are repaired
3) visualization
4) values that need repair but the library doesn't know the rule to repair (human-in-the-loop required)

will be generated.
Example and illustration with picture need to be added.

### Parameter `customized_rule` 
Users can define customized format transfomation function, which takes input string and outputs the corresponding output string. Example of costomized_rule need to be added.

### Parameter `split`

If True, the output will be split into each identified attribute of the date.

*Example.* df = 

| DateTime |
| --- |
| 2020-09-22 21:25:18 |

`clean_date(df, "DateTime", split=True)` returns

| DateTime | Year | Month | Day | Hour | Minute | Second |
| --- | --- | --- | --- | --- | --- | --- |
| 2020-09-22 21:25:18 | 2020 | 09 | 22 | 21 | 25 | 18 |

# Supplementary material

## Python libraries

1. [dateutil](https://github.com/dateutil/dateutil). Generic parsing of dates in almost any string format.



**Key points**

+ Parse input string into different components
+ Fix missing components based on default setting
+ Transform formats
+ Extensions: time conversion between different timezones