# `clean_address()`

The task is to identify and clean each component of an address, and transform the address into a standardized form or split the address into its components. Since parsing addresses is hard, we will start with common address formats in the United States (and perhaps Canada since it's similar), and progressively support more cases.

I propose that we follow the [pyaddress](https://github.com/SwoopSearch/pyaddress) library (main parsing script [here](https://github.com/SwoopSearch/pyaddress/blob/master/address/address.py)) which can identify the following [address attributes](https://github.com/SwoopSearch/pyaddress#address-1):

Example address: (The Estates) 123 N. Main St. Apt 10, Brooklyn, NY 53703

* `house number` (123)
* `street prefix` (N)
* `street` (Main)
* `street suffix` (St)
* `apartment` (Apt 10)
* `building` (The Estates)
* `city` (New York)
* `state` (NY)
* `zipcode` (53703)

# API
  
## Function header

In [None]:
def clean_address(
    df: Union[pd.DataFrame, dd.DataFrame],
    column: str,
    input_format: str = None,
    output_format: str = None,
    split: bool = False
    inplace: bool = False,
)

## Format specification

The user can specify the address format using the following keywords 

* `house_number`
* `street_prefix_abbr` ("N", "S", "E", or "W")
* `street_prefix_full` ("North", "South", "East", or "West")
* `street` 
* `street_suffix_abbr` (eg "St", "Ave")
* `street_suffix_full` (eg "Street", "Avenue")
* `apartment`
* `building`
* `city`
* `state_abbr` (eg "NY")
* `state_full` (eg "New York")
* `zipcode`

The **default output format** is `"(building) house_number street_prefix_abbr. street street_suffix_abbr. apartment, city, state_abbr zipcode"`, omitting any missing attributes.

The user can also specify how to split the address into columns using `"\t"`. 

*Example.* df = 

| address |
| --- |
| 1600 Pennsylvania Avenue, Washington, DC 20500 |

`clean_address(df, "address", output_format = "house_number street street_suffix_abbr. \t city \t state_abbr")` returns

| address | house_number_street_street_suffix | city | state|
| --- | --- | --- | --- |
| 1600 Pennsylvania Avenue, Washington, DC 20500, USA | 1600 Pensylvania Ave. | Washington | DC |

### Parameter `input_format` 

The user can specify the `input_format` like the output format. This could be useful if the input format is unusual but consistent.

### Parameter `split`

If True, the output will be split into each identified attribute of the address.

*Example.* df = 

| address |
| --- |
| 1600 Pennsylvania Avenue, Washington, DC 20500 |

`clean_address(df, "address", split=True)` returns

| address | house_number | street | street_suffix | city | state| zipcode |
| --- | --- | --- | --- | --- | --- | --- |
| 1600 Pennsylvania Avenue, Washington, DC 20500 | 1600 |Pensylvania | Avenue | Washington | DC | 20500 |

# Supplementary material

## Python libraries

1. [street-address](https://github.com/pnpnpn/street-address). Street address parser and formatter using regex and pattern matching

2. [pyaddress](https://github.com/SwoopSearch/pyaddress). Street address parser using regex and pattern matching that is more comprehensive than street-address.

3. [usaaddress-scourgify](https://github.com/GreenBuildingRegistry/usaddress-scourgify). Great reference, similar to pyaddress. Cleans and normalizes addresses using regex.

4. [usaddress](https://github.com/datamade/usaddress). Uses a probabilistic model to guess components. (US address components [here](https://usaddress.readthedocs.io/en/latest/#details))

## C library

1. [libpostal](https://github.com/openvenues/libpostal). Comprehensive address parsing/normalization library that uses statistical NLP techniques.

## Stackoverflow

1. [Data Cleaning (Addresses) Python](https://stackoverflow.com/questions/58907196/data-cleaning-addresses-python)

2. [How to parse freeform street/postal address out of text, and into components](https://stackoverflow.com/questions/11160192/how-to-parse-freeform-street-postal-address-out-of-text-and-into-components)
