openclean comes with the following operators to help users find anomalies, make fixes and wrangle their datasets.
Note
This list is growing and will be updated periodically.
We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.
.. jupyter-execute:: import os path_to_file = os.path.join(os.getcwd(), 'source', 'data')
.. jupyter-execute:: from openclean.data.load import dataset ds = dataset(os.path.join(path_to_file, 'job_locations.csv')) ds.head()
openclean makes it easy to identify any functional dependency violations in datasets.
.. jupyter-execute:: from openclean.operator.map.violations import fd_violations from openclean.operator.collector.count import distinct fd1_violations = fd_violations(ds, ['Street Name', 'GIS_NTA_NAME'], ['Borough']) print('# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is {}\n'.format(len(fd1_violations))) for key, gr in fd1_violations.items(): print(gr[['Street Name', 'GIS_NTA_NAME', 'Borough']])
We identify one violation in the above example. Clearly, it is row 11 as someone with domain knowledge should be able to point out that FD(BROADWAY SoHo-TriBeCa-Civic Center-Little Italy) -> QUEENS is incorrect. Let's fix this using one of the repair strategies available to us:
.. jupyter-execute:: from openclean.operator.collector.repair import Shortest, Vote, conflict_repair # Define the conflict resolution strategy. We use a majority vote for both RHS attributes. strategy = {'Borough': Vote()} # resolve the conflicts resolved = conflict_repair(conflicts=fd1_violations, strategy=strategy, in_order=False)
This should replace the violation value with the maximum frequency value of the group and resolve the conflict.
.. jupyter-execute:: violation_group = resolved[(resolved['Street Name']=='BROADWAY') & (resolved['GIS_NTA_NAME']=='SoHo-TriBeCa-Civic Center-Little Italy')] fd2_violations = fd_violations(resolved, ['Street Name', 'GIS_NTA_NAME'], ['Borough']) print('# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is {}\n'.format(len(fd2_violations))) print(violation_group)
A complete list of repair strategies can be accessed in the API Reference
Depending on the use case, missing values can be handled by the filter transformation (removing them) or by the update transformation (new values). They are both explained in :ref:`transform-ref`. We demonstrate both again here.
.. jupyter-execute:: misspelled_data = dataset(os.path.join(path_to_file, 'misspellings.csv')) misspelled_data['Borough'].value_counts()
We see there are empty values in this column. First, let's update them using a lambda function that uses the is_empty value function.
.. jupyter-execute:: from openclean.operator.transform.update import update from openclean.function.value.null import is_empty updated_misspelled = update(misspelled_data, 'Borough', lambda x: 'Unknown' if is_empty(x) else x) updated_misspelled['Borough'].value_counts()
We've replaced missing values with Unknown. But, because there is no way to be sure of the correct value without using other columns and data augmentation techniques from the :ref:`enrich-ref` section to get the correct Borough, we shall filter out the nulls in this example using the IsNotEmpty eval function.
.. jupyter-execute:: from openclean.operator.transform.filter import filter from openclean.function.eval.null import IsNotEmpty misspelled_data = filter(misspelled_data, predicate=IsNotEmpty('Borough')) misspelled_data['Borough'].value_counts()
The whole range of eval functions and value functions can be accessed in the Eval package of the API Reference and Value package of the API Reference respectively as explained in :ref:`concepts-ref`.
openclean can help identify misspellings and data entry bugs using its powerful string matcher class. It helps detect anomalous values using:
- Phonetic Matching
- Phonetic algorithms transform the input strings to normalized phonetic encodings before comparing them. openclean has the following phonetic string algorithms:
- NYSIIS
- Soundex
- Metaphone
- Fuzzy Matching
- Implementation of fuzzy string matching using n-gram overlaps and Levenshtein or cosine distance.
StringMatcher objects ingest a vocabulary and a matching algorithm that is used to identify dataset values that are misspelled. These can optionally be stored into an openclean mapping to be reused later with other datasets as translation tables.
.. jupyter-execute:: from openclean.function.matching.base import DefaultStringMatcher from openclean.function.matching.fuzzy import FuzzySimilarity from openclean.data.mapping import Mapping VOCABULARY = ['BROOKLYN' ,'MANHATTAN','STATEN ISLAND','BRONX', 'QUEENS'] matcher = DefaultStringMatcher( vocabulary=VOCABULARY, similarity=FuzzySimilarity() ) map = Mapping() for query in set(misspelled_data['Borough']): map.add(query, matcher.find_matches(query)) print(map)
The map shows all misspellings matched at least one value from the vocabulary so the map can be used to fix the Borough column. The user will have to manually intervene and update the map if for a query value there were zero or more than one matches from the vocabulary.
Fixing is easy, we can use the update operation with the Lookup eval function (to provide default values if key not found in the map).
.. jupyter-execute:: from openclean.function.eval.domain import Lookup from openclean.operator.transform.update import update from openclean.function.eval.base import Col fixed = update(misspelled_data, 'Borough', Lookup(columns=['Borough'], mapping=map.to_lookup(), default=Col('Borough'))) print(fixed['Borough'].unique())
- KNN Clustering
- openclean lets users use KNN clustering to identify values in a dataset that are potential variations of one another and suggests a replacement for each. For e.g. those with missing or extra punctuation. A well-put example demonstrating this can be found here.
Many a time, users will be faced with situations where the dataset contains variations of spellings (with extra/missing punctuations) for example, a Business or a street name. openclean provides multiple ways to enforce consistency across datasets:
- Token Signature Outliers
- When values in the dataset are expected to have a signature token present, this functionality helps identify anomalies. For e.g. an address column often requires values to contain street suffixes. The Token Signature class will ensure any values that don't have one are highlighted. Here is a notebook demonstrating this.
- KNN Clustering
- As mentioned in the previous section, KNN Clustering can be used to identify sets of similar values. openclean takes it up a notch by recommending a suggested value for each identified cluster.
These two tools become even more powerful when synergized with each other as demonstrated in this notebook where we use first replace different token signature representations with a standard one and then use KNN Clustering to fix the possible value variations in a street name column.
openclean provides many statistical anomaly detection operators that are implemented by the scikit-learn machine learning library. To name them, we have:
Here we use a simple ensemble approach that applies all these operators to the dataset's GIS_NTA_NAME column.
.. jupyter-execute:: from collections import Counter ensemble = Counter() from openclean.embedding.feature.default import UniqueSetEmbedding from openclean.profiling.anomalies.sklearn import ( dbscan, isolation_forest, local_outlier_factor, one_class_svm, robust_covariance ) for f in [dbscan, isolation_forest, local_outlier_factor, one_class_svm, robust_covariance]: ensemble.update(f(ds, 'GIS_NTA_NAME', features=UniqueSetEmbedding()))
We then count for each value, the number of operators that classified the value as an outlier.
.. jupyter-execute:: # Output values that have been classified as outliers by at least three out of the # five operators. prev = 0 for value, count in ensemble.most_common(): if count < 3: break if count < prev: print() if count != prev: print('{}\t{}'.format(count, value)) else: print('\t{}'.format(value)) prev = count
Statistically classified as anomalies, these neighborhoods can be those with fewer job requests or misspellings. Something a user with domain knowledge can verify.
A user can create their own data cleaning operators, apply them and reuse them as per their requirements. With :ref:`notebook-extension`, these eval functions or callables can further be registered on a UI and applied to datasets visually. The following screengrab shows how custom functions together with openclean-notebook enhance a user's data munging experience:
.. only:: html .. figure:: data/custom_func.gif