In this project, we need to find out commercial products listed on Google that refer to the same entity across Amazon by comparing the similarity. This problem is called Entity Resolution.
- Applied powerful and scalable text analysis techniques.
- Perform entity resolution across two datasets of commercial products.
- Discussed the use scenario of Broadcast Variable.
- Implemented a scalable ER algorithm.
Entity Resolution (ER) refers to the task of finding records in a dataset that refer to the same entity across different data sources (e.g., data files, books, websites, databases). ER is necessary when joining datasets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A dataset that has undergone ER may be referred to as being cross-linked.