Skip to content

Latest commit

 

History

History
112 lines (99 loc) · 6.33 KB

File metadata and controls

112 lines (99 loc) · 6.33 KB

Introduction

  • Deep semantic match has broad applications in information retrieval, question answering, dialog system, paraphrase, etc.
  • In production (e.g., retrieval-based chatbot), retrieval (e.g., Elasticsearch, inverted index) + ** rerank** (e.g., deep matching model) is two steps to perform information retrieval

Models

Statistical models

  • TFIDF
  • BM25 (BestMatch)
    • k1 is term frequency saturation (词频饱和度), value between [1.2, 2.0]
    • b is field length normalization, value between [0, 1], usually 0.75
  • Query likelihood
  • Jaccard
    • Size of intersection divided by size of union of two sets
    • Word duplication does not matter
    def jaccard_sim(str1, str2): 
        a = set(str1.split()) 
        b = set(str2.split())
        c = a.intersection(b)
        return float(len(c)) / (len(a) + len(b) - len(c))
    

Deep learning models

  • Interaction-based is perfered

  • Representation-based

    • Principle
    • Examples
      • DSSM, CDSSM, ARC-I
  • Model Review: A Deep Look into Neural Ranking Models for Information Retrieval

  • Implementation

    • MatchZoo
    • SPTAG
    • Annoy
    • Faiss
    • HNSW
      • How to get k-NN points for the query?
        • The search starts from a random sample on the top layer. The search on one layer stops as no closer neighbor could be found.
        • The discovered closest neighbor on the current layer is treated as the starting point (i.e., “enter point”) of the search on the lower layer, until it reaches to the bottom layer
        • The standard NN-Descent search is adopted to search for top-k nearest neighbors on this layer

Learning to Rank

  • Use machine learning models to build ranking models
  • Features
    • BM25
    • PageRank
    • Edit distance
  • Categories
    • Pointwise
    • Pariwise
    • Listwise
  • LambdaMART
  • Set retrieval
    • Precision
      • Proportion of retrieved documents that are relevant, |retrieved & relevant| / |retrieved|
    • Recall
      • Proportion of relevant documents that are retrieved, |retrieved & relevant| / |relevant|
    • F1
      • Harmonic mean of precision and recall
  • Ranked retrieval
    • Precision@k

      • Proportion of top-k documents that are relevant
    • Recall@k

      • Proportion of relevant documents that are in the top-k
    • Precision-recall curve

    • Average precision

      • Average all precision@k values when recall is 1.0
      • Mean Average Precision (MAP)
        • Take mean of average precisions across a set of queries
      • Example: Number of relevant documents for thisquery is 10
    • Discounted Cumulative Gain (DCG)

      • Assumption: more than two levels of relevance
      • Normalized Discounted Cumulative Gain (NDCG)

Books

Tutorials

Industrial applications