# Sprint 16

## Thesis Introduction

Here, I selected **Named Entity Recognition** as the area of interest, since I really like working on Natural Language Processing and this problem is one of the tasks in my university, so I think I can combine it in this work.

### 1. Li, Jing, et al. "A survey on deep learning for named entity recognition." IEEE Transactions on Knowledge and Data Engineering (2020).

**Why did I choose this paper?**

For every problem, reading a survey is quite a natural instinct before looking for any innovative work. We can know what has been done to solve the problem, what are the challenges and future research direction.

**Abstract the paper**

- Named Entity Recognition (NER) is the task to identify text spans that mention named entites, and to classify them into predefined categories (e.g., person, location, organization, etc.).  NER serves as the basis for a variety of natural language applications such as question answering, text summarization, and machine translation.
- Traditional approaches include:
    - Rule-based Approaches:
        - Pros: Works well when lexicon is exhaustive.
        - Cons: cannot be transferred to other domains.
    - Unsupervised learning Approaches:
        - Pros: Reduce the work of labelling data.
        - Cons: Need time to interpret the results.
    - Feature-based Supervised Learning Approaches:
        - Pros: Better generalization.
        - Cons: Requires considerable amount of engineering skill and domain expertise.
- Survey in Deep Learning techniques is the main contribution of this paper. It showed a new taxonomy of DL-based NER.
![image](fig1.png)
- Core strengths of Deep Learning approaches:
    - Benefits from the non-linear transformation, which generates non-mappings from input to output.
    - Save effort on designing NER features, compare to feature-based approahces.
    - Deep neural NER models can be trained in an end-to-end paradigm, by gradient descent, hence enable us to design possibly complex NER system.
- Architectures of every component in taxonomy:
    - Distributed Representation for Input: one-hot vector representation, word-level representation, character-level representation (mostly use CNN-based and RNN-based), hybrid representation, etc.
    - Context encoder: CNN, RNN, recursive NN, deep transformer, neural language model
    - Tag encoder: MLP + Softmax, Conditional Random Fields (CRF), RNN, Pointer Networks.
- Comparison:
    - Work with highest F1-score for CoNLL03 dataset used pre-trains and bidirectional transformer model in a close-style manner
    - Work with highest F1-score for OntoNotes5.0 dataset used BERT and dice loss.
- The most representative methods for recent applied techniques of deep learning in new NER problem settings and applications:
    - Deep Multi-task Learning
    - Deep Transfer Learning
    - Deep Active Learning
    - Deep Adversarial Learning
    - Deep Reinforcement Learning
    - Neural Attention
- Challenges faced by NER systems:
    - Data Annotation
    - Informal Text and Unseen Entities
- Future directions:
    - Fine-grained NER and Boundary Detection
    - Joint NER and Entity Linking
    - DL-based NER on Informal Text with Auxiliary Resource
    - Scalability of DL-based NER
    - Deep Transfer Learning for NER

### 2. Kim, Ji-Hwan, and Philip C. Woodland. "A rule-based named entity recognition system for speech input." Sixth International Conference on Spoken Language Processing. 2000.

**Why did I chose this paper?**

This is the representative of rule-based approaches in NER.

**Abstract the paper**

- Propse a rule based (transformation based) NER system which uses the Brill rule inference approach.
    - The preprocessing includes: Add word features and look-up name lists.
    - The rule-generation includes: Generate applcable rules, update environments, find the best rule and update NE labels in training data.
![image](fig2.png)
- Compare the performance between the proposed method and IdentiFinder, one of the most successful stochastic systems.
    - In the baseline case (no punctuation and no captialisation), both systems show almost equal performance.
    - The performance of both systems degrade linearly with added speech recognition errors, and almost equal.
    - They conclude that automatic rule inference is a viable alternative to the HMM-based approach to named entity recognition, but it retains the advantages of a rule-based approach.

### 3. D. M. Bikel, R. Schwartz, and R. M. Weischedel, “An algorithm that learns what’s in a name,” Mach. Learn., vol. 34, no. 1-3, pp. 211–231, 1999.

**Why did I choose this paper?**

This is the representative of feature-based machine learning approaches in NER. Also, paper (2) made a comparison to this one.

**Abstract the paper**

- Present IdentiFinder, a hidden Markov model that learns to solve NER problems.
- The representation of HMM model: ![image](fig3.png)
    - States include: NOT-A-NAME, ORGANIZATION, PERSON and specially START-OF-SENTENCE and END-OF-SENTENCE.
    - Every word is represented by a state in the bigram model, and there is a probability associated with every transition from the current word to the next word.
- Adding word-features for each word. Therefore words are considered to be ordered pairs <w, f>. Words features slightly increased the performance during the experiment.
- Implement the model in C++, and evaluate the model in English, Spanish and speech input, which IndetiFinder got performance around 90% on newswire. IdentiFinder’s performance is also competitive with approaches based on handcrafted rules on mixed case text and superior on text where case information is not available.

### 4. P. Zhou, S. Zheng, J. Xu, Z. Qi, H. Bao, and B. Xu, “Joint extraction of multiple relations and entities by using a hybrid neural network,” in CCL-NLP-NABD. Springer, 2017, pp. 135–146.

**Why did I chose this paper?**

This is one of the representatives of deep learning approaches. Also, I had some knowledge about the architecture that they used.

**Abstract the paper**
- Proposed model uses a hybrid neural network to automatically learn sentence features and does not rely on any Natural Language Processing (NLP) tools, such as dependency parser.
- The architecture consists of 5 components: Input Layer, Embedding Layer, BLSTM Layer (Bidirectional LSTM layer), RC Module and NER Module. ![image](fig4.png)
    - Input Layer
    - Embedding Layer: For each word in s, we first look up the embedding matrix. Then we transform a word into its word embedding using the matrix-vector product. Then the sentence is fed to the next layer as a real-valued matrix.
    - BLSTM layer: LSTM was proposed to overcome the gradient vanishing problem of RNN, and BLSTM was proposed to extend the unidirectional LSTM by introducing a second hidden layer, where the hidden to hidden connections flow in the opposite temporal order. Therefore, BLSTM can exploit information from both the past and the future.
    - Relation Classification Module: Consists of:
        - Convolution Layer
        - Max Pooling Layer
        - Sigmoid Activation Layer
    - Named Entity Recognition Module: Consists of
        - LSTM Decoder
        - Softmax Activation Layer
- Experiments on the CoNLL04 dataset demonstrate that our model using only word embeddings as input features achieves state-of-the-art performance.
- Analyze the effect of sentence length and relations.

### 5. X. Li, X. Sun, Y. Meng, J. Liang, F. Wu, and J. Li, “Dice loss for data-imbalanced NLP tasks,” CoRR, vol. abs/1911.02855, 2019.

**Why did I choose this paper?**

- This is one of the representatives of deep learning approaches.
- State-of-the-art mentioned by the survey (1).

**Abstract the paper**
- Data imbalance is the issue occurred in many NLP tasks: negative examples significantly outnumber positive ones, the huge number of easy-negative examples overwhelms training.
- Propose to use dice loss in replacement of the standars cross-entropy objective which is accutually accuracy-oriented and creates a discrepancy between training and test. The dice loss is F1-oriented.
- Experiment on speech tagging and named entity recognition. All experiments observed an increase in F1-score when using the dice loss compare to other loss including cross entropy, weighted cross entropy, dice coefficient and focal loss.