<a href="https://colab.research.google.com/github/babupallam/Msc_AI_Module2_Natural_Language_Processing/blob/main/Note_01_Introduction_to_Word_Categorization_and_Tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


- **Categorizing and Tagging Words** is one of the foundational concepts in Natural Language Processing (NLP).
- This section introduces the essential tasks of identifying different parts of speech (POS) in a sentence and marking words with their corresponding tags.
- These tagged labels represent lexical categories such as nouns, verbs, adjectives, etc. Tagging plays a pivotal role in numerous NLP applications, from syntactic parsing to machine translation.



### **What is Word Categorization?**

- Word categorization refers to classifying words into **lexical categories** (also called **parts of speech**).
- These categories are essentially groups of words that share similar syntactic or grammatical roles.
- The process helps computers understand the function of words in sentences, which is critical for building robust language models.

- The most common lexical categories include:
  - **Nouns (N)**: Represent people, places, or things (e.g., "dog," "city").
  - **Verbs (V)**: Indicate actions or states (e.g., "run," "is").
  - **Adjectives (ADJ)**: Describe nouns or pronouns (e.g., "blue," "large").
  - **Adverbs (ADV)**: Modify verbs, adjectives, or other adverbs (e.g., "quickly," "very").



### **Importance of Word Categorization in NLP**

- Word categorization is a critical step in **syntactic parsing**, **text-to-speech systems**, **information extraction**, and other NLP tasks.
- By categorizing words into their respective lexical categories, machines can start building a structured representation of language.
- For example, by identifying a word as a **noun**, the system can expect it to be the subject or object of a sentence, aiding in understanding the sentence's meaning.



### **What is Word Tagging (POS Tagging)?**

- **Part-of-speech tagging** (POS tagging) is the process of marking each word in a sentence with its corresponding part of speech.
- It is the automated version of word categorization, where computational models assign the most probable tag to each word based on its context.

For example, consider the sentence:  
*"The quick brown fox jumps over the lazy dog."*  
A POS tagger would label each word as follows:

- The (DET)  
- quick (ADJ)  
- brown (ADJ)  
- fox (NOUN)  
- jumps (VERB)  
- over (ADP)  
- the (DET)  
- lazy (ADJ)  
- dog (NOUN)

In this tagging example, each word is assigned a tag that reflects its grammatical function.



### is there any standard used for creating POS Tags?



Yes, there are several **standards** used for creating Part-of-Speech (POS) tags. The choice of tagset often depends on the language, the complexity of the task, and the needs of the NLP project. Some widely used **POS tagging standards** include:




#### **1. Penn Treebank Tagset**
This is one of the most widely used tagsets, particularly in English-language corpora. It contains 36 POS tags, which provide a detailed breakdown of parts of speech. The Penn Treebank is used in the **Wall Street Journal (WSJ) Corpus** and other large datasets.

##### Example Tags:
- **NN**: Singular noun (e.g., *dog*).
- **VB**: Base form of a verb (e.g., *run*).
- **JJ**: Adjective (e.g., *blue*).
- **RB**: Adverb (e.g., *quickly*).

##### Usage:
The Penn Treebank tagset is often used for syntactic parsing and as training data for supervised machine learning models. It provides detailed categorization, distinguishing between, for example, different forms of verbs like base form (VB), past tense (VBD), or gerunds (VBG).



#### **2. Universal POS Tagset**
The **Universal POS Tagset** is a simplified tagset used for multilingual NLP tasks. It was developed by researchers to be more general and cross-lingual, supporting multiple languages while reducing the complexity of language-specific tagsets.

##### Example Tags:
- **NOUN**: General noun.
- **VERB**: General verb.
- **ADJ**: Adjective.
- **ADV**: Adverb.

##### Usage:
The Universal POS Tagset is commonly used in tasks that require consistency across multiple languages, such as **machine translation** or **cross-lingual text classification**. It is also used in widely distributed corpora, like the **Universal Dependencies** project.




#### **3. Brown Corpus Tagset**
The **Brown Corpus Tagset** was one of the earliest tagsets used for POS tagging in the **Brown Corpus**, an American English corpus that became a standard for linguistic analysis. The Brown Corpus has a fine-grained set of POS tags (87 tags) that are highly specific.

##### Example Tags:
- **NP**: Proper noun.
- **AT**: Article (e.g., *the*).
- **VBZ**: Verb, 3rd person singular present (e.g., *runs*).

##### Usage:
It is still used in many NLP projects, but newer tagsets like the Penn Treebank have replaced it in more recent applications. However, the Brown Corpus tagset is useful for studying linguistic variation in American English.




#### **4. CLAWS Tagset**
The **CLAWS Tagset** was developed for the **Lancaster-Oslo/Bergen (LOB) Corpus** and has multiple versions, such as CLAWS5, CLAWS6, and CLAWS7, each offering increasingly fine-grained distinctions. CLAWS is especially known for its high accuracy in automatic tagging.

##### Example Tags:
- **NN1**: Singular noun.
- **VV0**: Base form of verb.
- **DT0**: General determiner.

##### Usage:
CLAWS has been widely used for corpus linguistics and text analysis in British English. It was one of the first automatic taggers and is known for its precision in tagging various parts of speech.




#### **5. Morphosyntactic Tagsets**
These tagsets capture **morphological information** alongside syntactic roles. For instance, in morphologically rich languages like Russian, Arabic, or Finnish, a word’s tag might encode not just its POS but also its **tense**, **gender**, **number**, or **case**.

##### Example (for Russian):
- **NOUN-fem-sing**: A singular, feminine noun.
- **VERB-past-masc**: A masculine past tense verb.

##### Usage:
Such tagsets are used in tasks like **morphological tagging** for languages with complex grammar systems. They are crucial for languages where word forms carry a significant amount of syntactic and semantic information.




#### **Key Considerations When Choosing a Tagset**
- **Task Complexity**: For simple tasks like text classification, the Universal Tagset might suffice. However, for more complex tasks like parsing or machine translation, more detailed tagsets like the Penn Treebank or Brown Corpus might be necessary.
- **Language**: Some tagsets, like the Penn Treebank, are English-specific, while others like the Universal Tagset are designed for cross-lingual tasks.
- **Corpus Size**: In large corpora, a more granular tagset like the CLAWS tagset may be helpful for detailed linguistic analysis.
- **Generalizability**: Universal tagsets are better suited for multilingual tasks, where consistency across languages is important.


### **Lexical Categories: Open and Closed Classes**
Words are typically divided into two major classes: **open class** and **closed class** words.

- **Open Class**:
  - This class includes words that frequently gain new members.
  - Examples are **nouns**, **verbs**, **adjectives**, and **adverbs**.
  - New words like "cyberspace" or "blogging" are typically added to these classes as language evolves.
  
- **Closed Class**:
  - Closed classes consist of words that rarely change or expand.
  - Examples include **prepositions** (e.g., "on," "in," "above") and **pronouns** (e.g., "he," "she," "they").
  - These categories remain relatively fixed over time.



### **Challenges in Word Tagging**
  - While tagging might seem straightforward, the real challenge lies in handling **ambiguities** and **contextual variations**.
  - For instance, words like **"bank"** can be tagged differently depending on the sentence:
    - **Noun**: "He went to the **bank**."
    - **Verb**: "I will **bank** on your support."

  - Such words are called **polysemous words**, where the tagger has to rely on the surrounding context to correctly disambiguate the part of speech.
  - This makes POS tagging not only a lexical task but also one that involves syntactic and semantic understanding of sentences.



### **Universal POS Tagset**
  - The **Universal POS Tagset** is a simplified set of POS categories used across different languages.
  - It reduces the complexity of the more detailed tagsets (like those in the Brown Corpus) by consolidating tags into high-level categories.
  - The Universal Tagset includes tags such as:
    - **ADJ** (Adjective)
    - **NOUN** (Noun)
    - **VERB** (Verb)
    - **ADV** (Adverb)
    - **DET** (Determiner)

  - By standardizing these tags, NLP systems can perform tagging across multiple languages with similar POS structures, making it easier to build multi-lingual models.



### **Example of a Tagged Corpus**

  - In order to learn tagging, NLP models are often trained using **tagged corpora**, which are collections of text where every word has been manually tagged with its POS.
  - One widely used corpus is the **Brown Corpus**, which provides a large dataset of English text with detailed POS tags.
  - An example from the Brown Corpus might look like this:

    ```
    ('The', 'DET'), ('cat', 'NOUN'), ('sat', 'VERB'), ('on', 'ADP'), ('the', 'DET'), ('mat', 'NOUN')
    ```

Tagging corpora like this help the model learn which words are likely to be associated with certain tags in various contexts.



### List of widely used tagged corpora



### 1. **Penn Treebank**
The **Penn Treebank** is one of the most widely used resources for training NLP models. It contains about 4.5 million words of tagged data, primarily from the Wall Street Journal (WSJ) corpus.

- **Tags**: Based on the **Penn Treebank Tagset** (36 POS tags).
- **Use**: Often used in syntactic parsing, POS tagging, and language modeling tasks.
  
---

### 2. **Brown Corpus**
The **Brown Corpus** is a pioneering corpus that contains over a million words of American English text. It was one of the first corpora to be tagged for parts of speech, using the **Brown Tagset** (87 tags).

- **Tags**: Highly detailed, distinguishes between different types of verbs, nouns, and determiners.
- **Use**: Training POS taggers, linguistic analysis, and studies of American English.

---

### 3. **Universal Dependencies (UD)**
**Universal Dependencies** is a multilingual treebank project that provides POS tagging, syntactic dependencies, and morphological annotations for over 100 languages.

- **Tags**: Uses the **Universal POS Tagset**, a simplified cross-lingual tagset.
- **Use**: Cross-lingual POS tagging, syntactic parsing, and dependency parsing.

---

### 4. **Lancaster-Oslo/Bergen (LOB) Corpus**
The **LOB Corpus** is a British English equivalent to the Brown Corpus. It consists of around one million words and is tagged using the **CLAWS tagset**.

- **Tags**: CLAWS tagset, which is known for its fine-grained distinctions.
- **Use**: Training POS taggers, studying British English, and comparative linguistic studies with the Brown Corpus.

---

### 5. **Treebank-3**
The **Treebank-3** is an extension of the Penn Treebank project, which includes additional text from the Switchboard corpus (transcribed telephone conversations) and other sources.

- **Tags**: Uses the **Penn Treebank Tagset**.
- **Use**: Used for POS tagging, syntactic parsing, and discourse analysis, especially in conversational text.

---

### 6. **CoNLL-2000 Chunking Corpus**
The **CoNLL-2000** corpus is used for **shallow parsing** or chunking tasks, which includes POS tags as well as chunk tags.

- **Tags**: Uses the Penn Treebank Tagset for POS tagging and additional tags for chunk types (e.g., NP for noun phrase).
- **Use**: Chunking tasks, POS tagging, and syntactic analysis.

---

### 7. **Wall Street Journal (WSJ) Corpus**
A subset of the Penn Treebank, the **WSJ Corpus** contains transcriptions of news articles from the Wall Street Journal. It has around 1 million words.

- **Tags**: Penn Treebank Tagset.
- **Use**: Used in parsing, POS tagging, and language modeling tasks.

---

### 8. **TIGER Corpus**
The **TIGER Corpus** is a German-language corpus that includes around 900,000 words of German text tagged with POS tags and syntactic structure.

- **Tags**: Uses the **Stuttgart-Tübingen Tagset (STTS)** for German POS tagging.
- **Use**: Training POS taggers, dependency parsing, and syntactic analysis of German.

---

### 9. **Indian Languages Corpora (Indic Treebanks)**
A collection of treebanks for various Indian languages such as Hindi, Bengali, and Telugu. These corpora are part of the **Indian Languages Corpora Initiative (ILCI)**.

- **Tags**: Language-specific tagsets adapted from the Universal POS Tagset.
- **Use**: Multilingual POS tagging, syntactic parsing, and linguistic research on Indian languages.

---

### 10. **Mac-Morpho Corpus (Brazilian Portuguese)**
The **Mac-Morpho Corpus** contains around 1.5 million words of Brazilian Portuguese text. It is manually tagged with POS labels.

- **Tags**: Uses the **Mac-Morpho tagset** for Brazilian Portuguese.
- **Use**: POS tagging for Portuguese, used in speech recognition and language modeling for Portuguese.

---

### 11. **Sinica Treebank (Chinese)**
The **Sinica Treebank** is a corpus of Chinese text tagged with POS tags and syntactic structures. It contains around 150,000 words.

- **Tags**: Uses a custom Chinese tagset.
- **Use**: POS tagging, syntactic parsing, and Chinese language modeling.

---

### 12. **ConLL-2003 Named Entity Recognition (NER) Corpus**
Although primarily used for Named Entity Recognition (NER), the **ConLL-2003 Corpus** also includes POS tags for English and German texts.

- **Tags**: Uses Penn Treebank tagset for English and STTS for German.
- **Use**: Named entity recognition, POS tagging, and sequence labeling tasks.

---

### 13. **Europarl Corpus**
The **Europarl Corpus** consists of parallel text aligned across several European languages, extracted from the proceedings of the European Parliament.

- **Tags**: Often tagged with the Universal POS Tagset or language-specific tagsets.
- **Use**: POS tagging, machine translation, and cross-lingual language processing.


### Comparison of the Tagged Corpus based on the usecases



| **Corpus**                          | **Language**            | **Tagset**                         | **Size**             | **Primary Use Cases**                        | **Key Features**                           |
|-------------------------------------|-------------------------|------------------------------------|----------------------|----------------------------------------------|--------------------------------------------|
| **Penn Treebank**                   | English                 | Penn Treebank Tagset (36 tags)     | 4.5 million words    | POS tagging, syntactic parsing, language modeling | Detailed syntactic annotations, used in WSJ corpus |
| **Brown Corpus**                    | English                 | Brown Tagset (87 tags)             | 1 million words      | POS tagging, linguistic analysis             | One of the earliest tagged corpora, detailed tagging |
| **Universal Dependencies (UD)**     | Multilingual (100+ languages) | Universal POS Tagset              | Varies by language   | Multilingual POS tagging, syntactic parsing  | Cross-lingual consistency, used for dependency parsing |
| **Lancaster-Oslo/Bergen (LOB) Corpus** | British English         | CLAWS Tagset                      | 1 million words      | POS tagging, British English analysis        | Focus on British English, comparable to Brown Corpus |
| **Treebank-3**                      | English (spoken, written) | Penn Treebank Tagset              | ~1 million words     | POS tagging, syntactic parsing, discourse analysis | Contains conversational data (e.g., Switchboard Corpus) |
| **CoNLL-2000 Chunking Corpus**      | English                 | Penn Treebank Tagset + Chunk tags  | 1 million words      | Chunking, POS tagging, syntactic parsing     | Used in shallow parsing (chunking) challenges |
| **Wall Street Journal (WSJ) Corpus** | English                 | Penn Treebank Tagset              | 1 million words      | POS tagging, syntactic parsing, text analysis | Part of the Penn Treebank, newswire data focus |
| **TIGER Corpus**                    | German                  | Stuttgart-Tübingen Tagset (STTS)   | 900,000 words        | POS tagging, syntactic parsing for German    | Comprehensive syntactic annotations for German |
| **Indian Languages Corpora (ILCI)** | Hindi, Telugu, Bengali   | Custom tagsets adapted from Universal POS | Varies by language  | Multilingual POS tagging, syntactic parsing for Indian languages | Key resource for NLP in Indian languages |
| **Mac-Morpho Corpus**               | Brazilian Portuguese     | Mac-Morpho Tagset                 | 1.5 million words    | POS tagging, speech recognition for Portuguese | Key corpus for Brazilian Portuguese text processing |
| **Sinica Treebank**                 | Chinese                 | Custom Chinese tagset             | 150,000 words        | POS tagging, syntactic parsing for Chinese   | Rich syntactic and POS annotations for Mandarin |
| **ConLL-2003 NER Corpus**           | English, German          | Penn Treebank (English), STTS (German) | ~300,000 words (English) | Named entity recognition, POS tagging, sequence labeling | Used in NER competitions, includes both POS and NER tags |
| **Europarl Corpus**                 | Multilingual (21 European languages) | Universal POS / Language-specific | Varies by language   | Machine translation, cross-lingual POS tagging | Multilingual, parallel text for machine translation |


### **Tagging Applications in NLP**
POS tagging serves as an essential preprocessing step in many high-level NLP tasks:
- **Text-to-Speech Systems**: Helps in pronunciation by indicating stress patterns and how words should be spoken.
- **Machine Translation**: Improves translation quality by maintaining syntactic structure across languages.
- **Named Entity Recognition (NER)**: By identifying nouns (often proper nouns), systems can detect entities such as people, locations, and organizations.


Note: This note is made with the help of AI assitance and other meterials, including journels and articles.

