# **Assignment 2: Advanced Text Pre-Processing**

### **Q1: Comparing Stemming and Lemmatization**

`Stemming and lemmatization` are both used to reduce words to their base forms.


- Using NLTK's `PorterStemmer` and `WordNetLemmatizer`, process the following list of words:  
    > ["running", "flies", "better", "studying", "mice"]

    - Show the output for both stemming and lemmatization.  

- Discuss the differences in their outputs. When would lemmatization be preferred over stemming in real-world applications?

- Identify a domain-specific NLP use case where lemmatization would improve model performance. Justify your answer.

### **Q2: POS Tagging in Complex Sentences**

`POS tagging` identifies grammatical roles in text.

- Use spaCy to perform POS tagging on the following text:  

    >    **The quick brown fox jumps over the lazy dog while the cat sleeps.**
    - List each word and its corresponding POS tag.  


- Discuss how POS tagging can improve the performance of NLP tasks like sentiment analysis or machine translation. Provide a real-world example for each.
- **Real-World Challenge**: 
    - Suggest a method to handle POS tagging for low-resource languages where pre-trained models are unavailable.

### **Q3: Named Entity Recognition (NER) in Financial News**

`NER` helps identify specific entities like people, organizations, and dates in the text.

- Use spaCy's NER model to extract named entities from the following text:  
    > **On October 5, 2023, Microsoft announced its acquisition of OpenAI for $10 billion.**

    - List each entity and its label (e.g., PERSON, ORG, DATE, MONEY).  

- Modify the text to include ambiguous entities (e.g., "Apple" could refer to the company or the fruit). Run NER again and discuss how ambiguity affects results.
- **Optional** In financial text analysis, what additional entity categories (e.g., stock tickers, financial metrics) would you add to a NER model? Explain why these categories are critical.

### **Q4: Coreference Resolution in Conversational Text**

`Coreference resolution` identifies when different words refer to the same entity.

- Use spaCy to resolve coreferences in the following text:  

    > **John went to the store. He bought milk and eggs. Later, he decided to bake a cake.**
    - Rewrite the text with all references that are explicitly resolved.

- Discuss the importance of coreference resolution in applications like chatbots and summarization tools. Provide a real-world example for each.

- **Optional**: Coreference resolution often struggles with long, complex documents. Suggest one method to improve its accuracy in such cases.

### **Q5: Preprocessing for Legal Documents**

Legal documents often contain long sentences, complex structures, and domain-specific terms.

- Given the following legal text, perform stemming, lemmatization, and POS tagging using NLTK or spaCy:  
    > **The agreement shall be terminated upon mutual consent of both parties.**

- Reflect on the results:
    - Which preprocessing technique is most suitable for analyzing legal text?  
    - Discuss how domain-specific stopwords (e.g., "agreement," "parties") can improve model performance.

- Suggest one practical application of NLP in the legal industry and outline a preprocessing pipeline for it.

### **Q6: Analyzing Customer Feedback with Advanced Preprocessing**

Imagine you're tasked with analyzing customer feedback for a hotel chain.

- Create a small dataset of 5 fictional reviews, such as:  
	> **The staff was friendly, but the rooms were unclean.**
   
	> **Amazing location and excellent service, but the food quality needs improvement.**

- Perform the following steps for each review:
    - Tokenize the text into sentences and words.  
    - Apply POS tagging to identify nouns and adjectives.  

- Based on the adjectives, classify each review as positive, negative, or neutral. Explain your reasoning.
- **Optional**: Discuss how advanced preprocessing techniques like NER and coreference resolution could enhance sentiment analysis for customer feedback.

### **Q7: Domain-Specific NER for Healthcare**

NER models often need customization for specific domains like healthcare.

- Create a fictional dataset of 5 sentences related to medical reports. Example:  
	> **Patient John Doe, aged 45, was prescribed Amoxicillin on January 10, 2023.**
    > **Dr. Smith diagnosed the patient with Type 2 Diabetes.**

- Use spaCy's NER model to extract named entities. Identify entities like PERSON, ORG, DATE, and any other relevant labels.  

- Suggest three additional entity categories (e.g., drug names, medical conditions) that would improve NER performance in the healthcare domain. Discuss how these entities could be used in real-world applications like electronic health records or medical research.

### **Q8: Preprocessing Long Documents with Multiple Challenges**

Long documents such as research papers or contracts pose unique preprocessing challenges.

- Suggest a strategy to preprocess the following text efficiently:  

    > **Artificial Intelligence has seen tremendous growth in recent years. In 2023, OpenAI released GPT-4, a state-of-the-art language model.**
    - Include stemming, lemmatization, and POS tagging in your strategy.  

- Discuss the role of coreference resolution in summarizing such documents. Provide a step-by-step example of how it can be applied.
- **Optional**: What are the limitations of current NLP tools when handling long documents, and how can they be addressed?

### **Q9: Medical Report Analysis**

The healthcare industry generates a vast amount of unstructured textual data in the form of medical reports, patient records, and prescriptions. Your task is to preprocess and analyze medical reports to extract key information for downstream tasks like diagnosis support, patient monitoring, or research insights.

**Task**

- You are building an NLP pipeline to process medical reports. Your system needs to:
    - Identify key entities like patient names, ages, diseases, and prescribed medications.
    - Resolve coreferences to ensure accurate entity linkage (e.g., “he” or “the patient” refers to the same individual).
    - Prepare the text for further analysis, such as classification or clustering.

**Dataset**
Here is a small sample of medical text for this task:

In [None]:
1. Patient John Doe, aged 45, was diagnosed with Type 2 Diabetes. He was prescribed Metformin and advised to follow a low-carb diet.
2. Jane Smith, a 34-year-old woman, complained of severe migraines. Dr. Allen suggested a CT scan and prescribed Sumatriptan.
3. A 60-year-old male patient was admitted for hypertension. He was treated with Amlodipine and discharged with a follow-up scheduled for next week.

#### **Tasks**

1. **Text Cleaning and Preprocessing**
    - Write a function to clean the text by:
    	- Removing irrelevant characters (e.g., punctuation and extra spaces).
    	- Converting all text to lowercase.
    	- Handling common medical abbreviations (e.g., “CT scan” → “computed tomography scan”).
	- Apply the function to the dataset.
2. **POS Tagging**
	- Use spaCy to perform POS tagging on the preprocessed text.
	- Extract all nouns and adjectives, as they are critical for understanding the medical context.
	- Reflect: Why are these POS tags particularly useful in this domain?

3. **Named Entity Recognition (NER)**
	- Use spaCy’s NER model to extract entities like:
    	- PERSON (patient names and doctors)
        - AGE (e.g., 45 years, 34 years)
    	- DISEASE (e.g., Type 2 Diabetes, migraines)
    	- DRUGS (e.g., Metformin, Sumatriptan, Amlodipine)
	- For each sentence, list the identified entities and their types.

4. **Advanced Analysis**
	- Create a dictionary structure to store the following for each patient:
        - Name
        - Age
        - Diagnosed Disease(s)
        - Prescribed Medication(s)
	- Populate the dictionary using the extracted entities and coreference-resolved text.

5. **Programming Challenge: Visualization and Insights**
	- Write a Python script to visualize the frequency of diseases and medications using matplotlib or seaborn.
	- Generate a bar chart showing:
    	- Top 3 most frequently diagnosed diseases.
    	- Top 3 most prescribed medications.
	- Reflect: How can these visualizations assist healthcare providers?