# **Assignment 1: Exploring NLP Fundamentals and Preprocessing**

### **Q1: NLP in the Real World**

- Choose **two industries** (e.g., healthcare, finance, e-commerce, or another) and discuss one **NLP use case** for each. For each use case:  
   - Describe the problem and how NLP can solve it.  
   - List three challenges in applying NLP to this scenario (e.g., data availability, multilingual support, or ambiguity).  

- As a data scientist, you are tasked with designing an **NLP-based chatbot** for a university.  
   - Write the steps of the pipeline (in detail) you would follow to build the chatbot, including data collection, preprocessing, and feature extraction.  
   - Suggest two real-world datasets you might use to train this chatbot.

### **Q2: Practical Challenges in Language Ambiguities**

- Write **Python code** to analyze the lexical ambiguity of the word "bat" in the following sentences. Use any pre-trained NLP library (e.g., spaCy) to determine its **Part of Speech (POS):**  

	•	The bat flew across the cave.
	•	He hit the ball with a bat.

- Syntactic ambiguity often arises in complex sentences. Rewrite the following sentence to **resolve the ambiguity** in two different ways:  

>    **I saw the man with the telescope**

    
- Semantic ambiguity can make NLP tasks difficult. Propose a strategy or algorithm to resolve semantic ambiguity in text.

### **Q3: Advanced Text Cleaning for Social Media**

Social media data is often noisy, with hashtags, mentions, and emojis.

- Write a Python function that performs the following steps on a given tweet:  
    - Removes mentions (e.g., `@username`), hashtags (e.g., `#topic`), and URLs.  
    - Replaces emojis with their textual meaning (use the `emoji` library).  
    - Removes stopwords using NLTK.

- Use your function to preprocess the following tweet:  

>   **@John I love #NLP! Check this out: https://nlp-tutorial.com 😊 #DataScience**

- Explain how this cleaned tweet can be used in a **sentiment analysis model**.

### **Q4: Tokenization and Real-World Text**

**Tokenization** plays a critical role in dividing text into meaningful units.

- Tokenize the following text into **sentences** using Python's `nltk.sent_tokenize()` and spaCy's sentence tokenizer.  

>    **Dr. Smith graduated from Stanford University in 2003. He now works at Google as a Senior Data Scientist**

- Tokenize the same text into **words** using:  
    - `nltk.word_tokenize()`  
    - spaCy's tokenizer  

> **Note:** Compare the outputs. Which method handles punctuation better?

- Real-world Challenge:  
    - Explain why sentence and word tokenization can be difficult for languages like Chinese or Arabic.  
    - Suggest an NLP library/tool that effectively handles tokenization for such languages.

### **Q5: Stopwords and Custom Filters**

**Stopwords** often reduce noise in data, but not all stopwords are irrelevant in every context.

- Write a Python script to remove stopwords from the following customer review:
  - Use NLTK's English stopword list.
    > **The delivery was quick, and the packaging was excellent, but the product quality was poor.**

- Add a custom stopword filter to remove domain-specific words like "delivery" and "packaging." Explain why custom stopwords might be useful in specific NLP tasks.

- Discuss a scenario where **keeping stopwords** (e.g., "not", "but") might be critical for the model's performance.

### **Q6: Creating N-Grams for Product Reviews**

**N-grams** are helpful for understanding the context of text.

- Define what an **n-gram** is and explain its role in capturing context.

- Using Python, generate **bigrams** and **trigrams** for the following review:  

    > **The product quality is amazing and works as expected.**

- Use the output of your n-grams to perform a basic frequency analysis. Identify the most common bigram and trigram.  

- Discuss how you would use n-grams in a **recommendation system** for e-commerce products.

### **Q7: Case Study: Preprocessing Job Descriptions**

Imagine you're building an NLP model to analyze job descriptions and match them with resumes.  

- Create a small dataset of 5 job descriptions. Example:  
	> **We are looking for a data scientist with expertise in Python and SQL.**
 
	> **The candidate should have experience in machine learning and data visualization.**

- Write a Python script to preprocess these job descriptions:  
    - Convert text to lowercase.  
    - Remove punctuation and stopwords.  
    - Tokenize the text into words.  

- Perform a frequency analysis on the preprocessed job descriptions. Identify the top 5 most common words.
- **Optional:** How might preprocessing job descriptions improve the performance of a job-matching NLP model? What challenges might arise when handling multilingual job descriptions?