# 👩‍💻 Clean Your First NLP Dataset: News Headlines Edition

## 📋 Overview

In this activity, you will embark on a hands-on journey to preprocess a dataset of news headlines, converting raw text into a cleaned and structured form ready for analysis. Just like turning raw sugarcane into refined sugar, you’ll take gritty, noisy text and transform it into meaningful tokens, each ready to power insightful natural language processing models.

## 🎯 Learning Outcomes

By the end of this lab, you will be able to:

- ✅ Clean and preprocess raw text data by removing noise and standardizing text
- ✅ Tokenize text into individual words for analysis
- ✅ Remove stopwords to focus on significant content
- ✅ Apply stemming or lemmatization to reduce words to their base forms

## Task 1: Data Loading and Exploration

**Context:** Properly loading and exploring the dataset helps in understanding the types of noise present in the text.

**Steps:**
1. A variable, `news_headlines`, has been created for you to use for this lab.
2. Explore its structure to understand the various types of noise present in the data, such as punctuation, URLs, emojis, and stopwords.
3. Inspect the first few entries to ascertain common patterns and elements to address in cleaning.

In [None]:
# Task 1: Loading and Exploration
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

# Download required resources
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

# Sample dataset of news headlines
news_headlines = [
    "Breaking news: Market hits new highs! See details at https://marketnews.com.",
    "Bitcoin hits 50k! Is it the new gold? 🤔 #cryptocurrency",
    "Experts debate climate impact at global summit on environmental change.",
    "COVID-19 updates: New variants and vaccination efforts continue."
]

# Explore / Inspect Data
# Your code here...

**💡 Tip:** Use `print()` and basic string operations to explore the dataset.

**⚙️ Test Your Work:**

- Display the first 5 news headlines with their raw text.

**Expected output:** The original raw text headlines showing various types of noise.

## Task 2: Cleaning Text Data

**Context:** Cleaning raw text data by removing noise is crucial for accurate NLP analysis.

**Steps:**

1. Develop functions to clean the text data:
2. Remove any HTML tags, URLs, and unnecessary symbols.
3. Handle special characters and numbers, deciding which elements should be retained or discarded.
4. Transform all text to a uniform case to standardize data.

In [None]:
# Task 2: Cleaning Text Data

**💡 Tip:** Use regular expressions (`re`) for cleaning operations.

**⚙️ Test Your Work:**
- Print a cleaned version of the first 5 news headlines.

**Expected output:** The cleaned text without noise, standardized to lower case.

## Task 3: Tokenization

**Context:** Tokenizing the cleaned text into individual words enables further analysis steps.

**Steps:**

1. Tokenize the cleaned text into individual words.

In [None]:
# Task 3: Tokenization

**💡 Tip:** Use `word_tokenize` from `nltk` for tokenization.

**⚙️ Test Your Work:**

- Print the tokenized version of the first 5 news headlines.

**Expected output:** Lists of tokenized words for each headline.

## Task 4: Stopword Removal

**Context:** Removing stopwords helps focus on the content carrying the most significant insights.

**Steps:**

1. Remove common stopwords from your tokens.
2. Reflect on whether additional, context-specific stopwords would enhance your dataset.

In [None]:
# Task 4: Stopword Removal

**💡 Tip:** Use `stopwords.words('english')` from `nltk` to get the list of stopwords.

**⚙️ Test Your Work:**

- Print the tokenized version of the first 5 news headlines after stopword removal.

**Expected output:** Lists of tokenized words without common stopwords.

## Task 5: Apply Stemming or Lemmatization

**Context:** Stemming or lemmatization reduces tokens to their base forms, which is useful for various NLP tasks.

**Steps:**

1. Implement stemming or lemmatization to reduce the tokens to their base forms.
2. Decide which method is more appropriate depending on your subsequent analytical tasks.

In [None]:
# Task 5: Apply Stemming or Lemmatization

**💡 Tip:** Use `WordNetLemmatizer` from `nltk` for lemmatization.

**⚙️ Test Your Work:**

- Print the tokenized and lemmatized version of the first 5 news headlines.

**Expected output:** Lists of lemmatized words for each headline.

### ✅ Success Checklist

- Successfully loaded and explored the dataset
- Cleaned the text data by removing noise and standardizing text
- Tokenized the cleaned text into individual words
- Removed stopwords to focus on significant content
- Applied stemming or lemmatization to reduce words to their base forms

### 🔍 Common Issues & Solutions

**Problem:** Text data not cleaning properly.

**Solution:** Ensure regular expressions are correctly specified for cleaning.

**Problem:** Tokenization errors.

**Solution:** Verify that `nltk` resources are correctly downloaded and used. 

### 🔑 Key Points

- Cleaning and preprocessing text data is crucial for accurate NLP analysis.
- Tokenization, stopword removal, and lemmatization help transform raw text into analyzable tokens.
- Proper preprocessing ensures that the data is ready for further NLP tasks.

## 💻 Exemplar Solution

<details>    
<summary><strong>Click HERE to see an exemplar solution</strong></summary>    

```python
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

# Download required resources
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

# Sample dataset of news headlines
news_headlines = [
    "Breaking news: Market hits new highs! See details at https://marketnews.com.",
    "Bitcoin hits 50k! Is it the new gold? 🤔 #cryptocurrency",
    "Experts debate climate impact at global summit on environmental change.",
    "COVID-19 updates: New variants and vaccination efforts continue."
]

def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove non-alphanumeric characters
    text = re.sub(r'[^\w\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Lowercase text
    text = text.lower()
    return text

def preprocess_headlines(headlines):
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    cleaned_data = []
    for headline in headlines:
        cleaned = clean_text(headline)
        tokens = word_tokenize(cleaned)
        filtered_tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
        cleaned_data.append(filtered_tokens)
    return cleaned_data

# Preprocess the headlines
cleaned_headlines = preprocess_headlines(news_headlines)
print(cleaned_headlines)
```