# Going beyond a single word
Objective: To understand how analysing sequences of words can provide more context and meaning that analysing single words alone. We will learn how to extract and analyse these sequences, called N-grams

### What are N-grams?
So far, our analysis (frequency counts, words clouds) have been based on single words or unigrams
- __Unigram__: 'prime'
- __Bigram__: 'prime minister'
- __Trigram__: 'the prime minister'

Analysing bigrams and trigrams is powerful because they capture phrases and concepts that single words miss

### Setup: Loading and Cleaning the data

In [None]:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

# --- Setup all the cleaning tools ---
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(raw_text):
    text = raw_text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return " ".join(tokens) # Join back to string for vectorizer

# --- Load and Clean the Data ---
url = 'https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv'
bbc_df = pd.read_csv(url)
bbc_df['cleaned_text'] = bbc_df['text'].apply(clean_text)

print("Setup complete. The BBC dataset is loaded and cleaned.")

## Extracting N-grams with Scikit-learn
The easiest way to get N-gram counts is to use scikit-learn's vectoriser. We'll use CountVectorizer for simplicity. The key is the ngram_range parameter
- ngram_range=(1,1): Unigrams (default)
- ngram_range=(2,2): Bigrams only
- ngram_range=(1,2): Both unigrams and bigrams

Let's find the most common bigrams

In [None]:
# Initialize a CountVectorizer to find bigrams
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), max_features=20)

# Create the bigram document-term matrix
bigram_matrix = bigram_vectorizer.fit_transform(bbc_df['cleaned_text'])

# Get the feature names (the bigrams)
bigram_features = bigram_vectorizer.get_feature_names_out()

print("Top 20 most frequent bigrams:")
print(bigram_features)

## Exercise
Find the top 15 most frequent trigrams (sequences of 3 words in the dataset).
1. Initialise a new CountVectoriser
2. Set the ngram_range parameter to (3,3) to look for trigrams
3. Set max_features=15 to get just the top 15
4. Fit the vectorizer on the 'cleaned_text' column
5. Get the feature names and print them

In [None]:
# Your Code here