# Discovering Hidden Themes with Topic Modelling
The objective of this notebook is to investigate topic modelling, which is how we automatically discover the main themes in a collection of documents. Thie notebook will introduce two crucial concepts: __vectorisation__ (turning text into numbers) and __Latent Diraichlelt Allocation__ (LDA), a popular topic modelling algorithm.

## Setup and Data Cleaning
As before, we start by settling up our environment and running our cleaning pipeline on the BBC dataset. Topic models work best with clean lemmatised text.

In [None]:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# --- Setup all the cleaning tools ---
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(raw_text):
    text = raw_text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

# --- Load and Clean the Data ---
url = 'https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv'
bbc_df = pd.read_csv(url)
bbc_df['final_tokens'] = bbc_df['text'].apply(clean_text)

print("Setup complete. The BBC dataset is loaded and cleaned.")
bbc_df.head()

## Word to Numbers:
You can think of vectorisation as another way to tokenising (to convert the text from text to something a machine understands). The most common method is to create a Document-Term Matrix, where every row is a document and every column is a unique word in the entire vocabulary. The value in each cell is a number representing the importance of that word in that document. 

We will use __TF-IDF__ (Term Frequency-Inverse Document Frequency), a smart way to calculate these values 
- __Term Frequency (TF)__: How often a  word appears in a document
- __Inverse Document Frequency__: How rare a word is across all documents. 

The TF-IDF score gives more weight to words that are frequent in one document but rare everywhere else, making them good keywords for that document

We'll be using the powerful scikit-learn library for this

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# The scikit-learn vectorizer needs strings, not lists of tokens.
# So, we'll join our tokens back into a single string.
bbc_df['cleaned_text_joined'] = bbc_df['final_tokens'].apply(lambda tokens: " ".join(tokens))

# Initialize the vectorizer
# max_df=0.95 means "ignore words that appear in 95% of documents" (too common)
# min_df=2 means "ignore words that appear in only one document" (too rare)
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

# Create the TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(bbc_df['cleaned_text_joined'])

# The output is a sparse matrix - let's see its shape
print(tfidf_matrix.shape)

## Building the Topic Model with LDA
Now that we have a numerical format, we can build our topic model. We'll use the LDA algorithm. 

It works by assuming that each document is a mix of different topics, and each topic is a mix of different words. It's an unsupervised algorithm, meaning we don't give it any data or answers, instead it takes the data and attempts to identify patterns in the topicd. 

We do have to tell it the how many topics we want it to look for. Since we know that the BBC data has 5 categories, it is the perfect number to start with 

In [7]:
from sklearn.decomposition import LatentDirichletAllocation

# We are looking for 5 topics
num_topics = 5

# Create the LDA model
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)

# Fit the model on our TF-IDF matrix
lda.fit(tfidf_matrix)

0,1,2
,n_components,5
,doc_topic_prior,
,topic_word_prior,
,learning_method,'batch'
,learning_decay,0.7
,learning_offset,10.0
,max_iter,10
,batch_size,128
,evaluate_every,-1
,total_samples,1000000.0


# Interpreting the Results
The model has now been trained, but how do we understand topics it found? We need to look at the most immportant words for each of the 5 topics. 

In [8]:
# A helper function to print the top words for each topic
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = f"Topic #{topic_idx}: "
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

# Get the feature names (our vocabulary) from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Print the top 10 words for each of our 5 topics
print("Top words for each topic:")
print_top_words(lda, feature_names, 10)

Top words for each topic:
Topic #0: foxx swank cabir blackpool hendrix eastwood millan daylewis ossie diageo
Topic #1: game said film player best win mobile year world england
Topic #2: said mr government year labour election people party blair minister
Topic #3: fiat mock commodore mirza goldsmith hingis casino viotti baa desailly
Topic #4: kenteris thanou greek wenger iaaf athens olympics gallas tzekos seafarer



__Look at the results__: remember the output of this model may not be 100% accurate but we should still be able to identify small topics within the text, if not we may need more data or more topics. 

## Exercise 
The model worked great when we told it find 5 topics, which we know is the 'correct' number for this dataset

But in the real-world, you don't know the number of topics before we start, so we usually have to experiment. 
The task:
1. Re-run the LDA model, but this time set n_components=3
2. Print the top words for these 3 
3. Do the topics seem as clear and distinct as they were? Or have they become furuther jumbled ?

## Conclusion
You have now performed complex machine learning for topic modelling to help idenntify topics within a datset. This can be great to help provide an overview and also identify categories 