
# ☁️ Keyword Frequency and Word Cloud Visualisation

This notebook loads a CSV containing extracted keywords and generates a frequency list and word cloud to visualise the most commonly occurring terms. It includes adjustable custom stopword filtering and text preprocessing for cleaner results.
Overall, the aim of this notebook is to have a better understanding what's in the dataset through displaying the most frequent keywords.


## 📦 Import Required Libraries

We import:
- `pandas` to load the CSV file
- `wordcloud` to generate the visualization
- `matplotlib` to display the image
- `Counter` to calculate word frequency
- `string` to help with punctuation removal


In [None]:
import pandas as pd 
from wordcloud import WordCloud 
import matplotlib.pyplot as plt
from collections import Counter
import string



## 📄 Load Your CSV File

We load one og the enhanced datasets and extract the `keywords` column for analysis. You can choose any dataset and any column.

In [None]:

# Load your CSV file
df = pd.read_csv('data/AoT5_enhanced.csv')

# Extract the 'keywords' column
keywords = df['keywords'].dropna().tolist()



## 🧩 Combine All Keywords into One Text Block

We join the list of keyword strings into a single large string for processing.


In [None]:

# Combine all keywords into a single string
text = ' '.join(keywords)



## 🧹 Define and Apply Custom Stopwords

We define a custom list of stopwords — domain-specific words we want to ignore in the word cloud — 
and remove punctuation and lowercase all words.


In [None]:

# Define custom stopwords. We predefined a few frequent terms but you can adjust it.
custom_stopwords = set(['thousand','thought','read','youre','weve','day','say','feel','settings','youll','uks','don','website', 'health', 'uk', 'welcome', 'way', 'world', 'team', 'working', 'cookies', 
                        'site', 'information', 'provide', 'healthcare', 'using', 'need', 'share', 'staff', 
                        'know', 'range', 'visit', 'unique', 'understand', 'services', 'really', 'things', 
                        'association', 'best', 'think', 'used', 'latest', 'project', 'offer', 'online', 
                        'taking', 'british', 'understanding', 'better', 'weeks', 'group', 'doctors', 'wide', 
                        'today', 'run', 'page', 'join'])

# Additional preprocessing: remove punctuation and make lowercase
translator = str.maketrans('', '', string.punctuation)
text = text.lower().translate(translator)



## 🧮 Count Word Frequency

We split the processed text into individual words, remove stopwords, and count how often each word appears.


In [None]:

# Split the text into words and filter out stopwords
words = text.split()
filtered_words = [word for word in words if word not in custom_stopwords]

# Calculate word frequency
word_counts = Counter(filtered_words)



## 🔢 Display the Most Common Words

We print the top 100 most frequent keywords from the cleaned dataset.


In [None]:

# Display the most common words
most_common_words = word_counts.most_common(100)
print("Most common words:")
for word, count in most_common_words:
    print(f"{word}: {count}")



## ☁️ Generate and Display the Word Cloud

We use the `WordCloud` library to create a word cloud from the most common keywords and plot it using `matplotlib`.


In [None]:

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=custom_stopwords).generate(' '.join(filtered_words))

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Keywords')
plt.show()
