# Getting Ready for Analysis - Cleaning and PreProcessing Text
In the last notebook we looked at basic features from the raw text and of the news articles. We got a good high-level overview, but to dig deeper, we need to clean up the text

## Why do we need to clean text?
Computers need consistency. To a computer, "Election", "election", and "election." are three different words. Our goal is to standardise the text so that we can group and analyze words by their actual meaning. This process is called pre-processing.

In this notebook we will build a cleaning pipeline to:
- Convert text to lowercase
- Remove punctuation and special characters
- Break text into a list of individual words (tokenisation)
- Remove common, low-value "stopwords"
- Reduce words to their root form (lemmitisation)

## Setup: Loading the BBC Dataset
We'll start by importing our data and load our bbc_df DataFrame, making this notebook self-contained

In [1]:
import pandas as pd

# Load the data from the stable URL
url = 'https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv'
bbc_df = pd.read_csv(url)

print("BBC News dataset loaded successfully!")
bbc_df.head()

BBC News dataset loaded successfully!


Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


## The cleaning pipeline
We will create new columns in our DataFrame at each step to see how the text transforms. 

### Step 1: Lowercasing
This is the simplest and most crucial first step. It ensures that we don't treat thhe same word differently based on its capitalisation

In [2]:
# Create a new column 'cleaned_text' with the lowercase version of the text
bbc_df['cleaned_text'] = bbc_df['text'].str.lower()

bbc_df[['text', 'cleaned_text']].head()

Unnamed: 0,text,cleaned_text
0,tv future in the hands of viewers with home th...,tv future in the hands of viewers with home th...
1,worldcom boss left books alone former worldc...,worldcom boss left books alone former worldc...
2,tigers wary of farrell gamble leicester say ...,tigers wary of farrell gamble leicester say ...
3,yeading face newcastle in fa cup premiership s...,yeading face newcastle in fa cup premiership s...
4,ocean s twelve raids box office ocean s twelve...,ocean s twelve raids box office ocean s twelve...


### Step 2: Removing Punctuation & Special Characters
Next, we'll remove all punctuation. This prevents "win" and "win." from being treated as two different words. We'll use a simple regular expression to keep only letters, numbers and spaces. 

In [3]:
# Use .str.replace() with a regular expression
# The regex [^\w\s] means "any character that is NOT a word character or a whitespace character"
bbc_df['cleaned_text'] = bbc_df['cleaned_text'].str.replace(r'[^\w\s]', '', regex=True)

bbc_df[['text', 'cleaned_text']].head()

Unnamed: 0,text,cleaned_text
0,tv future in the hands of viewers with home th...,tv future in the hands of viewers with home th...
1,worldcom boss left books alone former worldc...,worldcom boss left books alone former worldc...
2,tigers wary of farrell gamble leicester say ...,tigers wary of farrell gamble leicester say ...
3,yeading face newcastle in fa cup premiership s...,yeading face newcastle in fa cup premiership s...
4,ocean s twelve raids box office ocean s twelve...,ocean s twelve raids box office ocean s twelve...


### Step 3: Tokenisation
Now we'll break our clean string of text into a list of individual words or __tokens__. This is the standard format for most natural language processing tasks

In [4]:
# Use .str.split() which splits the string by spaces into a list of words
bbc_df['tokenized_text'] = bbc_df['cleaned_text'].str.split()

bbc_df[['cleaned_text', 'tokenized_text']].head()

Unnamed: 0,cleaned_text,tokenized_text
0,tv future in the hands of viewers with home th...,"[tv, future, in, the, hands, of, viewers, with..."
1,worldcom boss left books alone former worldc...,"[worldcom, boss, left, books, alone, former, w..."
2,tigers wary of farrell gamble leicester say ...,"[tigers, wary, of, farrell, gamble, leicester,..."
3,yeading face newcastle in fa cup premiership s...,"[yeading, face, newcastle, in, fa, cup, premie..."
4,ocean s twelve raids box office ocean s twelve...,"[ocean, s, twelve, raids, box, office, ocean, ..."


### Step 4: Removing Stop Words
Stop words are very common words like "the", "a", "is", "in", which often don't carry much meaning. Removing them helps us focus on the important keywords. We'll use a standard library for NLP called NLTK (Natural Language Toolkit) to get a list of English stop words.

In [5]:
# You might need to download the stopwords list the first time you run this
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Get the standard list of English stop words
stop_words = set(stopwords.words('english'))

# Create a function to remove stop words from a list of tokens
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

# Apply this function to our tokenized text
bbc_df['tokens_no_stop'] = bbc_df['tokenized_text'].apply(remove_stopwords)

bbc_df[['tokenized_text', 'tokens_no_stop']].head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Work\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,tokenized_text,tokens_no_stop
0,"[tv, future, in, the, hands, of, viewers, with...","[tv, future, hands, viewers, home, theatre, sy..."
1,"[worldcom, boss, left, books, alone, former, w...","[worldcom, boss, left, books, alone, former, w..."
2,"[tigers, wary, of, farrell, gamble, leicester,...","[tigers, wary, farrell, gamble, leicester, say..."
3,"[yeading, face, newcastle, in, fa, cup, premie...","[yeading, face, newcastle, fa, cup, premiershi..."
4,"[ocean, s, twelve, raids, box, office, ocean, ...","[ocean, twelve, raids, box, office, ocean, twe..."


### Step 5: Lemmitisation
The final steps reduces words to their base or dictionary form (e.g. "elections", "elected" -> "election"). This is powerful way to group words with the same root meaning.

In [6]:
# You might need to download wordnet the first time
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

# Create a function to lemmatize a list of tokens
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

# Apply this function to our tokens
bbc_df['final_tokens'] = bbc_df['tokens_no_stop'].apply(lemmatize_tokens)

bbc_df[['tokens_no_stop', 'final_tokens']].head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Work\AppData\Roaming\nltk_data...


Unnamed: 0,tokens_no_stop,final_tokens
0,"[tv, future, hands, viewers, home, theatre, sy...","[tv, future, hand, viewer, home, theatre, syst..."
1,"[worldcom, boss, left, books, alone, former, w...","[worldcom, bos, left, book, alone, former, wor..."
2,"[tigers, wary, farrell, gamble, leicester, say...","[tiger, wary, farrell, gamble, leicester, say,..."
3,"[yeading, face, newcastle, fa, cup, premiershi...","[yeading, face, newcastle, fa, cup, premiershi..."
4,"[ocean, twelve, raids, box, office, ocean, twe...","[ocean, twelve, raid, box, office, ocean, twel..."


### Before vs After 
So why did we do all that. Let's look athe 20 most common words in our dataset _before_ and _after_ cleaning. This will show the power of pre-processing

In [7]:
# Get all words from the original raw text column
all_raw_words = bbc_df['text'].str.lower().str.split().explode()

# Get all words from our final cleaned tokens column
all_cleaned_words = bbc_df['final_tokens'].explode()

# Calculate the frequency of the top 20 words for each
top_20_raw = all_raw_words.value_counts().head(20)
top_20_cleaned = all_cleaned_words.value_counts().head(20)

print("--- Top 20 Words (Before Cleaning) ---")
print(top_20_raw)
print("\n--- Top 20 Words (After Cleaning) ---")
print(top_20_cleaned)

--- Top 20 Words (Before Cleaning) ---
text
the     52567
to      24955
of      19947
and     18561
a       18251
in      17570
s        9007
for      8884
is       8515
that     8135
it       7584
on       7460
was      6016
he       5933
be       5765
with     5313
said     5072
as       4976
has      4952
have     4745
Name: count, dtype: int64

--- Top 20 Words (After Cleaning) ---
final_tokens
said          7254
mr            3045
year          2830
would         2577
also          2156
people        2044
new           1970
u             1926
one           1809
could         1511
game          1471
time          1449
last          1381
first         1283
say           1268
world         1214
government    1189
two           1181
company       1113
film          1113
Name: count, dtype: int64


Notice how the _before_ list is dominated by stopwords, while _after_ list contains meaningful keywords like "said", "mr", "year", and "uk". Our data is now much more useful

## Exercise 
Let's put it all together. 

Cretae a single function named clean_text that takes a raw text string as input and performs the entire cleaning pipeline:
- lowercasing
- punctuation removed
- tokenisation
- stop words removed
- lemmatisation

Apply this function to the original 'text' column to create a new column and show the head of your Dataframe. 

We will be using the fuction you create here in future notebooks. 

In [None]:
# Your code for the Exercise here.
# You will need to re-use the stop_words and lemmatizer variables we created earlier.

## Conclusion
You should now have a dataframe that contains a column ('final_tokens') of clean, standardised tokens. This clean data is the standard starting point for almost any advanced NLP task. You are now fully prepared to move onto things such as sentiment analysis, topic modelling and even machine learning if you wanted to. 