To process text files, I use the Natural Language Toolkil (NLTK). There are some much more powerful toolkits that can instantly process text files without coding step by step. However, for the learning purpose, I will use NLTK to help myself farmilize with the concept.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /Users/winston/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/winston/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/winston/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/winston/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /Users/winston/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize

In [3]:
#Disable warning in Anaconda
import warnings
warnings.filterwarnings('ignore')

#### Load the dataset

In [4]:
df = pd.DataFrame(columns = ['content','class'])

In reality, the text files are in various encodings. To detect the files' encoding and convert it into a Unicode string format so, we use the UnicodeDammit from bs4 lib.

In [5]:
from bs4 import UnicodeDammit

In [6]:
path = 'Data/bbc/'
for directory in os.listdir(path):
    directory = os.path.join(path, directory)
    if os.path.isdir(directory):
        for filename in os.listdir(directory):
            filename = os.path.join(directory, filename)
            encoding = ''
            #Using UnicodeDammit to detect and suggest the suitable encoding for the text data
            with open(filename,'rb') as f: #UnicodeDammit read the file in binary mode ('rb')
                content = f.read()
                suggestion = UnicodeDammit(content)
                encoding = suggestion.original_encoding
                
            #From the suggestion of encoding, using that encoding to read the file and append into the dataframe 
            with open(filename, encoding=encoding) as f:
                content = f.read()
                current_df = pd.DataFrame({'content':[content],'class':[os.path.basename(directory)]})
                df = df.append(current_df, ignore_index = True)

Recheck the data:

In [7]:
#Overview the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   content  2225 non-null   object
 1   class    2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [8]:
#Check missing data
df.isnull().sum()

content    0
class      0
dtype: int64

In [9]:
#Check duplicates
df.duplicated().sum()

98

In [10]:
#Drop duplicates and check again
df.drop_duplicates(inplace = True)
df.duplicated().sum()

0

In [11]:
df.head()

Unnamed: 0,content,class
0,Musicians to tackle US red tape\n\nMusicians' ...,entertainment
1,"U2's desire to be number one\n\nU2, who have w...",entertainment
2,Rocker Doherty in on-stage fight\n\nRock singe...,entertainment
3,Snicket tops US box office chart\n\nThe film a...,entertainment
4,Ocean's Twelve raids box office\n\nOcean's Twe...,entertainment


In [12]:
df.tail()

Unnamed: 0,content,class
2220,Warning over Windows Word files\n\nWriting a M...,tech
2221,Fast lifts rise into record books\n\nTwo high-...,tech
2222,Nintendo adds media playing to DS\n\nNintendo ...,tech
2223,Fast moving phone viruses appear\n\nSecurity f...,tech
2224,Hacker threat to Apple's iTunes\n\nUsers of Ap...,tech


#### Clean the text

In [13]:
#Remove HTTP links
df['content'] = df['content'].replace(r'((https|http)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*','',regex = True)

Regex for http or https: ((https|http)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*

In [14]:
#Remove end of line characters
df['content'] = df['content'].replace(r'[\r\n]+',' ',regex=True)

In [15]:
#Remove numbers, only keep letters
df['content'] = df['content'].replace('[\w]*\d+[\w]*','',regex=True)

In [16]:
#Remove punctuation
df['content'] = df['content'].replace('[^\w\s]',' ',regex=True)
punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
for char in punctuation:
    df['content'] = df['content'].replace(char,' ')

In [17]:
#Remove multiple spaces with one space
df['content'] = df['content'].replace('[\s]{2,}',' ',regex=True)

In [18]:
#Convert to lower case
df['content'] = df['content'].str.lower()

In [19]:
#Triming the spaces at the start and end of the lines
df['content'] = df['content'].replace('^[\s]{1,}','',regex=True)
df['content'] = df['content'].replace('[\s]{1,}$','',regex=True)

In [20]:
#Remove empty rows
df = df[df['content'] != '']

In [21]:
#Remove stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    text_split = text.split()
    text = [word for word in text_split if word not in stop_words]
    return ' '.join(text)    

In [22]:
df['content'] = df['content'].apply(remove_stopwords)

Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence.

In [23]:
#Use word net lemmatizer to get the root of the word
lemmatizer = WordNetLemmatizer()

In [24]:
#Function to identify verb, noun, adj...
def get_wordnet_pos(treebank_tag):
    '''
    return WORDNET POS compliance to WORDNET lemmatization (a,n,r,v)
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    else:
        #As defaul pos in lemmatization is Noun
        return wordnet.NOUN

Stemming is the process of reducing inflection in words to their root forms. For example, 'consult', 'consultant', 'consulting', 'consultantative' etc will return to 'consult' after being stemming.

However, 'studying' and 'study' will become 'studi', while 'change' and 'changing' become 'chang'.

Thus, lemmatization method, similar but more powerful and more accurate will be preferred. Lemmatization is also the process of converting a word to its base form.

In [25]:
def lemmatize_text(text):
    lemmatized = []
    post_tag_list = pos_tag(word_tokenize(text))
    for word, post_tag_val in post_tag_list:
        lemmatized.append(lemmatizer.lemmatize(word,get_wordnet_pos(post_tag_val)))
    text = ' '.join(x for x in lemmatized)
    return text

In [26]:
df['content'] = df['content'].apply(lemmatize_text)

In [27]:
#Save the processed text data into excel
df.to_excel('bbc_cleaned.xlsx',index=False)