##  Problem Statement: Natural Language Processing on BBC News dataset available on Kaggle.

In this tutorial, I have performed Natural Language Processing tasks using Python libraries such as NLTK, SpaCy, Word2Vec, and TF-IDF. I have used various techniques such as tokenization, stemming, lemmatization, and document similarity calculation using these libraries. 

## Tasks:

##### 1. Import the necessary libraries: Start by importing the required libraries,including NLTK, SpaCy, gensim, and scikit-learn.

In [9]:
!pip install gensim



In [10]:
!pip install spacy



In [1]:
import nltk
import spacy
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
import pandas as pd
from nltk import pos_tag
from nltk.corpus import wordnet
from spacy import displacy
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial import distance
from gensim.models import Word2Vec

In [2]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/jaspreetkaur/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [3]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jaspreetkaur/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
spacy.cli.download("en_core_web_sm") 

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
nlp = spacy.load("en_core_web_sm")

In [6]:
onjWNL = WordNetLemmatizer()
objPorterStemmer = PorterStemmer()

In [7]:
stWords = stopwords.words('english')

###### 2. Load the dataset: Load the BBCNewsdataset(BBC_DATA.csv)into a pandas DataFrame using the read_csv() function. The dataset contains 2,225 rows and 2 columns, with the first column containing the text of the news articles.

In [2]:
dataBbcNews = pd.read_csv("bbc_news.csv")

In [3]:
dataFrameTop5Rows = dataBbcNews.head()

In [10]:
dataBbcNews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20536 entries, 0 to 20535
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        20536 non-null  object
 1   pubDate      20536 non-null  object
 2   guid         20536 non-null  object
 3   link         20536 non-null  object
 4   description  20536 non-null  object
dtypes: object(5)
memory usage: 802.3+ KB


In [11]:
textColumn = dataBbcNews['title']
textColumn

0        Ukraine: Angry Zelensky vows to punish Russian...
1        War in Ukraine: Taking cover in a town under a...
2               Ukraine war 'catastrophic for global food'
3        Manchester Arena bombing: Saffie Roussos's par...
4        Ukraine conflict: Oil price soars to highest l...
                               ...                        
20531    UCI Cycling World Championships 2023: Tom Pidc...
20532    Hibernian 3-1 Luzern: Aston Villa Europa Confe...
20533    The Hundred 2023: Stevie Eskinazi steers Welsh...
20534    The Hundred 2023: Shabnam Ismail's match-winni...
20535              Mortgage rates: Five ways to save money
Name: title, Length: 20536, dtype: object

In [12]:
articleColumn =  dataBbcNews['description']
articleColumn

0        The Ukrainian president says the country will ...
1        Jeremy Bowen was on the frontline in Irpin, as...
2        One of the world's biggest fertiliser firms sa...
3        The parents of the Manchester Arena bombing's ...
4        Consumers are feeling the impact of higher ene...
                               ...                        
20531    Great Britain's Tom Pidcock is accused of "cra...
20532    Two late goals give Hibernian a potentially pr...
20533    Stevie Eskinazi hits a classy 43 from 18 balls...
20534    Watch the best plays from day 10 of The Hundre...
20535    Experts give advice for those who might be wor...
Name: description, Length: 20536, dtype: object

##### 3. Tokenization with NLTK:Implement tokenization using NLTK's word_tokenize() and sent_tokenize() functions. Apply these functions to a sample news article from the dataset.

In [13]:
word_tokenize(articleColumn[0])
# perofrm word tokenize of 1 row of article column

['The',
 'Ukrainian',
 'president',
 'says',
 'the',
 'country',
 'will',
 'not',
 'forgive',
 'or',
 'forget',
 'those',
 'who',
 'murder',
 'its',
 'civilians',
 '.']

In [14]:
sent_tokenize(articleColumn[1]) 
# perofrm sentence tokenize of 2 row of article column

['Jeremy Bowen was on the frontline in Irpin, as residents came under Russian fire while trying to flee.']

##### 4. Stemming and Lemmatization with NLTK:Implement stemming and lemmatization using NLTK's PorterStemmer and WordNetLemmatizer functions. Apply these functions to a sample news article from the dataset.

In [15]:
# implement stemming on a single row. 
sentence = articleColumn[0].lower()
wordTokenized = word_tokenize(sentence)
#wordTokenized
sentenceCleaned = [s for s in wordTokenized if s not in stWords]
# sentenceCleaned
print("====implement stemming on a single row values===")
for i in sentenceCleaned:
    print(f"The actual word is ==> {i} and stem word is ==> {objPorterStemmer.stem(i)}")

====implement stemming on a single row values===
The actual word is ==> ukrainian and stem word is ==> ukrainian
The actual word is ==> president and stem word is ==> presid
The actual word is ==> says and stem word is ==> say
The actual word is ==> country and stem word is ==> countri
The actual word is ==> forgive and stem word is ==> forgiv
The actual word is ==> forget and stem word is ==> forget
The actual word is ==> murder and stem word is ==> murder
The actual word is ==> civilians and stem word is ==> civilian
The actual word is ==> . and stem word is ==> .


In [16]:
# implement Lemmatization on a single row.
print("====implement Lemmatization on single row values===")
for i in sentenceCleaned:
  print(f"The actual word is ==> {i} and lemmatize word is ==> {onjWNL.lemmatize(i)}")

====implement Lemmatization on single row values===
The actual word is ==> ukrainian and lemmatize word is ==> ukrainian
The actual word is ==> president and lemmatize word is ==> president
The actual word is ==> says and lemmatize word is ==> say
The actual word is ==> country and lemmatize word is ==> country
The actual word is ==> forgive and lemmatize word is ==> forgive
The actual word is ==> forget and lemmatize word is ==> forget
The actual word is ==> murder and lemmatize word is ==> murder
The actual word is ==> civilians and lemmatize word is ==> civilian
The actual word is ==> . and lemmatize word is ==> .


In [17]:
# this method returns, Pos tag value of words, which we use later in lemmatize method
def getPosTag(tagValue):
    if tagValue.startswith('J'):
        return wordnet.ADJ
    elif tagValue.startswith('V'):
        return wordnet.VERB
    elif tagValue.startswith('N'):
        return wordnet.NOUN
    elif tagValue.startswith('R'):
        return wordnet.ADV
    else:
        return None
    

In [18]:
# lemmatized words according to their POS tag.... 
tokenized = sent_tokenize(sentence)
for i in tokenized:  
    wordsList = word_tokenize(i)

    # remove stop words from wordList
    wordsList = [w for w in wordsList if not w in stWords]
 
    #  call method to get POS tag of word
    PartsofSpeech = pos_tag(wordsList)
    for lis in PartsofSpeech:
        if getPosTag(lis[1]) != None:
            word_lemma = onjWNL.lemmatize(lis[0], pos = getPosTag(lis[1]))
            print(f"Original word: {lis[0]}, Lemmatized word: {word_lemma}")
   

Original word: ukrainian, Lemmatized word: ukrainian
Original word: president, Lemmatized word: president
Original word: says, Lemmatized word: say
Original word: country, Lemmatized word: country
Original word: forgive, Lemmatized word: forgive
Original word: forget, Lemmatized word: forget
Original word: murder, Lemmatized word: murder
Original word: civilians, Lemmatized word: civilian


##### 5. Named Entity Recognition with SpaCy: Use SpaCy's pre-trained model to perform named entity recognition on a sample news article from the dataset. Visualize the named entities using displaCy.

In [None]:
# implement Named entity recognition on 5 row of article column
sentence = articleColumn[5].lower() 
doc = nlp(sentence)
NER = [(ent.text,ent.label_) for ent in doc.ents]
displacy.serve(doc, style="ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...



##### 6. Word2Vec with gensim: Implement Word2Vec using gensim's Word2Vec function on the entire dataset.  Train the model and get the vector representation of a sample word.

In [4]:
# implement word2vec on dataset description column... 
dataBbcNews['tokenized_sents'] = dataBbcNews.apply(lambda row: nltk.word_tokenize(row['description']), axis=1)
#dataBbcNews['tokenized_sents']

In [5]:
# create Word2Vec model object
wordModel=Word2Vec(window=5, min_count=2, workers=4, sg=0)

# Train word2vec model with specific parameters
wordModel.build_vocab(dataBbcNews['tokenized_sents'],
                  progress_per=1000)
wordModel.train(dataBbcNews['tokenized_sents'], 
            total_examples=wordModel.corpus_count, epochs=wordModel.epochs)

(1485838, 2015590)

In [6]:
###### Note: i have commented some code for more clearance..
###### And display only required code... which is required only for problem statement

In [7]:
#wordModel.wv["president"]

In [8]:
#wordModel.wv["civilians"]

In [9]:
#wordModel.wv["Manchester"]

In [10]:
wordModel.wv.similarity(w1="civilians", w2="president")

0.7765721

In [11]:
wordModel.wv.similarity(w1="civilians", w2="Manchester")

0.41614118

In [12]:
wordModel.wv.similarity(w1="parents", w2="Ukrainian")

0.9338075

##### 7. TF-IDF with scikit-learn: Implement TF-IDF using scikit-learn's TfidfVectorizer function on the entire dataset.  Transform the dataset using the fitted vectorizer and calculate the cosine similarity between two news articles.

In [13]:
# convert database description column to list, 
# as TfidfVectorizer fit_transform method takes list as arguments...
descList = dataBbcNews['description'].tolist()
descList = descList[1:5]
#descList

In [14]:
# create object
vectorizer = TfidfVectorizer()

In [15]:
# pass list object to fit_transform method
tfidfMatrix = vectorizer.fit_transform(descList)

In [16]:
#feature_names = vectorizer.get_feature_names_out()

In [17]:
# convert matrix 1 row to list, after fetching from array...
textToVector1 = tfidfMatrix.toarray()[0].tolist()
#textToVector1

In [18]:
# convert matrix 2 row to list, after fetching from array...
textToVector2 = tfidfMatrix.toarray()[1].tolist()
# textToVector2

In [19]:
# find cosine similarity between two rows in term of percentage....
cosine = distance.cosine(textToVector1, textToVector2)
print('Similarity of two sentences are equal to ',round((1-cosine)*100,2),'%')

Similarity of two sentences are equal to  7.64 %


In [20]:
textToVector3 = tfidfMatrix.toarray()[3].tolist()
# textToVector3

In [21]:
# find cosine similarity between two rows in term of percentage....
cosine = distance.cosine(textToVector1, textToVector3)
print('Similarity of two sentences are equal to ',round((1-cosine)*100,2),'%')

Similarity of two sentences are equal to  5.82 %
