Natural Processing language (NLTK- Natural language ToolKit) in Python

In [None]:
# Download NLTK library and load nltk
import nltk

nltk.download('all') # once only

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    |   Package bcp47 is already up-to-dat

True

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# I define the stop_words here so I don't do it every time in the function below
stop_words = stopwords.words('english')
# I've added the index_col='id' here to set your 'id' column as the index. This assumes that the 'id' is unique.
df = pd.read_csv('drive/MyDrive/VoiceofCustomer.csv', index_col='id')
df

Unnamed: 0_level_0,satisfaction,satisfaction score,voice of customer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
11,neutral or dissatisfied,4,NIL
12,neutral or dissatisfied,3,NIL
20,neutral or dissatisfied,3,NIL
22,neutral or dissatisfied,4,NIL
55,neutral or dissatisfied,3,NIL
...,...,...,...
129796,neutral or dissatisfied,4,NIL
129814,neutral or dissatisfied,4,I encountered a delay in receiving my requeste...
129828,neutral or dissatisfied,3,NIL
129843,neutral or dissatisfied,3,NIL


Here we define our function that will be applied to each row using **df.apply** in the next cell. You can see that this function **get_keywords** takes a row as its argument and returns a string of comma separated keywords like you have in your desired output above ("meaning,word,himalaya"). Within this function we lower, tokenize, filter out punctuation with **isalpha()**, filter out our stop_words, and join our keywords together to form the desired output.

In [None]:
# This function will be applied to every row in Dataframe
# See the docs for df.apply at:

def get_keywords(row):
    some_text = row['voice of customer']
    lowered = some_text.lower()
    tokens = nltk.tokenize.word_tokenize(lowered)
    keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
    keywords_string = ','.join(keywords)
    return keywords_string

Now that we have defined our function that will be applied we call **df.apply**(get_keywords, axis=1) **bold text**. This will return a Pandas Series (similar to a list). Since we want this series to be a part of our dataframe we add it as a new column using **df['keywords'] = df.apply(get_keywords, axis=1)**

In [None]:
# applying the get_keywords function to our dataframe and saving the results
# as a new column in our dataframe called 'keywords'
# axis=1 means that we will apply get_keywords to each row and not each column

df['keywords'] = df.apply(get_keywords, axis=1)
df
#df.to_excel("reviewskeywords.xlsx")

Unnamed: 0_level_0,satisfaction,satisfaction score,voice of customer,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
11,neutral or dissatisfied,4,NIL,nil
12,neutral or dissatisfied,3,NIL,nil
20,neutral or dissatisfied,3,NIL,nil
22,neutral or dissatisfied,4,NIL,nil
55,neutral or dissatisfied,3,NIL,nil
...,...,...,...,...
129796,neutral or dissatisfied,4,NIL,nil
129814,neutral or dissatisfied,4,I encountered a delay in receiving my requeste...,"encountered,delay,receiving,requested,special,..."
129828,neutral or dissatisfied,3,NIL,nil
129843,neutral or dissatisfied,3,NIL,nil
