DATASET: Tweets to and from companies doing customer support on Twitter.

Step 1: Import Libraries

In [None]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string


Step 2: Load the dataset

In [None]:
full_df = pd.read_csv('/content/sample.csv', nrows=5000)
df = full_df[["text"]]
df["text"] = df["text"].astype(str)
full_df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,119237,105834,True,Wed Oct 11 06:55:44 +0000 2017,@AppleSupport causing the reply to be disregar...,119236.0,
1,119238,ChaseSupport,False,Wed Oct 11 13:25:49 +0000 2017,@105835 Your business means a lot to us. Pleas...,,119239.0
2,119239,105835,True,Wed Oct 11 13:00:09 +0000 2017,@76328 I really hope you all change but I'm su...,119238.0,
3,119240,VirginTrains,False,Tue Oct 10 15:16:08 +0000 2017,@105836 LiveChat is online at the moment - htt...,119241.0,119242.0
4,119241,105836,True,Tue Oct 10 15:17:21 +0000 2017,@VirginTrains see attached error message. I've...,119243.0,119240.0


# Lower Casing:

The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way.

By default, lower casing is done my most of the modern day vecotirzers and tokenizers like sklearn TfidfVectorizer and Keras Tokenizer.

So we need to set them to false as needed depending on our use case.

In [None]:
df["text_lower"] = df["text"].str.lower()
df.head()

Unnamed: 0,text,text_lower
0,@AppleSupport causing the reply to be disregar...,@applesupport causing the reply to be disregar...
1,@105835 Your business means a lot to us. Pleas...,@105835 your business means a lot to us. pleas...
2,@76328 I really hope you all change but I'm su...,@76328 i really hope you all change but i'm su...
3,@105836 LiveChat is online at the moment - htt...,@105836 livechat is online at the moment - htt...
4,@VirginTrains see attached error message. I've...,@virgintrains see attached error message. i've...


# Removal of Punctuations:

One another common text preprocessing technique is to remove the punctuations from the text data. This is again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.

We also need to carefully choose the list of punctuations to exclude depending on the use case. For example, the string.punctuation in python contains the following punctuation symbols

!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`

We can add or remove more punctuations as per our need.

In [None]:
# drop the new column created in last cell
df.drop(["text_lower"], axis=1, inplace=True)

PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

df["text_wo_punct"] = df["text"].apply(lambda text: remove_punctuation(text))
df.head()

Unnamed: 0,text,text_wo_punct
0,@AppleSupport causing the reply to be disregar...,AppleSupport causing the reply to be disregard...
1,@105835 Your business means a lot to us. Pleas...,105835 Your business means a lot to us Please ...
2,@76328 I really hope you all change but I'm su...,76328 I really hope you all change but Im sure...
3,@105836 LiveChat is online at the moment - htt...,105836 LiveChat is online at the moment https...
4,@VirginTrains see attached error message. I've...,VirginTrains see attached error message Ive tr...


# Removal of stopwords:
Stopwords are commonly occuring words in a language like 'the', 'a' and so on.

They can be removed from the text most of the times, as they don't provide valuable information for downstream analysis.

In cases like Part of Speech tagging, we should not remove them as provide very valuable information about the POS.

These stopword lists are already compiled for different languages and we can safely use them.

For example, the stopword list for english language from the nltk package can be seen below.

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [None]:
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

df["text_wo_stop"] = df["text_wo_punct"].apply(lambda text: remove_stopwords(text))
df.head()

Unnamed: 0,text,text_wo_punct,text_wo_stop
0,@AppleSupport causing the reply to be disregar...,AppleSupport causing the reply to be disregard...,AppleSupport causing reply disregarded tapped ...
1,@105835 Your business means a lot to us. Pleas...,105835 Your business means a lot to us Please ...,105835 Your business means lot us Please DM na...
2,@76328 I really hope you all change but I'm su...,76328 I really hope you all change but Im sure...,76328 I really hope change Im sure wont Becaus...
3,@105836 LiveChat is online at the moment - htt...,105836 LiveChat is online at the moment https...,105836 LiveChat online moment httpstcoSY94VtU8...
4,@VirginTrains see attached error message. I've...,VirginTrains see attached error message Ive tr...,VirginTrains see attached error message Ive tr...


# Removal of Frequent words


In the previos preprocessing step, we removed the stopwords based on language information. But say, if we have a domain specific corpus, we might also have some frequent words which are of not so much importance to us.

So this step is to remove the frequent words in the given corpus. If we use something like tfidf, this is automatically taken care of.

Let us get the most common words adn then remove them in the next step

In [None]:
from collections import Counter
cnt = Counter()
for text in df["text_wo_stop"].values:
    for word in text.split():
        cnt[word] += 1

cnt.most_common(10)

[('I', 34),
 ('us', 25),
 ('DM', 19),
 ('help', 17),
 ('httpstcoGDrqU22YpT', 12),
 ('AppleSupport', 11),
 ('Thanks', 11),
 ('phone', 9),
 ('Hi', 8),
 ('get', 8)]

In [None]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["text_wo_stopfreq"] = df["text_wo_stop"].apply(lambda text: remove_freqwords(text))
df.head()

Unnamed: 0,text,text_wo_punct,text_wo_stop,text_wo_stopfreq
0,@AppleSupport causing the reply to be disregar...,AppleSupport causing the reply to be disregard...,AppleSupport causing reply disregarded tapped ...,causing reply disregarded tapped notification ...
1,@105835 Your business means a lot to us. Pleas...,105835 Your business means a lot to us Please ...,105835 Your business means lot us Please DM na...,105835 Your business means lot Please name zip...
2,@76328 I really hope you all change but I'm su...,76328 I really hope you all change but Im sure...,76328 I really hope change Im sure wont Becaus...,76328 really hope change Im sure wont Because ...
3,@105836 LiveChat is online at the moment - htt...,105836 LiveChat is online at the moment https...,105836 LiveChat online moment httpstcoSY94VtU8...,105836 LiveChat online moment httpstcoSY94VtU8...
4,@VirginTrains see attached error message. I've...,VirginTrains see attached error message Ive tr...,VirginTrains see attached error message Ive tr...,VirginTrains see attached error message Ive tr...


# Removal of Rare words
This is very similar to previous preprocessing step but we will remove the rare words from the corpus.

In [None]:
# Drop the two columns which are no more needed
df.drop(["text_wo_punct", "text_wo_stop"], axis=1, inplace=True)

n_rare_words = 10
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

df["text_wo_stopfreqrare"] = df["text_wo_stopfreq"].apply(lambda text: remove_rarewords(text))
df.head()

Unnamed: 0,text,text_wo_stopfreq,text_wo_stopfreqrare
0,@AppleSupport causing the reply to be disregar...,causing reply disregarded tapped notification ...,causing reply disregarded tapped notification ...
1,@105835 Your business means a lot to us. Pleas...,105835 Your business means lot Please name zip...,105835 Your business means lot Please name zip...
2,@76328 I really hope you all change but I'm su...,76328 really hope change Im sure wont Because ...,76328 really hope change Im sure wont Because ...
3,@105836 LiveChat is online at the moment - htt...,105836 LiveChat online moment httpstcoSY94VtU8...,105836 LiveChat online moment httpstcoSY94VtU8...
4,@VirginTrains see attached error message. I've...,VirginTrains see attached error message Ive tr...,VirginTrains see attached error message Ive tr...


We can combine all the list of words (stopwords, frequent words and rare words) and create a single list to remove them at once.

# Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.

For example, if there are two words in the corpus walks and walking, then stemming will stem the suffix to make them walk. But say in another example, we have two words console and consoling, the stemmer will remove the suffix and make them consol which is not a proper english word.

There are several type of stemming algorithms available and one of the famous one is porter stemmer which is widely used. We can use nltk package for the same.

In [None]:
from nltk.stem.porter import PorterStemmer

# Drop the two columns
df.drop(["text_wo_stopfreq", "text_wo_stopfreqrare"], axis=1, inplace=True)

stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

df["text_stemmed"] = df["text"].apply(lambda text: stem_words(text))
df.head()

Unnamed: 0,text,text_stemmed
0,@AppleSupport causing the reply to be disregar...,@applesupport caus the repli to be disregard a...
1,@105835 Your business means a lot to us. Pleas...,@105835 your busi mean a lot to us. pleas dm y...
2,@76328 I really hope you all change but I'm su...,@76328 i realli hope you all chang but i'm sur...
3,@105836 LiveChat is online at the moment - htt...,@105836 livechat is onlin at the moment - http...
4,@VirginTrains see attached error message. I've...,@virgintrain see attach error message. i'v tri...


We can see that words like private and propose have their e at the end chopped off due to stemming. This is not intented. What can we do fort hat? We can use Lemmatization in such cases.

Also this porter stemmer is for English language. If we are working with other languages, we can use snowball stemmer. The supported languages for snowball stemmer are

In [None]:
from nltk.stem.snowball import SnowballStemmer
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

# Lemmatization
Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language.

As a result, this one is generally slower than stemming process. So depending on the speed requirement, we can choose to use either stemming or lemmatization.

Let us use the WordNetLemmatizer in nltk to lemmatize our sentences

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df["text_lemmatized"] = df["text"].apply(lambda text: lemmatize_words(text))
df.head()

Unnamed: 0,text,text_stemmed,text_lemmatized
0,@AppleSupport causing the reply to be disregar...,@applesupport caus the repli to be disregard a...,@AppleSupport causing the reply to be disregar...
1,@105835 Your business means a lot to us. Pleas...,@105835 your busi mean a lot to us. pleas dm y...,@105835 Your business mean a lot to us. Please...
2,@76328 I really hope you all change but I'm su...,@76328 i realli hope you all chang but i'm sur...,@76328 I really hope you all change but I'm su...
3,@105836 LiveChat is online at the moment - htt...,@105836 livechat is onlin at the moment - http...,@105836 LiveChat is online at the moment - htt...
4,@VirginTrains see attached error message. I've...,@virgintrain see attach error message. i'v tri...,@VirginTrains see attached error message. I've...
