**LOADING THE DATA**

In [2]:
import pandas as pd

In [3]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [4]:
!ls /content/drive/MyDrive/twcs.csv

/content/drive/MyDrive/twcs.csv


In [5]:
df=pd.read_csv("/content/drive/MyDrive/twcs.csv")

In [6]:
df

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4,6.0
...,...,...,...,...,...,...,...
2811769,2987947,sprintcare,False,Wed Nov 22 08:43:51 +0000 2017,"@823869 Hey, we'd be happy to look into this f...",,2987948.0
2811770,2987948,823869,True,Wed Nov 22 08:35:16 +0000 2017,@115714 wtf!? I’ve been having really shitty s...,2987947,
2811771,2812240,121673,True,Thu Nov 23 04:13:07 +0000 2017,@143549 @sprintcare You have to go to https://...,,2812239.0
2811772,2987949,AldiUK,False,Wed Nov 22 08:31:24 +0000 2017,"@823870 Sounds delicious, Sarah! 😋 https://t.c...",,2987950.0


In [10]:
df.shape

(2811774, 7)

In [7]:
corpus = df["text"]

In [8]:
corpus

0          @115712 I understand. I would like to assist y...
1              @sprintcare and how do you propose we do that
2          @sprintcare I have sent several private messag...
3          @115712 Please send us a Private Message so th...
4                                         @sprintcare I did.
                                 ...                        
2811769    @823869 Hey, we'd be happy to look into this f...
2811770    @115714 wtf!? I’ve been having really shitty s...
2811771    @143549 @sprintcare You have to go to https://...
2811772    @823870 Sounds delicious, Sarah! 😋 https://t.c...
2811773    @AldiUK  warm sloe gin mince pies with ice cre...
Name: text, Length: 2811774, dtype: object

**LOWER CASING**

Converting all the text data present in corpus in to lower casing

In [12]:
corpus = corpus.str.lower()

In [13]:
corpus

0          @115712 i understand. i would like to assist y...
1              @sprintcare and how do you propose we do that
2          @sprintcare i have sent several private messag...
3          @115712 please send us a private message so th...
4                                         @sprintcare i did.
                                 ...                        
2811769    @823869 hey, we'd be happy to look into this f...
2811770    @115714 wtf!? i’ve been having really shitty s...
2811771    @143549 @sprintcare you have to go to https://...
2811772    @823870 sounds delicious, sarah! 😋 https://t.c...
2811773    @aldiuk  warm sloe gin mince pies with ice cre...
Name: text, Length: 2811774, dtype: object

REMOVING THE HTML TAGS

Removing all the html tags present inside the text

In [14]:
import re

In [15]:
def remove_html_tags(text):
    pattern=re.compile('<.*?>')
    return pattern.sub("",text)

In [16]:
corpus = corpus.apply(remove_html_tags)

In [17]:
corpus[70]

'@chipotletweets no diet coke and a literal bone this boorito was extra spooky! https://t.co/c4occtduct'

**REMOVING URLS**

Removing all the urls present inside the text by defining a function

In [18]:
def remove_url(text):
    pattern=re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub("",text)

In [19]:
corpus = corpus.apply(remove_url)

In [20]:
corpus

0          @115712 i understand. i would like to assist y...
1              @sprintcare and how do you propose we do that
2          @sprintcare i have sent several private messag...
3          @115712 please send us a private message so th...
4                                         @sprintcare i did.
                                 ...                        
2811769    @823869 hey, we'd be happy to look into this f...
2811770    @115714 wtf!? i’ve been having really shitty s...
2811771    @143549 @sprintcare you have to go to  and ask...
2811772                  @823870 sounds delicious, sarah! 😋 
2811773    @aldiuk  warm sloe gin mince pies with ice cre...
Name: text, Length: 2811774, dtype: object

**REMOVING PUNCTUATION**

Removing all the punctuation marks present inside the text

In [21]:
import string

In [22]:
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    result_string = text.translate(translator)
    return result_string

In [23]:
corpus = corpus.apply(remove_punctuation)

In [24]:
corpus

0          115712 i understand i would like to assist you...
1               sprintcare and how do you propose we do that
2          sprintcare i have sent several private message...
3          115712 please send us a private message so tha...
4                                           sprintcare i did
                                 ...                        
2811769    823869 hey wed be happy to look into this for ...
2811770    115714 wtf i’ve been having really shitty serv...
2811771    143549 sprintcare you have to go to  and ask t...
2811772                     823870 sounds delicious sarah 😋 
2811773    aldiuk  warm sloe gin mince pies with ice crea...
Name: text, Length: 2811774, dtype: object

**CONVERTING THE CHATWORDS**

The abbrevations that are present in the text are converted into there long forms.

In [25]:
chat_words = {
"AFAIK":"As Far As I Know",
"AFK":"Away From Keyboard",
"ASAP":"As Soon As Possible",
"ATK":"At The Keyboard",
"ATM":"At The Moment",
"A3":"Anytime, Anywhere, Anyplace",
"BAK":"Back At Keyboard",
"BBL":"Be Back Later",
"BBS":"Be Back Soon",
"BFN":"Bye For Now",
"B4N":"Bye For Now",
"BRB":"Be Right Back",
"BRT":"Be Right There",
"BTW":"By The Way",
"B4":"Before",
"B4N":"Bye For Now",
"CU":"See You",
"CUL8R":"See You Later",
"CYA":"See You",
"FAQ":"Frequently Asked Questions",
"FC":"'Fingers Crossed",
"FWIW":"For What It's Worth",
"FYI":"For Your Information",
"GAL":"Get A Life",
"GG":"Good Game",
"GN":"Good Night",
"GMTA":"Great Minds Think Alike",
"GR8":"Great",
"G9":"Genius",
"IC":"I See",
"ICQ":"I Seek you",
"ILU": "I Love You",
"IMHO":"In My Honest Opinion",
"IMO":"In My Opinion",
"IOW":"In Other Words",
"IRL":"In Real Life",
"KISS":"Keep It Simple Stupid",
"LDR":"Long Distance Relationship",
"LMAO":"Laugh My A.. Off",
"LOL":"Laughing Out Loud",
"LTNS":"Long Time No See",
"L8R":"Later",
"MTE":"My Thoughts Exactly",
"M8":"Mate",
"NRN":"No Reply Necessary",
"OIC":"Oh I See",
"PITA":"Pain In The A..",
"PRT":"Party",
"PRW":"Parents Are Watching",
"ROFL":"Rolling On The Floor Laughing",
"ROFLOL":"Rolling On The Floor Laughing Out Loud",
"ROTFLMAO":"Rolling On The Floor Laughing My A.. Off",
"SK8":"Skate",
"STATS":"Your sex and age",
"ASL":"Age, Sex, Location",
"THX":"Thank You",
"TTFN":"Ta-Ta For Now",
"TTYL":"Talk To You Later",
"U":"You",
"U2":"You Too",
"U4E":"Yours For Ever",
"WB":"Welcome Back",
"WTF":"What The F...",
"WTG":"Way To Go",
"WUF":"Where Are You From",
"W8":"Wait",
"TFW":"That feeling when",
"MFW":"My face when",
"MRW":"My reaction when",
"IFYP":"I feel your pain",
"LOL":"Laughing out loud",
"TNTL":"Trying not to laugh",
"JK":"Just kidding",
"IDC":"I don’t care",
"ILY":"I love you",
"IMU":"I miss you",
"ADIH":"Another day in hell",
"IDC":"I don’t care",
"ZZZ":"Sleeping, bored, tired",
"WYWH":"Wish you were here",
"TIME":"Tears in my eyes",
"BAE":"Before anyone else",
"FIMH":"Forever in my heart",
"BSAAW":"Big smile and a wink",
"BWL":"Bursting with laughter",
"LMAO":"Laughing my a.. off",
"BFF":"Best friends forever",
"CSL":"Can’t stop laughing",
}

In [26]:
def chat_conversion(text):
    new_text=[]
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [27]:
corpus = corpus.apply(chat_conversion)

In [28]:
corpus = corpus.str.lower()

In [29]:
corpus

0          115712 i understand i would like to assist you...
1               sprintcare and how do you propose we do that
2          sprintcare i have sent several private message...
3          115712 please send us a private message so tha...
4                                           sprintcare i did
                                 ...                        
2811769    823869 hey wed be happy to look into this for ...
2811770    115714 what the f... i’ve been having really s...
2811771    143549 sprintcare you have to go to and ask th...
2811772                      823870 sounds delicious sarah 😋
2811773    aldiuk warm sloe gin mince pies with ice cream...
Name: text, Length: 2811774, dtype: object

**REMOVING STOPWORDS**

Stopwords are the set of commonly used words in english,which are repetative and not that important,hence are been eliminated here.While removing stopwords we have kept word'not',since it carries a important meaning if the sentence is negative.

In [30]:
import nltk
nltk.download("all")

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

In [31]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    stop_words.discard('not')  # Remove 'not' from stopwords list
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return ' '.join(filtered_text)


In [32]:
corpus = corpus.apply(remove_stopwords)

In [33]:
corpus

0          115712 understand would like assist would need...
1                                         sprintcare propose
2          sprintcare sent several private messages one r...
3          115712 please send us private message assist c...
4                                                 sprintcare
                                 ...                        
2811769    823869 hey wed happy look please send us direc...
2811770    115714 f ... ’ really shitty service day get s...
2811771    143549 sprintcare go ask add hulu service acco...
2811772                      823870 sounds delicious sarah 😋
2811773    aldiuk warm sloe gin mince pies ice cream best...
Name: text, Length: 2811774, dtype: object

**EMOJI CONVERSTION**

The emojicons thet are present in the corpus are converted into text form so that the further preprocessing can be carried out.

In [34]:
!pip install emoji
import emoji

Collecting emoji
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/421.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.6/421.5 kB[0m [31m4.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m419.8/421.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.10.1


In [35]:
def remove_emoji(text):
    clean_text=emoji.demojize(text)
    return clean_text

In [36]:
corpus = corpus.apply(remove_emoji)

In [37]:
corpus

0          115712 understand would like assist would need...
1                                         sprintcare propose
2          sprintcare sent several private messages one r...
3          115712 please send us private message assist c...
4                                                 sprintcare
                                 ...                        
2811769    823869 hey wed happy look please send us direc...
2811770    115714 f ... ’ really shitty service day get s...
2811771    143549 sprintcare go ask add hulu service acco...
2811772    823870 sounds delicious sarah :face_savoring_f...
2811773    aldiuk warm sloe gin mince pies ice cream best...
Name: text, Length: 2811774, dtype: object

**TOKENIZATION**

The process of converting a sequence of text into smaller parts, known as tokens.These tokens can be as small as characters or as long as words.

In [38]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [39]:
corpus = corpus.apply(word_tokenize)

In [40]:
corpus

0          [115712, understand, would, like, assist, woul...
1                                      [sprintcare, propose]
2          [sprintcare, sent, several, private, messages,...
3          [115712, please, send, us, private, message, a...
4                                               [sprintcare]
                                 ...                        
2811769    [823869, hey, wed, happy, look, please, send, ...
2811770    [115714, f, ..., ’, really, shitty, service, d...
2811771    [143549, sprintcare, go, ask, add, hulu, servi...
2811772    [823870, sounds, delicious, sarah, :, face_sav...
2811773    [aldiuk, warm, sloe, gin, mince, pies, ice, cr...
Name: text, Length: 2811774, dtype: object

**STEMMING AND LEMMATIZATION**

Stemming is a process that stems or removes last few characters from a word. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.Hence,lemmatization takes more time for the execution.since our data is very large we will peform stemming over here.

In [41]:
from nltk.stem import PorterStemmer

def stemming(text):
    obj=PorterStemmer()

    stem_word=[obj.stem(word) for word in text]

    return stem_word

In [42]:
corpus = corpus.apply(stemming)

In [43]:
corpus

0          [115712, understand, would, like, assist, woul...
1                                        [sprintcar, propos]
2          [sprintcar, sent, sever, privat, messag, one, ...
3          [115712, pleas, send, us, privat, messag, assi...
4                                                [sprintcar]
                                 ...                        
2811769    [823869, hey, wed, happi, look, pleas, send, u...
2811770    [115714, f, ..., ’, realli, shitti, servic, da...
2811771    [143549, sprintcar, go, ask, add, hulu, servic...
2811772    [823870, sound, delici, sarah, :, face_savorin...
2811773    [aldiuk, warm, sloe, gin, minc, pie, ice, crea...
Name: text, Length: 2811774, dtype: object