<a href="https://colab.research.google.com/github/bilaloumehdi/TP_NLP/blob/master/NLP_TP1_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Install and Connect To Kaggle**

In [1]:
#install kaggle
!pip install kaggle



In [3]:
# authenticate to kaggle
!mkdir -p ~/.kaggle
!cp ./kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


# **Download the dataset and unzip it**

In [4]:
!kaggle datasets download -d thoughtvector/customer-support-on-twitter


Downloading customer-support-on-twitter.zip to /content
 99% 167M/169M [00:01<00:00, 201MB/s]
100% 169M/169M [00:01<00:00, 157MB/s]


In [5]:
#unzip
!unzip customer-support-on-twitter.zip

Archive:  customer-support-on-twitter.zip
  inflating: sample.csv              
  inflating: twcs/twcs.csv           


# **Load the Data from sample.csv**



In [7]:
import pandas as pd
df = pd.read_csv('./sample.csv')

df.head()


Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,119237,105834,True,Wed Oct 11 06:55:44 +0000 2017,@AppleSupport causing the reply to be disregar...,119236.0,
1,119238,ChaseSupport,False,Wed Oct 11 13:25:49 +0000 2017,@105835 Your business means a lot to us. Pleas...,,119239.0
2,119239,105835,True,Wed Oct 11 13:00:09 +0000 2017,@76328 I really hope you all change but I'm su...,119238.0,
3,119240,VirginTrains,False,Tue Oct 10 15:16:08 +0000 2017,@105836 LiveChat is online at the moment - htt...,119241.0,119242.0
4,119241,105836,True,Tue Oct 10 15:17:21 +0000 2017,@VirginTrains see attached error message. I've...,119243.0,119240.0


In [9]:
# extract the text column
df_text = df.text
df_text

0     @AppleSupport causing the reply to be disregar...
1     @105835 Your business means a lot to us. Pleas...
2     @76328 I really hope you all change but I'm su...
3     @105836 LiveChat is online at the moment - htt...
4     @VirginTrains see attached error message. I've...
                            ...                        
88    @105860 I wish Amazon had an option of where I...
89    They reschedule my shit for tomorrow https://t...
90    @105861 Hey Sara, sorry to hear of the issues ...
91    @Tesco bit of both - finding the layout cumber...
92    @105861 If that doesn't help please DM your fu...
Name: text, Length: 93, dtype: object

# **Preprocessing**

# 1. Convert the lines of Text Column to a list of lines

**Advantages:**

* **Memory Efficiency**: Loading the entire DataFrame into
memory as a list can be memory-intensive, especially if the DataFrame is large. Converting it to a list can reduce memory consumption, making it more efficient for handling large datasets.

* **Improved Iteration Speed**: When you convert a DataFrame to a list, you're essentially moving the data from a tabular structure into a more linear one. For certain types of operations, such as sequential processing, iterating through a list can be faster than iterating through a DataFrame, as list operations are generally more optimized for iteration.

* **Compatibility**: Some libraries and NLP tools may expect data in a list format rather than a DataFrame. Converting to a list ensures compatibility with such libraries, making it easier to integrate different components of your NLP pipeline.



In [10]:
# convert lines to a list
df_text_list = df_text.values.tolist()

df_text_list[:5]

['@AppleSupport causing the reply to be disregarded and the tapped notification under the keyboard is opened😡😡😡',
 '@105835 Your business means a lot to us. Please DM your name, zip code and additional details about your concern. ^RR https://t.co/znUu1VJn9r',
 "@76328 I really hope you all change but I'm sure you won't! Because you don't have to!",
 '@105836 LiveChat is online at the moment - https://t.co/SY94VtU8Kq or contact 03331 031 031 option 1, 4, 3 (Leave a message) to request a call back',
 "@VirginTrains see attached error message. I've tried leaving a voicemail several times in the past week https://t.co/NxVZjlYx1k"]

# 2. Removing Punctuation and Emojis

We use the **string** library to eliminate punctuation and the **emoji** library to remove emojis.

In [12]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.8.0


In [19]:
# this function takes a sentence as parameter
# and  return it without punctuation

def remove_punctuation(sentence):
  return ''.join([char for char in sentence if char not in string.punctuation])

# this function takes a sentence as parameter
# and  return it without emojis
def remove_emojis(sentence):
    return ''.join([char for char in sentence if char not in emoji.EMOJI_DATA])

def remove_links(sentence):
  return re.sub(r'https?\S+|www\.\S+','',sentence)


In [20]:
# convert to lower and remove punctuation
import string
import emoji
import re

text_without_punc = []

for sentence in df_text_list:
  # convert to lowercase
  sentence = sentence.lower()

  sentence = remove_punctuation(sentence)
  sentence = remove_links(sentence)
  sentence = remove_emojis(sentence)

  text_without_punc.append(sentence)

# print the 5 first lines
print(text_without_punc[:5])

['applesupport causing the reply to be disregarded and the tapped notification under the keyboard is opened', '105835 your business means a lot to us please dm your name zip code and additional details about your concern rr ', '76328 i really hope you all change but im sure you wont because you dont have to', '105836 livechat is online at the moment   or contact 03331 031 031 option 1 4 3 leave a message to request a call back', 'virgintrains see attached error message ive tried leaving a voicemail several times in the past week ']


# 3. Tokenization

In [21]:
import nltk
nltk.download('punkt')

tokens_list = []

for sentence in text_without_punc:

  tokenized_words = nltk.word_tokenize(sentence)
  tokens_list.append(tokenized_words)

print(tokens_list[:5])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[['applesupport', 'causing', 'the', 'reply', 'to', 'be', 'disregarded', 'and', 'the', 'tapped', 'notification', 'under', 'the', 'keyboard', 'is', 'opened'], ['105835', 'your', 'business', 'means', 'a', 'lot', 'to', 'us', 'please', 'dm', 'your', 'name', 'zip', 'code', 'and', 'additional', 'details', 'about', 'your', 'concern', 'rr'], ['76328', 'i', 'really', 'hope', 'you', 'all', 'change', 'but', 'im', 'sure', 'you', 'wont', 'because', 'you', 'dont', 'have', 'to'], ['105836', 'livechat', 'is', 'online', 'at', 'the', 'moment', 'or', 'contact', '03331', '031', '031', 'option', '1', '4', '3', 'leave', 'a', 'message', 'to', 'request', 'a', 'call', 'back'], ['virgintrains', 'see', 'attached', 'error', 'message', 'ive', 'tried', 'leaving', 'a', 'voicemail', 'several', 'times', 'in', 'the', 'past', 'week']]


# 3. Remove Stop words

In [22]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

text_without_stopwords = []
sentence_without_stopwords =[]

for sentence in tokens_list:
  for word in sentence:
    if word not in stop_words:
      sentence_without_stopwords.append(word)

  text_without_stopwords.append(sentence_without_stopwords)
  # re-initilize the sentence
  sentence_without_stopwords = []

print(text_without_stopwords[:5])

[['applesupport', 'causing', 'reply', 'disregarded', 'tapped', 'notification', 'keyboard', 'opened'], ['105835', 'business', 'means', 'lot', 'us', 'please', 'dm', 'name', 'zip', 'code', 'additional', 'details', 'concern', 'rr'], ['76328', 'really', 'hope', 'change', 'im', 'sure', 'wont', 'dont'], ['105836', 'livechat', 'online', 'moment', 'contact', '03331', '031', '031', 'option', '1', '4', '3', 'leave', 'message', 'request', 'call', 'back'], ['virgintrains', 'see', 'attached', 'error', 'message', 'ive', 'tried', 'leaving', 'voicemail', 'several', 'times', 'past', 'week']]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# 3. Lemmatisation

Because in this case we have a small dataset

In [23]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer  =  WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [24]:
lemmatized_text = []
lemmatized_sentence = []
for sentence in text_without_stopwords:
  for word in sentence:
    lemmatized_word = lemmatizer.lemmatize(word)
    lemmatized_sentence.append(lemmatized_word)

  lemmatized_text.append(lemmatized_sentence)
  lemmatized_sentence=[]

print(lemmatized_text[:5])

[['applesupport', 'causing', 'reply', 'disregarded', 'tapped', 'notification', 'keyboard', 'opened'], ['105835', 'business', 'mean', 'lot', 'u', 'please', 'dm', 'name', 'zip', 'code', 'additional', 'detail', 'concern', 'rr'], ['76328', 'really', 'hope', 'change', 'im', 'sure', 'wont', 'dont'], ['105836', 'livechat', 'online', 'moment', 'contact', '03331', '031', '031', 'option', '1', '4', '3', 'leave', 'message', 'request', 'call', 'back'], ['virgintrains', 'see', 'attached', 'error', 'message', 'ive', 'tried', 'leaving', 'voicemail', 'several', 'time', 'past', 'week']]


# **Result**
finally we join the tokens to sentences again to have a preprocessed text because it will be more performant when we move to vectorizing

In [25]:

joined_sentence = ""
result = []
for sentence in lemmatized_text :
  joined_sentence = ' '.join([c for c in sentence])
  result.append(joined_sentence)

# convert to dataframe for only nice display
df_result = pd.DataFrame(result,columns=['text'])
df_result

Unnamed: 0,text
0,applesupport causing reply disregarded tapped ...
1,105835 business mean lot u please dm name zip ...
2,76328 really hope change im sure wont dont
3,105836 livechat online moment contact 03331 03...
4,virgintrains see attached error message ive tr...
...,...
88,105860 wish amazon option get shipped ups stor...
89,reschedule shit tomorrow
90,105861 hey sara sorry hear issue ask lay speed...
91,tesco bit finding layout cumbersome removing i...
