# Pre-processing of Chialing and Celine

# Objectives

In this notebook, our objective is to transform the dataset that we collected from webscrapping Ankor reviews on Amazon to another table that will be more structured, and focused on the titles and comments in order to be able to analyse the wording using the nltk package.
This work aims to facilitate the future visualisation of the data that will be done on Power BI.

**Needed package:**
- **numpy**
- **nltk**: state-of-the art library for NLP analysis (https://github.com/nltk/nltk)

# 1. Settings

### 1.1 Import Libraries

In [None]:
import numpy as np
import pandas as pd
import nltk
import re
import string

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk import word_tokenize,WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords

from string import punctuation

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


### 1.2 Import Data

In [None]:
df_date = pd.read_csv('https://raw.githubusercontent.com/emaleeeeeee/review_mini/main/REVIEWS%20RESCRAP.csv')
df_date

Unnamed: 0,Customer,Title,Comment,Rating,Date
0,EDGAR,Great projector,"Great item, worth every penny. I was skeptical...",5.0,
1,Jesse,A little projector that packs a punch!,Bought this for my wife for her birthday. She ...,5.0,
2,Brian,"Well thought-out, high-quality","Every time I use this projector, I like it mor...",5.0,
3,SSG Cross,Truly amazing piece of hardware. I'm blown away!,First the not so great: Initially it is kind o...,5.0,
4,chris,Awesome little projector,Awesome little projector! Google and Hulu does...,5.0,
...,...,...,...,...,...
1454,Sergii,not working with its own Twitch app,"Upset, not working with its own Twitch app. us...",2.0,2021-05-23 00:00:00.000
1455,Juan Pablo Navarro Almeida,Poco brillo y baja resolución,"No tiene buen brillo, no sirve para jugar play...",2.0,2018-12-21 00:00:00.000
1456,Sk jhonybasha,Badluck,One day the nebula get heated more and suddenl...,1.0,2020-01-08 00:00:00.000
1457,Len,Wont last long....,,3.0,2019-09-28 00:00:00.000


# 2. Data Cleaning

In the dataset that we have, we will be focussing on the "Title" and "Comment" columns as they contain words we can analyse.

We decided to keep the "Title" column as users have to summarise their reviews in only a few words. Therefore, it is where there will be the strongest words as opposed to the "Comment" column where opinions can be more diluted. (eg. Title: **Terrible** product, Comment: This product **does not work** correctly)

In [None]:
#Merge Title and Comment column into one
df_date['Review'] = df_date['Title'] + ' ' + df_date['Comment']

#Drop obsolete columns and set the first index to the number 1
df=df_date.drop(columns=['Customer','Rating','Date', 'Title', 'Comment'])
df=df.dropna()
df

Unnamed: 0,Review
0,"Great projector Great item, worth every penny...."
1,A little projector that packs a punch! Bought ...
2,"Well thought-out, high-quality Every time I us..."
3,Truly amazing piece of hardware. I'm blown awa...
4,Awesome little projector Awesome little projec...
...,...
1453,"Not worth the price Easy to use, but it’s dark..."
1454,"not working with its own Twitch app Upset, not..."
1455,Poco brillo y baja resolución No tiene buen br...
1456,Badluck One day the nebula get heated more and...


To facilitate future modelisation, we decided to structurise the dataframe to only have 1 word per line instead of 1 comment per line. The new dataset and the old one will be connected via an ID column that we will create here

In [None]:
#Start index at 1 instead of 0
df.index = np.arange(1, len(df)+1)

#Link each review to an ID by matching ID number to index number
df.index = np.arange(1, len(df)+1)
df['ID'] = df.index

#Splitting each review into 1 word per row & keeping ID to the right review
new_df = pd.DataFrame(df['Review'].str.split(' ').tolist(), index=df['ID']).stack()

#Reset index
new_df = new_df.reset_index([0, 'ID'])

#Renaming columns
new_df.columns = ['ID', 'Word']
new_df

Unnamed: 0,ID,Word
0,1,Great
1,1,projector
2,1,Great
3,1,"item,"
4,1,worth
...,...,...
107366,1441,services
107367,1441,and
107368,1441,fast
107369,1441,turn


The last step will be to clean the text:
- Every words needs to be in lower character for normalisation,
- Puntuation needs to be removed as they are not relevant to our analyse and will interfere in the next steps
- Negations are very important for our study as it will allow us to differenciate between good and bad review. In order to avoid any variation of the "'" in contractions and deal with words that are not correctly separated (don't --> don' + 't), we will expand these words into their original form (don't --> do + not)

In [None]:
def text_cleaning(text):
    
    text = text.lower()

    text = re.sub(r"n\’t", " not", text)
    text = re.sub(r"\’re", " are", text)
    text = re.sub(r"\’s", " is", text)
    text = re.sub(r"\’d", " would", text)
    text = re.sub(r"\’ll", " will", text)
    text = re.sub(r"\’t", " not", text)
    text = re.sub(r"\’ve", " have", text)
    text = re.sub(r"\’m", " am", text)
    
    text = re.sub('[%s]' % re.escape(string.punctuation), '', str(text))
    return text

df_text=new_df['Word'].apply(text_cleaning)

new_df['Word']=df_text.apply (lambda x:''.join(x)).apply(text_cleaning)
new_df

Unnamed: 0,ID,Word
0,1,great
1,1,projector
2,1,great
3,1,item
4,1,worth
...,...,...
107366,1441,services
107367,1441,and
107368,1441,fast
107369,1441,turn


# 3. Preprocessing

Several steps will be implemented here:
- Tokenisation
- Lemming
- Stemming --> Transform a word into its original form (eg: giving --> give)
- Define stopwords --> Define words that are judged not useful for the analysis (eg: and, a, an, ...)

### 3.1 Tokenisation

In [None]:
#Tokenize words

def tokenizer(token_text):
    token_text = word_tokenize(token_text)
    return token_text

df_text=df_text.apply(tokenizer)
df_text

0             [great]
1         [projector]
2             [great]
3              [item]
4             [worth]
             ...     
107366     [services]
107367          [and]
107368         [fast]
107369         [turn]
107370         [over]
Name: Word, Length: 107371, dtype: object

### 3.2 Lemmatisation and Stemming

Several words can have the same root. It is therefore interesting to convert them into their original form in order for a word to be counted only once.

Eg: giving, gave and given all have the same root: "give". It is not necessary for our analysis to catalog all of those forms. Instead of gathering each of them seperatly, this step will transform and gather them all into the form "give" to falicitate the analysis.

One of the method used here is **Stemming** which cuts the last letters of a word until the stem is reached. This method is rather effective on most english words. However, the end word may not make sens.
Example of stemming:
- boater, boating, boats --> boat
- service --> servic
Here, we are using PorterStemmer for the stemming (https://tartarus.org/martin/PorterStemmer/)

We are also using **Lemmatisation** as in contrast to stemming, lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words. Exemple:
- was, been, am --> be
- mice --> mouse

The reason why we are using both methods is because lemmatiers require a lot more knowledge about structure of a language than stemming. Therefore, the result will often result to the same inputed word because the algorithm didn't suceed in finding the root of the word

In [None]:
#Define and apply Lemme and Stemme

wnl = WordNetLemmatizer()
ps=PorterStemmer()

def lemmatize(s):
     s = [wnl.lemmatize(word) for word in s]
     return s
     
def stemm(s):
     s = [ps.stem(word) for word in s]
     return s

new_df['Lemme']=df_text.apply(lemmatize)
new_df['Stemme']=df_text.apply(stemm)
new_df

Unnamed: 0,ID,Word,Lemme,Stemme
0,1,great,[great],[great]
1,1,projector,[projector],[projector]
2,1,great,[great],[great]
3,1,item,[item],[item]
4,1,worth,[worth],[worth]
...,...,...,...,...
107366,1441,services,[service],[servic]
107367,1441,and,[and],[and]
107368,1441,fast,[fast],[fast]
107369,1441,turn,[turn],[turn]


### 3.3 Stopping

In a sentence, some words like "for" or "and" are not considered useful to understand the sentence as they don't bring any value. Thus, these words are called "**stop words**" and must be identified.

In our case, we only want to identify them instead of removing them entirely because they can be useful for statistic purposes. Moreover, in order to further study what customers don't like about the product, negative reviews identifying all features that don't work are very important. Thus, negations are very important for our studies such as the word "not" that must be kept.

Here we used the nltk list of stopwords as well as added some of our own

In [None]:
#Defining stop words
from string import punctuation

our_stopwords = ['also', 'would', 'one']

stopwords1 = list(stopwords.words('english'))

stopwords1 = our_stopwords + list(stopwords.words('english'))+list(punctuation)+list('’')+list(',')
print(len(stopwords1),stopwords1,sep='\n\n')

216

['also', 'would', 'one', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', '

In [None]:
new_df['stopword']=new_df['Word'].isin(stopwords1)
new_df

Unnamed: 0,ID,Word,Lemme,Stemme,stopword
0,1,great,[great],[great],False
1,1,projector,[projector],[projector],False
2,1,great,[great],[great],False
3,1,item,[item],[item],False
4,1,worth,[worth],[worth],False
...,...,...,...,...,...
107366,1441,services,[service],[servic],False
107367,1441,and,[and],[and],True
107368,1441,fast,[fast],[fast],False
107369,1441,turn,[turn],[turn],False


In [None]:
new_df[new_df['ID']==41]

Unnamed: 0,ID,Word,Lemme,Stemme,stopword
3450,41,love,[love],[love],False
3451,41,it,[it],[it],True
3452,41,it,[it],[it],True
3453,41,is,[is],[is],True
3454,41,difficult,[difficult],[difficult],False
3455,41,to,[to],[to],True
3456,41,add,[add],[add],False
3457,41,apps,[apps],[app],False
3458,41,but,[but],[but],True
3459,41,love,[love],[love],False


In [None]:
#2021/10/17
new_df.to_csv('Preprocessing.csv',encoding='utf-8-sig')

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=22d1188b-dbee-4618-bda5-79d4ace33c29' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>