# Natural Language Processing

### Data Preprocessing

In [1]:
# Importing the dataset
dataset_original = read.delim('Restaurant_Reviews.tsv', 
                    quote = '',
                    stringsAsFactors = FALSE)

In [2]:
head(dataset_original, 10)

Review,Liked
Wow... Loved this place.,1
Crust is not good.,0
Not tasty and the texture was just nasty.,0
Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.,1
The selection on the menu was great and so were the prices.,1
Now I am getting angry and I want my damn pho.,0
Honeslty it didn't taste THAT fresh.),0
The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.,0
The fries were great too.,1
A great touch.,1


In [3]:
dim(dataset_original)

### Cleaning the texts

In [4]:
# install.packages('tm')
library(tm)
corpus = VCorpus(VectorSource(dataset_original$Review))

# Lowercase each word
corpus = tm_map(corpus, content_transformer(tolower))

Loading required package: NLP


In [5]:
dataset_original$Review[1]

In [6]:
as.character(corpus[[1]])

In [7]:
# Removing all the numbers
corpus = tm_map(corpus, removeNumbers)

In [8]:
dataset_original$Review[29]

In [9]:
as.character(corpus[[29]])

In [10]:
# Removing all the Punctuation
corpus = tm_map(corpus, removePunctuation)

In [11]:
dataset_original$Review[1]

In [12]:
as.character(corpus[[1]])

In [13]:
# Removing stopwords eg. 'the', 'a', 'an', 'in', 'on' i.e all the preposition and articles
corpus = tm_map(corpus, removeWords, stopwords())

In [14]:
dataset_original$Review[1]

In [15]:
as.character(corpus[[1]])

In [16]:
# Stemming
# install.packages('SnowballC')
corpus = tm_map(corpus, stemDocument)

In [17]:
dataset_original$Review[1]

In [18]:
as.character(corpus[[1]])

In [19]:
# Removing white space if any
# corpus = tm_map(corpus, stripWhitespace)

### Creating the Bag of Words model

In [20]:
dtm = DocumentTermMatrix(corpus)

In [21]:
dim(dtm)

In [22]:
dtm

<<DocumentTermMatrix (documents: 1000, terms: 1577)>>
Non-/sparse entries: 5435/1571565
Sparsity           : 100%
Maximal term length: 32
Weighting          : term frequency (tf)

In [23]:
# Filter words that are not frequent
dtm = removeSparseTerms(dtm, 0.999)
# Checking column for most 1

In [24]:
dtm

<<DocumentTermMatrix (documents: 1000, terms: 691)>>
Non-/sparse entries: 4549/686451
Sparsity           : 99%
Maximal term length: 12
Weighting          : term frequency (tf)

In [25]:
dataset = as.data.frame(as.matrix(dtm))
dataset$Liked = dataset_original$Liked

In [26]:
# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(1234)
split = sample.split(dataset$Liked, SplitRatio = 0.80)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Fitting Naive Bayes to the Training set
# install.packages('e1071')
library(e1071)
classifier = naiveBayes(x = training_set[-692],
                        y = training_set$Liked)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-692])

# Making the Confusion Matrix
cm = table(test_set[, 692], y_pred)

In [27]:
cm

   y_pred
     0  1
  0  9 91
  1  7 93

In [28]:
# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(1234)
split = sample.split(dataset$Liked, SplitRatio = 0.80)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Fitting Random Forest to the Training set
# install.packages('randomForest')
library(randomForest)
classifier = randomForest(x = training_set[-692],
                          y = training_set$Liked,
                          ntree = 10)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-692])

# Making the Confusion Matrix
cm = table(test_set[, 692], y_pred)

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.


In [29]:
cm

   y_pred
     0  1
  0 76 24
  1 28 72