## NLP & Binary Classification: Yelp Reviews
https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#

** Dataset Information: **

1000 sentences labelled with positive or negative sentiment from yelp.com 

** Attribute Information: (1 features and 1 class)**

- Sentences	
- Score : 1 (for positive) or 0 (for negative)	

** Objective of this project **

predict sentiment (positive or negative) from sentences

## Data

In [18]:
options(warn=-1)
# Load Data
df_original  <- read.delim('yelp_labelled.txt', quote = '',
                           stringsAsFactors = FALSE,header=FALSE)
colnames(df_original) = c('text','label')

In [19]:
# Inspect Data
head(df_original)

text,label
Wow... Loved this place.,1
Crust is not good.,0
Not tasty and the texture was just nasty.,0
Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.,1
The selection on the menu was great and so were the prices.,1
Now I am getting angry and I want my damn pho.,0


In [20]:
dim(df_original)

In [21]:
str(df_original)

'data.frame':	1000 obs. of  2 variables:
 $ text : chr  "Wow... Loved this place." "Crust is not good." "Not tasty and the texture was just nasty." "Stopped by during the late May bank holiday off Rick Steve recommendation and loved it." ...
 $ label: int  1 0 0 1 1 0 0 0 1 1 ...


In [22]:
summary(df_original)

     text               label    
 Length:1000        Min.   :0.0  
 Class :character   1st Qu.:0.0  
 Mode  :character   Median :0.5  
                    Mean   :0.5  
                    3rd Qu.:1.0  
                    Max.   :1.0  

In [23]:
table(df_original$label)


  0   1 
500 500 

## Data preprocessing

** Clean Text **

In [24]:
library(tm)
library(SnowballC)
corpus = VCorpus(VectorSource(df_original$text))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)

** Creat Bag-Of-Words model **

In [25]:
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
df = as.data.frame(as.matrix(dtm))
dim(df)
df$label = df_original$label
dim(df)

** Encode label **

In [26]:
df$label = factor(df$label, levels = c(0, 1))
label_col = 692

In [27]:
library(caTools)
library(caret)
seed = 101 #random seed for reproducibility
set.seed(seed) 

** Split Train Test Sets **

In [28]:
split = sample.split(df$label, SplitRatio = 0.80)
train_set = subset(df, split == TRUE)
test_set = subset(df, split == FALSE)

## Model Train /  Evaluation

In [29]:
# Fit randomForest to the Test set
library(randomForest)
model = randomForest(x = train_set[-label_col],
                     y = train_set$label,
                     ntree = 200)

In [30]:
# Make predictions
predictions = predict(model, newdata = test_set[-label_col])
# Evaluate the results
confusionMatrix(predictions, test_set$label)

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 77 24
         1 23 76
                                       
               Accuracy : 0.765        
                 95% CI : (0.7, 0.8219)
    No Information Rate : 0.5          
    P-Value [Acc > NIR] : 1.354e-14    
                                       
                  Kappa : 0.53         
 Mcnemar's Test P-Value : 1            
                                       
            Sensitivity : 0.7700       
            Specificity : 0.7600       
         Pos Pred Value : 0.7624       
         Neg Pred Value : 0.7677       
             Prevalence : 0.5000       
         Detection Rate : 0.3850       
   Detection Prevalence : 0.5050       
      Balanced Accuracy : 0.7650       
                                       
       'Positive' Class : 0            
                                       