## NLP & Binary Classification: Amazon Product Reviews
https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#

** Dataset Information: **

1000 sentences labelled with positive or negative sentiment from amazon.com

** Attribute Information: (1 features and 1 class)**

- Sentences	
- Score : 1 (for positive) or 0 (for negative)	

** Objective of this project **

predict sentiment (positive or negative) from sentences

## Data

In [15]:
options(warn=-1)
# Load Data
df_original  <- read.delim('amazon_cells_labelled.txt', quote = '',
                           stringsAsFactors = FALSE,header=FALSE)
colnames(df_original) = c('text','label')

In [16]:
# Inspect Data
head(df_original)

text,label
So there is no way for me to plug it in here in the US unless I go by a converter.,0
"Good case, Excellent value.",1
Great for the jawbone.,1
Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!,0
The mic is great.,1
I have to jiggle the plug to get it to line up right to get decent volume.,0


In [17]:
dim(df_original)

In [18]:
str(df_original)

'data.frame':	1000 obs. of  2 variables:
 $ text : chr  "So there is no way for me to plug it in here in the US unless I go by a converter." "Good case, Excellent value." "Great for the jawbone." "Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!" ...
 $ label: int  0 1 1 0 1 0 0 1 0 0 ...


In [19]:
summary(df_original)

     text               label    
 Length:1000        Min.   :0.0  
 Class :character   1st Qu.:0.0  
 Mode  :character   Median :0.5  
                    Mean   :0.5  
                    3rd Qu.:1.0  
                    Max.   :1.0  

In [20]:
table(df_original$label)


  0   1 
500 500 

## Data preprocessing

** Clean Text **

In [21]:
library(tm)
library(SnowballC)
corpus = VCorpus(VectorSource(df_original$text))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)

** Creat Bag-Of-Words model **

In [22]:
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
df = as.data.frame(as.matrix(dtm))
dim(df)
df$label = df_original$label
dim(df)

** Encode label **

In [23]:
df$label = factor(df$label, levels = c(0, 1))
label_col = 620

In [24]:
library(caTools)
library(caret)
seed = 101 #random seed for reproducibility
set.seed(seed) 

** Split Train Test Sets **

In [25]:
split = sample.split(df$label, SplitRatio = 0.80)
train_set = subset(df, split == TRUE)
test_set = subset(df, split == FALSE)

## Model Train /  Evaluation

In [26]:
# Fit randomForest to the Test set
library(randomForest)
model = randomForest(x = train_set[-label_col],
                     y = train_set$label,
                     ntree = 200)

In [27]:
# Make predictions
predictions = predict(model, newdata = test_set[-label_col])
# Evaluate the results
confusionMatrix(predictions, test_set$label)

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 80 21
         1 20 79
                                          
               Accuracy : 0.795           
                 95% CI : (0.7323, 0.8487)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.59            
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8000          
            Specificity : 0.7900          
         Pos Pred Value : 0.7921          
         Neg Pred Value : 0.7980          
             Prevalence : 0.5000          
         Detection Rate : 0.4000          
   Detection Prevalence : 0.5050          
      Balanced Accuracy : 0.7950          
                                          
       'Positive' Class : 0               
                                          