# Sentiment Analysis - Shopee Code League 2020

## 2.0 Implementing Trained Model

### Introduction
After reviewing the task from [Shopee Code League 2020 - Sentiment Analysis](https://www.kaggle.com/c/student-shopee-code-league-sentiment-analysis/overview), our team has built and trained the earlier model. 

We will now be using this script to implement the model before submission to kraggle to obtain test score before further fine-tuning our script to improve scores.

To do this, we will prepare the test data, unPickle and test with the trained model before saving into the output file.

#### Team Introduction
Team Name: **JNNY** <br/>
Team Members: **Natalie, James, Yong Xian, Nicky** <br/>
Script Prepared by: **Nicky** [@ahjimomo](https://github.com/ahjimomo)

## 2.1 Library & Data Import

In [1]:
import numpy as np 
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
import re

In [2]:
# Importing the dataset

test_raw = pd.read_csv('input/test.csv')

test_raw # 60427 reviews imported

Unnamed: 0,review_id,review
0,1,"Great danger, cool, motif and cantik2 jg model..."
1,2,One of the shades don't fit well
2,3,Very comfortable
3,4,Fast delivery. Product expiry is on Dec 2022. ...
4,5,it's sooooo cute! i like playing with the glit...
...,...,...
60422,60423,Product has been succesfully ordered and shipp...
60423,60424,Opening time a little scared. Fear dalemnya de...
60424,60425,The product quality is excellent. The origina...
60425,60426,They 're holding up REALLY well also .


## 2.2 Data Cleaning & Preparation

Similarly to how we prepare the date for training of the model, we will prepare the data for the test model as well.

The basic text pre-processing techniques will be similar as per the train set with [Python Regular Expressions](https://developers.google.com/edu/python/regular-expressions) as follows <br/>
    - Removing regular expressions
    - Removing single characters
    - Removing words containing numbers
    - Removing multiple whitespaces
    - Making text all lowercase

After the cleaning process, we will then move to tokenizing the data for the TF-IDF approach

In [3]:
# Creating a output folder

In [4]:
# Extract the labels & sentiments from the training data

test_features = test_raw.iloc[:, 1].values

# Check features
i = 0

while i < 5:
    print(test_features[i], "\n")
    i += 1

Great danger, cool, motif and cantik2 jg models. Delivery cepet. Tp packing less okay krn only wear clear plastic nerawang klihtan contents jd 

One of the shades don't fit well 

Very comfortable 

Fast delivery. Product expiry is on Dec 2022. Product wrap properly. No damage on the item. 

it's sooooo cute! i like playing with the glitters better than browsing on my phone now. item was also deliered earlier than i expected. thank you seller! may you have more buyers to come. 😊😊😊 



In [5]:
processed_test_features = []

for sentence in range(0, len(test_features)):
    
    # Remove all special characters
    processed_feature = re.sub(r'\W', ' ', str(test_features[sentence]))
    
    # Remove all words that include digits / numbers
    processed_feature = re.sub(r'\w*\d\w*', ' ', processed_feature)
    
    # Remove all single characters
    processed_feature = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)
    
    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)
    
    # Substituing multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags = re.I)
    
    # Converting to lowercase
    processed_featured = processed_feature.lower()
    
    # Append cleaned review to processed list
    processed_test_features.append(processed_feature)

In [6]:
# Checking if labels / reviews have been cleaned up
i = 0

while i < 5:
    print(processed_test_features[i], "\n")
    i += 1

Great danger cool motif and jg models Delivery cepet Tp packing less okay krn only wear clear plastic nerawang klihtan contents jd 

One of the shades don fit well 

Very comfortable 

Fast delivery Product expiry is on Dec Product wrap properly No damage on the item  

it sooooo cute like playing with the glitters better than browsing on my phone now item was also deliered earlier than expected thank you seller may you have more buyers to come  



#### Vectorization

We will use the same `TF-IDF` vectorizer that would include the function of tokenization and filtering of stop words to help with our test dataset. <br/>

Similarly, we will use `max_features` = 2500 with `max_df` of 80% to remove the all too common words but a revision to reduce the `min_df` of 5 as the test set in smaller in size.

In [7]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features = 2500, min_df = 5, max_df = 0.8, stop_words = stopwords.words('english'))
processed_test_features = vectorizer.fit_transform(processed_test_features).toarray()

MemoryError: 

## 2.3 Time to predict the test data for submission! 

We will have to now the unPickle the trained model to make predictions to the test data & output in the format required by the challenge.

Since we have 2 trained models, we will prepare 2 sets of submission in `csv` format. 

The task requires our submission to have only 2 columns:
    - review_id (int) which are the indexes of the reviews, starting from 0
    - rating (int) the predictions made by our model
    - output in csv format

In [None]:
# UnPickle our earlier trained model

# Model 1 : Accuracy of 45.5% on y-test

filename_01 = "rfc_model_01.pkl"
with open(filename_01, 'rb') as file1:
    pickle_rfc_01 = pickle.load(file1)
    
# Model 2 : Accuracy of 45.36% on y-test
filename_02 = "rfc_model_02.pkl"
with open(filename_02, 'rb') as file2:
    pickle_rfc_02 = pickle.load(file2)

In [None]:
# Time to predict!

predictions_01 = pickle_rfc_01.predict(processed_test_features)b

In [None]:
predictions_02 = pickle_rfc_02.predict(processed_test_features)

In [None]:
# Checking the prediction output accurate & is between 1 to 5
# Check length to ensure input features = output predicted labels

print(min(predictions_01), min(predictions_02))
print(max(predictions_01), max(predictions_02))
print(len(predictions_01), len(predictions_02), "\nOriginal:", len(test_raw))

#### Prearing the submission document

We will need to attach the predictions to the original indexing before removing the features (reviews/sentiments) to have the document in the correct submission format. 

In [None]:
# Creating copies for ease of reusibility later

submission_01 = test_raw.copy()

submission_02 = test_raw.copy()

# Checking if copy is successful
submission_01.head()

In [None]:
# Attaching the predictions to it's review_id
submission_01['rating'] = predictions_01
submission_02['rating'] = predictions_02

# Checking if columns are correctly attached
submission_01.head()

In [None]:
# Removing the review column

submission_01 = submission_01.drop("review", axis = 1)
submission_02 = submission_02.drop("review", axis = 1)

# Checks if column has been dropped correctly
submission_01.head()

In [None]:
# Last check before saving to csv.

print("submission doc 1:", submission_01.shape)
print("submission doc 2:", submission_02.shape)

In [None]:
# Output results in csv format

submission_01.to_csv("shopeecodeleague_SentimentAnalysis_TeamJnny_01.csv", index = False, encoding = 'utf8')
submission_02.to_csv("shopeecodeleague_SentimentAnalysis_TeamJnny_02.csv", index = False, encoding = 'utf8')

## Congratulations!

We have now completed & submitted our work. The test score as follows on the [Kraggle Leaderboard](https://www.kaggle.com/c/student-shopee-code-league-sentiment-analysis/leaderboard#score):

`Model_01`: 0.018133 <br/>
`Model_02`: 0.018122

Seems really poor in performance, how can we improve it? In `part 3`, we shall look into adding other techniques in hopes to improve our model and the result.

#### Thank you ;)