### This is in continuation with the previous notebook titled "Atharva Joshi 200425197 Q1,2,3".

# Question 4 and 5

In [1]:
import csv                               # csv reader
from sklearn.svm import LinearSVC
from nltk.classify import SklearnClassifier
from random import shuffle
from sklearn.pipeline import Pipeline

In [2]:
# load data from a file and append it to the rawData
def loadData(path, Text=None):
    with open(path,encoding='utf8') as f:
        reader = csv.reader(f, delimiter='\t')
        for line in reader:
            if line[0] == "DOC_ID":  # skip the header
                continue
            (Id, Text,AmazonCategory,AmazonRating,AmazonVerifiedReview,Label) = parseReview(line)
            rawData.append((Id, Text,AmazonCategory,AmazonRating,AmazonVerifiedReview,Label))


def splitData(percentage):
    # A method to split the data between trainData and testData 
    dataSamples = len(rawData)
    halfOfData = int(len(rawData)/2)
    trainingSamples = int((percentage*dataSamples)/2)
    for (_, Text, AmazonCategory, AmazonRating, AmazonVerifiedReview, Label) in rawData[:trainingSamples] + rawData[halfOfData:halfOfData+trainingSamples]:
        trainData.append((toFeatureVector(preProcess(Text),AmazonCategory,AmazonRating,AmazonVerifiedReview),Label))
    for (_, Text, AmazonCategory, AmazonRating, AmazonVerifiedReview, Label) in rawData[trainingSamples:halfOfData] + rawData[halfOfData+trainingSamples:]:
        testData.append((toFeatureVector(preProcess(Text),AmazonCategory,AmazonRating,AmazonVerifiedReview),Label))

### In the previous notebook we saw that FScore was a bit low. We will try to improve that. First we will try to add more features. Earlier we used only text to predict. We will add some supporting features to influence its decision. Here we will add three more features Category, Rating and check if it's a verified review. Category will tell us whether a specific category has more fake reviews. eg. A mobile phone company might plant fake positive reviews about their products to influence people. Hence if such a category is detected, it will be an important parameter in the decision. Rating will help such thata huge number of 5 stars or 1 star may indicate paid positive or negative reviews. We need to filter out such fake reviews. Verified Review helps us understand whether the reviewer has actually used the product. Such verified reviewers are likely to give a genuine review. We also tried Product ID and Product Title instead of Category and Rating and got average FScore of 0.537 and an average accuracy of 0.524 which is much less. That makes sense since ID and Title don't really contribute to the decision of the review being real or fake.

In [3]:
# Convert line from input file into an id/text/label tuple
def parseReview(reviewLine):
    if reviewLine[1]=='__label2__':
        reviewLine[1]=realLabel
    else:
        reviewLine[1]=fakeLabel
    return (reviewLine[0], reviewLine[8], reviewLine[4],reviewLine[2],reviewLine[3],reviewLine[1])

### Earlier we saw that the data at specific labels was returned. It is now tweaked to accomodate the three new features that we took. Category, Rating and Verified Review are at 4, 2 and 3 positions respectively. Here we also assigned "\_\_label2\_\_" as "Real" and "\_\_label1\_\_" as "Fake" for easy understanding.

In [4]:
# TEXT PREPROCESSING AND FEATURE VECTORIZATION
import re,nltk,string
from nltk.util import ngrams
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
# Input: a string of one review
def preProcess(text):
    
    #Remove @tags
    TAG_CLEANING_RE = "@\S+"
    text=re.sub(TAG_CLEANING_RE, ' ', text)
    
    #Removing website links
    LINKS_CLEANING_RE = "https?:\S+|http?:\S|[^A-Za-z0-9]+"
    text=re.sub(LINKS_CLEANING_RE, ' ', text)
    
    #Remove Punctuation
    text=text.translate(text.maketrans('', '', string.punctuation))
    
    #remove white spaces
    text=text.strip()
    
    #Convert To Lower Case
    text=text.lower()
    
    #Tokenize
    tokens = nltk.word_tokenize(text)
    
    #Stop Words removal
    stop_words = set(stopwords.words('english'))
    tokens = [i for i in tokens if not i in stop_words]
    
    #Porter Stemming
    #stemmer= PorterStemmer()
    #tokens = [stemmer.stem(i) for i in tokens]
    
    #Lemmatiztion
    lemmatizer=WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(i) for i in tokens]
    
    #ngrams = zip(*[tokens[i:] for i in range(2)])
    #tokens= [" ".join(ngram) for ngram in ngrams]
    
    return tokens

### In the previous notebook we mentioned that there are some problems with the tokenisation process like the number of tokens in this was way too large. These unnecessary tokens can be easily removed. We do that by doing the following:
### 1) @ Tags:
### Some users mention other users or the seller in their reviews. This doesn't affect the result and is an unnecessary addition to the corpus. Hence we use Regular Expression to remove such tags. Without this we get an average Fscore of 0.776154 and Average Accuracy of 0.776154. 
### 2) Website Links:
### Some users might use website links maybe to suggest similar and/or better and/or cheaper products to other users. We remove those too.  Without this we get an average Fscore of 0.772114 and Average Accuracy of 0.775528. 
### 3) Punctuation
### eg. "I am happy." will be tokenized into ['I','am','happy','.']. Here there is a full stop which doesn't matter for the decision but is unnecessarily increases the vector space. Hence we remove all the punctuations.  Without this we get an average Fscore of 0.772178 and Average Accuracy of 0.773148. 
### 4) Lower Case
### eg. {"I am Happy"}, {"I am happy"}. These two sentences will be tokenised as ['I','am','Happy','happy']. Here we know that Happy=happy. Hence to avoid this we convert every letter in the text to lower form.  Without this we get an average Fscore of 0.771459 and Average Accuracy of 0.772156. 
### 5) Remove white spaces.
### Since we are splitting on white spaces, we need to make sure there isn't any before and after a text. Strip removes all unnecessary white spaces before and after the text.  This didn't make any difference to FScore and Accuracy
### 6) Tokenisation
### Here we use a library 'NLTK' to tokenize the text by splitting on white spaces.
### 7) Stop words removal
### Some examples of stopwords are "a," "and", "but", "how", "or" and "what." These words are unnecessary for the text to be classified and have probably a large count since they are used in all sentences thus overshadowing other important words (Remember Language is Zipfian).  Without this we get an average Fscore of 0.775621 and Average Accuracy of 0.775543. 
### 8) Stemming
### eg. Lion and Lions. Here the plurality of the lion doesn't matter in the context. Stemming brings words to its basic form like in this case Lions becomes Lion to establish equality. In this case we have used Porter Stemmer which is widely used. 
### 9) Lemmatization
### In lemmatization, each words is brought to it's basic root form. eg. apple/apples becomes appl and berry/berries becomes berri. This helps establish equality between all the words having same roots but different forms.
### Note: Lemmatization in this case performs better than stemming.  With Stemming we get an average Fscore of 0.772320 and Average Accuracy of 0.772187. Using both doesn't make any difference.
### 10) Bigrams
### Till now we divided the text into single tokens called unigrams. Here we tried to use bigrams. Bigrams establish relationship with the previous word thus facilitating better prediction. We also tried higher orders like trigrams and 4-grams but the result was pretty much similar. However, unigrams still gave a slight better result than other n-grams which gave average Fscore of 0.780357 and Average Accuracy of 0.779553

In [5]:
featureDict = {} # A global dictionary of features

def toFeatureVector(tokens,AmazonCategory,AmazonRating,AmazonVerifiedReview):
    featureDictLocal={}
    for t in tokens:
        try:
            featureDict[t] += 1
            featureDictLocal[t] += 1
        except KeyError:            
            featureDict[t] = 1
            featureDictLocal[t] = 1
    featureDict.update({'AmazonCategory':AmazonCategory,'AmazonRating':AmazonRating,'AmazonVerifiedReview':AmazonVerifiedReview})
    featureDictLocal.update({'AmazonCategory':AmazonCategory,'AmazonRating':AmazonRating,'AmazonVerifiedReview':AmazonVerifiedReview})
    return featureDictLocal

### Earlier we assigned weights based on the frequency of the words appearing in the text. Here we assign the weights by dividing them with the total length of the dataset. The weight increases as the count of words increases but it's small enough that it doesn't overflow. But this method gave a lower average Fscore of 0.765648 and Average Accuracy of 0.764412. Even here we are maintaining a global and a local dictionary. We also used other methods like CountVectorizer or TF-IDF Vectorizer which are available in different libraries but gave a similar performance. Hence we reverted back to the frequency vectorization.

In [6]:
# TRAINING AND VALIDATING OUR CLASSIFIER
def trainClassifier(trainData):
    print("Training Classifier...")
    pipeline =  Pipeline([('svc', LinearSVC(penalty='l2',max_iter=2000, loss='hinge',dual=True,random_state=100,
                                            verbose=1,C=0.001,class_weight='balanced', fit_intercept=True,
                                            intercept_scaling=1,multi_class='ovr'))])
    return SklearnClassifier(pipeline).train(trainData)

### LinearSVC is the SVM we're using to train the classifier. Certain parameters can be tweaked to optimise it's performance and converge to a minima in the loss function. Following was observed during the tweaking of parameters:
### 1)Penalty default is L2. L1 leads to coef_ that are sparse. Since L1 is not supported with hinge or squared_hinge it gave an error and hence not used.
### 2)tol default value 1e-3 is suitable. Increasing or decreasing reduces average f score and average accuracy to around 0.77 both.
### 3)Max inter from 1k to 2k because of the Convergence Warning we encountered during training. No change in FScore or average
### 4)Loss Function Hinge gave slightly higher f1 score. (0.77 v/s 0.78)
### 5)Dual is true(default). Should be false only when n_samples>n_features. So did not change.
### 6)Random state has no effect on FScore and Accuracy
### 7)Verbose has no effect. Only used to print more related metadata.
### 8)C is the Regularization parameter. Strictly should be positive. Default is 1.0 and decreasing increases the Fscore and Average slightly than 0.76 which was obtained by increasing.
### 9)class_weight: The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). Default is None where all features have weight 'one'. Didn't make any difference in the score.
### 11)Default values are taken for fir_intercept and intercept_scaling. Doesn't affect Fscore.
### 12)Changing multi_class to 'crammer_singer' has extremely slight decrease in fscore to around 0.74. Hence kept default i.e. 'ovr'.

In [7]:
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
crossValidationActual=[]
def crossValidate(dataset, folds):
    shuffle(dataset)
    cv_results = []
    accuracySum = 0
    accuracyAverage = 0.0
    foldSize = int(len(dataset)/folds)
    for i in range(0,len(dataset),foldSize):
        crossValidationTestData = dataset[i:i+foldSize]
        crossValidationTrainData = dataset[:i] + dataset[i+foldSize:]
        classifier = trainClassifier(crossValidationTrainData)
        crossValidationActual = [x[1] for x in crossValidationTestData]
        crossValidationPredictedLabels = predictLabels(crossValidationTestData,classifier)
        cv_results.append(precision_recall_fscore_support(crossValidationActual, crossValidationPredictedLabels, average='weighted'))
        accuracySum+=(accuracy_score(crossValidationActual, crossValidationPredictedLabels))
    print('\nAverage Accuracy:%f' % (accuracySum/10))
    return cv_results

### The algorithm and implementation remains same here. We can also use the libraries to do K Fold Cross Validation.

In [8]:
# PREDICTING LABELS GIVEN A CLASSIFIER

def predictLabels(reviewSamples, classifier):
    return classifier.classify_many(map(lambda t: t[0], reviewSamples))

def predictLabel(reviewSample, classifier):
    return classifier.classify(toFeatureVector(preProcess(reviewSample)))

In [9]:
# MAIN
sumFScore=0
# loading reviews
# initialize global lists that will be appended to by the methods below
rawData = []          # the filtered data from the dataset file (should be 21000 samples)
trainData = []        # the pre-processed training data as a percentage of the total dataset (currently 80%, or 16800 samples)
testData = []         # the pre-processed test data as a percentage of the total dataset (currently 20%, or 4200 samples)

# the output classes
fakeLabel = 'fake'
realLabel = 'real'

# references to the data files
reviewPath = 'amazon_reviews.txt'

# Do the actual stuff (i.e. call the functions we've made)
# We parse the dataset and put it in a raw data list
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing the dataset...",sep='\n')

loadData(reviewPath) 

# We split the raw dataset into a set of training data and a set of test data (80/20)
# You do the cross validation on the 80% (training data)
# We print the number of training samples and the number of features before the split
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing training and test data...",sep='\n')
splitData(0.8)
# We print the number of training samples and the number of features after the split
print("After split, %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Training Samples: ", len(trainData), "Features: ", len(featureDict), sep='\n')

# QUESTION 3 - Make sure there is a function call here to the
# crossValidate function on the training set to get your results
validationResults=crossValidate(trainData,10)
for i in range(len(validationResults)):
    print('Fold ' + str(i+1) + ': \nPrecision: %f\tRecall: %f\tF Score:%f' % validationResults[i][:3])
    sumFScore+=validationResults[i][2]
print('Average FScore:%f' % (sumFScore/10))

Now 0 rawData, 0 trainData, 0 testData
Preparing the dataset...
Now 21000 rawData, 0 trainData, 0 testData
Preparing training and test data...
After split, 21000 rawData, 16800 trainData, 4200 testData
Training Samples: 
16800
Features: 
30324
Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]
Average Accuracy:0.782857
Fold 1: 
Precision: 0.783885	Recall: 0.775595	F Score:0.774163
Fold 2: 
Precision: 0.782435	Recall: 0.778571	F Score:0.777937
Fold 3: 
Precision: 0.795714	Recall: 0.791667	F Score:0.790222
Fold 4: 
Precision: 0.780969	Recall: 0.780357	F Score:0.780101
Fold 5: 
Precision: 0.800612	Recall: 0.793452	F Score:0.792266
Fold 6: 
Precision: 0.787950	Recall: 0.784524	F Score:0.784169
Fold 7: 
Pr

### Here we can see the results for each fold. After all the operations, the approximate average of FScore is around 0.78 and the Average Accuracy is also 0.78 which is significantly better than what we had in the previous notebook. Hence, our ways to improve the methods to optimise the performance worked properly. Here we can see that the number of features has also significantly reduced.

# Evaluate on test set

In [10]:
# Finally, check the accuracy of your classifier by training on all the tranin data
# and testing on the test set
# Will only work once all functions are complete
functions_complete = True  # set to True once you're happy with your methods for cross val
if functions_complete:
    print(testData[0])   # have a look at the first test data instance
    classifier = trainClassifier(trainData)  # train the classifier
    testTrue = [t[1] for t in testData]   # get the ground-truth labels from the data
    testPred = predictLabels(testData, classifier)  # classify the test data to get predicted labels
    finalScores = precision_recall_fscore_support(testTrue, testPred, average='weighted') # evaluate
    print("Done training!")
    print("Precision: %f\nRecall: %f\nF Score:%f" % finalScores[:3])
    print("Accuracy: %f" % accuracy_score(testTrue, testPred))

({'assortment': 1, 'really': 1, 'hershey': 1, 'best': 1, 'little': 1, 'one': 1, 'always': 1, 'excited': 1, 'whenever': 1, 'holiday': 1, 'come': 1, 'AmazonCategory': 'Grocery', 'AmazonRating': '5', 'AmazonVerifiedReview': 'N'}, 'fake')
Training Classifier...
[LibLinear]Done training!
Precision: 0.815233
Recall: 0.808095
F Score:0.807003
Accuracy: 0.808095


### Here we can say that there is significant improvement in the FScore and the Accuracy compared to the previous notebook.