By previously investigating the Admin Posts dataset, In the [Interpreting the Relationship between Sentiment and Engagement](https://github.com/Lotass/Natural_Language_Processing) notebook. I discovered tha dependency between various independent variable and the Sentiment variable.

So, I see that it is kind of __overhead__ to apply __NLP/ Word2vec Model__ on the Admin post text data to fit a model for __Sentiment Analysis__.

### My Approach is as the following, I will build a __Kernel-SVM Model__ based on Independent variables that is highly correlated to the dependent variables then I will apply the Feature Extraction technique LDA to boost up the performance of the model

In [1]:
# import dependencies
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# import dataset
admin_posts = pd.read_excel("Hyundai_admin_post_data.xls")
admin_posts = admin_posts.reset_index()
admin_posts.drop(["index","Post_id","user_name","creation_date","creation_time","month","updated_date",
                  "updated_time","link","picture","user_id","text","type","ontologies","entities","Total_Sentiment"],
                   axis=1, inplace=True)

In [2]:
x = admin_posts.iloc[:,:].values # independent variables
y = admin_posts.iloc[:,2].values # sentiment as dependent variable

Type a list of each unique sentiment values and its frequency.

In [3]:
dict((i, list(y).count(i)) for i in y)

{-8: 1,
 -6: 1,
 -4: 4,
 -3: 1,
 -2: 11,
 -1: 8,
 0: 156,
 1: 21,
 2: 43,
 3: 21,
 4: 12,
 5: 3,
 6: 2,
 7: 5,
 8: 3,
 9: 2,
 11: 2}

Splitting data into training and test set with __60%__ Training and __40%__ Testing.
** I tried to make the split 70% and 30% for training and testing but the current configuration drived better results **

In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, train_size=0.6,  random_state= 0)

I need to do some __Feature Scaling__ to normalize the training data before applying the __LDA Feature Extraction__ technique.

In [5]:
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)



Apply linear discriminant analysis __LDA__ to Extract the Features that most discriminate/ seperate the multiple class values of the __Sentiment__ dependent variable.

In [6]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA (n_components = 2)
x_train = lda.fit_transform(x_train, y_train)
x_test = lda.transform(x_test)



I will try to optimize the __hyperparameters__ of the __Kernel-SVM__ Algorithm using the __Grid Search__ Technique.

In [7]:
from sklearn.svm import SVC 

# possible hyperparameters
c_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]
gamma_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]

# accuracy
best_score = 0
best_params = {'C': None, 'gamma': None}

# Kernel SVM Hyperparameters optimization
for c in c_values:
    for gamma in gamma_values:
        # train the model for every hyperparameter value pair
        classifier = SVC(kernel = 'rbf', C = c, gamma = gamma)
        #classifier = SVC(kernel = 'rbf', C = random.randint(0,9), gamma = random.randint(0,3))
        classifier.fit(x_train, y_train)
        score = classifier.score(x_test, y_test)
        
        # rate the accuracy of the model using each hyperparameter value pair
        if score > best_score:
            best_score = score
            best_params['C'] =  c
            best_params['gamma'] = gamma

# best score
print best_score, best_params 

0.957983193277 {'C': 1, 'gamma': 0.01}


Fit the __Kernel-SVM__ with a __Gaussian Kernel__Nodel to the training data using the derived optimal values of __C__ and __gamma__.

In [8]:
from sklearn.svm import SVC 
classifier = SVC(kernel = 'rbf', random_state = 0, C = 1, gamma = 0.01)
classifier.fit(x_train, y_train)

# Predicting the test set sentiment values
y_pred = classifier.predict(x_test)
print y_pred 

[ 2  0  0  0  0  0  0  8  0  4  1  1  0  1  2  0  0 -4  0  5  0  0  2  0  0
  3  0  1  0  8  4  0  8  0  7  6  0  0  1  0 -1  0  2  2  4  0 -2  0  0  0
  0  0  0  0  0  0  0  0  2  3  0  0  0  1  3  0  0  0  0 -2  1  0 -2  0  8
  0  0  0  0  1  2  0  8 -1  2  7  0  0  0  0 -1  0  0  2  0  2  0  3  1  0
  0  2  0  0  0  0  2  3  8 -2  0  0  0  0  0  2  2  3  0]


Compute the __Confusion matrix__ of the __Kernel-SVM__  Model

In [9]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print cm

[[ 0  0  0  0  0  0  0  0  0  0  0  0  1  0  0]
 [ 0  1  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  4  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  3  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0 69  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  9  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0 14  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  6  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  3  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  1  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  1  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  2  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  1  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  2  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  2  0  0]]


By interpreting the __Confusion matrix__ diagonal that represents the __True Positive (TP)__ = 0+1+4+3+69+9+14+6+3+1+1+2+1+0+0 = __114__ that 

#### Calculate the precision, recall and F1_Score scores Using Confusion Matrix of a Multiclass Classification Problem
True positive: diagonal position, cm(x, x).
False positiv__Confusion matrix__: sum of column x (without main diagonal), sum(cm(:, x))-cm(x, x).
False negative: sum of row x (without main diagonal), sum(cm(x, :), 2)-cm(x, x). 


Calculate the __Kernel-SVM__  Model Accuracy

In [10]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_pred)*100   # 67.41573%
print "The Kernel-SVM Accuracy = %f" % acc, "%"


The Kernel-SVM Accuracy = 95.798319 %
