# Thesis
***

## Aspect BASED Sentiment Analysis
***
***

### Project Lifecycle 

* 1. Problem Statsement

* 2. Data Collection 

* 3. EDA

* 4. Machine Learning Approach

* 5. Result validation and Report

## Load Formatted Data

### Import library

In [1]:
import numpy as np
import pandas as pd
import re
import time
import seaborn as sns
import matplotlib.pyplot as plt
import pickle

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import defaultdict
import spacy
from tqdm import tqdm
from collections import Counter
from nltk.stem import WordNetLemmatizer
  

import datapurifier as dp
from datapurifier import Mleda, Nlpeda, Nlpurifier, MlReport
from datapurifier import NLAutoPurifier

In [2]:
sentence_aspect_df = pd.read_csv('sentence_aspect_df.csv')

In [3]:
print(sentence_aspect_df.shape)

(76545, 7)


In [4]:
sentence_aspect_df.sample(2)

Unnamed: 0,location_id,sentences,clean_sentences,ind,aspect,aspect_score,aspect_seed_word
63866,3603705,"I actually contacted onstar 2x,both times I co...",i actually contacted onstar 2x both times i co...,63866,MISCELLANEOUS,0.207828,provider
35446,3603735,The 1st agent I talked to tried telling there ...,the 1st agent i talked to tried telling there ...,35446,ADMINISTRATION,0.402787,agent


# 4. Machine Learning Approach

Ref Paper: https://ieeexplore.ieee.org/document/9279217

* 4.1 Aspect Extraction  
* 4.2 Aspect Detection
* 4.3 Aspect's Sentiment detection

## 4.3 Aspect's Sentiment detection

* Using pretrained DistilBERT model to find the sentiment of a sentence and it's corresponding aspect

Ref: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment?text=I+like+you.+I+love+you

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

In [7]:
# Load the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()

tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

bert_model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
bert_model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elemen

In [16]:
# Calculate the sentiment value
# Sentiment value ranges from 1 ti 5. 1 indicate negative sentiment and 5 indicate positive sentiment 
sentence_aspect_df['aspect_sentiment_score'] = None

for idx in tqdm(range(sentence_aspect_df.shape[0])):
    sentence = sentence_aspect_df['clean_sentences'].iloc[idx]
    tokens = tokenizer.encode(sentence, return_tensors='pt')
    sentence_aspect_df.at[idx, 'aspect_sentiment_score'] = int(torch.argmax(bert_model(tokens.to(device)).logits))+1

100%|████████████████████████████████████████████████████████████████████████████| 76545/76545 [16:46<00:00, 76.08it/s]


In [17]:
sentence_aspect_df.to_csv('sentence_aspect_sentiment_df.csv', index=False)

In [21]:
sentence_aspect_df.head(2)

Unnamed: 0,location_id,sentences,clean_sentences,ind,aspect,aspect_score,aspect_seed_word,aspect_sentiment_score
0,3603702,The representative that helped me was amazing.,the representative that helped me was amazing,0,ADMINISTRATION,0.604803,representative,5
1,3603702,"She was kind, professional and went over and b...",she was kind professional and went over and be...,1,EXPERIENCE,0.302708,experience,3


# 5. Report/ Result Analysis

* Using Data Studio for creating report dashboard

* Report Link: https://datastudio.google.com/reporting/18a4a89f-da16-411e-8d02-141a8f705d26

End