## Importing required libraries

In this Python notebook, we will explore the exciting field of Natural Language Processing (NLP) and its practical applications using various libraries. NLP is a subfield of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language.
We will use NLP to extract aspect based sentiments for customer reviews.

To begin our journey, we will import the following essential libraries:

1. **pandas** (imported as `pd`): Pandas is a powerful library for data manipulation and analysis. It provides data structures like DataFrames that are widely used in handling textual data during NLP tasks.

2. **numpy** (imported as `np`): NumPy is a fundamental library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, making it invaluable for numerical processing in NLP.

3. **emoji**: The emoji library allows us to work with emojis within the text data. Emojis are becoming increasingly common in text communication, and this library facilitates their handling in NLP tasks.

4. **re**: The `re` module in Python stands for regular expressions. It is used for pattern matching and manipulation of strings. Regular expressions play a crucial role in preprocessing and extracting information from textual data.

5. **string**: The `string` module provides various functions and constants related to string manipulation. It comes in handy when dealing with text preprocessing and filtering out unwanted characters.

6. **collections.Counter**: This class from the `collections` module is useful for counting the occurrences of elements in a collection. It is especially helpful for analyzing the frequency distribution of words in text data.

Additionally, we will dive into the world of NLP using the **spacy** library, which is a powerful and efficient NLP framework. We will leverage the `Tokenizer` from the `spacy.tokenizer` module to break down text into individual tokens or words. Furthermore, we will use the `pprint` module for pretty-printing the results and making them more readable.

Let's import these libraries and get started with our exploration!

In [3]:
# Required Libraries

#Base and Cleaning 
import pandas as pd
import numpy as np
import emoji
import re
import string
from collections import Counter


#Natural Language Processing (NLP)
import spacy
from spacy.tokenizer import Tokenizer
from pprint import pprint

## Importing Android and iOS App Reviews Data

In this section, we will import the Android and iOS app reviews data from CSV files to perform our analysis. We will use the `pandas` library to read the CSV files and store the data in data frames for further exploration.

### Importing Android Reviews Data

We will first import the Android app reviews data from the CSV file named `android_reviews.csv`. This file contains user reviews and feedback for various Android smartphones.


In [7]:
android_reviews = pd.read_csv("../Data collection/android_reviews_new.csv")

### Importing iOS Reviews Data
Next, we will import the iOS smartphone reviews data from the CSV file named ios_reviews.csv. This file contains user reviews and feedback for iPhone 11 and iPhone SE.

In [8]:
ios_reviews = pd.read_csv("../Data collection/ios_reviews.csv")

In [13]:
android_reviews.head(10)

Unnamed: 0,Review,Make,Model,Software
0,good,Samsung,Galaxy S21 Ultra,Android
1,No for that you have to spend $300 to buy a Ga...,Samsung,Galaxy S21 Ultra,Android
2,"So no heart rate reader, no Sp02 reader... ?",Samsung,Galaxy S21 Ultra,Android
3,"So no heart rate reader, no Sp02 reader... ?",Samsung,Galaxy S21 Ultra,Android
4,"Abs123, 24 Jul 2021I bought Samsung ultra s21 ...",Samsung,Galaxy S21 Ultra,Android
5,I bought Samsung ultra s21 4 days ago and sinc...,Samsung,Galaxy S21 Ultra,Android
6,"""The Always On Display also works at 60Hz - we...",Samsung,Galaxy S21 Ultra,Android
7,"Bob, 05 Jul 2021I'd be interested to see if th...",Samsung,Galaxy S21 Ultra,Android
8,"No SD card slot, no money from me. I have the ...",Samsung,Galaxy S21 Ultra,Android
9,"I'm not a computer or technical savvy person, ...",Samsung,Galaxy S21 Ultra,Android


In [15]:
ios_reviews.head(10)

Unnamed: 0,Review,Make,Model,Software
0,Very bad experience with this iPhone xr phone....,Apple,11,iOS
1,Amazing phone with amazing camera coming from ...,Apple,11,iOS
2,So I got the iPhone XR just today. The product...,Apple,11,iOS
3,I've been an android user all my life until I ...,Apple,11,iOS
4,I was delivered a phone that did not work imme...,Apple,11,iOS
5,It has been a month since I started using my i...,Apple,11,iOS
6,The phone is hanging. Video quality is not ver...,Apple,11,iOS
7,I'll use this review to mostly say what I'm no...,Apple,11,iOS
8,Went with the iPhone XR after over a month of ...,Apple,11,iOS
9,NOTE:,Apple,11,iOS


## Data Exploration: Shape, Data Types, and Missing Values

Before we dive into the analysis of Android and iOS app reviews, let's perform some initial data exploration to understand the structure and quality of our data frames.

### Checking the Shape of Data Frames

To start, let's examine the shape of our data frames to understand the number of rows and columns in each data set.


In [9]:
# Shape of Android Reviews Data Frame
android_shape = android_reviews.shape
print(f"Android Reviews Data Frame Shape: {android_shape}")

# Shape of iOS Reviews Data Frame
ios_shape = ios_reviews.shape
print(f"iOS Reviews Data Frame Shape: {ios_shape}")

Android Reviews Data Frame Shape: (4664, 4)
iOS Reviews Data Frame Shape: (14722, 4)


### Checking Data Types
Next, let's inspect the data types of the columns in both data frames. Understanding the data types is essential to ensure that the data is appropriately represented for further analysis.

In [11]:
# Data Types in Android Reviews Data Frame
android_data_types = android_reviews.dtypes
print("Data Types in Android Reviews Data Frame:")
print(android_data_types)

# Data Types in iOS Reviews Data Frame
ios_data_types = ios_reviews.dtypes
print("\nData Types in iOS Reviews Data Frame:")
print(ios_data_types)

Data Types in Android Reviews Data Frame:
Review      object
Make        object
Model       object
Software    object
dtype: object

Data Types in iOS Reviews Data Frame:
Review      object
Make        object
Model       object
Software    object
dtype: object


### Convert the values to string type

To make sure all columns are strings, as there is no other data type in this data.

In [12]:
all_columns = list(android_reviews) # Creates list of all column headers
android_reviews[all_columns] = android_reviews[all_columns].astype(str)

all_columns = list(ios_reviews) # Creates list of all column headers
ios_reviews[all_columns] = ios_reviews[all_columns].astype(str)

### Checking Missing Values

Lastly, let's identify if there are any missing values in the data frames. Missing values can impact the quality of our analysis, so it's crucial to handle them appropriately.

In [16]:
"""
Removing null rows, since only reviews was 
fecthed and rest all the columns were hardcoded during csv creation,
we can safely drop these rows.
"""
print(len(android_reviews[android_reviews.Review.isnull()]))
print(len(ios_reviews[ios_reviews.Review.isnull()]))

android_reviews = android_reviews.dropna()
ios_reviews = ios_reviews.dropna()

0
0


## Data Cleaning: User Reviews Preprocessing

In this section, we will focus on the essential task of cleaning and preprocessing the user reviews data. Raw user reviews often contain various elements that can affect the accuracy and effectiveness of our analysis. We will perform the following data cleaning steps to prepare the text data for further analysis:

1. **Handling URLs**: User reviews might contain URLs or hyperlinks that do not contribute to the sentiment analysis. We will remove these URLs to focus on the meaningful text content.

2. **Handling Emojis**: Emojis are graphical representations used to convey emotions. While they add expressiveness to text, they can be challenging to process during analysis. We will handle emojis to ensure they do not interfere with our NLP tasks.

By performing these cleaning steps, we aim to transform the raw user reviews into a clean, concise, and meaningful form that is ready for further sentiment analysis.

Let's proceed with the data cleaning process and transform our user reviews data into a more structured and meaningful format.


In [17]:
def give_emoji_free_text(text):
    """
    Removes emoji's from reviews
    Accepts:
        Text (reviews)
    Returns:
        Text (emoji free reviews)
    """
    return emoji.replace_emoji(text)

def url_free_text(text):
    '''
    Cleans text from urls
    '''
    text = re.sub(r'http\S+', '', text)
    return text

# Apply the function above and get tweets free of emoji's
call_emoji_free = lambda x: give_emoji_free_text(x)

# Apply `call_emoji_free` which calls the function to remove all emoji's
android_reviews['emoji_free_text'] = android_reviews['Review'].apply(call_emoji_free)
ios_reviews['emoji_free_text'] = ios_reviews['Review'].apply(call_emoji_free)

#Create a new column with url free tweets
android_reviews['url_free_text'] = android_reviews['emoji_free_text'].apply(url_free_text)
ios_reviews['url_free_text'] = ios_reviews['emoji_free_text'].apply(url_free_text)

## Sentiment Analysis: Importing Pre-trained ABSA Checkpoints

In this section, we will explore the fascinating world of Aspect-Based Sentiment Analysis (ABSA) using pre-trained checkpoints. ABSA is a specialized task within Natural Language Processing (NLP) that aims to identify sentiment polarity towards specific aspects or entities mentioned in text data.

To simplify our ABSA implementation, we will utilize the powerful `pyabsa` library, which provides access to various pre-trained ABSA checkpoints. These checkpoints are pre-trained models that can accurately detect sentiments associated with different aspects in user reviews, product feedback, or any other text containing opinions about multiple aspects.

### Importing Available Checkpoints

To get started, we will import the `available_checkpoints` function from `pyabsa`. This function allows us to retrieve a mapping of the available pre-trained ABSA checkpoints that we can use for our sentiment analysis tasks.

In [18]:
from pyabsa import available_checkpoints
checkpoint_map = available_checkpoints()

/bin/sh: nvidia-smi: command not found


No CUDA GPU found in your device
[2023-08-10 14:58:26] (2.3.1) [31mPyABSA(2.3.1): If your code crashes on Colab, please use the GPU runtime. Then run "pip install pyabsa[dev] -U" and restart the kernel.
Or if it does not work, you can use v1.16.27




Try to downgrade transformers<=4.29.0.



[0m
[2023-08-10 14:58:36] (2.3.1) Please specify the task code, e.g. from pyabsa import TaskCodeOption


  _warn(f"unclosed running multiprocessing pool {self!r}",


### Importing ATEPCCheckpointManager

To get started with aspect extraction, we will import the `ATEPCCheckpointManager` from `pyabsa`. This manager class provides convenient functions to access and manage pre-trained Aspect Term Extraction (ATE) checkpoints.

#### Initializing the Aspect Extractor
Next, we will initialize the aspect extractor using the get_aspect_extractor function from ATEPCCheckpointManager. This function allows us to obtain a pre-trained aspect extractor for ABSA.


In [20]:
from pyabsa import ATEPCCheckpointManager

aspect_extractor = ATEPCCheckpointManager.get_aspect_extractor(checkpoint='english',
                                   auto_device=True  # False means load model on CPU
                                   )

[2023-08-10 14:59:31] (2.3.1) [32mDownloading checkpoint:english [0m
[2023-08-10 14:59:31] (2.3.1) [31mNotice: The pretrained model are used for testing, it is recommended to train the model on your own custom datasets[0m
[2023-08-10 14:59:31] (2.3.1) Checkpoint already downloaded, skip
[2023-08-10 14:59:31] (2.3.1) Load aspect extractor from checkpoints/ATEPC_ENGLISH_CHECKPOINT/fast_lcf_atepc_English_cdw_apcacc_82.36_apcf1_81.89_atef1_75.43
[2023-08-10 14:59:31] (2.3.1) config: checkpoints/ATEPC_ENGLISH_CHECKPOINT/fast_lcf_atepc_English_cdw_apcacc_82.36_apcf1_81.89_atef1_75.43/fast_lcf_atepc.config
[2023-08-10 14:59:31] (2.3.1) state_dict: checkpoints/ATEPC_ENGLISH_CHECKPOINT/fast_lcf_atepc_English_cdw_apcacc_82.36_apcf1_81.89_atef1_75.43/fast_lcf_atepc.state_dict
[2023-08-10 14:59:31] (2.3.1) model: None
[2023-08-10 14:59:31] (2.3.1) tokenizer: checkpoints/ATEPC_ENGLISH_CHECKPOINT/fast_lcf_atepc_English_cdw_apcacc_82.36_apcf1_81.89_atef1_75.43/fast_lcf_atepc.tokenizer
[2023-08-10

Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2Model: ['mask_predictions.classifier.bias', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.weight', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Special tokens have b

In [21]:
inference_source = android_reviews['url_free_text'].to_list()
atepc_result = aspect_extractor.extract_aspect(inference_source=inference_source,  #
                          pred_sentiment=True,  # Predict the sentiment of extracted aspect terms
                          )

preparing ate inference dataloader: 100%|██████████| 4664/4664 [00:08<00:00, 530.28it/s] 
extracting aspect terms: 100%|██████████| 146/146 [58:48<00:00, 24.17s/it]   
preparing apc inference dataloader: 100%|██████████| 7992/7992 [03:09<00:00, 42.15it/s]  
  lcf_cdm_vec = torch.tensor(
  float(x) for x in F.softmax(i_apc_logits).cpu().numpy().tolist()
classifying aspect sentiments: 100%|██████████| 250/250 [1:16:49<00:00, 18.44s/it]   


[2023-08-10 17:18:43] (2.3.1) The results of aspect term extraction have been saved in /Users/chiraghs/Library/Mobile Documents/com~apple~CloudDocs/UON/Dissertation/Notebooks/Aspect Term Extraction and Polarity Classification.FAST_LCF_ATEPC.result.json
[2023-08-10 17:18:43] (2.3.1) Example 0: good
[2023-08-10 17:18:43] (2.3.1) Example 1: No for that you have to spend $ 300 to buy a [36m<Galaxy:Neutral Confidence:0.9989>[0m watch : D
[2023-08-10 17:18:43] (2.3.1) Example 2: So no heart rate reader , no Sp02 reader . . . ?
[2023-08-10 17:18:43] (2.3.1) Example 3: So no heart rate reader , no Sp02 reader . . . ?
[2023-08-10 17:18:43] (2.3.1) Example 4: Abs123 , 24 Jul 2021I bought Samsung ultra s21 4 days ago and since day 1 its restarting on its own randomly , stuc . . . moreCheck a place you bought it
[2023-08-10 17:18:43] (2.3.1) Example 5: I bought Samsung ultra s21 4 days ago and since day 1 its restarting on its own randomly , stucks at [31m<android recovery:Negative Confidence:0

The checkpoint parameter specifies the language for the pre-trained model. In this case, we use 'english' to load an English language model.

The auto_device parameter is set to True by default, which means the model will be loaded on the available GPU if one is available. If you prefer to load the model on the CPU, you can set auto_device=False.

Now, we have an initialized aspect extractor ready to identify and extract aspects from text data. Let's proceed with aspect extraction and explore the sentiment polarity associated with each aspect in the user reviews.

## Output Format
The output from the aspect extractor is a structured representation of the aspects found in the input text data. It typically includes the following information for each identified aspect:

Aspect Term: The specific aspect or entity mentioned in the text.
Positional Information: The start and end positions of the aspect term in the original text.
Aspect Category: The category or type of the aspect. In ABSA, aspects are often classified into specific categories like "service," "price," "quality," etc.
Sentiment Polarity: The sentiment polarity associated with the aspect. It indicates whether the sentiment expressed towards the aspect is positive, negative, or neutral.

### Utilizing Aspect Information
The extracted aspects and their associated sentiment polarities provide valuable insights for further analysis. We can use this information to understand which aspects of a product or service are positively or negatively perceived by users. These insights can be leveraged for sentiment analysis, product improvement, and better customer satisfaction.

Let's now leverage the aspect information and proceed with sentiment analysis to understand the overall sentiment expressed in user reviews.