## 📌 Introduction

In today's data-driven world, understanding and extracting insights from **unstructured text data**—such as customer reviews—has become paramount for businesses aiming to enhance customer satisfaction and improve their products or services.

This project focuses on leveraging **sentiment analysis** to classify **Amazon reviews** as **positive**, **negative**, or **neutral**, providing valuable insights into customer opinions and sentiments.


## 🎯 Objective

The primary objective of this project is to develop an **automated sentiment analysis system** capable of accurately determining the **sentiment** expressed in **Amazon customer reviews** by employing **Natural Language Processing (NLP)** techniques.


## 🛠️ Methodology

### 1. **Data Collection**
The dataset used in this project was collected from **Amazon customer reviews**.

---

### 2. **Data Preparation**

Tasks include:

- **2.1** Loading and reading the dataset  
- **2.2** Cleaning and transforming the data to ensure high-quality input for analysis

---

### 3. **Preprocessing**

Several preprocessing steps were undertaken:

- **3.1** Tokenization  
- **3.2** Stop-word removal  
- **3.3** Stemming/Lemmatization  

These steps help **standardize** the text and **reduce noise**.

---

### 4. **Sentiment Analysis**

The final step involves classifying each review as:

- **Positive**
- **Negative**
- **Neutral**


In [1]:
from twython import Twython


ModuleNotFoundError: No module named 'twython'

In [None]:
#Importing the Important Libraries 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 
import nltk

plt.style.use('ggplot')

In [None]:
file_path="/kaggle/input/amazon-fine-food-reviews/Reviews.csv"

In [None]:
#load the dataset into a dataframe
dataset=pd.read_csv(file_path)

In [None]:
#read the first 5 rows of the dataset
dataset.head()

In [None]:
#check the dataset shape 
dataset.shape

 <h5 syle="solor:#183D3D"> The Data Set Contain of [568454] Rows and [5] Columns</h5> 

In [None]:
#dataset columns name 
dataset.columns

In [None]:
#check the missing  values  in the dataset
dataset.isna().sum()

In [None]:
dataset.isna().sum().sum()

In [None]:
#consider the most important columns for this project 
dataset = dataset[['Id', 'ProductId', #'UserId', 'ProfileName', 'HelpfulnessNumerator',
       #'HelpfulnessDenominator',
                   'Score', #'Time',
                   #'Summary'
                    'Text']].copy()

In [None]:
dataset.shape

## 🧠 How Sentiment Analysis Works

### 1. **Tokenization**  
The first step is to divide the text into **tokens**, which are individual words or phrases. Each token is assigned a standardized format for processing.

### 2. **Polarity Score Calculation**  
For each text, a **polarity score** is calculated. This score can include the following components:

- **Positive (Pos):** Indicates a positive sentiment.  
- **Negative (Neg):** Indicates a negative sentiment.  
- **Neutral (Neu):** Indicates a neutral sentiment.  
- **Compound (Comp):** A composite value that represents the overall sentiment of the text by combining positive, negative, and neutral elements.

The **Pos** score reflects the degree of positivity in the text, the **Neg** score indicates negativity, and the **Neu** score captures neutral tone. The **Comp** score provides an overall sentiment assessment:
- **Positive Comp values** → positive sentiment  
- **Negative Comp values** → negative sentiment


In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm #to loop through the dataset row by row 

sia = SentimentIntensityAnalyzer()

In [None]:
# find the polarity score on the entire dataset
#this will get a dictionary contain the [ID  ,neg, neu , pos ,comp ]
res = {}
for i, row in tqdm(dataset.iterrows(), total=len(dataset)):
    text = row['Text']
    myid = row['Id']
    res[myid] = sia.polarity_scores(text)

In [None]:
#show the polarity score for each text 
res

In [None]:
#conver res to dataframe called vasders then merge vaders dataframe with the whole dataset through the Id column 
vaders = pd.DataFrame(res).T
vaders = vaders.reset_index().rename(columns={'index': 'Id'})
vaders = vaders.merge(dataset, how='left')


In [None]:
vaders.head(5)

In [None]:
#Plot VADER results 
fig, axs = plt.subplots(1, 3, figsize=(12, 3))
sns.barplot(data=vaders, x='Score', y='pos', ax=axs[0])
sns.barplot(data=vaders, x='Score', y='neu', ax=axs[1])
sns.barplot(data=vaders, x='Score', y='neg', ax=axs[2])
axs[0].set_title('Positive')
axs[1].set_title('Neutral')
axs[2].set_title('Negative')
plt.tight_layout()
plt.show()