#Assignment - 4 Sentiment Analysis for AI Infra and Architecture

By Abhinav Raghav

*   200517685
*   AIDI -F23



# *Important Note:*

*Hello professor , I encountered unexpected issues with my Azure keys last night, preventing timely submission. Desperate to meet the deadline, I proceeded ded to process everything locally, including database ingestion.*

*I regret any inconvenience caused and kindly request your understanding and leniency in grading. I assure you this was a one-time occurrence, and I will take precautions to avoid such issues in the future.*

*Hope you understand and like my implementation!*


## Acquiring Data :
In this projct I'll be doing sentiment Analysis of Amazon Reviews. The architecture of the sentiment analysis project involves a combination of natural language processing (NLP) techniques, machine learning, and visualization tools. The sentiment analysis is performed using pre-trained models (NLTK, TextBlob) and a custom sentiment model trained with Scikit-learn. The results are visualized using matplotlib and Plotly.


`AMAZON-REVIEW-DATA-CLASSIFICATION.csv`

First, we'll read in the data :

In [53]:
import pandas as pd

reviews_df = pd.read_csv('AMAZON-REVIEW-DATA-CLASSIFICATION.csv')

Let's take a look at the data :

In [None]:
reviews_df.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


We will apply sentiment analysis with pretrained models to the 'reviewText' column and see the results. In a proper workflow, we would apply all the necessary text preprocessing thereto beforehand.

## Sentiment Analysis:

### Using `nltk`
A sentiment model is built into `nltk`. First let's download the `vader` package:

In [54]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [55]:
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

Let's take a sample review and check the sentiment:

In [56]:
sample_review = reviews_df['reviewText'][2]
sample_review

"Waste of money!!! It wouldn't load to my system."

In [57]:
sia.polarity_scores(sample_review)

{'neg': 0.315, 'neu': 0.685, 'pos': 0.0, 'compound': -0.5684}

### Using `TextBlob`

In [None]:
from textblob import TextBlob

TextBlob(sample_review)

TextBlob("Waste of money!!! It wouldn't load to my system.")

In [None]:
TextBlob(sample_review).sentiment

Sentiment(polarity=-0.390625, subjectivity=0.0)

### Training a Sentiment Model from Scratch

Below, we will quickly train a sentiment model using `scikit-learn` from the class labels provided for the review text.

In [None]:
# Remove NAs
reviews = reviews_df.dropna()

In [None]:
reviewText = reviews['reviewText'].str.replace('\n', ' ').str.replace('\t', ' ').str.lower()
reviewText

0        purchased for youngster who inherited my "too ...
1                                    unable to open or use
2         waste of money!!! it wouldn't load to my system.
3        i attempted to install this os on two differen...
4        i've spent 14 fruitless hours over the past tw...
                               ...                        
69995    purchased for my daughter's 8th birthday.  we ...
69996    visual link i is the only spanish learning sof...
69997    great product for making up all your needed le...
69998    another tax year almost over.  i could not hav...
69999    great journaling software at a good price, try...
Name: reviewText, Length: 69977, dtype: object

In [None]:
y = reviews['isPositive']

Though we have not fully preprocessed the text, we will apply simple count vectorization and see the initial result:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')

X = cv.fit_transform(reviewText)

Now let's split into train and test sets, we'll use a test set of 30%:

In [None]:
# Train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

X_train.shape

(48983, 51017)

In [None]:
X_test.shape

(20994, 51017)

Now we can fit and evaluate the model, predicting positive sentiment (`y = 1`):

In [None]:
# Fit and evaluate the model
from sklearn.linear_model import LogisticRegression

# Instantiate
sentiment_logit = LogisticRegression(max_iter=2000)

# Fit
sentiment_logit.fit(X_train, y_train)

# score
sentiment_logit.score(X_test, y_test)

0.8484805182433076

We can see the model has ~84% accuracy on the test set! What are the most predictive words of positive sentiment?

In [None]:
import pandas as pd

# Filter positive and negative reviews
positive_reviews = reviews_df[reviews_df['isPositive'] == 1]
negative_reviews = reviews_df[reviews_df['isPositive'] == 0]

# Display a table of positive reviews
print("Positive Reviews:")
display(positive_reviews[['reviewText', 'isPositive']])

# Display a table of negative reviews
print("\nNegative Reviews:")
display(negative_reviews[['reviewText', 'isPositive']])


Positive Reviews:


Unnamed: 0,reviewText,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",1.0
6,The download doesn't take long at all. And it'...,1.0
7,This program is positively wonderful for word ...,1.0
8,Fantastic protection!! Great customer support!!,1.0
9,Obviously Win 7 now the last great operating s...,1.0
...,...,...
69993,I would recommend this software for students w...,1.0
69996,Visual Link I is the ONLY Spanish learning sof...,1.0
69997,Great product for making up all your needed le...,1.0
69998,Another tax year almost over. I could not hav...,1.0



Negative Reviews:


Unnamed: 0,reviewText,isPositive
1,unable to open or use,0.0
2,Waste of money!!! It wouldn't load to my system.,0.0
3,I attempted to install this OS on two differen...,0.0
4,I've spent 14 fruitless hours over the past tw...,0.0
5,I purchased the home and business because I wa...,0.0
...,...,...
69985,"VERY OLD PROGRAM, NOT COMPATIBLE NOR UPGRADABL...",0.0
69987,The description reads that the program is user...,0.0
69991,I have installed this product 4 times and have...,0.0
69994,"I kid you not, this suite,program,resource *TH...",0.0


In [None]:
pip install plotly



# Top 10 Positive and Negative Coefficients :
-  It refer to the values obtained from the sentiment analysis model.

- Coefficients represent the weight or influence of specific words in determining whether a given text (in this case, reviews) is positive or negative.



In [None]:
# Convert coefficients series to DataFrame
coef_df = coef_series.to_frame(name='Coefficient')

# Display the coefficients as a table
print("Top 10 Positive Coefficients:")
print(coef_df.nlargest(10, 'Coefficient'))

print("\nTop 10 Negative Coefficients:")
print(coef_df.nsmallest(10, 'Coefficient'))

Top 10 Positive Coefficients:
            Coefficient
excellent      2.647593
fantastic      2.222196
flawlessly     2.173198
downfall       2.107640
roofs          2.055563
relief         2.053154
flawless       1.926934
awesome        1.893672
organizing     1.893254
highly         1.875323

Top 10 Negative Coefficients:
               Coefficient
worst            -2.735212
useless          -2.468813
unusable         -2.456357
waste            -2.349445
unacceptable     -2.123390
sucks            -2.103137
disappointing    -2.079721
terrible         -2.040809
crashing         -1.999931
worthless        -1.966514


In [None]:
import plotly.express as px
import plotly.graph_objects as go

# Function to create a bar plot for positive and negative coefficients
def plot_coefficients(coef_series, title):
    fig = px.bar(
        coef_series,
        orientation='h',
        labels={'index': 'Feature', 'value': 'Coefficient'},
        title=title,
        color=coef_series.index,
        color_continuous_scale=px.colors.diverging.RdYlGn,
    )
    fig.update_layout(yaxis=dict(categoryorder='total ascending'))
    return fig

# Plot positive coefficients
positive_fig = plot_coefficients(coef_series.nlargest(10), 'Top 10 Positive Coefficients')
positive_fig.show()

# Plot negative coefficients
negative_fig = plot_coefficients(coef_series.nsmallest(10), 'Top 10 Negative Coefficients')
negative_fig.show()

# Save the figures as HTML files
positive_fig.write_html('positive_coefficients.html')
negative_fig.write_html('negative_coefficients.html')

# Print the HTML filenames
print("Positive Coefficients HTML saved as 'positive_coefficients.html'")
print("Negative Coefficients HTML saved as 'negative_coefficients.html'")


Positive Coefficients HTML saved as 'positive_coefficients.html'
Negative Coefficients HTML saved as 'negative_coefficients.html'


# Word Cloud for Reviews:
- Context: The word cloud represents the most frequent words in the reviews. The size of each word in the cloud is proportional to its frequency in the entire dataset.

- Interpretation: Larger words are mentioned more frequently in reviews, providing an overview of the most common terms used by reviewers.

In [None]:
pip install wordcloud plotly




In [None]:
from wordcloud import WordCloud
import plotly.express as px
import pandas as pd

# Concatenate all reviews into a single string
all_reviews = ' '.join(reviews_df['reviewText'].dropna())

# Generate a word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_reviews)

# Convert the word cloud to a Plotly figure
wordcloud_fig = px.imshow(wordcloud.to_array(), binary_string=True)
wordcloud_fig.update_layout(title='Word Cloud for Reviews', xaxis=dict(visible=False), yaxis=dict(visible=False))

# Display the interactive word cloud
wordcloud_fig.show()

# Generate frequencies from the word cloud and convert to DataFrame
word_frequencies = wordcloud.process_text(all_reviews)
word_df = pd.DataFrame(list(word_frequencies.items()), columns=['Word', 'Frequency'])

# Display the top words in tabular form
print("Top Words and Their Frequencies:")
print(word_df.nlargest(10, 'Frequency'))


Top Words and Their Frequencies:
         Word  Frequency
40   software      20137
14        use      20025
57    program      19728
74    product      15776
58       work      13200
198       one      12373
41   computer      11365
178   problem      10384
23       will      10347
171      time       9975


# Distribution of Positive and Negative Comments.

In [None]:
import plotly.express as px

# Data preparation
positive_count = reviews_df['isPositive'].sum()
negative_count = len(reviews_df) - positive_count
labels = ['Positive', 'Negative']
sizes = [positive_count, negative_count]

# Create a Pie chart using Plotly
fig = px.pie(
    values=sizes,
    names=labels,
    labels={'Positive': 'Positive', 'Negative': 'Negative'},
    title='Distribution of Positive and Negative Comments',
    color_discrete_sequence=['#66b3ff', '#ff9999']
)

# Update layout for percentage display
fig.update_traces(textinfo='percent+label')

# Show the interactive pie chart
fig.show()


In [None]:
# Create a DataFrame for tabular representation
data = {'Sentiment': labels, 'Count': sizes, 'Percentage': [f'{size/sum(sizes)*100:.2f}%' for size in sizes]}
sentiment_df = pd.DataFrame(data)

# Display the tabular form
print("Sentiment Distribution Table:")
print(sentiment_df)


Sentiment Distribution Table:
  Sentiment    Count Percentage
0  Positive  43692.0     62.42%
1  Negative  26308.0     37.58%


# Distribution of Review Ratings:

- Context: This bar plot shows the distribution of review ratings, where '1' represents negative reviews, and '0' represents positive reviews.

- Interpretation: It provides insight into the overall sentiment of the reviews. A skewed distribution towards '1' indicates a higher prevalence of negative sentiments, while a balanced distribution suggests a mix of positive and negative sentiments.

In [None]:
import plotly.express as px

# Count the number of positive and negative reviews
review_counts = reviews_df['isPositive'].value_counts()

# Create a Bar chart using Plotly
fig = px.bar(
    x=['Negative', 'Positive'],
    y=review_counts,
    color=['Negative', 'Positive'],
    labels={'x': 'Sentiment', 'y': 'Count'},
    title='Distribution of Review Ratings',
    color_discrete_map={'Negative': '#66b3ff', 'Positive': '#ff9999'}  # Corrected color assignment
)

# Show the interactive bar chart
fig.show()


In [None]:
# Create a DataFrame for tabular representation
data = {'Sentiment': ['Negative', 'Positive'], 'Count': review_counts.values}
sentiment_df = pd.DataFrame(data)

# Display the tabular form
print("Sentiment Distribution Table:")
print(sentiment_df)


Sentiment Distribution Table:
  Sentiment  Count
0  Negative  43692
1  Positive  26308


# Time-based Analysis:

- Context: The line plot shows the number of reviews over time, aggregated by month. Each point on the plot represents the count of reviews in a specific month.

- Interpretation: Trends in the number of reviews over time can reveal patterns, such as seasonal variations or changes in user engagement. A sudden increase or decrease might be attributed to specific events or product releases.

In [None]:
import plotly.express as px

# Convert timestamp to datetime
reviews_df['datetime'] = pd.to_datetime(reviews_df['time'], unit='s')

# Resample data by month and count reviews
monthly_reviews = reviews_df.resample('M', on='datetime').size().reset_index(name='Review Count')

# Create a Line chart using Plotly
fig = px.line(
    monthly_reviews,
    x='datetime',
    y='Review Count',
    labels={'datetime': 'Date', 'Review Count': 'Review Count'},
    title='Monthly Review Count Over Time',
    line_shape='linear',
    color_discrete_sequence=['#66b3ff']
)

# Show the interactive line chart
fig.show()


In [None]:
# Create a DataFrame for tabular representation
print("Monthly Review Count Over Time:")
print(monthly_reviews)

Monthly Review Count Over Time:
      datetime  Review Count
0   1999-11-30             4
1   1999-12-31             7
2   2000-01-31            11
3   2000-02-29            14
4   2000-03-31            16
..         ...           ...
223 2018-06-30           206
224 2018-07-31           156
225 2018-08-31            75
226 2018-09-30             6
227 2018-10-31             3

[228 rows x 2 columns]


# Distribution of Review Length:

- Context: The histogram displays the distribution of the length of reviews, measured in the number of characters.

- Interpretation: It helps understand the variability in the length of reviews. Peaks in certain length ranges may indicate common review styles—short and concise versus long and detailed. This information can guide strategies for responding to reviews based on their length.

In [None]:
import plotly.express as px

# Create a new column for review length, handling missing values
reviews_df['review_length'] = reviews_df['reviewText'].apply(lambda x: len(str(x)) if pd.notna(x) else 0)

# Create a Histogram using Plotly
fig = px.histogram(
    reviews_df,
    x='review_length',
    nbins=50,
    labels={'review_length': 'Review Length', 'count': 'Frequency'},
    title='Distribution of Review Length',
    color_discrete_sequence=['skyblue']
)

# Show the interactive histogram
fig.show()


In [None]:
# Create a DataFrame for tabular representation
review_length_df = pd.DataFrame({
    'Review Length Bins': [f'{bin_left}-{bin_right}' for bin_left, bin_right in zip(fig.data[0].x, fig.data[0].x[1:])],
    'Frequency': fig.data[0].y
})

# Display the tabular form
print("Distribution of Review Length:")
print(review_length_df)

Distribution of Review Length:
           Review Length Bins Frequency
0                     0.0-0.0      None
1                     0.0-0.0      None
2                     0.0-0.0      None
3      0.0-1.0986122886681098      None
4      1.0986122886681098-0.0      None
...                       ...       ...
69994  1.3862943611198906-0.0      None
69995  0.0-2.3978952727983707      None
69996  2.3978952727983707-0.0      None
69997                 0.0-0.0      None
69998                 0.0-0.0      None

[69999 rows x 2 columns]


# Verification Status Distribution:

- True (Verified):

Reviews marked as "True" in the verification status distribution are those that have undergone a verification process. This process often involves confirming that the reviewer has indeed purchased or used the product they are reviewing.

- False (Not Verified):

Reviews marked as "False" in the distribution are those that have not undergone the verification process. These reviews may lack confirmation of the reviewer's actual experience with the product.

In [None]:
import plotly.graph_objects as go

# Count the verification status
verification_counts = reviews_df['verified'].value_counts()

# Create a Bar chart using Plotly graph_objects
fig = go.Figure(data=[
    go.Bar(x=verification_counts.index, y=verification_counts.values, marker_color=['#66b3ff', '#ff9999'])
])

# Update layout for better readability
fig.update_layout(
    title='Verification Status Distribution',
    xaxis_title='Verification Status',
    yaxis_title='Count',
    showlegend=False
)

# Show the interactive bar chart
fig.show()


In [None]:
# Create a DataFrame for tabular representation
verification_df = pd.DataFrame({
    'Verification Status': verification_counts.index,
    'Count': verification_counts.values
})

# Display the tabular form
print("Verification Status Distribution:")
print(verification_df)

Verification Status Distribution:
   Verification Status  Count
0                 True  47208
1                False  22792


# Summary Sentiment Statistics:

- Count: The total number of reviews for which sentiment analysis of the summary has been conducted. In this case, there are 70,000 reviews.

- Mean: The average sentiment polarity score across all the reviews. Sentiment polarity typically ranges from -1 to 1, where -1 indicates a highly negative sentiment, 0 indicates neutral, and 1 indicates a highly positive sentiment.

- Standard Deviation: A measure of the variability or dispersion of sentiment scores. A higher standard deviation indicates a wider range of sentiment scores, suggesting more diverse opinions among the reviews.

- Min: The minimum sentiment polarity score observed. In this case, the minimum score is -1, indicating the most negative sentiment.

- 25th Percentile (Q1): The sentiment polarity score below which 25% of the reviews fall. It provides insight into the lower range of sentiment scores.

- 50th Percentile (Median or Q2): The median sentiment polarity score, representing the middle value when all scores are sorted in ascending order. It separates the higher 50% of sentiment scores from the lower 50%.

- 75th Percentile (Q3): The sentiment polarity score below which 75% of the reviews fall. It provides insight into the higher range of sentiment scores.

- Max: The maximum sentiment polarity score observed. In this case, the maximum score is 1, indicating the most positive sentiment.



In [None]:
import plotly.express as px

# Calculate sentiment polarity for each summary using TextBlob
reviews_df['summary_sentiment'] = reviews_df['summary'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)

# Create a Histogram using Plotly
fig = px.histogram(
    reviews_df,
    x='summary_sentiment',
    nbins=30,
    labels={'summary_sentiment': 'Sentiment Polarity', 'count': 'Frequency'},
    title='Distribution of Summary Sentiment',
    color_discrete_sequence=['lightgreen']
)

# Show the interactive histogram
fig.show()

# Display summary sentiment statistics
print("Summary Sentiment Statistics:")
print(summary_sentiment_stats)

Summary Sentiment Statistics:
count    70000.000000
mean         0.165343
std          0.396973
min         -1.000000
25%          0.000000
50%          0.000000
75%          0.433333
max          1.000000
Name: summary_sentiment, dtype: float64


The mean sentiment score gives an overall indication of whether the reviews, based on their summaries, are generally positive, negative, or neutral.
Variability:

The standard deviation helps assess how consistent or varied the sentiment is across the reviews. A higher standard deviation suggests a wider range of sentiments.
Distribution Percentiles:

The percentiles (25th, 50th, 75th) provide insights into the distribution of sentiment scores, helping to identify the central tendency and spread of sentiments.
Extreme Sentiments:

The minimum and maximum values highlight the most negative and most positive sentiments expressed in the reviews.
User Engagement:



# Distribution of Log Votes:

- Logarithmic Transformation: The vote counts associated with reviews have been transformed using the logarithmic function. Logarithmic transformations are commonly used to compress data with a wide range into a more manageable and interpretable form. In this context, it's applied to the vote counts.

- Log Votes: Each review is associated with a vote count, and the logarithmic transformation is applied to these counts. The transformed values are then used to create bins for the distribution.

- Histogram: The distribution is visualized as a histogram, where the x-axis represents the logarithmically transformed vote counts, and the y-axis represents the frequency or count of reviews falling into each bin.

- Interpretation: The histogram provides insights into the distribution of user engagement or voting activity on reviews. Logarithmic transformation is often used when there is a wide range of values, and it helps in emphasizing differences in the lower range of values.

- Logarithmic Scale: The logarithmic scale is beneficial when dealing with data that spans several orders of magnitude, as it can reveal patterns and details in the lower end of the scale that might be overshadowed in a linear scale.

In [None]:
import plotly.express as px

# Create a Histogram using Plotly
fig = px.histogram(
    reviews_df,
    x='log_votes',
    nbins=30,
    labels={'log_votes': 'Log Votes', 'count': 'Frequency'},
    title='Distribution of Log Votes',
    color_discrete_sequence=['salmon']
)

# Show the interactive histogram
fig.show()


# Display log votes distribution
print("Log Votes Distribution:")
print(vote_distribution)


Log Votes Distribution:
0.000000    50341
1.098612     4977
1.386294     3006
1.609438     2078
1.791759     1468
            ...  
4.718499        1
5.117994        1
5.916202        1
6.363028        1
4.969813        1
Name: log_votes, Length: 226, dtype: int64


- Common Engagement: Most reviews may receive a relatively low number of votes, contributing to the peak on the lower end of the logarithmic scale.

- Few Highly Voted Reviews: There might be a smaller number of reviews that received a significantly higher number of votes, contributing to a long tail on the higher end of the logarithmic scale.

- User Participation: Understanding the distribution of log votes can provide insights into the level of user engagement with the reviews. It helps identify whether a few reviews receive a large number of votes or if the engagement is more evenly distributed.

- Impactful Reviews: Reviews with higher vote counts may indicate that they had a more significant impact on users or garnered more attention from the community.

# Conclusions Reached:

- The sentiment analysis models achieve an accuracy of approximately 84.8%. This suggests a good performance in classifying reviews into positive and negative sentiments.

- Sentiment Distribution : There are 43,692 positive reviews and 26,308 negative reviews, indicating a higher count of positive sentiments in the dataset.

- Top Positive Coefficients : Words like "excellent," "fantastic," and "flawless" have high positive coefficients, suggesting they strongly contribute to positive sentiments.

- Top Negative Coefficients : Words like "worst," "useless," and "unusable" have high negative coefficients, indicating strong associations with negative sentiments.

- Top Words and Frequencies: Commonly used words include "software," "use," "program," and "product." Understanding these frequent words helps grasp the main topics discussed in reviews.

- Sentiment Distribution Over Time : The monthly review count over time shows an increasing trend, with some fluctuations. This information can be valuable for understanding the temporal dynamics of reviews.

- Distribution of Review Length : The distribution of review lengths provides insight into the range of lengths present in the dataset, although the specific frequency details are not provided due to missing values.

- Verification Status : Approximately 47,208 reviews are verified, while 22,792 are not. This distribution indicates a substantial number of verified reviews.

- Summary Sentiment Statistics : The summary sentiment statistics show that, on average, summaries tend to have a positive sentiment (mean = 0.165). The majority of summaries have a sentiment close to zero.