# Assignment 8(Week 9)

## Name: Theresa Louise Bazudde

## Importing necessary libraries

In [None]:
# Built-in library
import itertools
import re
from typing import Any, Optional, Sequence, Union

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 1_000

#removing stop words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

#Lemmatizing
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

#word cloud
from wordcloud import WordCloud

#sentient analysis
from textblob import TextBlob

#model building
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC


### Data Columns
- Review Title: The title of the review.
- Review_Text: The full text of the review.
- Verified_Buyer: Whether the reviewer is a verified buyer of the product.
- Review_Date: The date the review was published relative to the review scrape date.
- Review_Location: The location of the reviewer.
- Review_Upvotes: How many times the review was upvoted by other reviewers.
- Review_Downvotes: How many times the review was downvoted by other reviewers.
- Product: The name of the product the review was issued for.
- Brand: The brand of the product.
- Scrape Date: The date the data was pulled from the web.

### Objectives
1. Exploratory Data Analysis.
2. Sentiments Analysis.

## Loading the data

In [None]:
df = pd.read_csv(r"C:\Users\user\Documents\Folder\CSV\UltaSkincareReviews.csv")
df.head()

In [None]:
#dataset info
df.info()

The data contains both numerical and categorical data.

## Exploratoty Data Analysis

The following analysis will be performed
<br>-checking for null values and removing them
<br>-converting the Review_Text column to string data type
<br>-converting Review_Text column to lowercase
<br>-changing all contractions to their full form
<br>-splitting data into numerical and categorical
<br>-making count plots to know whcih products were reviewed more and if buyers are verified or not

As I did not collect the data myself, I must check for null values as they would cause errors down the line

In [None]:
# Checking for null values
df.isna().sum()

The dataset is large and threfore removing a few data points would cause a major difference.

In [None]:
#removing null values
df = df[~df["Review_Text"].isnull()]
df = df[~df["Review_Location"].isnull()]

In [None]:
#checking if the null are gone
df.isnull().sum()

So as to have uniform data types in the column, all data is changed to string.

In [None]:
# Changing all reviews to string variables
df['Review_Text'] = df["Review_Text"].astype(str)

As python is a case sensitive language, the review text must all be in lower case so that for example the word Love and love would falsely be identified as different words.

In [None]:
# Changing all to lower case
df['Review_Text'] = df['Review_Text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df.head()

All contractions in the review text column are converted to their full form fpr better analysis
A dictionary containing contractions and their full forms is created then passed to a function that converts them to their full form.

In [None]:
#a dictionary of contractions
contractions = {
    "ain't": "am not",
    "aren't": "are not",
    "can't": "cannot",
    "couldn't": "could not",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'll": "he will",
    "he's": "he is",
    "i'd": "i would",
    "i'll": "i will",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it's": "it is",
    "let's": "let us",
    "mustn't": "must not",
    "shan't": "shall not",
    "she'd": "she would",
    "she'll": "she will",
    "she's": "she is",
    "shouldn't": "should not",
    "that's": "that is",
    "there's": "there is",
    "they'd": "they would",
    "they'll": "they will",
    "they're": "they are",
    "they've": "they have",
    "we'd": "we would",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "where's": "where is",
    "who'll": "who will",
    "who's": "who is",
    "won't": "will not",
    "wouldn't": "would not",
    "you'd": "you would",
    "you'll": "you will",
    "you're": "you are",
    "you've": "you have",
    "'s": " is"
}
#changing contactions to full form
df['Review_Text'] = df['Review_Text'].apply(lambda x: ' '.join([contractions.get(word, word) for word in x.split()]))
df.head()

A count plot for the products to determine which product is more popular amongst those reviewed

In [None]:
#products
sns.countplot(y='Product', data=df)

#### Conclusion
Daily Superfoliant and Daily Microfoliant are the most popular among the reviewed products

## Word CLoud

In [None]:
grouped_reviews = df.groupby('product')['review'].apply(lambda x: ' '.join(x)).reset_index()

# create a word cloud for each product
for index, row in grouped_reviews.iterrows():
    product = row['product']
    text = row['review']
    wordcloud = WordCloud(width=800, height=800, background_color='white', max_words=50, contour_width=3, contour_color='steelblue').generate(text)
    plt.figure(figsize=(8, 8), facecolor=None)
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.tight_layout(pad=0)
    plt.title(product)
    plt.show()

## Sentient Analysis
Questions to be answered are
<ol>
<li>Are most reviews positive or negative?</li>
<li>Which products have majority negative reviews?</li>
<li>Which have majority positive reviews?</li></ol>

Stop words like 'the', 'and' etc are removed as they are redundant.

In [None]:
stop = stopwords.words('english')
df['Review_Text'] = df['Review_Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df.head()

Lemmatising cuts words like 'running' to 'run' so that the root word is used to establish similarities.

In [None]:
#lemmatising
df['Review_Text'] = df['Review_Text'].apply(lambda x: " ".join([lemmatizer.lemmatize(word) for word in x.split()]))
df.head(5)

Special characters are also removed as they add no meaning to the sentient analysis.

In [None]:
#removing special characters
df['Review_Text'] = df['Review_Text'].apply(lambda x: re.sub('[^A-Za-z0-9\s]+', '', x))
df.head()

### TextBlob polarity
TextBlob uses a machine learning algorithm to classify the sentiment of text as either positive or negative. 
<br> It has a built-in feature that is used to classify text as positive or negative based on a set of predefined words and their associated polarities. 
<br>Polarity is between -1 and 1 with negative words giving rise to negative values and positive words giving positive values.


In [None]:
#polarity of the reviews
df['Polarity'] = df['Review_Text'].map(lambda x: TextBlob(x).sentiment.polarity)
df.head()

### Polarity distribution
The distribution of polarities amongst the various reviews is exmined to know whether most customers are happy or not

In [None]:
#seeing polarity distribution
df[["Polarity"]].hist(bins=20, figsize=(15, 10))

These polarities are then converted to a sentient (either positive or negative) based on whether they're less than or greater than zero.

In [None]:
df['Sentiment'] = df['Polarity'].apply(lambda x: 'positive' if x >= 0 else 'negative')

### Polarities for the different products
A count plot is used to analyze the different products and their sentients. The aim is to find out whether some products have more negative reviews than positive ones.

In [None]:
g = sns.catplot(x='Sentiment', kind='count', data=df, hue='Product', palette='bright')

### Conclusion
<ul>
<li>The polarity distribution is skewed to the right. Therefore, majority of reviews are positive</li>
<li>All products have more positive reviews than negative ones.</li>
</ul>

## Machine learning model

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['Review_Text'], df['Sentiment'], test_size=0.2)

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train a linear support vector machine (SVM) classifier
clf = LinearSVC()
clf.fit(X_train_vec, y_train)



##### Evaluating the model

In [None]:
accuracy = clf.score(X_test_vec, y_test)
print('Accuracy:', accuracy)