<a href="https://colab.research.google.com/github/acesanu/Regulatory_Compliance_Analysis/blob/main/RegulatoryComplianceAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Introduction
According to the Consumer Response Annual Report of Consumer Financial Protection Bureau(CFPB), the Bureau sent approximately 264,100 (or 80%) of these complaints to companies for review and response, referred 14% complaints received to other regulatory agencies. and found 4% to be incomplete

In [None]:
# Q 11: What are the Most Common Words in Neutral Reviews?
# For neutral reviews, we are setting star_rating to 3 and polarity to 0.

wordcloud = WordCloud(width=2500, height=2000, max_words=50,
                      background_color='White').generate(str(reviews_df[reviews_df['star_rating'] ==3) &
                                                    (reviews_df['polarity'] ==0)].sample(10000, random_state=0)['clean_reviews']))

In [None]:
plt.figure(figsize=(10, 10))
plt.imshow(wordcloud)
plt.axis('off')

Observations:
The most common words for neutral reviews are charge, use, monitor, anyone, etc
These words don't express a clear sentiment and hence are appropriate for neutral reviews.

In [None]:
# Q 12: What are the Most Common words in Negative Reviews?

# For negative reviews, we are setting star_rating to 1 and polarity to -1.

wordcloud = WordCloud(width=2500, height=2000, max_words=50,
                      background_color= 'White').generate(str(reviews_df[(reviews_df['star_rating']== 1) &
                                                      (reviews_df['polarity'] == -1)]['clean_reviews']))

In [None]:
plt.figure(figsize= (10, 10))
plt.imshow(wordcloud)
plt.axis('off')

Observation:
1) The most common words for negative reviews are horrible, worst, terrible, awful, broke etc
2) These words clearly express a negative emotion and are appropriate for negative reviews.

Post Data Processing & Analysis

1) After completing the analysis on the data, we can move on towards fitting our Machine Learning models with our data.
2) But, our dataset still contains a lot of redundant columns in our data which won't help the model in making predictions
3) Also, we need to remove samples having subjectively lower than the subjectivity threshold value of 0.3
4) And, we need to create a sentiment column containing the labels for our machine learning model.
5) In this section, we will remove all the redundant columns, drop samples that doesn't satisfy our selection criteria, and then create a sentiment column.
6) We will also be splitting the data into two subsets for training and testing purposes.

In [None]:
# Removing Redundant columns
# We will remove every redundant column from our dataset
# We will create a new dataframe containing only the essential features and use this dataframe down the line.

essential_df = reviews_df[['star_rating', 'clean_reviews', 'polarity', 'subjectivity']]
essential_df.head()

# The new dataframe only contains the essential features: star_rating, clean_reviews_ polarity, subjectivity.

In [None]:
# Removing Samples having Subjectivity less than 0.3
# Here, we will remove the samples having subjectivity lower than the subjectivity threshold value of 0.3

essential_df['subjectivity'].min()


In [None]:
essential_df = essential_df[essential_df['subjectivity'] >= 0.3]

In [None]:
# Checking the minimum value of subjectivity after removing samples
essential_df['subjectivity'].min()

# The minimum value of subjectivity has increased frok 0 to 0.3
# We have successfully removed every sample having subjectivity less than 0.3

Creating Sentiment Column
Now, we will create a sentiment column which will provide the labels for our training samples used in the Machine Learning models.
We will use both the star_rating and polarity values to divide our reviews into different sentiments.

For positive reviews, we are using a star_rating higher than 3 and a polarity value greater than or equal to 0.5
We have to use such a high value of polarity due to the large volume of positive reviews.
We need to reduce the size of our dataset for training, otherwise our session will crash, due to shortage of RAM.

In [None]:
essential_df[(essential_df['star_rating'] > 3) & (essential_df['polarity'] >= 0.5)].head()

In [None]:
positive_df = essential_df[(essential_df['star_rating'] > 3) & essential_df['polarity'] >=0.5)]

In [None]:
positive_df['sentiment'] = 'positive'

In [None]:
positive_df.head()

# For neutral reviews, we are using a star_rating equal to 3 and a polarity value between -0.1 and 0.1

essential_df[(essential_df['star_rating'] == 3) & (essential_df['polarity'] >=-0.1) & (essential_df['polarity'] <= 0.1)].head()

neutral_df = essential_df[(essential_df['star_rating'] == 3) & (essential_df['polairty'] >= 10.1) & (essential_df['polarity'] <= 0.1)]

neutral_df['sentiment']  = 'neutral'

neutral_df.head()

In [None]:
# For negative reviews, we are using a star_rating less than 3 and a polarity value less than 0.

essential_df[(essential_df['star_rating'] < 3) & (essential_df['polarity'] < 0)].head()

negative_df = essential_df[(essential_df['star_rating'] <3) & (essential_df['polarity'] <0)]

negative_df['sentiment'] = 'negative'

negative_df.head()

In [None]:
# Joining all 3 dataframes to create our final dataset

sentiment_df = pd.concat([positive_df, neutral_df, negative_df], ignore_index=True)
sentiment_df.head()

sentiment_df.tail()

# The final dataset contains the selected reviews along with their respective sentiments.

In [None]:
# Proportional Distribution of Sentiments
# Plotting the proportional distribution of sentiments
sentiment_df['sentiment'].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=80, figsize=(8, 8),
                                              fontsize=14, legend=True, cmap='summer')

plt.ylabel('Sentiments', fontsize=16)
plt.title('Proportional Distribution of Sentiments', fontsize=18)

Observations:
Our final dataset contains about 72.4% positive reviews, 20.1 % negative reviews, and 7.6% neutral reviews

Saving the Final Dataset as a CSV File
We will save our final dataset as a csv file in the local system.
This will allow us to resume from this point onwards if we want to make changes to the model building process.

In [None]:
# Saving the dataframe to a csv file
sentiment_df.to_csv('review_data.csv', index=False, encoding='utf-8')

# To load the saved csv file into a dataframe
sentiment_df = pd.read_csv('review_data.csv')
sentiment_df.head()

Data Splitting.

Now,we will  split the dataset into Train and Test subsets.
We will use 80% data for training and the remaining 20% data for testing our models.

First, we will separate the reviews and their respective sentiment labels from the data.

In [None]:
# Separating the Reviews from the dataset
X = sentiment_df['clean_reviews'].values
X[:5]

In [None]:
# Separating the labels
y = sentiment_df['sentiment'].values
y[:5]

In [None]:
# After separating the reviews and labels, we will split the data into train and test sets.
# Using scikit-learn's train_test_split function to split the dataset into train and test sets.
# 80% of the data will be in the train set and 20% in the test set, as specified by test_size=0.2

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [None]:
# Checking the shapes of the training and test sets.
print('Training Data Shape:', X_train.shape, y_train.shape)
print('Testing Data Shape:', X_test.shape, y_test.shape)

# The data has been divided into training and test sets.

In [None]:
# Model Development & Evaluation - in this section we will be building our Machine Learning models and fitting them with the training data.

# Building Machine Learning Model.
# Building a tokenizer function will split each review into a list of tokens.
# A token is a single word in this case, and the review will be splitted on a single white space.

In [None]:
def tokenizer(text):
    return text.split()

Building a TFIDF Vectorizer.
In this, first we create a vocabulary of unique tokens from the entire set of documents(i.e reviews)
Then we construct a feature vector from each document that contains the term frequency of how often each word occurs in a particular document.
Term Frequency is the number of times a term t, occurs in a document d.
TFIDF stands for term frequency-inverse document frequency(tf-idf)
It is used to downweight the frequently occuring words in the feature vectors that typically don't contain useful or discriminatory information.

tf- idf(t,d) = tf(t,d) * id(t,d)
Here, tf(t,d) is the term frequency, and
idf(t,d) is the inverse document frequency.

idf(t,d) = log(ndl(1+df(d,t))
Here nd is the total number of documents, and
df(d,t) is the number of documents, d that contain the term t.

This will create a matrix of 15000 most common words, based on their term frequency in the data, as the columns.

In [None]:
tfidf = TfidfVectorizer(max_features=15000, tokenizer=tokenizer)

Creating a Machine Learning Pipeline
This will first, vectorize the data, creating a TFIDF feature matrix from the dataset, then will pass to our classifier.
The classifier in this case is the Multinomial Naive Bayes classifier
This algorithm is most suited for vectorized text that contains a large number of features.
Creating a pipeline allows us to streamline our Machine learning workflow by performing multiple steps in a single pass.

In [None]:
tfidf_mnb = Pipeline([('vect', tfidf), ('clf', MultinomialNB())])

In [None]:
# Fitting our pipeline with the training data
tfidf_mnb.fit(X_train, y_train)

In [None]:
# Making Predictions on the Test set
y_pred = tfidf_mnb.predict(X_test)

In [None]:
y_pred[:5]

Model Evaluation

Checking the model accuracy on both train and test sets.
We are using the classifier's score method, which calculates the accuracy score of the model on a given data

In [None]:
print('Model Accuracy for the Train set:', tfidf_mnb.score(X_train, y_train))

In [None]:
print('Model Accuracy for the Test set:', tfidf_mnb.score(X_test, y_test))

Observations:

We get an accuracy of about 92% on both of our train set and test set.
This implies that our model is not overfitting
It is generating well on unseen data, and giving good results.

In [None]:
pd.Dataframe(confusion_matrix(y_test, y_pred), columns=['negative','neutral','positive'], index=['negative', 'neutral','positive'])

In [None]:
print(classification_report(y_test, y_pred))

Our model is giving best results for the positive sentiment.
This might be due to the large proportion of positive reviews in the dataset.
The performance for negative reviews is also good with a F1 score of 85%
But, our model struggles while predicting the neutral sentiment
This is due to the fact the many neutral sentiment reviews contains words which can sometimes be associated with both positive and negative reviews.
Also, there are a lot less number of neutral reviews in the data.
And, hence leads to a lot of false negatives in the neutral statement.

In [None]:
# Creating a dataframe containing the test set reviews and sentiments, along with the predictions made by the model for comparison

evaluation_df = pd.DataFrame({'reviews': X_test, 'sentiment': y_test, 'sentiment_tfidf_mnb': y_pred})
evaluation_df.head()

Observations:

The sentiments we gave to the reviews were based on star ratings and the polarity value calculating TextBlob's sentiment analysis function.
Now, we can visually compare the sentiment predictions made by our model on these reviews.

Conclusion:

We cleaned the reviews by
Changing the case of each word to lowercase.
Fixing certain words
Removing all the punctuation marks from each review, and
Removing any additional white space from each review.
Then, we calculate the polarity and subjectivity values for each review
This allowed us to analyze our data in-depth to find relationship between various features like star rating and polarity
We also calculated a threshold for the subjectivity value in our reviews
we then found out the most common words associated with different sentiments
After analyzing the data:
We remove all redundant columns from the data
Remove all the samples having subjectivity less than 0.3 i.e the subjectivity threshold
Divide reviews into sentiments based on star rating and polarity
At last, we split the data into training and test sets.
During model building, we first created a TFIDF matrix of our data and trained a Multinomial Naive Bayes model
This model was then used to make predictions on the test set
It achieved an accuracy of 92% on both train and test sets
This implied that model is not overfitting and is generalizing well on unseen data.