# Sentiment Analysis for Customer Reviews Challenge

## Challenge:
Develop a robust Sentiment Analysis classifier for XYZ customer reviews, automating the categorization into positive, negative, or neutral sentiments. Utilize Natural Language Processing (NLP) techniques, exploring different sentiment analysis methods.

## Problem Statement:
XYZ organization, a global online retail giant, accumulates a vast number of customer reviews daily. Extracting sentiments from these reviews offers insights into customer satisfaction, product quality, and market trends. The challenge is to create an effective sentiment analysis model that accurately classifies XYZ customer reviews.

### Important Instructions:

1. Make sure this ipynb file that you have cloned is in the __Project__ folder on the Desktop. The Dataset is also available in the same folder.
2. Ensure that all the cells in the notebook can be executed without any errors.
3. Once the Challenge has been completed, save the SentimentAnalysis.ipynb notebook in the __*Project*__ Folder on the desktop. If the file is not present in that folder, autoevalution will fail.
4. Print the evaluation metrics of the model. 
5. Before you submit the challenge for evaluation, please make sure you have assigned the Accuracy score of the model that was created for evaluation.
6. Assign the Accuracy score obtained for the model created in this challenge to the specified variable in the predefined function *submit_accuracy_score*. The solution is to be written between the comments `# code starts here` and `# code ends here`
7. Please do not make any changes to the variable names and the function name *submit_accuracy_score* as this will be used for automated evaluation of the challenge. Any modification in these names will result in unexpected behaviour.

### --------------------------------------- CHALLENGE CODE STARTS HERE --------------------------------------------

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report,accuracy_score
from sklearn.model_selection import *
from sklearn.naive_bayes import MultinomialNB

In [3]:
df = pd.read_csv("Reviews.csv")
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [4]:
df.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [5]:
df.describe()

Unnamed: 0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,568454.0,568454.0,568454.0,568454.0,568454.0
mean,284227.5,1.743817,2.22881,4.183199,1296257000.0
std,164098.679298,7.636513,8.28974,1.310436,48043310.0
min,1.0,0.0,0.0,1.0,939340800.0
25%,142114.25,0.0,0.0,4.0,1271290000.0
50%,284227.5,0.0,1.0,5.0,1311120000.0
75%,426340.75,2.0,2.0,5.0,1332720000.0
max,568454.0,866.0,923.0,5.0,1351210000.0


In [6]:
df = df.drop(columns={"Id","ProductId"	,"UserId",	"ProfileName","HelpfulnessNumerator","HelpfulnessDenominator","Time"})

In [7]:
df.head()

Unnamed: 0,Score,Summary,Text
0,5,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,4,"""Delight"" says it all",This is a confection that has been around a fe...
3,2,Cough Medicine,If you are looking for the secret ingredient i...
4,5,Great taffy,Great taffy at a great price. There was a wid...


In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
import re
import matplotlib.pyplot as plt
import seaborn as sns
 


def preprocess_text(text):
    # Text cleaning: Remove special characters and HTML tags
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text.lower()
 
df['CleanedText'] = df['Text'].apply(preprocess_text)
 
# Handling Missing Values
df.dropna(subset=['CleanedText'], inplace=True)

 
# Splitting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['CleanedText'], df['Score'], test_size=0.2, random_state=42)
 
# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
 
# Models to be used
models = {
    # 'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(),
    # 'Support Vector Machine': SVC(),
    # 'Random Forest': RandomForestClassifier(),
    # 'K-Nearest Neighbors': KNeighborsClassifier()
}
 
# Training and testing multiple models
for model_name, model in models.items():

    model.fit(X_train_vec, y_train)
    y_pred = model.predict(X_test_vec)
    
    from sklearn.metrics import f1_score 
    f1_scores = f1_score(y_test, y_pred, labels=model.classes_, average=None) 
    plt.figure(figsize=(7, 6)) 
    plt.bar(model.classes_, f1_scores) 
    plt.xlabel('Score') 
    plt.ylabel('F1-Score') 
    plt.title('F1-Score per Class') 
    plt.show()



    plt.figure(figsize=(10, 8)) 
    from sklearn.metrics import confusion_matrix 
    conf_matrix = confusion_matrix(y_test, y_pred)
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=model.classes_, yticklabels=model.classes_) 
    plt.title('Confusion Matrix') 
    plt.xlabel('Predicted Label') 
    plt.ylabel('True Label') 
    plt.show()

    # Evaluation: Calculate accuracy for each model
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model: {model_name}, Accuracy: {accuracy:.4f}")

In [None]:
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt   
categoryratings = df['Score'].unique()  
for rating in categoryratings:
    print(rating)
    text = " ".join(review for review in df[df['Rating'] == rating]['ReviewText'])  
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)       
    plt.figure(figsize=(10, 6))     
    plt.imshow(wordcloud, interpolation='bilinear')     
    plt.title(f'Word Cloud for Rating {rating}')     
    plt.axis('off')     
    plt.show()

In [9]:
accuracy = accuracy_score(y_test, y_pred) 
report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(report)

Accuracy: 0.74
Classification Report:
              precision    recall  f1-score   support

           1       0.66      0.69      0.67     10326
           2       0.47      0.22      0.30      5855
           3       0.48      0.32      0.38      8485
           4       0.52      0.26      0.34     16123
           5       0.80      0.95      0.87     72902

    accuracy                           0.74    113691
   macro avg       0.59      0.49      0.51    113691
weighted avg       0.71      0.74      0.71    113691



### --------------------------------------- CHALLENGE CODE ENDS HERE --------------------------------------------

### NOTE:
1. Assign the Accuracy score obtained for the model created in this challenge to the specified variable in the predefined function *submit_accuracy_score* below. The solution is to be written between the comments `# code starts here` and `# code ends here`
2. Please do not make any changes to the variable names and the function name *submit_accuracy_score* as this will be used for automated evaluation of the challenge. Any modification in these names will result in unexpected behaviour.

In [10]:
def submit_accuracy_score()-> float:
    #accuracy should be in the range of 0.0 to 1.0
    accuracy = 0.0
    # code starts here
   
    # code ends here
    return accuracy