# Global Solution 1 - 2TIAR

## Integrantes:
* André Gomes Monteiro      RM: 89168
* Larissa Dias Cardomingo RM: 88842
* Luara Maria Marino RM: 89375
 

## Análise de Sentimento de reviews de produtos vendidos pela Amazon
Que a Amazon é uma das maiores empresas da atualidade, todo mundo já sabe, mas, você sabia que a análise de sentimento dos reviews dados pelos usuários em seu marketplace, é usado frequentemente para melhoria dos produtos e serviços? E é claro, que como uma empresa tecnológica que é, isso não é feito de forma manual. Algoritmos de machine learning são executados diariamente para realizar essas análises de forma automática. 

Sabendo disso, [aqui](https://github.com/prof-renato/data/blob/main/amazon_sentiment_analysis.csv.gz?raw=true) pode ser encontrado uma pequena amostra da base de dados de reviews da Amazon. O trabalho da equipe é desenvolver um modelo para inferir se aquele review é bom ou ruim. 

As métricas de avaliação ficam a escolha da equipe, porém, atentem-se ao fato de que reviews negativas são bem mais impactantes para o negócio que as positivas. Nesse caso, justifique a escolha da métrica. 

## Instalação dos módulos

In [14]:
%%capture
!pip install nltk
!pip install imblearn

## Import dos módulos

In [15]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.pipeline import Pipeline
from sklearn.metrics import classification_report

## Análise exploratória

In [16]:
df = pd.read_csv('amazon_sentiment_analysis.csv')
df.head(10)

Unnamed: 0,review,sentiment
0,Expensive Junk: This product consists of a pie...,bad
1,"Toast too dark: Even on the lowest setting, th...",bad
2,Excellent imagery...dumbed down story: I enjoy...,good
3,Are we pretending everyone is married?: The au...,bad
4,Not worth your time: Might as well just use a ...,bad
5,Book reads like written for grade schoolers: I...,bad
6,Jeanne de Florette & Manon of the Springs: I s...,bad
7,Theater Projector Ceiling Mount: Would not fit...,bad
8,This import is sooooooooooo good: This is a gr...,good
9,Garbage: The handle broke clean off after TWO ...,bad


In [17]:
# remove missing values
df = df.dropna()

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 559500 entries, 0 to 559499
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   review     559500 non-null  object
 1   sentiment  559500 non-null  object
dtypes: object(2)
memory usage: 12.8+ MB


In [19]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Histogram(x=df['sentiment'], name="count", texttemplate="%{x}", textfont_size=20))
fig.show()

In [20]:
# Take just a sample of dataset
# df = df.sample(10000, random_state=42)

## Pré-processamento do texto

In [21]:
# Download stopwords package
%%capture
nltk.download('stopwords')

UsageError: Line magic function `%%capture` not found.


In [22]:
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    nostop = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    return(nostop)

text_process(df['review'].iloc[1])

['Toast',
 'dark',
 'Even',
 'lowest',
 'setting',
 'toast',
 'dark',
 'liking',
 'Also',
 'light',
 'stays',
 'lit',
 'unplug',
 'avoid',
 'wasting',
 'electricity',
 'quality',
 'expected',
 'Cuisinart']

## Separação de conjutos de treino e teste

In [23]:
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.35, random_state=9)

In [24]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((363675,), (195825,), (363675,), (195825,))

## Treinamento do modelo

In [25]:
model = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

In [26]:
model.fit(X_train, y_train)

In [27]:
y_pred =  model.predict(X_test)
y_pred[:10]


array(['good', 'good', 'bad', 'bad', 'good', 'good', 'bad', 'good',
       'good', 'bad'], dtype='<U4')

## Avaliação do modelo

In [28]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         bad       0.84      0.87      0.85     97961
        good       0.87      0.83      0.85     97864

    accuracy                           0.85    195825
   macro avg       0.85      0.85      0.85    195825
weighted avg       0.85      0.85      0.85    195825



## Salvando modelo

In [31]:
import pickle

# Saving the model
Pkl_Filename = "sentiment_analysis_model.pkl"

with open(Pkl_Filename, 'wb') as file:  
    pickle.dump(model, file)

In [32]:
# Load the model
with open(Pkl_Filename, 'rb') as file:  
    sa_model = pickle.load(file)

sa_model

## Teste manual

In [29]:
def modelPredict(phrase):
  return (sa_model.predict([phrase]).tolist()[0])

In [33]:
modelPredict("Amazon is great if your looking for a deal! If you want it on time, it's a crap shoot! Amazon's 'guaranteed delivery' last a joke!")

'bad'

In [34]:
modelPredict("It was delivered quickly and I always have good, quick service from Amazon. I love that Inbox Dollars let's me get these rewards")

'good'

In [35]:
modelPredict("I received a package today from Amazon that was atrocious. Why would a driver leave such a package on my doorstep. You want to know why because he didn't give a damn and calling Amazon to report him is like pulling teeth you have all these international customer service people who can care less of what you are complaining about. I'm going to think long and hard before ordering from them. Had to be a crack head")

'bad'

## Considerações finais

A métrica de avaliação escolhida foi a acurácia e o resultado final do modelo foi de 85% de acertos, podemos considerar um resultado razoavelmente bom.