Importação de todos os pacotes que serão utilizados ao longo do modelo.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


In [2]:
#leitura dos dados
df_review = pd.read_csv("IMDB Dataset.csv")
#checagem do balanceamento
df_review.value_counts(subset='sentiment')

sentiment
negative    25000
positive    25000
Name: count, dtype: int64

Dataset já veio perfeitamente balanceado. Nesse caso, não há necessidade de tomar qualquer ação nesse sentido.  
De forma semelhante, os dados já estão limpos, não exigindo esse passo tampouco.

In [3]:
#separação dos dados em treino e teste, utilizando proporção 80/20
treino, teste = train_test_split(df_review, test_size=0.2, random_state=42)
treino_x, treino_y = treino['review'], treino['sentiment']
teste_x, teste_y = teste['review'], teste['sentiment']

treino_x: variáveis independentes (review/análises) utilizadas para treinar o modelo. 
treino_y: variáveis dependentes/labels (sentiment) que o modelo deve prever.  
teste_x: variáveis independentes que serão utilizadas para teste de precisão do modelo.  
teste_y: labels/etiquetas usadas para testar a precisão da previsão do modelo contra as categorias de fato

Para transformar a representação textual em vetores numéricos, utilizaremos a técnica de bag of words (BOW).  
A escolha se dá por: 1) a frequência das palavras nas análises importa, podendo ser indicador de seu sentimento; 2) a ordem em si das palavras possui menor relevância.  
A BOW será representada por Term Frequency, Inverse Document Frequency (TF-IDF), forma de ponderar o peso de cada palavra de um documento (no caso, análise individual) levando em conta sua presença no corpus total (o conjunto de documentos/análises).

In [4]:
tfidf = TfidfVectorizer(stop_words='english') #remoção de stop words/palavras vazias da língua inglesa
treino_x_vetor = tfidf.fit_transform(treino_x) #encontra parâmetros internos do modelo e os aplica, vetorizando o treino
teste_x_vetor = tfidf.transform(teste_x) #apenas vetoriza o teste para uso
treino_x_vetor

<40000x92692 sparse matrix of type '<class 'numpy.float64'>'
	with 3543198 stored elements in Compressed Sparse Row format>

Teste de diferentes modelos de aprendizagem supervisionada (classificação) para selecionar aquele com maior precisão. Testaremos Support Vector Machines (SVM), árvore de decisões, e regressão logística, comparando suas precisões médias (mean accuracy), F1 score, classification report e matriz de confusão.

In [5]:
#inserção dos dados nos algoritmos
#SVM
svc = SVC(kernel='linear')
svc.fit(treino_x_vetor, treino_y)


In [6]:
#Decision Tree
dec_tree = DecisionTreeClassifier()
dec_tree.fit(treino_x_vetor, treino_y)

In [8]:
#Regressão logística
log_reg = LogisticRegression()
log_reg.fit(treino_x_vetor, treino_y)

In [10]:
#Mean Accuracy
print(svc.score(teste_x_vetor, teste_y))
print(dec_tree.score(teste_x_vetor, teste_y))
print(log_reg.score(teste_x_vetor, teste_y))

0.8964
0.7228
0.8941


Como o resultado das árvores de decisão(0.7228) foi consideravelmente abaixo daqueles apresentados pelo SVM (0.8964) e regressão logística (0.8941), utilizaremos apenas esses dois últimos para os demais testes de comparação.

In [11]:
#F1 score
print(f1_score(teste_y, svc.predict(teste_x_vetor), 
         labels=['positive', 'negative'], average=None))
print(f1_score(teste_y, log_reg.predict(teste_x_vetor), 
         labels=['positive', 'negative'], average=None))

[0.89805156 0.89469404]
[0.89655172 0.89152924]


Scores bastante semelhantes/próximos, continuando uma pequena vantagem para o SVM.

In [12]:
#Classification report
print(classification_report(teste_y,
                            svc.predict(teste_x_vetor),
                            labels=['positive', 'negative']))
print(classification_report(teste_y,
                            log_reg.predict(teste_x_vetor),
                            labels=['positive', 'negative']))

              precision    recall  f1-score   support

    positive       0.89      0.91      0.90      5039
    negative       0.90      0.89      0.89      4961

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

              precision    recall  f1-score   support

    positive       0.88      0.91      0.90      5039
    negative       0.91      0.88      0.89      4961

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



Novamente, uma pequena margem de vantagem para o SVM. Como sua accuracy foi maior que 0.895, foi arredondada para 0.9.  

In [14]:
#Confusion matrix
conf_mat_svc = confusion_matrix(teste_y,
                            svc.predict(teste_x_vetor),
                            labels=['positive', 'negative'])
conf_mat_logreg = confusion_matrix(teste_y,
                            log_reg.predict(teste_x_vetor),
                            labels=['positive', 'negative'])

In [16]:
print(conf_mat_svc)
print(conf_mat_logreg)

[[4563  476]
 [ 560 4401]]
[[4589  450]
 [ 609 4352]]


Outro resultado próximo. De forma geral, baseando-se nos resultados, o SVM apresenta melhor capacidade de previsão, ainda que ligeiramente. Entretanto, pode-se ponderar o uso da regressão logística em seu lugar, tendo em vista que, apesar de ser ligeiramente menos preciso, apresenta um tempo de processamento menor, em especial no treinamento do modelo (no caso, foi efetivamente mais de 70x mais rápido do que o SVM).

Temos, então, treinado um modelo de previsão de sentimento de análise de filmes em inglês, baseado em 50 mil análises igualmente distribuídas entre positivas e negativas, em uma proporção treino/teste de 80/20. O dataset foi atualizado pela última vez 4 anos atrás. Testaremos o modelo, agora, com algumas poucas avaliações do IMDB para o filme [Dungeons & Dragons: Honor Among Thieves](https://www.imdb.com/title/tt2906216/), lançado em 2023. Para os fins desses testes finais, consideraremos uma review como "positiva" se tiver notas de 7 a 10, sendo negativas as demais. 

In [17]:
#Começando com uma análise de 10/10, https://www.imdb.com/review/rw8940209
print(svc.predict(tfidf.transform(["It's not off, and you can say that a movie lives up to a trailer when you have a trailer as action, packed, funny, and full of Easter eggs, as dungeons & dragons: honor among thieves. However, somehow, this movie did exactly that. It is absolutely hilarious, it has all the charm and banter Of guardians of the Galaxy with incredible action choreography like The Man with the Iron Fist, and a dynamic and love-to-hate villain with a fantastic performance. This movie is extremely fun. It's rare that a movie is both entertaining and good - but this movie really is everything you could ever want it to be. Fans of the game will be delighted with the endless references and Easter eggs, and while the movie does play fast and loose with the rules, one has to in order to make a film work, and it worked brilliantly. Pine and Rodriguez have a surprising on screen chemistry and this lovable group of adventures will surely win over the hearts of fans of critical role as well. I highly recommend this movie to any fans of action and fantasy."])))

['positive']


In [19]:
#Ainda no lado positivo, mas indo para uma 8/10, https://www.imdb.com/review/rw8963551
print(svc.predict(tfidf.transform(["A surprisingly strong blockbuster - surpassing my initial low expectations (I'm not even slightly a gamer, don't know the source material & am well aware adaptations usually tend to falter when put to the big screen) - so despite my inability to comment on accuracy when translating ideas from one format to another, I am genuinely impressed by the strength of Dungeons & Dragons (in terms of judging the piece solely as a movie, in & of itself), from first impressions. Although far from being a masterpiece, the fantasy film doesn't necessarily have to be in order for it to be an entertaining watch (or valid, as Honour Amongst Thieves isn't trying to be profound - quite the opposite; revelling in the fact that it never takes the story too seriously - resulting in some honestly amusing moments that triggered audible laughter from the audience, repeatedly) & there are a plethora of attributes to appreciate; the genuinely funny, self referential humour, sharp wit, continuous narrative pay-offs, effective utilisations of practical FX throughout (blended amongst VFX for maximum impact) & ingeniously inventive magical sequences (of pure cinematic spectacle) that advance the plot brilliantly & enhance fight scenes - in ways no previous filmmakers have seemingly ever thought to try before (despite the existence of numerous / similar brands like Harry Potter, Doctor Strange etc. The wizardry here is actually uniquely realised & visually original) & therefore, I seriously recommend seeing the release at the cinema - since it's well worth your time."])))

['positive']


In [20]:
#Indo para o limite do que consideramos "positivo", 7/10, https://www.imdb.com/review/rw8954230
print(svc.predict(tfidf.transform(["I was lucky enough to catch a sneak preview of the Movie. I have played Dungeons and Dragons since the basic game came out in the late 70ies. So far, every adaptation I have seen trying to turn the D&D world into a movie has fallen flat on its face mostly because it was trying to appeal to way too many audiences and include way to much modern BS. This movie was decent, It has a very adventurous feel and definitely played to both fans of the game and those who have never played the game. I like the fact that they didn't spoil the movie with any modern elements like the movies of the past had tried to do. It had a good story and great special effects. Is it the story I would have told? Nope but I don't have a $100,000,000 to make a movie and my movie would have catered more to the D&D world and player. It also most likely would have been a complete flop because it's a very small niche market! They don't call this the movie business for nothing and I understand that. Let's face it, we won't see another movie if it doesn't make money! I would definitely pay to see a series of this type of movie providing they stay on the same path they are on. The movie was far from perfect from a game player stance, but the movie has been the truest to the spirit of the game to date. Game players will be divided on the movie's success but everyone else will most likely be entertained."])))

['positive']


In [21]:
#Iniciando o que consideraríamos avaliações negativas, 5/10, https://www.imdb.com/review/rw9007949
print(svc.predict(tfidf.transform(["Despite its rich source material and an enthusiastic cast, the film falls flat in its execution, succumbing to a string of cliches and predictable story beats that leave audiences feeling underwhelmed. The film's narrative structure is a textbook example of the team of heroes trope, complete with characters who possess specific skill sets needed to overcome various obstacles. However, the film's reliance on such a formulaic approach results in a sense of déjà vu for viewers who are familiar with fantasy adventures. To its detriment, the film fails to deviate from or subvert the expected conventions, leading to a predictable and lackluster experience. The filmmakers' affection for the source material is apparent, but it doesn't translate into a well-crafted movie. The film's visual design lacks creativity, with Forge's city resembling a generic fantasy video game setting. The film's use of CGI is at times jarring, with magic-driven sequences feeling detached from the more grounded practical effects. The world-building falls short, and the film misses the opportunity to create immersive and visually captivating environments. While the cast delivers competent performances, their efforts are hindered by the film's shortcomings. Pine's rough charisma, Rodriguez's physicality, and Grant's smarminess are all on display, but they are not enough to elevate the material. The film suffers from a lack of substance, and despite its lengthy runtime of 139 minutes, it feels devoid of depth and genuine emotion. The characters are constantly moving from one plan to the next, but the repetitive structure leaves little room for meaningful development or stakes."])))

['negative']


In [22]:
#Uma segunda review de 5/10, para tentar um falso positivo, https://www.imdb.com/review/rw8939818
print(svc.predict(tfidf.transform(["Dungeons & Dragons: Honor Among Thieves is advertised as a movie for everyone, DND player or not. This made me hopeful for the movie, however I was disappointed when I found myself getting bored and lost at points throughout it. The world is interesting, but there wasn't much time spent on accustoming the viewer to it. You're pretty much just thrown in, which I definitely could have looked past if the story held a grip on me, but I found it to be slightly cliché and uninteresting. Take this for what you will, as I'm not really an action movie person in the first place, but I was definitely lost to daydreaming during some of the middle parts. The action to meaningful story ratio simply wasn't my cup of tea. The actors and actresses all give great preformances, though. Justice Smith's character goes through a fulfilling character arc, one of the things I liked, and Hugh Grant plays a very comical and convincing villian. If you are familiar with DND and/or enjoy movies with a lot of action, I would give this a watch, but it wasn't for me."])))

['negative']


In [23]:
#Último teste, uma review com pitada de sarcasmo, mas de 2/10, https://www.imdb.com/review/rw8973689
print(svc.predict(tfidf.transform(["I really tried to like the film. So many great reviews here, but in the end it's always a matter of taste. I thought it was really boring, no one in our cinema laughed a lot or shared in the excitement and somehow everyone was glad when it was over. The budget was big, the expectations big, what could go wrong? We know a lot now. The posed drama doesn't fit the one-liners, the story itself has no substance and afterwards I watched the trailer, although I never watch trailers and it's true, that's all you need to see. It's a film that relies on a few familiar faces and a lot of bad CGI. It has no charm and will probably disappear into the annals of history. It's a shame about the wasted opportunity. If you liked Thor 4, you can watch this film without hesitation."])))

['negative']
