Questão39

A classificação textual refere-se ao processo de atribuir rótulos ou categorias a documentos de texto com base no conteúdo. Isso envolve treinar modelos de aprendizado de máquina para identificar padrões nos textos e, posteriormente, usar esses modelos para classificar automaticamente novos textos.
* Análise de Sentimento
* Categorização de Notícias
* Filtragem de Spam
* Classificação de Tópicos
* Identificação de Intenções em Chatbots
* Diagnóstico Médico
* Detecção de Fake News

Questão40

In [77]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, classification_report
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB
from sklearn.svm import SVC
from sklearn import tree
from sklearn.linear_model import LinearRegression,LogisticRegression
from nltk.stem import RSLPStemmer
import gensim, logging, warnings
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models.ldamulticore import LdaMulticore

In [78]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [79]:
class NaiveBayes:
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self._classes = np.unique(y)
        n_classes = len(self._classes)
       # calcula a media, variância, and priori para cada classe
        self._mean = np.zeros((n_classes, n_features), dtype=np.float64)
        self._var = np.zeros((n_classes, n_features), dtype=np.float64)
        self._priors = np.zeros(n_classes, dtype=np.float64)
        for idx, c in enumerate(self._classes):
            X_c = X[y == c]
            self._mean[idx, :] = X_c.mean(axis=0)
            self._var[idx, :] = X_c.var(axis=0)
            self._priors[idx] = X_c.shape[0] / float(n_samples)

    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)
    def _predict(self, x):
        posteriors = []
       # Calcula a probabilidade posterior para cada classe
        for idx, c in enumerate(self._classes):
            prior = np.log(self._priors[idx])
            posterior = np.sum(np.log(self._pdf(idx, x)))
            posterior = posterior + prior
            posteriors.append(posterior)
       # retorna a classe com a maior probalidade posterior
        return self._classes[np.argmax(posteriors)]
    def _pdf(self, class_idx, x):
        mean = self._mean[class_idx]
        var = self._var[class_idx]
        numerator = np.exp(-((x - mean) ** 2) / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator / denominator

In [80]:
def accuracy(y_test, y_pred):
    return np.sum(y_test == y_pred) / len(y_test)

In [81]:
X, y = datasets.make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
nb = NaiveBayes()
nb.fit(X_train, y_train)
predictions = nb.predict(X_test)
print("Naive Bayes classification accuracy", accuracy(y_test, predictions))

Naive Bayes classification accuracy 0.81


Questão 41

In [82]:
df = pd.read_csv("/kaggle/input/stockmarket-sentiment-dataset/stock_data.csv")
df

Unnamed: 0,Text,Sentiment
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,1
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,1
2,user I'd be afraid to short AMZN - they are lo...,1
3,MNTA Over 12.00,1
4,OI Over 21.37,1
...,...,...
5786,Industry body CII said #discoms are likely to ...,-1
5787,"#Gold prices slip below Rs 46,000 as #investor...",-1
5788,Workers at Bajaj Auto have agreed to a 10% wag...,1
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",1


In [83]:
df['Sentiment'].value_counts()

Sentiment
 1    3685
-1    2106
Name: count, dtype: int64

In [84]:
df['vector'] = df['Text'].apply(lambda text: nlp(text).vector)  

In [85]:
df.head()

Unnamed: 0,Text,Sentiment,vector
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,1,"[-0.37225932, -1.9083202, -0.96866316, 1.28718..."
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,1,"[-1.9204868, -1.6881311, 0.30030242, 0.4985219..."
2,user I'd be afraid to short AMZN - they are lo...,1,"[-3.4280977, 2.5345087, -2.7803833, 1.5270884,..."
3,MNTA Over 12.00,1,"[0.06870751, -1.7497749, -0.15267503, 0.1832, ..."
4,OI Over 21.37,1,"[0.007576001, -0.80037993, -0.32887203, 0.7365..."


In [86]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.vector.values,
    df.Sentiment,
    test_size=0.2,
    random_state=2022
)

In [87]:
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [88]:
from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)


clf = NaiveBayes()
clf.fit(scaled_train_embed, y_train)

In [89]:
y_pred = clf.predict(scaled_test_embed)
print("Naive Bayes classification accuracy", accuracy(y_test, y_pred))

Naive Bayes classification accuracy 0.630716134598792


Questão 42

In [90]:
#y_pred = clf.predict(scaled_test_embed)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          -1       0.48      0.44      0.46       416
           1       0.70      0.73      0.72       743

    accuracy                           0.63      1159
   macro avg       0.59      0.59      0.59      1159
weighted avg       0.62      0.63      0.63      1159



Questão 43

In [91]:
model = MultinomialNB()
model.fit(scaled_train_embed,y_train)
model.score(scaled_test_embed, y_test)

0.6410698878343399

Sim os resultados estão bem proximos 

Questão 44

O Naive Bayes posui algumas caracteristicas como 
* tratar palavras indempendentes 
* lida bem com dados esparsos

que são caracteristicas boas para classificação textual.

classificador Naive Bayes tem algumas limitações e pode não funcionar bem em determinadas situações, incluindo:
* Dependência entre Características
* Problemas com Ironia e Sarcasmo
* Problemas de Balanceamento
* Modelos Mais Complexos

Questão 45

In [92]:
model_tree = tree.DecisionTreeClassifier()
model_tree = model_tree.fit(scaled_train_embed, y_train)
model_tree.score(scaled_test_embed,y_test)

0.5823986194995686

In [93]:
linear_model = LinearRegression().fit(scaled_train_embed,y_train)
linear_model.score(scaled_test_embed, y_test)

0.1880578231078095

In [94]:
svm_model = SVC(kernel='linear', C=1.0)
svm_model.fit(scaled_train_embed, y_train)
svm_model.score(scaled_test_embed,y_test)

0.7238999137187231

Questão 46

o modelo SVM demostrou ter o melhor resultado entre os outros modelos, o modelo de arvore de decisão ainda ficou atras em acuracia do nosso modelo de Naive Bayes e o modelo de regrassão linear ficou com uma acuracia baixissima

Questão 47

In [95]:
data = pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [96]:
new_data=data.iloc[0:500]

In [97]:
new_data['sentiment'].value_counts()

sentiment
negative    263
positive    237
Name: count, dtype: int64

In [98]:
new_data['vector'] = new_data['review'].apply(lambda text: nlp(text).vector)  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data['vector'] = new_data['review'].apply(lambda text: nlp(text).vector)


In [99]:
X_train, X_test, y_train, y_test = train_test_split(
    new_data.vector.values,
    new_data.sentiment,
    test_size=0.2,
    random_state=2022
)

In [100]:
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [101]:
from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)


gaussian_model = GaussianNB()
gaussian_model.fit(scaled_train_embed, y_train)

In [102]:
y_pred = gaussian_model.predict(scaled_test_embed)
print("Naive Bayes classification accuracy", accuracy(y_test, y_pred))

Naive Bayes classification accuracy 0.67


Questão 48

In [103]:
data = pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [104]:
def preprocess(text):
    doc = nlp(text)
    
    no_stop_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(no_stop_words)

In [105]:
def stemming(text):
    stemmer = RSLPStemmer()
    stem_words = [stemmer.stem(word) for word in text]
    return "".join(stem_words)

In [106]:
def pipeline(text):
    text_without_stop_words=preprocess(text)
    text_after_stemming=stemming(text_without_stop_words)
    return text_after_stemming
pipeline("Security is mostly a superstition. Life is either a daring adventure or nothing.")

'security superstition life daring adventure'

In [107]:
new_data=data.iloc[0:500]

In [108]:
new_data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
495,"""American Nightmare"" is officially tied, in my...",negative
496,"First off, I have to say that I loved the book...",negative
497,This movie was extremely boring. I only laughe...,negative
498,I was disgusted by this movie. No it wasn't be...,negative


In [109]:
new_data["new_review"] = new_data["review"].apply(pipeline)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data["new_review"] = new_data["review"].apply(pipeline)


In [110]:
new_data

Unnamed: 0,review,sentiment,new_review
0,One of the other reviewers has mentioned that ...,positive,reviewers mentioned watching 1 oz episode hook...
1,A wonderful little production. <br /><br />The...,positive,wonderful little production < br /><br />the f...
2,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...
3,Basically there's a family where a little boy ...,negative,basically family little boy jake thinks zombie...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei love time money visually stunnin...
...,...,...,...
495,"""American Nightmare"" is officially tied, in my...",negative,american nightmare officially tied opinion pat...
496,"First off, I have to say that I loved the book...",negative,loved book animal farm read 9th grade class gr...
497,This movie was extremely boring. I only laughe...,negative,movie extremely boring laughed times decided r...
498,I was disgusted by this movie. No it wasn't be...,negative,disgusted movie graphic sex scenes ruined imag...


In [111]:
new_data['vector'] = new_data['new_review'].apply(lambda text: nlp(text).vector)  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data['vector'] = new_data['new_review'].apply(lambda text: nlp(text).vector)


In [112]:
X_train, X_test, y_train, y_test = train_test_split(
    new_data.vector.values,
    new_data.sentiment,
    test_size=0.2,
    random_state=2022
)

In [113]:
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [114]:
from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)


gaussian_model = GaussianNB()
gaussian_model.fit(scaled_train_embed, y_train)

In [115]:
y_pred = gaussian_model.predict(scaled_test_embed)
print("Naive Bayes classification accuracy", accuracy(y_test, y_pred))

Naive Bayes classification accuracy 0.86


Melhora pois elimina palavras muito comuns que não contribuem significativamente na classificação de sentimentos e o classificador pode focar em palavras que são mais informativas e tem um papel mais inportante

Questão 49

In [116]:
df = pd.read_csv("/kaggle/input/movies-dataset/movie_dataset.csv")

In [117]:
df.head()

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


In [118]:
df=df.head(500)

In [119]:
df["new_overview"] = df["overview"].apply(pipeline)
df

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director,new_overview
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron,22nd century paraplegic marine dispatched moon...
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski,captain barbossa long believed dead come life ...
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes,cryptic message bond past sends trail uncover ...
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950,...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan,following death district attorney harvey dent ...
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton,john carter war weary military captain inexpli...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,495,79000000,Adventure Action Science Fiction,http://www.themysteriousisland.com/,72545,mission mysterious island missing person durin...,en,Journey 2: The Mysterious Island,Sean Anderson partners with his mom's boyfrien...,40.723459,...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Believe the Impossible. Discover the Incredible.,Journey 2: The Mysterious Island,5.8,1030,Dwayne Johnson Josh Hutcherson Kristin Davis V...,"[{'name': 'Michael Bostick', 'gender': 2, 'dep...",Brad Peyton,sean anderson partners mom boyfriend mission f...
496,496,78000000,Animation Family Comedy,,109451,inventor food scientist,en,Cloudy with a Chance of Meatballs 2,After the disastrous food storm in the first f...,41.247402,...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Something big was leftover.,Cloudy with a Chance of Meatballs 2,6.4,915,Bill Hader Anna Faris James Caan Will Forte An...,"[{'name': 'Mary Hidalgo', 'gender': 1, 'depart...",Cody Cameron,disastrous food storm film flint friends force...
497,497,78000000,Crime Thriller Horror,,9533,psychopath serial killer fbi agent,en,Red Dragon,"Former FBI Agent Will Graham, who was once alm...",10.083905,...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Before the Silence.,Red Dragon,6.7,1115,Anthony Hopkins Edward Norton Ralph Fiennes Ha...,"[{'name': 'Danny Elfman', 'gender': 2, 'depart...",Brett Ratner,fbi agent graham killed savage hannibal cannib...
498,498,100000000,Western Adventure,,2023,horse race horse racehorse,en,Hidalgo,"Set in 1890, this is the story of a Pony Expre...",16.759252,...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Unbridled. Unbroken. Unbeaten.,Hidalgo,6.5,318,Viggo Mortensen Zuleikha Robinson Omar Sharif ...,"[{'name': 'James Newton Howard', 'gender': 2, ...",Joe Johnston,set 1890 story pony express courier travels ar...


In [120]:
docs = list(df["new_overview"])

In [121]:
print(len(docs))

500


In [122]:
def tokenizer(text):
    doc = nlp(text)
    
    no_stop_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
    no_stop_words_lowered = [text.lower() for text in no_stop_words]
    
    return no_stop_words

In [124]:
docs = [tokenizer(token) for token in docs]

In [125]:
docs[0]

['22nd',
 'century',
 'paraplegic',
 'marine',
 'dispatched',
 'moon',
 'pandora',
 'unique',
 'mission',
 'torn',
 'following',
 'orders',
 'protecting',
 'alien',
 'civilization']

In [126]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

In [127]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [128]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [129]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 55
Number of documents: 500


In [130]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10
chunksize = 500
passes = 20
iterations = 100
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

In [131]:
top_topics = model.top_topics(corpus)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -4.4169.
[([(0.20106986, 'world'),
   (0.094433814, 'find'),
   (0.090884425, 'team'),
   (0.06724107, 'race'),
   (0.06539463, 'mission'),
   (0.05037568, 'battle'),
   (0.047818035, 'earth'),
   (0.04414724, 'crew'),
   (0.04086263, 'save'),
   (0.038693737, 'forces'),
   (0.036543563, 'way'),
   (0.03632532, 'army'),
   (0.034192536, 'finds'),
   (0.03025987, 'human'),
   (0.021161223, 'evil'),
   (0.02090097, 'home'),
   (0.01877165, 'epic'),
   (0.016250607, 'man'),
   (0.014677253, 'war'),
   (0.010603868, 'mysterious')],
  -3.0925339643601695),
 ([(0.19783418, '  '),
   (0.1860835, 'time'),
   (0.056976974, 'home'),
   (0.051161867, 'life'),
   (0.05096049, 'epic'),
   (0.04894247, 'men'),
   (0.048493683, 'years'),
   (0.046876732, 'love'),
   (0.041813266, 'world'),
   (0.038745496, 'lives'),
   (0.032838713, 'save'),
   (0.03268077, 'young'),
   (0.025040464, 'story'),
   (0.02313174, 'city'),
   (0.018020999, 'set'),
   (0.013505763, 'year'),
   (0.0

Questão 50

In [132]:
perplexity = model.log_perplexity(corpus)
print(f"Perplexity: {perplexity:.2f}")

Perplexity: -4.40


Questão 51

In [133]:
from operator import itemgetter

def name_topics(lda_model, num_words=3):
    topic_names = []

    for topic_id in range(lda_model.num_topics):
        topic_words = lda_model.show_topic(topic_id, num_words)
        topic_name = ", ".join(word for word, _ in topic_words)
        topic_names.append(f"Topic {topic_id + 1}: {topic_name}")

    return topic_names

# Get topic names
topic_names = name_topics(model, num_words=5)

# Print the topic names
for name in topic_names:
    print(name)

Topic 1: young, wife, world, help, set
Topic 2: life, family, agent, human, home
Topic 3: new, city, life, adventure, planet
Topic 4: world, find, team, race, mission
Topic 5: old, named, son, men, year
Topic 6: evil, man, father, group, high
Topic 7: stop, true, love, mission, dangerous
Topic 8: discovers, secret, john, years, mysterious
Topic 9:   , time, home, life, epic
Topic 10: war, protect, world, forces, city


Questão 52

In [134]:
movie_overview = [
    "Eight years after the Joker's reign of chaos, Batman is coerced out of exile with the assistance of the mysterious Selina Kyle in order to defend Gotham City from the vicious guerrilla terrorist Bane.",
    "In a post-apocalyptic wasteland, a woman rebels against a tyrannical ruler in search for her homeland with the aid of a group of female prisoners, a psychotic worshiper and a drifter named Max.",
    "Armed with only one word, Tenet, and fighting for the survival of the entire world, a Protagonist journeys through a twilight world of international espionage on a mission that will unfold in something beyond real time.",
    "As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea and by the co-founder who was later squeezed out of the business.",
    "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.",
    "After narrowly escaping a bizarre accident, a troubled teenager is plagued by visions of a man in a large rabbit suit who manipulates him to commit a series of crimes.",
    "On the hottest day of the year on a street in the Bedford-Stuyvesant section of Brooklyn, everyone's hate and bigotry smolders and builds until it explodes into violence.",
    "A New York City police officer tries to save his estranged wife and several others taken hostage by terrorists during a Christmas party at the Nakatomi Plaza in Los Angeles."
]

In [135]:
new_movie_overview = [pipeline(overview) for overview in movie_overview]

In [136]:
new_movie_overview

['years joker reign chaos batman coerced exile assistance mysterious selina kyle order defend gotham city vicious guerrilla terrorist bane',
 'post apocalyptic wasteland woman rebels tyrannical ruler search homeland aid group female prisoners psychotic worshiper drifter named max',
 'armed word tenet fighting survival entire world protagonist journeys twilight world international espionage mission unfold real time',
 'harvard student mark zuckerberg creates social networking site known facebook sued twins claimed stole idea co founder later squeezed business',
 'menace known joker wreaks havoc chaos people gotham batman accept greatest psychological physical tests ability fight injustice',
 'narrowly escaping bizarre accident troubled teenager plagued visions man large rabbit suit manipulates commit series crimes',
 'hottest day year street bedford stuyvesant section brooklyn hate bigotry smolders builds explodes violence',
 'new york city police officer tries save estranged wife taken

In [137]:
new_movie_overview = [tokenizer(token) for token in new_movie_overview]

In [138]:
new_movie_overview

[['years',
  'joker',
  'reign',
  'chaos',
  'batman',
  'coerced',
  'exile',
  'assistance',
  'mysterious',
  'selina',
  'kyle',
  'order',
  'defend',
  'gotham',
  'city',
  'vicious',
  'guerrilla',
  'terrorist',
  'bane'],
 ['post',
  'apocalyptic',
  'wasteland',
  'woman',
  'rebels',
  'tyrannical',
  'ruler',
  'search',
  'homeland',
  'aid',
  'group',
  'female',
  'prisoners',
  'psychotic',
  'worshiper',
  'drifter',
  'named',
  'max'],
 ['armed',
  'word',
  'tenet',
  'fighting',
  'survival',
  'entire',
  'world',
  'protagonist',
  'journeys',
  'twilight',
  'world',
  'international',
  'espionage',
  'mission',
  'unfold',
  'real',
  'time'],
 ['harvard',
  'student',
  'mark',
  'zuckerberg',
  'creates',
  'social',
  'networking',
  'site',
  'known',
  'facebook',
  'sued',
  'twins',
  'claimed',
  'stole',
  'idea',
  'co',
  'founder',
  'later',
  'squeezed',
  'business'],
 ['menace',
  'known',
  'joker',
  'wreaks',
  'havoc',
  'chaos',
  'peop

In [139]:
movie_topic_distributions = [model.get_document_topics(dictionary.doc2bow(token)) for token in new_movie_overview]

In [140]:
liked_movie_overview = "The jury in a New York City murder trial is frustrated by a single member whose skeptical caution forces them to more carefully consider the evidence before jumping to a hasty verdict."

In [141]:
liked_movie_overview_tokenized = tokenizer(liked_movie_description)  

In [142]:
liked_movie_overview_tokenized

['jury',
 'New',
 'York',
 'City',
 'murder',
 'trial',
 'frustrated',
 'single',
 'member',
 'skeptical',
 'caution',
 'forces',
 'carefully',
 'consider',
 'evidence',
 'jumping',
 'hasty',
 'verdict']

In [143]:
liked_movie_topic_distribution = model.get_document_topics(dictionary.doc2bow(liked_movie_overview_tokenized))
liked_movie_most_relevant_topic = max(liked_movie_topic_distribution)

In [144]:
similarities = [max(topic_distribution) for topic_distribution in movie_topic_distributions]
print(similarities)

[(9, 0.8434716), (9, 0.023275519), (9, 0.013617788), (9, 0.08003938), (9, 0.08003938), (9, 0.036063712), (9, 0.03606368), (9, 0.013617825)]


In [148]:
similar_movie_indices = sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)

recommended_movies = [movie_overview[i] for i in similar_movie_indices[0:3]]

print("Recommended Movies:")
for movie in recommended_movies:
    print(movie)

Recommended Movies:
Eight years after the Joker's reign of chaos, Batman is coerced out of exile with the assistance of the mysterious Selina Kyle in order to defend Gotham City from the vicious guerrilla terrorist Bane.
As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea and by the co-founder who was later squeezed out of the business.
When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.


Questão 53

ele esta coerente pois esta dando de sujestão filmes de ação e filme que possui um juri, para melhorar esse recomendador de filmes poderiamos implementar ele a um dataset grande que teria mais variações de filmes e Provavelmente seria superior se utilizasse word embeddings para as entradas.