## 1. Data Preprocessing

We import a text:

In [1]:
# process “The Time Machine", a science fiction
import os
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
path = 'The_Time_Machine/timemachine.txt'

We transfer the text into a pandas dataframe

In [2]:
df = pd.DataFrame(columns=['title','article'])

with open(path, 'r') as f:
    articles = f.readlines()
    text = ""
    for article in articles:
        text = text + article
    text = text.replace('\n',' ')
    text = text.replace('-',' ')
    text = text.replace('_', ' ')
    df_new = pd.DataFrame({'title':['The Time Machine.txt'],'article':[text]})
    df = df.append(df_new,ignore_index=True)
df

Unnamed: 0,title,article
0,The Time Machine.txt,I The Time Traveller (for so it will be conv...


In [3]:
print(df.article[0])



## 2. PyCaret environment

In [4]:
from pycaret.nlp import *

We preprocess the text and do stopwords removal, special character removal, bigram extraction etc.

In [5]:
exp_nlp101 = setup(data=df, target='article', session_id=123)

Description,Value
session_id,123
Documents,1
Vocab Size,3547
Custom Stopwords,False


In [6]:
plot_model()

In [7]:
plot_model(plot = 'bigram')

In [8]:
# choose some meaningful bigram as topics
topics = ['time traveller','electronic work','thick dust','time dimension','silent man']

# Sentence Transformer


We import the package Sentence Transformer and load the stsb-roberta-large model:

In [9]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('stsb-roberta-large')

We prepend 'supports' and 'opposes' to topics and create a dozen boolean predicate sentences:

In [10]:
# generate beliefs
beliefs = []

for topic in topics:
    pos_sentence = 'supports ' + topic
    neg_sentence = 'opposes ' + topic
    beliefs.append(pos_sentence)
    beliefs.append(neg_sentence)

beliefs

['supports time traveller',
 'opposes time traveller',
 'supports electronic work',
 'opposes electronic work',
 'supports thick dust',
 'opposes thick dust',
 'supports time dimension',
 'opposes time dimension',
 'supports silent man',
 'opposes silent man']

We encode these sentences as BERT encodings:

In [11]:
beliefs_embeddings = model.encode(beliefs)
beliefs_embeddings.shape

(10, 1024)

In [12]:
# split articles
sentences = df.article[0].split('.')
sentences

['I   The Time Traveller (for so it will be convenient to speak of him) was expounding a recondite matter to us',
 ' His grey eyes shone and twinkled, and his usually pale face was flushed and animated',
 ' The fire burned brightly, and the soft radiance of the incandescent lights in the lilies of silver caught the bubbles that flashed and passed in our glasses',
 ' Our chairs, being his patents, embraced and caressed us rather than submitted to be sat upon, and there was that luxurious after dinner atmosphere when thought roams gracefully free of the trammels of precision',
 ' And he put it to us in this way  marking the points with a lean forefinger  as we sat and lazily admired his earnestness over this new paradox (as we thought it) and his fecundity',
 "  'You must follow me carefully",
 ' I shall have to controvert one or two ideas that are almost universally accepted',
 ' The geometry, for instance, they taught you at school is founded on a misconception',
 "'  'Is not that rath

In [13]:
# get sentences embeddings
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

(2014, 1024)

caculate the similarity between sentences and beliefs by using cosine_similarity

In [14]:
similarity = []
for belief in beliefs_embeddings:
    max_simi = 0
    for sentence in sentence_embeddings:
        simi = cosine_similarity([belief], [sentence])[0][0]
        # choose the maximum similarity because if there is a sentence strong related to one belief,
        # that means this belief is writer's belief
        max_simi = max(simi, max_simi)
    similarity.append(max_simi)
similarity

[0.5588477,
 0.5278709,
 0.62620753,
 0.44709566,
 0.5989601,
 0.54649675,
 0.577142,
 0.5700769,
 0.5217863,
 0.6457208]

In similarity, the odd values are positive attitude similarity and the even values are negative attitude similarity. Print author's attitudes about topics

In [20]:
for i in range(len(similarity)):
    if i%2==1:
        if similarity[i]>similarity[i-1]:
            print(beliefs[i])
        else:
            print(beliefs[i-1])

supports time traveller
supports electronic work
supports thick dust
supports time dimension
opposes silent man


### Conclusion:          Cannot always use "supports" and "opposes" to make sentence.      
### Next exploration:    Try to find suitable verbs in the text

Dino: There is not enough of a difference between positive and negative position to conclude one way or another.

Dino: Chuanyang, this is not a code file, it's a jupyter notebook. All the comments should be in markdown cells. THe only comments in code cells are to illustrate the code, not the intent behind the code.

# Extract the most common Verb
Try to find the most common verbs by using Counter

In [15]:
# lowercase article
df.article = df.article.apply(lambda x : x.lower())
df

Unnamed: 0,title,article
0,The Time Machine.txt,i the time traveller (for so it will be conv...


Example about converting verbs to original form, the performance isn't good. Then try to convert verbs based on their pos_tag

In [16]:
# convert verbs to original form Example 
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "does do did doing studies study studied studying cries cry cried crying"
tokenization = nltk.word_tokenize(text)
new_text = ""
for w in tokenization:
    new_text = new_text + wordnet_lemmatizer.lemmatize(w)
    new_text = new_text + ' '
    
new_text

'doe do did doing study study studied studying cry cry cried cry '

In [17]:
# get words' pos_tag
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [18]:
# convert verbs to original form based on pog_tag
def lemma(sentence):
    tokens = word_tokenize(sentence) 
    tagged_sent = pos_tag(tokens)     # get pos tag

    wnl = WordNetLemmatizer()
    lemmas_sent = ''
    for tag in tagged_sent:
        wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
        lemmas_sent = lemmas_sent + wnl.lemmatize(tag[0], pos=wordnet_pos)
        lemmas_sent = lemmas_sent + ' '

    return lemmas_sent

sentence = "does do did doing studies study studied studying cries cry cried crying"
lemma(sentence)

'do do do do study study study study cry cry cry cry '

In [19]:
print(df.article[0])



In [20]:
# apply lemmatization to article
df.article = df.article.apply(lambda x : lemma(x))
print(df.article[0])



In [21]:
# use Counter to select the most common verbs
# small example
from collections import Counter

example = ['I','like','this','photo','like']
print(Counter(example))

Counter({'like': 2, 'I': 1, 'this': 1, 'photo': 1})


In [22]:
# collect all verbs and nonus
def collect_verbs_nonus(text):
    verbs = []
    nonus = []
    tokens = word_tokenize(text)
    tagged_tokens = pos_tag(tokens)

    for token in tagged_tokens:
        if token[1].startswith('V'):
            verbs.append(token[0])
        elif token[1].startswith('N'):
            nonus.append(token[0])
    return verbs,nonus
text = 'Today the Netherlands celebrates King\'s Day. To honor this tradition, the Netherlands embassy in San Francisco invited me to'
verbs,nonus = collect_verbs_nonus(text)
verbs,nonus

(['celebrates', 'honor', 'invited'],
 ['Today',
  'Netherlands',
  'King',
  'Day',
  'tradition',
  'Netherlands',
  'embassy',
  'San',
  'Francisco'])

In [23]:
v_counter = Counter(verbs)
n_counter = Counter(nonus)

In [24]:
# print the two most common verbs
v_counter.most_common(2)

[('celebrates', 1), ('honor', 1)]

In [25]:
# print the two most common nouns
n_counter.most_common(2)

[('Netherlands', 2), ('Today', 1)]

In [26]:
# apply to the full article
verbs,nonus = collect_verbs_nonus(df.article[0])
v_counter = Counter(verbs)
n_counter = Counter(nonus)

In [27]:
# select the 30 most common verbs
common_verb = v_counter.most_common(30)
common_verb

[('be', 1119),
 ('have', 506),
 ('i', 169),
 ('come', 138),
 ('do', 113),
 ('say', 112),
 ('go', 100),
 ('saw', 87),
 ('seem', 77),
 ('see', 70),
 ('think', 66),
 ('make', 63),
 ('find', 62),
 ('felt', 56),
 ('take', 53),
 ('look', 53),
 ('get', 49),
 ('know', 47),
 ('begin', 42),
 ('put', 36),
 ('leave', 27),
 ('run', 27),
 ('follow', 26),
 ("'s", 25),
 ('tell', 25),
 ('stand', 24),
 ('hear', 22),
 ('sit', 21),
 ('move', 21),
 ('turn', 21)]

In [28]:
# select the most suitable verb
def select_best_verb(topic):
    max_simi = 0
    best_belief = ""
    for verb in common_verb:
#    for verb in v_counter:
        sentence = verb[0] + ' ' + topic
        simi = compute_similarity(sentence)
        if simi > max_simi:
            best_belief = sentence
            max_simi = simi
    return best_belief, max_simi

In [29]:
# compute similarity between article and beliefs
def compute_similarity(belief):
    beliefs_embedding = model.encode(belief)
    max_simi = 0
    for sentence in sentence_embeddings:
        simi = cosine_similarity([beliefs_embedding], [sentence])[0][0]
        # just choose the maximum similarity because if there is a sentence strong related to one belief,
        # that means this belief is writer's belief
        max_simi = max(simi, max_simi)
    return max_simi

In [30]:
# generate beliefs
beliefs = []

for topic in topics:
    best_belief,max_simi = select_best_verb(topic)
    beliefs.append(best_belief)
beliefs

['leave time traveller',
 'say electronic work',
 'seem thick dust',
 'leave time dimension',
 'seem silent man']

### Conclusion:          In this part, I extract the most common verbs, but these verbs are not special and are more like stop words   
### Next exploration:    Try to use tf-idf

# TF-IDF

TF-IDF, short for term frequency–inverse document frequency, a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

In [31]:
# here is a small example about TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    "what is the weather like today",
    "what is for dinner tonight",
    "this is a question worth pondering",
    "it is a beautiful day today"
]
 
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)
 
# print all features
print(tfidf_vec.get_feature_names())
 
# print features and its id
print(tfidf_vec.vocabulary_)
 
# print tf-idf matrix
print(tfidf_matrix)

['beautiful', 'day', 'dinner', 'for', 'is', 'it', 'like', 'pondering', 'question', 'the', 'this', 'today', 'tonight', 'weather', 'what', 'worth']
{'what': 14, 'is': 4, 'the': 9, 'weather': 13, 'like': 6, 'today': 11, 'for': 3, 'dinner': 2, 'tonight': 12, 'this': 10, 'question': 8, 'worth': 15, 'pondering': 7, 'it': 5, 'beautiful': 0, 'day': 1}
  (0, 11)	0.3710221459250386
  (0, 6)	0.47059454669821993
  (0, 13)	0.47059454669821993
  (0, 9)	0.47059454669821993
  (0, 4)	0.24557575678403082
  (0, 14)	0.3710221459250386
  (1, 12)	0.506765426545092
  (1, 2)	0.506765426545092
  (1, 3)	0.506765426545092
  (1, 4)	0.2644512224141842
  (1, 14)	0.3995396830595886
  (2, 7)	0.4838025881780501
  (2, 15)	0.4838025881780501
  (2, 8)	0.4838025881780501
  (2, 10)	0.4838025881780501
  (2, 4)	0.25246826075544676
  (3, 1)	0.506765426545092
  (3, 0)	0.506765426545092
  (3, 5)	0.506765426545092
  (3, 11)	0.3995396830595886
  (3, 4)	0.2644512224141842


Apply tf-idf algorithm to article

In [60]:
#extract all verbs and nonus from article
verbs,nonus = collect_verbs_nonus(df.article[0])

In [61]:
# get the tfidf matrix
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(sentences)

In [62]:
# word and index
features = tfidf_vec.vocabulary_
features

{'the': 4365,
 'time': 4433,
 'traveller': 4496,
 'for': 1721,
 'so': 4000,
 'it': 2375,
 'will': 4841,
 'be': 363,
 'convenient': 869,
 'to': 4444,
 'speak': 4047,
 'of': 2958,
 'him': 2088,
 'was': 4751,
 'expounding': 1508,
 'recondite': 3483,
 'matter': 2673,
 'us': 4634,
 'his': 2092,
 'grey': 1932,
 'eyes': 1526,
 'shone': 3854,
 'and': 172,
 'twinkled': 4554,
 'usually': 4642,
 'pale': 3064,
 'face': 1530,
 'flushed': 1695,
 'animated': 184,
 'fire': 1635,
 'burned': 535,
 'brightly': 505,
 'soft': 4006,
 'radiance': 3419,
 'incandescent': 2221,
 'lights': 2545,
 'in': 2213,
 'lilies': 2550,
 'silver': 3905,
 'caught': 611,
 'bubbles': 527,
 'that': 4364,
 'flashed': 1665,
 'passed': 3101,
 'our': 3014,
 'glasses': 1874,
 'chairs': 632,
 'being': 397,
 'patents': 3107,
 'embraced': 1376,
 'caressed': 585,
 'rather': 3439,
 'than': 4362,
 'submitted': 4216,
 'sat': 3699,
 'upon': 4627,
 'there': 4374,
 'luxurious': 2618,
 'after': 99,
 'dinner': 1151,
 'atmosphere': 297,
 'when':

In [63]:
# The tfidf_matrix is a sparse matrix, convert it to dense matrix
dense_matrix = tfidf_matrix.todense()
dense_matrix.shape

(2014, 4936)

In [64]:
# set the biggest tf-idf value as a verbs' tf-idf value
import numpy as np
max_tfidf = np.max(dense_matrix, axis=0)
arr_tfidf = max_tfidf.getA()     # convert matrix to numpy array
arr_tfidf

array([[0.32089585, 0.23560016, 0.23560016, ..., 0.4165639 , 0.366907  ,
        0.32705   ]])

In [65]:
# create a dictionary that key is verb, value is its tf-idf
set_verbs = set(verbs)
verb_dic = {}
for verb in set_verbs:
    if verb in features:
        verb_dic[verb] = arr_tfidf[0][features[verb]]

In [66]:
# sort the dictionary
sorted_verb = dict(sorted(verb_dic.items(), key=lambda item: item[1], reverse=True))
sorted_verb

{'gutenberg': 1.0,
 'doubt': 0.7397543549459972,
 'wait': 0.7375615173932877,
 'plain': 0.7325719001604722,
 'quartz': 0.7281740431243764,
 'argue': 0.7071093664523651,
 'damp': 0.7030863961192625,
 'rest': 0.6988656132476937,
 'turn': 0.6966263396882154,
 'jump': 0.6942986366360127,
 'minute': 0.6883075790020153,
 'want': 0.6868279195560346,
 'account': 0.686500802166211,
 'good': 0.685511093706903,
 'explain': 0.6795463618073765,
 'truth': 0.6753196430998754,
 'naked': 0.6750148936686133,
 'sunset': 0.6720486388441117,
 'license': 0.658803280632874,
 'object': 0.6551343489521531,
 'hope': 0.6538679168941883,
 'face': 0.6514806155551899,
 'glaring': 0.6467269347133495,
 'struggle': 0.6455384091847866,
 'round': 0.6448387722784474,
 'fell': 0.6437499404477114,
 'apologize': 0.643662010948258,
 'afraid': 0.6426470024519285,
 'frenzy': 0.642161461407677,
 'view': 0.6417985516746293,
 'question': 0.6407681190694284,
 'silent': 0.6373856792662592,
 'defend': 0.6326225502756951,
 'cry': 0.6

In [70]:
# got 30 most common verbs
count = 30
common_verb = []
for verb in sorted_verb:
    if count <= 0:
        break
    common_verb.append(verb)
    count = count-1
len(common_verb) 

30

In [71]:
# select the most suitable verb
def select_best_verb(topic):
    max_simi = 0
    best_belief = ""
    for verb in common_verb:
#    for verb in v_counter:
        sentence = verb + ' ' + topic
        simi = compute_similarity(sentence)
        if simi > max_simi:
            best_belief = sentence
            max_simi = simi
    return best_belief, max_simi

In [72]:
# generate beliefs
# try most common verbs
beliefs = []

for topic in topics:
    best_belief,max_simi = select_best_verb(topic)
    beliefs.append(best_belief)
beliefs

['explain time traveller',
 'gutenberg electronic work',
 'gutenberg thick dust',
 'gutenberg time dimension',
 'turn silent man']

Dino: The TfIdf verbs are indeed less general and more time-traveler-focused verbs than the most common verbs found previously. Selecting bigrams as topics and then prepending a verb is an interesting approach, but it's too haphazard (random). You need a more structured algorithm that will work for all texts.