In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle
import numpy as np
import random

### TF-IDF Vectorization

In [2]:
with open('all_forums_posts_text.data', 'rb') as file:
    posts_text = pickle.load(file)

# TFIDF vectorization
tfidf_vectorizer = TfidfVectorizer(lowercase=True, max_df=0.5, min_df=2, ngram_range=(1,2), max_features=10000)
tfidf_vectors = tfidf_vectorizer.fit_transform(posts_text)

with open('all_forums_tfidf_vector.data', 'wb') as file:
    pickle.dump(tfidf_vectors, file)

with open('tfidf_vectorizer.pkl', 'wb') as file:
    pickle.dump(tfidf_vectorizer, file)

### Similarity Example #1

Both the query post and most similar post talk about Honda Civics and seek help for problems related to the turning off and/or turning on of the car.

In [None]:
# Choose random post as query
random_index = random.randint(0, tfidf_vectors.shape[0] - 1)
query_vector = tfidf_vectors[random_index]

# Calculate cosine similarity
similarity_scores = cosine_similarity(query_vector, tfidf_vectors).flatten()

# Exclude query post itself from recommendations
similarity_scores[random_index] = -1

# Index of most similar post
most_similar_index = np.argmax(similarity_scores)

# Recommend the most similar post
print(f"Query Post: {posts_text[random_index]}")
print(f"Most Similar Post: {posts_text[most_similar_index]}")
print(f"Similarity Score: {similarity_scores[most_similar_index]}")

Query Post: 
2012 Honda Civic usually won’t start right up in the morning but always starts after a few tries. Cranks a few times first try and then seems dead, leave key on cranks a few more times and slows to stop, turn key off and back on cranks better like it never happened or does the same thing again. I’m pretty sure the door lock actuator in drivers side is bad not sure if it is parasitic on power to ignition? Please any ideas

Most Similar Post: 
1991 Honda Civic shuts off when I stop sometimes but turns back on…please leave any solutions

Similarity Score: 0.21726805949542058


### Similarity Example #2

Not all the scraped data is in English. In fact, in this example, it can be seen that the posts belong to the same language, namely German. They both mention political movements, some sort of protest/rebellion against authority, and specific topics such as capitalism and nationalism.

In [4]:
# Choose random post as query
random_index = random.randint(0, tfidf_vectors.shape[0] - 1)
query_vector = tfidf_vectors[random_index]

# Calculate cosine similarity
similarity_scores = cosine_similarity(query_vector, tfidf_vectors).flatten()

# Exclude query post itself from recommendations
similarity_scores[random_index] = -1

# Index of most similar post
most_similar_index = np.argmax(similarity_scores)

# Recommend the most similar post
print(f"Query Post: {posts_text[random_index]}")
print(f"Most Similar Post: {posts_text[most_similar_index]}")
print(f"Similarity Score: {similarity_scores[most_similar_index]}")

Query Post: 
Die Gelbe Westen Bewegung scheint sich global auszubreiten. In Europa, Nordamerika, Asien wird zur Zeit in gelben Westen demonstriert.
Die gelben Westen sind kein homogener Haufen, obwohl sie an Occupy Wallstreet errinern, richtet sich ihr Protest primär gegen die Regierung und nicht gegen das Finanzkapital.
Sie wollen weder Kapitalismus, Nationalismus oder Kommunismus, sie wissen meistens überhaupt nicht was sie wollen. Dennoch scheint eine Masse von Menschen in Ländern der ersten und zweiten Welt mit ihren Lebensbedingungen massiv unzufrieden zu sein.
Es kann natürlich sein, dass sich das ganze wieder verläuft, aber es gibt auch ein Potential, dass sich dieser Unwille früher oder später politisch artikulieren wird. Wer Politik abseits der etablierten Parteien betreiben möchte, dem empfehle ich sich mit diesem Thema mal auseinanderzusetzen.

Most Similar Post: 

Ich erlaube mir, den Vortrag hier auf Deutsch wiederzugeben, mit einigen Anmerkungen und Erklärungen natürlich.

### Similarity Example #3
Both posts are philosophical discussions relating to values and ethics in which utilitarianism and consequentialism are brought up, and both authors discuss the future and what they consider themselves to be at the present stage.

In [6]:
# Choose random post as query
random_index = random.randint(0, tfidf_vectors.shape[0] - 1)
query_vector = tfidf_vectors[random_index]

# Calculate cosine similarity
similarity_scores = cosine_similarity(query_vector, tfidf_vectors).flatten()

# Exclude query post itself from recommendations
similarity_scores[random_index] = -1

# Index of most similar post
most_similar_index = np.argmax(similarity_scores)

# Recommend the most similar post
print(f"Query Post: {posts_text[random_index]}")
print(f"Most Similar Post: {posts_text[most_similar_index]}")
print(f"Similarity Score: {similarity_scores[most_similar_index]}")

Query Post: 

question_everything.jpg1024×768 238 KB

I used to be a utilitarian. Now I have become mostly uncertain. I’ve pondered about my uncertainty so much that I decided that it was the only thing that was constant and that could give me a kind of stability. Uncertainty was the only basis I could use to get somewhere. Epistemological consequentialism (EC) is the philosophy I’m currently trying to shape on that basis. But first of all, let’s consider utilitarianism first and then move on to why I started getting doubts.
Utilitarianism
Utilitarianism is an altruistic consequentialist normative ethical philosophy that aims to optimize utility as defined by some kind or set of non-negative mental state(s). Since there are many different kinds and sets of non-negative mental state(s) there are different strands of utilitarianism – see below. My definition here is a bit of a monster, so we need to break it down a bit:


Altruisitic means that people should care about more than their ow