## Example Usecase for Movies
https://www.kernix.com/blog/recommender-system-based-on-natural-language-processing_p10

Used algorithm LSI (LSA). 
Idea: Texts that contain similar words have a similar meaning.

## Preprocessing
  
We create a so called bag of words. This means that for each text we throw all words into a "bag" so we ignore the ordering and just look at which words occur how often. This can be thought of as a matrix where each row corresponds to a word and each column is a text. The value written is either 0, 1 for occured or did not occur or the number of occurences or the tf-idf value (text frequency - inverse document frequency).  

In [1]:
import numpy as np
texts = ["This is a text about blockchain.", "Is that a text about IoT?"]

def split_words(texts):
  words = set()
  for t in texts:
    words = words.union(t.lower().split(" "))
    
  words = list(words)
  return words
  
words = split_words(texts)
print(words)
print(np.array([[int(w in s) for s in texts] for w in words]))

['iot?', 'this', 'a', 'blockchain.', 'that', 'about', 'text', 'is']
[[0 0]
 [0 0]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 0]]


By taking tolower we identyfied that "Is" and "is" are the same words but for example "text." and "text" are seen as different. So we want to do an additional step where we delete non words. This can be done easily by some regex.
Still words like "book" and "books" or "walk" and "walked" are seen as different. To eliminate those differences we need some smarter language specific algorithms. This is called stemming, example library: snowball.
For some details see http://snowball.tartarus.org/texts/introduction.html
(Im pretty sure there is some neural network solution for this too. ~1980-1990 technology).

In [2]:
from nltk.stem import SnowballStemmer
words = ["book", "books", "walk", "walked", "die", "dying", "happy", "unhappy",
"become", "became"]
words2 = ["money", "cash", "cheaply", "reply", "sun", "sunshine",
"dictator", "dictatorship", "house", "huose", ]
stemmer = SnowballStemmer("english") 
stemmed = [stemmer.stem(w) for w in words]
stemmed2 = [stemmer.stem(w) for w in words2]
print(stemmed)
print(stemmed2)

['book', 'book', 'walk', 'walk', 'die', 'die', 'happi', 'unhappi', 'becom', 'becam']
['money', 'cash', 'cheapli', 'repli', 'sun', 'sunshin', 'dictat', 'dictatorship', 'hous', 'huos']


I would say better than what i could have implemented and definetly useful but has some serious limitations.  
  
  
Also if you look at the similarities of the first example then it would show that these two texts are quite similar because they have the meaningless words "is,a, about" in common. Another preprocessing step is to delete such useless words (stopwords).  
We do this by just taking a list of known english stopwords and delete those from our texts. (static)
  
Usually this word text matrix is really spars so instead of keeping a trillion 0 in memory we use a sparse matrix notation. Saving only the (row_number, column_number, value) where the value is not 0. Storing 3 $\cdot$ nr_non_zeroes instead of rows $\cdot$ columns. This is also called corpus.
  
By having this translationg we get word vecs for each text and we could just measure how similar two of these vecs are. This would be a algorithm that hasnt learned anything from the data though.
  
## Creating an LSI model
The idea is to reduce dimensions and learn topics. So that the algorithm can learn words that are similar and not only check if two texts have the same words in it. So if there is a text like "bmw is a car" and one with "vw is a car" that it will learn the topic "bmw car vw" and if we get two texts "i have a bmw" and "i want a vw" that thw algorithm can now knwo that they both talk about cars while just comparing the bag of words vectors of those two would not show similarity.

In [3]:
import numpy as np
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
row_words = ["bmw", "sun", "car", "beach", "vw"]
texts = ["bmw are cars", "vw is a car", "sun and beach"]

texts = [t.split(" ") for t in texts]
print(texts)

def stem_and_stop(texts):
    stemmer = SnowballStemmer("english")
    stemmed = [[stemmer.stem(w.lower()) for w in words] for words in texts]
    stopwords_set = set(stopwords.words("English"))
    final_words = [[w for w in words if w not in set(stopwords.words("English"))] for words in stemmed]
    return final_words

final_words = stem_and_stop(texts)
distinct_words = set([w for text in final_words for w in text])
print(final_words)
print(distinct_words)
words_text_mat = np.array([[int(w in s) for s in final_words] for w in distinct_words])
print(words_text_mat)
words_text_mat = text_to_mat(texts)

u, s, vh = np.linalg.svd(words_text_mat, compute_uv=True, full_matrices=False)
u = np.round(u, 1)
print(u)

[['bmw', 'are', 'cars'], ['vw', 'is', 'a', 'car'], ['sun', 'and', 'beach']]


OSError: No such file or directory: '/home/dimitra/nltk_data/corpora/stopwords/English'

The columns of this matrix are the topics ordered by importance. As we can see the most important topic is actually the topic containing vw,bmw and car and the second most important topic contains the other words sun and beach.  
  
So the algorithm learned that vw and bmw are both cars or atleast made a connection between those. If we score the similarity between the two new texts "bmw in the sun" and "vw on the beach". Then these are translatet into word vectors as:

In [None]:
new_texts = ["bmw in the sun", "vw on the beach"]

print(mat)

In [None]:
from recommender.nlp import LanguageProcessing
from recommender.database import Database
d = Database()
L = LanguageProcessing(d)


In [None]:
#L.ldamodel.get_document_topics()
import numpy as np
u = L.lsi.get_topics()
ur = np.round(u, 2)
nr_words = 4
print("Top 10 topics")
def get_topics(v1, sgn):
    x = [(L.dictionary[i], v1[i]) for i in range(len(v1))]# if abs(v1[i]) > 0.1]
    x.sort(key=lambda a:a[1], reverse=True)
    if sgn:
            #sum([abs(a[1]) for a in x[0:3]]) >= sum([abs(a[1]) for a in x[-nr_words:]]):
        print(x[0:nr_words])
    else:
        tmp = x[-nr_words:]
        tmp.reverse()
        print(tmp)
for i in range(10):
    v1 = ur[i, :]

    get_topics(v1, True)


In [None]:
#L.destinations.index[L.destinations["iata_code"] == "BGI"][0]
L.model.get_topics()

In [None]:
for city_name in ["NYC", "BER", "LON", "BGI", "PMI", "HKG"]:
    print("\n" + city_name)
    city_index = L.destinations.index[L.destinations["iata_code"] == city_name][0]
    topic_vec=[x[1] for x in L.lsi[L.corpus[city_index]]]
    topic_vec_abs = np.abs(topic_vec)
    topic_indices = np.argpartition(topic_vec_abs, [-1,-2,-3])[-3:]
    u = L.lsi.get_topics()
    for i in topic_indices:
        sgn = topic_vec[i] > 0
        ur = u[i,:]
        get_topics(ur,sgn)

In [None]:
from recommender.nlp import LanguageProcessing
from recommender.database import Database
d = Database()
L = LanguageProcessing(d)
L.optimize_parameters()
