# 1. Introduction

In this post I am going to talk about how to apply the natural lanugage processing techniques to the recommender system. In many real cases, for instance, if we have finished a course in Coursera, we want to learn some related courses to get a deeper insight into one research field. Hence, it would be better if our online system could automatically recommend some relevant courses for a learner.

If we talk more about this coursera case, some people might say we could use labels to tag all the courses previously and if one learner finished one course, based on the label the system would pose some relevant courses. However, manually labelling huge number of courses is very time consuming. Moreover, some algorithms such as collaborative filtering and content-based filtering could also help this problem. But these approaches would not work if we had a new course on Coursera and we did not have much feedback(data) from users about this course. **A better approach** to this problem is that we utilize the description of each course and use NLP techniques to work out for instance the similarity of two courses. The following sections will show the main techniques we use.

So let's get started!

# 2. Construct the NLP Models

In this section we show how to create a recommendation system for our Coursera users. Let's first import modules for this task.

## 2.1 Load the dataset

In [55]:
import nltk
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer
from gensim import corpora, models, similarities

Then we load the data. The data can be found in here: [Coursera Corpus](http://t.cn/RhjgPkv), the password is **oppc**.

In [9]:
file = open('F:/Data Analysis/github/Text-Data-Analysis/Recommender System/coursera_corpus', encoding = 'gb18030', 
            errors = 'ignore')
courses = [line.strip() for line in file]
courses[:2]

['Writing II: Rhetorical Composing\tRhetorical Composing engages you in a series of interactive reading, research, and composing activities along with assignments designed to help you become more effective consumers and producers of alphabetic, visual and multimodal texts.  Join us to become more effective writers... and better citizens.\tRhetorical Composing is a course where writers exchange words, ideas,     talents, and support. You will be introduced to a variety of rhetorical     concepts鈥攖hat is, ideas and techniques to inform and persuade audiences鈥攖hat     will help you become a more effective consumer and producer of written,     visual, and multimodal texts. The class includes short videos, demonstrations,     and activities. We envision Rhetorical Composing as a learning community that includes both those enrolled in this course and the instructors. We bring our expertise in writing, rhetoric and course design, and we have designed the assignments and course infrastructure 

From the output, we see that the names of the courses are seperated by tab. Hence we use the following code to extract the courses' names.

In [3]:
courses_name = [course.split('\t')[0] for course in courses]
courses_name[:10]

['Writing II: Rhetorical Composing',
 'Genetics and Society: A Course for Educators',
 'General Game Playing',
 'Genes and the Human Condition (From Behavior to Biotechnology)',
 'A Brief History of Humankind',
 'New Models of Business in Society',
 'Analyse Num茅rique pour Ing茅nieurs',
 'Evolution: A Course for Educators',
 'Coding the Matrix: Linear Algebra through Computer Science Applications',
 'The Dynamic Earth: A Course for Educators']

## 2.2 Preprocess the data

Generally, we need to do the following text pre-processing:

1. lowercase the words
2. eliminate punctuations
3. eliminate the stopwords
3. stemming

In [27]:
text_lower = [[word for word in document.lower().split()] for document in courses]
print(text_lower[0], end = '\t')

['writing', 'ii:', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading,', 'research,', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic,', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words,', 'ideas,', 'talents,', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', 'variety', 'of', 'rhetorical', 'concepts鈥攖hat', 'is,', 'ideas', 'and', 'techniques', 'to', 'inform', 'and', 'persuade', 'audiences鈥攖hat', 'will', 'help', 'you', 'become', 'a', 'more', 'effective', 'consumer', 'and', 'producer', 'of', 'written,', 'visual,', 'and', 'multimodal', 'texts.', 'the', 'class', 'includes', 'short', 'videos,', 'demonstr

From the text above, we see some output such as 'texts.' does not split the word and the punctuation. Hence, it would be better if we could use the **word_tokenize** function in nltk and get a more meaningful result.

In [30]:
text_lower = [[word.lower() for word in word_tokenize(document)] for document in courses]
print(text_lower[0], end = '\t')

['writing', 'ii', ':', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading', ',', 'research', ',', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic', ',', 'visual', 'and', 'multimodal', 'texts', '.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers', '...', 'and', 'better', 'citizens', '.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words', ',', 'ideas', ',', 'talents', ',', 'and', 'support', '.', 'you', 'will', 'be', 'introduced', 'to', 'a', 'variety', 'of', 'rhetorical', 'concepts鈥攖hat', 'is', ',', 'ideas', 'and', 'techniques', 'to', 'inform', 'and', 'persuade', 'audiences鈥攖hat', 'will', 'help', 'you', 'become', 'a', 'more', 'effective', 'consumer', 'and', 'producer', 'of', 'written', ',', 'visual', ',', 'and', 'multimodal', 'texts', '.

The words and the punctuations have been splitted. Then we need to do is eliminate the punctuatiaons.

In [40]:
x = re.compile('[%s]' % re.escape(string.punctuation))
text_lower_no_punctuation = []

for text in text_lower:
    new_text = []
    for token in text:
        new_token = x.sub(u'',token) 
        if not new_token == u'':
            new_text.append(new_token)
    text_lower_no_punctuation.append(new_text)
    
print(text_lower_no_punctuation[0], end = "\t")

['writing', 'ii', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading', 'research', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic', 'visual', 'and', 'multimodal', 'texts', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers', 'and', 'better', 'citizens', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words', 'ideas', 'talents', 'and', 'support', 'you', 'will', 'be', 'introduced', 'to', 'a', 'variety', 'of', 'rhetorical', 'concepts鈥攖hat', 'is', 'ideas', 'and', 'techniques', 'to', 'inform', 'and', 'persuade', 'audiences鈥攖hat', 'will', 'help', 'you', 'become', 'a', 'more', 'effective', 'consumer', 'and', 'producer', 'of', 'written', 'visual', 'and', 'multimodal', 'texts', 'the', 'class', 'includes', 'short', 'videos', 'demonstrations', 'and', 'a

The punctuations have been removed. After that, we need to remove the unmeaningful stopwords. Fortunately, nltk provides us with a list of stopwords which are helpful in this task.

In [43]:
english_stopwords = stopwords.words('english')

In [49]:
text_filtered = [[word for word in document if word not in english_stopwords ] for document in text_lower_no_punctuation]
print(text_filtered[0], end = "\t")

['writing', 'ii', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'series', 'interactive', 'reading', 'research', 'composing', 'activities', 'along', 'assignments', 'designed', 'help', 'become', 'effective', 'consumers', 'producers', 'alphabetic', 'visual', 'multimodal', 'texts', 'join', 'us', 'become', 'effective', 'writers', 'better', 'citizens', 'rhetorical', 'composing', 'course', 'writers', 'exchange', 'words', 'ideas', 'talents', 'support', 'introduced', 'variety', 'rhetorical', 'concepts鈥攖hat', 'ideas', 'techniques', 'inform', 'persuade', 'audiences鈥攖hat', 'help', 'become', 'effective', 'consumer', 'producer', 'written', 'visual', 'multimodal', 'texts', 'class', 'includes', 'short', 'videos', 'demonstrations', 'activities', 'envision', 'rhetorical', 'composing', 'learning', 'community', 'includes', 'enrolled', 'course', 'instructors', 'bring', 'expertise', 'writing', 'rhetoric', 'course', 'design', 'designed', 'assignments', 'course', 'infrastructure', 'help', '

At last, we do the stemming for this text. Stemming in NLP means that we treat each word's different variants as the same word. For instance for playing, played, play, we see them as a same word:play. nltk has many useful stemmers. The most well-know ones are Lancaster Stemmer and Porter Stemmer. Here we use Lancaster Stemmer to cope with this Coursera corpus.

In [54]:
st = LancasterStemmer()
text_stemmed = [[st.stem(word) for word in docment] for docment in text_filtered]
print(text_stemmed[0], end = "\t")

['writ', 'ii', 'rhet', 'compos', 'rhet', 'compos', 'eng', 'sery', 'interact', 'read', 'research', 'compos', 'act', 'along', 'assign', 'design', 'help', 'becom', 'effect', 'consum', 'produc', 'alphabet', 'vis', 'multimod', 'text', 'join', 'us', 'becom', 'effect', 'writ', 'bet', 'cit', 'rhet', 'compos', 'cours', 'writ', 'exchang', 'word', 'idea', 'tal', 'support', 'introduc', 'vary', 'rhet', 'concepts鈥攖h', 'idea', 'techn', 'inform', 'persuad', 'audiences鈥攖h', 'help', 'becom', 'effect', 'consum', 'produc', 'writ', 'vis', 'multimod', 'text', 'class', 'includ', 'short', 'video', 'demonst', 'act', 'envid', 'rhet', 'compos', 'learn', 'commun', 'includ', 'enrol', 'cours', 'instruct', 'bring', 'expert', 'writ', 'rhet', 'cours', 'design', 'design', 'assign', 'cours', 'infrastruct', 'help', 'shar', 'expery', 'writ', 'stud', 'profess', 'us', 'collab', 'facilit', 'wex', 'writ', 'exchang', 'plac', 'exchang', 'work', 'feedback']	

# 3. Construct the model

In [67]:
dictionary = corpora.Dictionary(text_stemmed)

In [58]:
corpus = [dictionary.doc2bow(text) for text in text_stemmed]

In [60]:
tfidf = models.TfidfModel(corpus)

In [62]:
corpus_tfidf = tfidf[corpus]

In [63]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)

In [64]:
index = similarities.MatrixSimilarity(lsi[corpus])

In [70]:
ml_course = text_stemmed[210]
ml_bow = dictionary.doc2bow(ml_course)

In [72]:
ml_lsi = lsi[ml_bow]
print(ml_lsi)

[(0, -8.6933670528331479), (1, -0.66447585480810489), (2, -1.0886790044639523), (3, -0.052309285044395269), (4, -4.4845025795785798), (5, 0.51587283626060187), (6, -2.0423517702051046), (7, -1.5686940271221004), (8, 0.16306096471472381), (9, 0.62193933557425851)]


In [73]:
sims = index[ml_lsi]

In [76]:
sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sort_sims[:10])

[(210, 1.0), (174, 0.9801591), (189, 0.95262301), (238, 0.94242418), (63, 0.94210637), (184, 0.93176317), (141, 0.93109751), (221, 0.92591852), (74, 0.92006928), (220, 0.91861022)]


In [79]:
print(courses_name[220])

Algorithms: Design and Analysis, Part 2
