# **Extract Bag of Words (BoW) Features from Course Textual Content**


First, let's install and import required libraries:


In [1]:
# uncomment the installs if you haven't done before
#!pip install nltk==3.6.7
#!pip install gensim==4.1.2

In [2]:
import smart_open
import gensim
import pandas as pd
import nltk as nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora

%matplotlib inline

Download stopwords


In [3]:
# warnings off → warning is usually useful to help you understand and improve your code, 
#but I'm going to hide this one as it may contain personal information.

import warnings
warnings.filterwarnings('ignore')

<br>`quiet=True` I'm going to hide this one as it may contain personal information. You can remove that to see where what has been downloaded.

In [7]:
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

True

In [8]:
# also set a random state
rs = 123

### Bag of Words (BoW) features


BoW features are essentially the counts or frequencies of each word that appears in a text (string). Let's illustrate it with some simple examples.


Suppose we have two course descriptions as follows:


In [9]:
course1 = "this is an introduction data science course which introduces data science to beginners"

In [10]:
course2 = "machine learning for beginners"

In [11]:
courses = [course1, course2]
courses

['this is an introduction data science course which introduces data science to beginners',
 'machine learning for beginners']

The first step is to split the two strings into words (tokens). A token in the text processing context means the smallest unit of text such as a word, a symbol/punctuation, or a phrase, etc. The process to transform a string into a collection of tokens is called `tokenization`.


One common way to do ```tokenization``` is to use the Python built-in `split()` method of the `str` class.  However, in this lab, we want to leverage the `nltk` (Natural Language Toolkit) package, which is probably the most commonly used package to process text or natural language.


 More specifically, we will use the ```word_tokenize()``` method on the content of course (string):


In [12]:
# Tokenize the two courses
tokenized_courses = [word_tokenize(course) for course in courses]

In [13]:
tokenized_courses

[['this',
  'is',
  'an',
  'introduction',
  'data',
  'science',
  'course',
  'which',
  'introduces',
  'data',
  'science',
  'to',
  'beginners'],
 ['machine', 'learning', 'for', 'beginners']]

As you can see from the cell output, two courses have been tokenized and turned into two token arrays.


Next, we want to create a token dictionary to index all tokens. Basically, we want to assign a key/index for each token. One way to index tokens is to use the `gensim` package which is another popular package for processing textual data:


In [14]:
# Create a token dictionary for the two courses
tokens_dict = gensim.corpora.Dictionary(tokenized_courses)

In [15]:
print(tokens_dict.token2id)

{'an': 0, 'beginners': 1, 'course': 2, 'data': 3, 'introduces': 4, 'introduction': 5, 'is': 6, 'science': 7, 'this': 8, 'to': 9, 'which': 10, 'for': 11, 'learning': 12, 'machine': 13}


With the token dictionary, we can easily count each token in the two example courses and output two BoW feature vectors. However, more conveniently, the `gensim` package provides us a `doc2bow` method to generate BoW features out-of-box.


In [16]:
# Generate BoW features for each course
courses_bow = [tokens_dict.doc2bow(course) for course in tokenized_courses]

In [17]:
courses_bow

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 2),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 2),
  (8, 1),
  (9, 1),
  (10, 1)],
 [(1, 1), (11, 1), (12, 1), (13, 1)]]

It outputs two BoW arrays where each element is a tuple, e.g., (0, 1) and (7, 2). The first element of the tuple is the token ID and the second element is its count. So `(0, 1)` means `(``an``, 1)` and `(7, 2)` means `(``science``, 2)`.


We can use the following code snippet to print each token and its count:


In [18]:
for course_idx, course_bow in enumerate(courses_bow):
    print(f"Bag of words for course {course_idx}:")
    # For each token index, print its bow value (word count)
    for token_index, token_bow in course_bow:
        token = tokens_dict.get(token_index)
        print(f"--Token: '{token}', Count:{token_bow}")

Bag of words for course 0:
--Token: 'an', Count:1
--Token: 'beginners', Count:1
--Token: 'course', Count:1
--Token: 'data', Count:2
--Token: 'introduces', Count:1
--Token: 'introduction', Count:1
--Token: 'is', Count:1
--Token: 'science', Count:2
--Token: 'this', Count:1
--Token: 'to', Count:1
--Token: 'which', Count:1
Bag of words for course 1:
--Token: 'beginners', Count:1
--Token: 'for', Count:1
--Token: 'learning', Count:1
--Token: 'machine', Count:1


If we turn to the long list into a horizontal feature vectors, we can see the two courses become two numerical feature vectors:


### BoW dimensionality reduction


A document may contain tens of thousands of words which makes the dimension of the BoW feature vector huge. To reduce the dimensionality, one common way is to filter the relatively meaningless tokens such as stop words or sometimes add position and adjective words.


Note there are many other ways to reduce dimensionality such as `stemming` and `lemmatization` but they are beyond the scope of this capstone project. You are encouraged to explore them yourself.


We can use the english stop words provided in `nltk`:


In [19]:
stop_words = set(stopwords.words('english'))

In [20]:
#feel free to uncomment the below stop_words to see the output
#stop_words

Then we can filter those English stop words from the tokens in course1:


In [21]:
# Tokens in course 1
tokenized_courses[0]

['this',
 'is',
 'an',
 'introduction',
 'data',
 'science',
 'course',
 'which',
 'introduces',
 'data',
 'science',
 'to',
 'beginners']

In [22]:
processed_tokens = [w for w in tokenized_courses[0] if not w.lower() in stop_words]

In [23]:
processed_tokens

['introduction',
 'data',
 'science',
 'course',
 'introduces',
 'data',
 'science',
 'beginners']

You can see the number of tokens for ```course1``` has been reduced.


Another common way is to only keep nouns in the text. We can use the `nltk.pos_tag()` method to analyze the part of speech (POS) and annotate each word.


In [24]:
tags = nltk.pos_tag(tokenized_courses[0])
tags

[('this', 'DT'),
 ('is', 'VBZ'),
 ('an', 'DT'),
 ('introduction', 'NN'),
 ('data', 'NNS'),
 ('science', 'NN'),
 ('course', 'NN'),
 ('which', 'WDT'),
 ('introduces', 'VBZ'),
 ('data', 'NNS'),
 ('science', 'NN'),
 ('to', 'TO'),
 ('beginners', 'NNS')]

As we can see [`introduction`, `data`, `science`, `course`, `beginners`] are all of the nouns and we may keep them in the BoW feature vector.


### TASK: Extract BoW features for course textual content and build a dataset


By now you have learned what a BoW feature is, so let's start extracting BoW features from some real course textual content.


In [25]:
course_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_processed.csv"
course_content_df = pd.read_csv(course_url)
course_content_df.head()

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
0,ML0201EN,robots are coming build iot apps with watson ...,have fun with iot and learn along the way if ...
1,ML0122EN,accelerating deep learning with gpu,training complex deep learning models with lar...
2,GPXX0ZG0EN,consuming restful services using the reactive ...,learn how to use a reactive jax rs client to a...
3,RP0105EN,analyzing big data in r using apache spark,apache spark is a popular cluster computing fr...
4,GPXX0Z2PEN,containerizing packaging and running a sprin...,learn how to containerize package and run a ...


In [26]:
course_content_df.iloc[0, :]

COURSE_ID                                               ML0201EN
TITLE          robots are coming  build iot apps with watson ...
DESCRIPTION    have fun with iot and learn along the way  if ...
Name: 0, dtype: object

The course content dataset has three columns `COURSE_ID`, `TITLE`, and `DESCRIPTION`. `TITLE` and `DESCRIPTION` are all text upon which we want to extract BoW features. 


Let's join those two text columns together.


In [27]:
# Merge TITLE and DESCRIPTION title
course_content_df['course_texts'] = course_content_df[['TITLE', 'DESCRIPTION']].agg(' '.join, axis=1)
course_content_df = course_content_df.reset_index()
course_content_df['index'] = course_content_df.index

In [28]:
course_content_df.iloc[0, :]

index                                                           0
COURSE_ID                                                ML0201EN
TITLE           robots are coming  build iot apps with watson ...
DESCRIPTION     have fun with iot and learn along the way  if ...
course_texts    robots are coming  build iot apps with watson ...
Name: 0, dtype: object

and we have prepared a `tokenize_course()` method for you to tokenize the course content:


In [29]:
def tokenize_course(course, keep_only_nouns=True):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(course)
    # Remove English stop words and numbers
    word_tokens = [w for w in word_tokens if (not w.lower() in stop_words) and (not w.isnumeric())]
    # Only keep nouns 
    if keep_only_nouns:
        filter_list = ['WDT', 'WP', 'WRB', 'FW', 'IN', 'JJR', 'JJS', 'MD', 'PDT', 'POS', 'PRP', 'RB', 'RBR', 'RBS',
                       'RP']
        tags = nltk.pos_tag(word_tokens)
        word_tokens = [word for word, pos in tags if pos not in filter_list]

    return word_tokens

Let's try it on the first course.


In [30]:
a_course = course_content_df.iloc[0, :]['course_texts']
a_course

'robots are coming  build iot apps with watson  swift  and node red have fun with iot and learn along the way  if you re a swift developer and want to learn more about iot and watson ai services in the cloud  raspberry pi   and node red  you ve found the right place  you ll build iot apps to read temperature data  take pictures with a raspcam  use ai to recognize the objects in those pictures  and program an irobot create 2 robot  '

In [31]:
tokenize_course(a_course)

['robots',
 'coming',
 'build',
 'iot',
 'apps',
 'watson',
 'swift',
 'red',
 'fun',
 'iot',
 'learn',
 'way',
 'swift',
 'developer',
 'want',
 'learn',
 'iot',
 'watson',
 'ai',
 'services',
 'cloud',
 'raspberry',
 'pi',
 'node',
 'red',
 'found',
 'place',
 'build',
 'iot',
 'apps',
 'read',
 'temperature',
 'data',
 'take',
 'pictures',
 'raspcam',
 'use',
 'ai',
 'recognize',
 'objects',
 'pictures',
 'program',
 'irobot',
 'create',
 'robot']

Next, you will need to write some code snippets to generate the BoW features for each course. Let's start by tokenzing all courses in the `courses_df`:


_TODO: Use provided tokenize_course() method to tokenize all courses in courses_df['course_texts']._


In [32]:
# WRITE YOUR CODE HERE
   
all_tokenized_courses = [tokenize_course(course, True) for course in course_content_df['course_texts']]

In [33]:
# feel free to uncomment all_tokanized_courses to see the result
#all_tokenized_courses

Then we need to create a token dictionary `tokens_dict`


_TODO: Use gensim.corpora.Dictionary(tokenized_courses) to create a token dictionary._


In [34]:
# WRITE YOUR CODE HERE
tokens_dict = gensim.corpora.Dictionary(all_tokenized_courses)

Then we can use `doc2bow()` method to generate BoW features for each tokenized course.


_TODO: Use tokens_dict.doc2bow() to generate BoW features for each tokenized course._


In [35]:
# WRITE YOUR CODE HERE
all_courses_bow = [tokens_dict.doc2bow(course) for course in all_tokenized_courses]

In [37]:
res_rows = []
for ID, bunch in enumerate(all_courses_bow):
    # print(f"Bag of words for course {course_idx}:")
    # For each token index, print its bow value (word count)
    for ind, count in bunch:
        token = tokens_dict.get(ind)
        l = [ID, token, count]
        res_rows.append(l)

print(res_rows[:2])

[[0, 'ai', 2], [0, 'apps', 2]]


In [38]:

for elem in res_rows:
    z = course_content_df.iloc[elem[0],1]
    elem.insert(1, z)
    
print(res_rows[:50])

[[0, 'ML0201EN', 'ai', 2], [0, 'ML0201EN', 'apps', 2], [0, 'ML0201EN', 'build', 2], [0, 'ML0201EN', 'cloud', 1], [0, 'ML0201EN', 'coming', 1], [0, 'ML0201EN', 'create', 1], [0, 'ML0201EN', 'data', 1], [0, 'ML0201EN', 'developer', 1], [0, 'ML0201EN', 'found', 1], [0, 'ML0201EN', 'fun', 1], [0, 'ML0201EN', 'iot', 4], [0, 'ML0201EN', 'irobot', 1], [0, 'ML0201EN', 'learn', 2], [0, 'ML0201EN', 'node', 1], [0, 'ML0201EN', 'objects', 1], [0, 'ML0201EN', 'pi', 1], [0, 'ML0201EN', 'pictures', 2], [0, 'ML0201EN', 'place', 1], [0, 'ML0201EN', 'program', 1], [0, 'ML0201EN', 'raspberry', 1], [0, 'ML0201EN', 'raspcam', 1], [0, 'ML0201EN', 'read', 1], [0, 'ML0201EN', 'recognize', 1], [0, 'ML0201EN', 'red', 2], [0, 'ML0201EN', 'robot', 1], [0, 'ML0201EN', 'robots', 1], [0, 'ML0201EN', 'services', 1], [0, 'ML0201EN', 'swift', 2], [0, 'ML0201EN', 'take', 1], [0, 'ML0201EN', 'temperature', 1], [0, 'ML0201EN', 'use', 1], [0, 'ML0201EN', 'want', 1], [0, 'ML0201EN', 'watson', 2], [0, 'ML0201EN', 'way', 1], 

Lastly, you need to append the BoW features for each course into a new BoW dataframe. The new dataframe needs to include the following columns (you may include other relevant columns as well):
- 'doc_index': the course index starting from 0
- 'doc_id': the actual course id such as `ML0201EN`
- 'token': the tokens for each course
- 'bow': the bow value for each token


_TODO: Create a new course_bow dataframe based on the extracted BoW features._


In [39]:
# WRITE YOUR CODE HERE

 
bow_dicts = {
     "doc_index": [x[0] for x in res_rows],
     "doc_id": [x[1] for x in res_rows],
     "token": [x[2] for x in res_rows],
     "bow": [x[3] for x in res_rows]
 }

pd.DataFrame(bow_dicts)

Unnamed: 0,doc_index,doc_id,token,bow
0,0,ML0201EN,ai,2
1,0,ML0201EN,apps,2
2,0,ML0201EN,build,2
3,0,ML0201EN,cloud,1
4,0,ML0201EN,coming,1
...,...,...,...,...
10358,306,excourse93,modifying,1
10359,306,excourse93,objectives,1
10360,306,excourse93,pieces,1
10361,306,excourse93,plugins,1
