### This file
- tokenizes the descriptions from spreadsheet2
- vactorizes the word frequency for each course
- outputs three files, courses, words and the word-freq matrix
***

### Read data

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('spreadsheet2.csv')
df.fillna('', inplace=True)

In [3]:
df.head()

Unnamed: 0,level,dept,cno,name,description,prerequisite
0,Artificial Intelligence,CSC,457,Expert Systems,A study of the development of expert systems. ...,CSC403
1,Artificial Intelligence,CSC,458,Symbolic Programming,Concepts of symbolic programming as embodied i...,CSC403
2,Artificial Intelligence,DSC,478,Programming Machine Learning Applications,The course will focus on the implementations o...,DSC441-CSC401
3,Artificial Intelligence,CSC,480,Artificial Intelligence I,An in-depth survey of important concepts probl...,CSC403
4,Artificial Intelligence,CSC,481,Introduction to Image Processing,The course is a prerequisite for more advanced...,CSC412


***
# Tokenization

In [4]:
## load every discription into a list
doc_complete = [df.iloc[i,4] for i in range(df.shape[0])]

In [5]:
import nltk
nltk.download("stopwords")
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
## Preprocessing
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()

In [7]:
## Cleaning
def clean(doc):
    stop_free = " ".join([w for w in doc.lower().split() if w not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [8]:
doc_clean = [clean(doc) for doc in doc_complete]

In [9]:
## see the first five results
for i in range(5):
    print(df.iloc[i,3])
    print(doc_clean[i], end='\n\n')

Expert Systems
study development expert system student use commercial package develop standalone embedded expert system topic include rulebased system decision tree forward backward chaining inference reasoning uncertainty intelligent agent

Symbolic Programming
concept symbolic programming embodied language lisp basic data control structure lisp symbolic expression interpreter function recursion iteration technique prototyping building conceptually advanced system environment encourages procedural data abstraction advanced topic may include prolog intelligent tutoring system intelligent agent natural language processing assignment focus basic ai technique class intended anyone need rapidly develop large complex system

Programming Machine Learning Applications
course focus implementation various data mining machine learning technique using highlevel programming language student hand experience developing supervised unsupervised machine learning algorithm learn employ technique context

***
# CountVectorizer

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [11]:
## feed the documents into vactor object
X = vectorizer.fit_transform(doc_clean)
X

<164x1428 sparse matrix of type '<class 'numpy.int64'>'
	with 5361 stored elements in Compressed Sparse Row format>

In [12]:
## get all the tokens
words = vectorizer.get_feature_names()
# print(word)

In [13]:
## get the 2d word frequency array
matirx = X.toarray()
print(matirx.shape)
matirx

(164, 1428)


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [14]:
## write the matrix into csv
mat = pd.DataFrame(matirx)
mat.shape

(164, 1428)

## output files

In [15]:
## word freq matrix
mat.to_csv("word_freq.csv", index=False)

In [16]:
## courses
courses = [str(df.iloc[i,1])+str(df.iloc[i,2]) for i in range(df.shape[0])]
pd.DataFrame({'courses':courses}).to_csv("courses.csv", index=False)

In [17]:
## words
pd.DataFrame({'words':words}).to_csv("words.csv", index=False)