Game Plan:

- Read in the Blogs & Tags
- Generate Feature Vectors
    * Version 1: Naive approach - Use All Words
    * Version 2: Filter overly common and rare words
    * Version 3: Only use adjectives & nouns (May need NER)
    * Version 4: Investigate topic modeling
    * Version 5: n-grams
- Apply Machine Learning Algorithm
    * Naive Bayes
    * kNN
    * K-Means
    ...
- Evaluate Results
    * Manual Evaluation
    * Categorical Evaluation (Maybe use F-Score)
   

Read in data

In [1]:
import os

script_directory = os.getcwd()
relative_file_path = 'data\cleaned up (final).txt'
INPUT_FILE_PATH = os.path.join(script_directory, relative_file_path)

print(script_directory)

input_file = open(INPUT_FILE_PATH, 'r', encoding='utf-8')

all_titles = list()
all_blog_entries = list()
all_tags = list()

while True:
    title_line = input_file.readline().strip()
    blog_line = input_file.readline().strip()
    tag_line = input_file.readline().strip()
    tag_list = tag_line.split(',')

    input_file.readline() # This is just to burn the extra new line
    #print("Title: {}\nBlog: {}\nTags: {}\n".format(title_line, blog_line[:100], tag_line))
    #print(tag_list)

    # We only care about entries that have blog entries
    if tag_line:
        all_titles.append(title_line)
        all_blog_entries.append(blog_line)
        all_tags.append(tag_list)
    else:
        print("End of File")
        print("Title: {}\nBlog: {}\nTags: {}\n".format(title_line, blog_line[:100], tag_line))
        break # EOF

print("Title Entries: {}".format(len(all_titles)))
print("Blog Entries:  {}".format(len(all_blog_entries)))
print("Tag Entries:   {}".format(len(all_tags)))

D:\Jupyter Projects
End of File
Title: 
Blog: 
Tags: 

Title Entries: 970
Blog Entries:  970
Tag Entries:   970


Convert to Feature Vectors

In [2]:
# Experimentation, just trying to unsupervised cluster them

# Code is based off of https://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories

from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
training_counts = count_vector.fit_transform(all_blog_entries)

print(training_counts.shape)
print(count_vector.vocabulary_.get(u'the'))

# Normalize the vector counts by diving by document length (TF - Term Frequency)
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(training_counts)
training_tf = tf_transformer.transform(training_counts)
print(training_tf.shape)



# Training a Classifier
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

X_train = np.array(all_blog_entries)
y_train = all_tags

#print(y_train)

#return

# Convert Training Array to Binary Array
mlb = MultiLabelBinarizer()
y_train_binary_array = mlb.fit_transform(y_train)

# print(y_train_binary_array)

#X_test = np.array([all_blog_entries[0], all_blog_entries[1]])
X_test = np.array(all_blog_entries[100:125])   
target_names = list(mlb.classes_)

#print(X_test)
#print(target_names)
#print(len(target_names))
#print("\n\n\n\n\n")

classifier = Pipeline([ \
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train_binary_array)
predicted = classifier.predict(X_test)
predicted_as_string = mlb.inverse_transform(predicted)

print("\n")

#for item, labels in zip(X_test, predicted_as_string):
#    print('%s => %s' % (item[100:], ', '.join(target_names[x] for x in labels)))

for item, labels in zip(X_test, predicted_as_string):
    print("\n{} => \n\t{}".format(item[:100], labels))


(970, 27747)
24328
(970, 27747)



yes weve finally done it liferays community wiki has been migrated from mediawiki to liferays own wi => 
	('migration', 'wiki')

the much awaited liferay 51 bits are now available for download one of the features that i worked on => 
	('glassfish v3', 'osgi')

recently openportal portlet container was fully integrated in liferay portal the openportal portlet  => 
	('jsr 286', 'jsr-286', 'portlet container', 'portlet v2.0', 'portlets')

opensso is an open source project for single sign on from sun microsystems this document describes h => 
	('liferay  opensso  glassfish',)

brian just wrote a great blog about the meetup and i couldnt resist adding my usual philosophical sh => 
	('collaboration', 'liferay', 'meetup', 'social')

weve made some enhancements to the search infrastructure of the portal recently we abstracted out lu => 
	('5.1.1', 'search engine', 'solr')

the opensource software movement also known as the free software movement received a bo

Now to split test & training data

In [3]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

X_train, X_test, y_train, y_test = train_test_split(all_blog_entries, all_tags, test_size=0.2, random_state=0)

X_train = np.array(X_train)
X_test = np.array(X_test)

# Note to self:
# X_train => y_train
# X_test => y_test

print("X_train: {}".format(X_train.shape))
print("y_train: {}".format(len(y_train)))
print("X_test:  {}".format(X_test.shape))
print("y_test:  {}".format(len(y_test)))

# Convert Training Array to Binary Array
mlb = MultiLabelBinarizer()
y_train_binary_array = mlb.fit_transform(y_train)

classifier = Pipeline([ \
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train_binary_array)
predicted = classifier.predict(X_test)
predicted_as_string = mlb.inverse_transform(predicted)

print("\n")

for item, labels in zip(X_test, predicted_as_string):
    print("\n{} => \n\t{}".format(item[:100], labels))


X_train: (776,)
y_train: 776
X_test:  (194,)
y_test:  194



software should be modular it has been a core thought of developers and architects for quite a long  => 
	()

starting to recover from jetlag after a two week trip los angeles and liferay retreat one of the thi => 
	('maven',)

i took a look at jboss portal and concluded that nothing is wrong with liferay haha buismart guy hah => 
	()

objective implement many to many relation in liferay development using service builder tool liferay  => 
	()

ive learned a lot while writing liferay in action im going through my final edits now before the boo => 
	()

you probably know the basic installation instructions for liferay bundles unzip and run startupsh wi => 
	()

publish version hook plugin this article is described about the custom publish version hook which is => 
	()

my very first blog post to be honest ive said for months that i was going to start doing this but th => 
	()

last friday i had the pleasure of attending our ver