These are the projects for supervised learning for Slide Rule's Data Science Intensive Program which are taken from lectures from Udacity's Intro to Machine Learning Course. The projects cover the Naive Bayes, SVM, and Decision Trees algorithms. They were originally separate, but I thought it would be better to combine them since they all dealt with the same data. The goal is to try to predict the author of an email based on the email's contents.

Further details on each project can be found here:

Naive Bayes: https://www.udacity.com/course/viewer#!/c-ud120/l-2254358555/m-2959448580

SVM: https://www.udacity.com/course/viewer#!/c-ud120/l-2252188570/e-3020238710/m-3037398541

Decision Trees: https://www.udacity.com/course/viewer#!/c-ud120/l-2258728540/m-2987588597

Most of the code below is a reproduction of the project templates outlined by Udacity. One major difference is that I adapted the code for Python 3. These templates can be found here:

https://github.com/udacity/ud120-projects

In [4]:
#Original: https://github.com/udacity/ud120-projects/blob/master/tools/email_preprocess.py

import pickle
import numpy

from sklearn import cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif

def preprocess(words_file = r"C:\Users\Gordon\Documents\Projects\ud120-projects\tools\word_data.pkl", 
               authors_file=r"C:\Users\Gordon\Documents\Projects\ud120-projects\tools\email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the features and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels
    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, 'rb')
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, 'rb')
    word_data = pickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set (remainder go into training)
    features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, 
                                                                                                 authors, 
                                                                                                 test_size=0.1, 
                                                                                                 random_state=42)
	
    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)

    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
	#selector = SelectPercentile(f_classif, percentile=1) 
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print("no. of Chris training emails:", sum(labels_train))
    print("no. of Sara training emails:", len(labels_train)-sum(labels_train))
    
    return features_train_transformed, features_test_transformed, labels_train, labels_test

####Naive Bayes

In [6]:
""" 
    this is the code to accompany the Lesson 1 (Naive Bayes) mini-project 

    use a Naive Bayes Classifier to identify emails by their authors
    
    authors and labels:
    Sara has label 0
    Chris has label 1
    
"""
    
import sys
from time import time
sys.path.append(r"C:\Users\Gordon\Documents\Projects\ud120-projects\tools")

### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

#########################################################
### your code goes here ###

from sklearn.naive_bayes import GaussianNB as GB

clf = GB()
t0 = time()
clf.fit(features_train,labels_train)
print("training time:", round(time()-t0, 3), "s")
t1 = time()
clf.predict(features_train)
print("predicting time:", round(time()-t1, 3), "s")

accuracy_score = clf.score(features_test,labels_test)
print(accuracy_score)

#########################################################

UnpicklingError: the STRING opcode argument must be quoted

#### SVM

In [None]:
""" 
    this is the code to accompany the Lesson 2 (SVM) mini-project

    use an SVM to identify emails from the Enron corpus by their authors
    
    Sara has label 0
    Chris has label 1

"""
    
import sys
from time import time
sys.path.append(r"C:\Users\Gordon\Documents\Projects\ud120-projects\tools")

### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

#########################################################
### your code goes here ###

features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100] 

from sklearn.svm import SVC

clf = SVC(kernel='linear')
t0 = time()
clf.fit(features_train,labels_train)
print "training time:", round(time()-t0, 3), "s"
t1 = time()
pred=clf.predict(features_test)
print("predicting time:", round(time()-t1, 3), "s")
print(clf.score(features_test,labels_test))
#########################################################

#### Decision Trees

In [None]:
""" 
    this is the code to accompany the Lesson 3 (decision tree) mini-project

    use an DT to identify emails from the Enron corpus by their authors
    
    Sara has label 0
    Chris has label 1

"""
    
import sys
from time import time
sys.path.append(r"C:\Users\Gordon\Documents\Projects\ud120-projects\tools")

### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

#########################################################
### your code goes here ###
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.metrics import accuracy_score

clf = DTC(min_samples_split=40)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = accuracy_score(pred,labels_test)
print(accuracy)
#########################################################