# Curious Comments 
![commentpic](comment_structure.png)

A critical part of any review are the comments. We will now proceed to analyze the comments in our q data. We will judge the predictive power of these comments, and analyze the role they play in a score's q rating.

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import os
import math
from itertools import chain
import ast

We begin by creating a dataframe of comments. We will randomly subsample 10 comments per review and only consider courses 10 or more comments. Our dataframe will consist of three columns; Comment, Course and Overall Positive Rating. A course defined to have an 'Overall Positive Rating' if it has been been given more positive ratings than negative ratings across all semesters that it has been rated.

In [2]:
MIN_PER_COURSESEM_REVIEWS = 5

In [3]:
bigdf=pd.read_csv("bigdf.csv")
bigdf.reset_index(drop=True)
bigdf.head(5)

Unnamed: 0,C_Department,C_Number,Course,C_CatNum,C_ID,C_Semester,C_Year,C_Term,C_Overall,C_Workload,C_Difficulty,C_Recommendation,C_Enrollment,C_ResponseRate,I_First,I_Last,I_ID,I_Overall,I_EffectiveLectures,I_Accessible,I_GeneratesEnthusiasm,I_EncouragesParticipation,I_UsefulFeedback,I_ReturnsAssignmentsTimely,QOverall_1,QOverall_2,QOverall_3,QOverall_4,QOverall_5,QDifficulty_1,QDifficulty_2,QDifficulty_3,QDifficulty_4,QDifficulty_5,QWorkload_1,QWorkload_2,QWorkload_3,QWorkload_4,QWorkload_5,Comments,Sem_Average,Positive
0,HISTSCI,270.0,HISTSCI-270,58523,2697,Spring '12,2011,2,4.67,2.33,3.33,5.0,6,50.0,Rebecca,Lemov,79de794d3e2e19eb71a2033b0ec0b76d,4.67,4.33,4.0,4.33,5.0,4.5,4.0,0,0,0,1,2,0,0,2,1,0,0,2,1,0,0,[u'This course is a perfect example of what gr...,4.22635,True
1,EXPOS,20.132,EXPOS-20.132,22108,1676,Fall '14,2014,1,4.1,7.1,,3.5,13,76.92,Owen,Chen,1341ccb7bd27f47e68625b63b15281d1,4.5,4.6,3.7,3.9,3.9,4.1,4.6,0,1,1,4,4,0,1,2,7,3,1,3,4,1,0,"[u'The class has a fairly high work load, but ...",4.24437,False
2,EXPOS,20.132,EXPOS-20.132,22108,1676,Fall '13,2013,1,3.5,2.6,3.9,3.2,13,100.0,Owen,Chen,1341ccb7bd27f47e68625b63b15281d1,3.8,4.0,2.5,3.5,4.1,4.3,4.4,0,1,1,4,4,0,1,2,7,3,1,3,4,1,0,[u'Philosophy of the State with Dr. Chen offer...,4.256888,False
3,EXPOS,20.132,EXPOS-20.132,22108,1676,Fall '12,2012,1,3.73,2.47,3.67,3.47,15,100.0,Owen,Chen,1341ccb7bd27f47e68625b63b15281d1,3.87,4.33,2.64,3.93,4.18,3.64,3.82,0,1,1,4,4,0,1,2,7,3,1,3,4,1,0,[u'This was by far my favorite course. Dr. Che...,4.190299,False
4,EXPOS,20.132,EXPOS-20.132,22108,1676,Fall '11,2011,1,3.85,2.0,3.54,3.62,13,100.0,Owen,Chen,1341ccb7bd27f47e68625b63b15281d1,4.08,3.75,3.31,3.92,4.46,4.23,4.08,0,1,1,4,4,0,1,2,7,3,1,3,4,1,0,"[u'Be prepared to read', u'Discussions were gr...",4.185893,False


In [4]:
def sample_comments(commentsListAsString):
    if type(commentsListAsString) != str:
        return ""
    else:
        allComments = ast.literal_eval(commentsListAsString)
        if len(allComments) >= MIN_PER_COURSESEM_REVIEWS:
            return " ".join(np.random.choice(allComments, MIN_PER_COURSESEM_REVIEWS, replace=False))
        else:
            return ""

subdf = bigdf[['Course','C_Semester','Comments', 'Positive']].dropna()
subdf["Sampled_Comments"] = subdf.Comments.map(sample_comments)
subdf = subdf[subdf.Sampled_Comments != ""]

In [5]:
subdf.head(12)

Unnamed: 0,Course,C_Semester,Comments,Positive,Sampled_Comments
1,EXPOS-20.132,Fall '14,"[u'The class has a fairly high work load, but ...",False,Lot of work/reading but very rewarding. You re...
2,EXPOS-20.132,Fall '13,[u'Philosophy of the State with Dr. Chen offer...,False,This class is always rumored to be very diffic...
3,EXPOS-20.132,Fall '12,[u'This was by far my favorite course. Dr. Che...,False,Take this class if you are interested in the c...
4,EXPOS-20.132,Fall '11,"[u'Be prepared to read', u'Discussions were gr...",False,You'll have to think deeply about your opinion...
6,EXPOS-20.133,Spring '12,[u'This is a fantastic course if you have an i...,False,"This class is really difficult, but if you enj..."
7,EXPOS-20.133,Spring '14,[u'This is not a bad Expos course. Dr. Chen is...,False,Be ready to speak up in class If you like phil...
13,EXPOS-20.131,Fall '14,[u'If you are interested in political philosop...,False,The preceptor gives good feedback. This course...
14,EXPOS-20.131,Fall '13,"[u'This class involves quite a bit of reading,...",False,"If you have any interest at all in philosophy,..."
15,EXPOS-20.131,Fall '12,"[u""A basic, but thorough, understanding of phi...",False,"In this class, you will get to work through a ..."
16,EXPOS-20.131,Fall '11,"[u'It is a great class, which combines develop...",True,"If you decide to take this course, it will be ..."


Now we will convert our comments dataframe, subdf, to a spark dataframe for text analysis

In [6]:
#setup spark
import os
import findspark
findspark.init()
print findspark.find()
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local')
    .setAppName('pyspark')
    .set("spark.executor.memory", "2g"))
sc = pyspark.SparkContext(conf=conf)
import sys
rdd = sc.parallelize(xrange(10),10)
rdd.map(lambda x: sys.version).collect()
sys.version
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

/home/vagrant/spark


In [7]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS
import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")
#Useless verbs courtesy of: http://mbweston.com/2012/11/26/writing-editing-find-and-eliminate-useless-verbs/
uselessverbs = ['be','is','are','be','was','were','been','being',
                'go','goes','went','gone','going','put','puts','putting',
                'do','does','did','done','doing',
                'come','comes','came','coming',
                'have','have','has','had','having',
                'can','could','begin','begins','began','begun','beginning',
                'seem','seems','seemed','seeming',
                'get','got','gotten','getting',
                'become','became','becoming']
import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")

We write a get parts function to parse the language in the comments. This customize get_parts function returns lists of the nouns, adjectives and verbs in the comments.

In [8]:
def get_parts(thetext):
    thetext=re.sub(regex1, ' ', thetext)
    thetext=re.sub(regex2, ' ', thetext)
    nouns=[]
    descriptives=[]
    verbs = []
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        verbs.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
                elif token[1] in ['VB','VBP','VBZ','VBG','VBD','VBN']:
                    if token[4] in stopwords or token[4] in uselessverbs or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    verbs[i].append(token[4])
    out=zip(nouns, descriptives,verbs)
    nouns2=[]
    descriptives2=[]
    verbs2 = []
    for n,d,v in out:
        if len(n)!=0 and len(d)!=0 and len(v)!=0:
            nouns2.append(n)
            descriptives2.append(d)
            verbs2.append(v)
    return nouns2, descriptives2, verbs2

In [9]:
subdf = sqlsc.createDataFrame(subdf)
subdf.show(5)

+------------+----------+--------------------+--------+--------------------+
|      Course|C_Semester|            Comments|Positive|    Sampled_Comments|
+------------+----------+--------------------+--------+--------------------+
|EXPOS-20.132|  Fall '14|[u'The class has ...|   false|Lot of work/readi...|
|EXPOS-20.132|  Fall '13|[u'Philosophy of ...|   false|This class is alw...|
|EXPOS-20.132|  Fall '12|[u'This was by fa...|   false|Take this class i...|
|EXPOS-20.132|  Fall '11|[u'Be prepared to...|   false|You'll have to th...|
|EXPOS-20.133|Spring '12|[u'This is a fant...|   false|This class is rea...|
+------------+----------+--------------------+--------+--------------------+
only showing top 5 rows



In [10]:
comment_parts = subdf.rdd.map(lambda r: get_parts(r.Sampled_Comments))
comment_parts.take(1)

[([[u'quality', u'instruction', u'course', u'instructor', u'time'],
   [u'paper', u'family', u'state'],
   [u'lot', u'work', u'summary', u'assignment'],
   [u'class', u'work', u'load', u'writer'],
   [u'course'],
   [u'subject', u'course', u'overall', u'expectation'],
   [u'purpose',
    u'class',
    u'freshman',
    u'writing',
    u'skill',
    u'class',
    u'resource',
    u'purpose'],
   [u'work', u'effort', u'essay'],
   [u'essay', u'topic', u'guidance'],
   [u'feedback', u'draft', u'draft'],
   [u'feedback', u'office', u'hour'],
   [u'peer', u'resource', u'guidance', u'class'],
   [u'class', u'requirement']],
  [[u'generous'],
   [u'final'],
   [u'prepared', u'dense', u'philosophical', u'online'],
   [u'high', u'helpful'],
   [u'unlikely'],
   [u'passionate'],
   [u'useful'],
   [u'manageable'],
   [u'reasonable', u'little'],
   [u'vague', u'unhelpful', u'final'],
   [u'difficult', u'constructive', u'accessible'],
   [u'writing'],
   [u'particular']],
  [[u'receive'],
   [u'foc

In [11]:
%%time
parsedcomments=comment_parts.collect()

CPU times: user 171 ms, sys: 37.4 ms, total: 208 ms
Wall time: 1min 38s


We begin our text analysis with an LDA of the nouns in the comments 

In [12]:
[e[0] for e in parsedcomments[:3]]

[[[u'quality', u'instruction', u'course', u'instructor', u'time'],
  [u'paper', u'family', u'state'],
  [u'lot', u'work', u'summary', u'assignment'],
  [u'class', u'work', u'load', u'writer'],
  [u'course'],
  [u'subject', u'course', u'overall', u'expectation'],
  [u'purpose',
   u'class',
   u'freshman',
   u'writing',
   u'skill',
   u'class',
   u'resource',
   u'purpose'],
  [u'work', u'effort', u'essay'],
  [u'essay', u'topic', u'guidance'],
  [u'feedback', u'draft', u'draft'],
  [u'feedback', u'office', u'hour'],
  [u'peer', u'resource', u'guidance', u'class'],
  [u'class', u'requirement']],
 [[u'class'],
  [u'lot', u'philosophy', u'professor', u'lot'],
  [u'course', u'lot', u'reading', u'writing'],
  [u'philosophy',
   u'stimulating',
   u'class',
   u'discussion',
   u'variety',
   u'topic',
   u'student',
   u'share'],
  [u'class', u'lot'],
  [u'class', u'improvement', u'writing', u'understanding', u'writing'],
  [u'student', u'class', u'philosophy', u'level', u'discussion'],


In [13]:
ldadatardd=sc.parallelize([ele[0] for ele in parsedcomments]).flatMap(lambda l: l)
ldadatardd.cache()
ldadatardd.take(5)

[[u'quality', u'instruction', u'course', u'instructor', u'time'],
 [u'paper', u'family', u'state'],
 [u'lot', u'work', u'summary', u'assignment'],
 [u'class', u'work', u'load', u'writer'],
 [u'course']]

In [14]:
ldadatardd.flatMap(lambda word: word).take(5)

[u'quality', u'instruction', u'course', u'instructor', u'time']

In [15]:
vocabtups = (ldadatardd.flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
             .map(lambda (x,y): x)
             .zipWithIndex()
).cache()

In [16]:
vocab=vocabtups.collectAsMap()
id2word=vocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [17]:
id2word[0], vocab.keys()[5], vocab[vocab.keys()[5]]

(u'stuff\xe2\u20ac', u'databasis', 5)

In [18]:
len(vocab.keys())

4322

In [19]:
from collections import defaultdict
def helperfunction(element):
    d = defaultdict(int)
    for k in element:
        d[vocab[k]] += 1
    return d.items()
documents = ldadatardd.map(lambda w: helperfunction(w))

In [20]:
documents.take(5)

[[(1315, 1), (2575, 1), (2959, 1), (1030, 1), (3615, 1)],
 [(2243, 1), (3798, 1), (1622, 1)],
 [(1619, 1), (3644, 1), (173, 1), (2894, 1)],
 [(2675, 1), (4291, 1), (173, 1), (2787, 1)],
 [(1315, 1)]]

In [21]:
corpus=documents.collect()

In [None]:
import gensim

In [None]:
lda2 = gensim.models.ldamodel.LdaModel(corpus = corpus, num_topics = 2, id2word=id2word, chunksize=200, passes = 10)

Above, we print the topics we find using LDA.

In [None]:
lda2.print_topics()

The first topic (let us call this Topic 0) includes the combination of words: 

- class, lot, time, work, way, professor, problem, section, exam, and fun.


The second topic (let us call this Topic 1) includes the following combination of words: 

- course, material, lecture, student, topic, reading, person, science, year, and lecturer.


Topic 0 seems to encompass the more interactive, qualitative, personable aspects of the course with key terms including class, professor, problem, section, and fun. Topic 1, by contrast, seems to encompass the more solitary, logistical, factual aspects of a course that a student experiences with key terms including material, lecture, student, topic, reading, year, and lecturer. One thing worth noticing is that Topic 1 includes "course" as a key term while Topic 0 includes "class". While "course" and "class" are often used interchangably in language, "class" arguably connotes a more personal, interactive experience than "course", which is more administrative and logistical and more likely to be used as an umbrella term for everything from everyday class to homework. This would support our separation of Topic 0 and 1 into more interactive/personable aspects and more logistical/solitary aspects, respectively.

In order to further evaluate our intial hypothesis that course reviews are split along two topics (interactive, qualitiative, personable aspects v. solitary, logistical aspects), we will output the words of some sentences, along with the probability of the sentence belonging to Topic 0 and Topic 1, to qualitatively check that our topics are reasonable and supported.

In [None]:
for bow in corpus[0:1200:60]:
    print bow
    print lda2.get_document_topics(bow)
    print " ".join([id2word[e[0]] for e in bow])
    print "=========================================="

The "sentences" (or bag-of-words) which have a much greater probability of belonging to Topic 0 include:
- class writer load work
- class bit pass work
- class cost intervention u'very benefit reform lot issue
- class student success belief stand risk course question ability
- class kink
- way professor class discussion material
- food professor person class
- class section discussion student
- time homework week

The words in these sentences are more descriptive and relate to more creative, interactive, person-to-person aspects of a course. Specifically, "writer", "intervention", "reform", "issue", "success", "belief", "stand", "risk", "question", "kink", and "discussion" all imply rich and diverse elements of the course experience. As a Harvard student, I know that the word "section" also implies discussion and collaboration since sections for Harvard classes provide an opportunity outside of lecture to engage more closely with course material and consist of tight-knit groups.


The "setences" which have a much greater probability of belonging to Topic 1 include:
- concept background
- reading
- thought-provoking discussion debate material staff teaching

The function below transforms X-col (which consists of word-based "sentences" (bag-of-words or "documents")) using the vectorizer which is also a parameter.
- time bit course u"must
- career course regret
- education perspective lot debate history

Words in these sentences that are not present in the previous cluster of setences and that stand out as implying more impersonal, logistical, or practical aspects of a course include "concept", "background", "reading", "material", "time", "must", "education", "history", and arguably "staff" (since "staff" is a somewhat impersonal way to refer to professors and teaching fellows). Although words like "thought-provoking", "discussion", "debate" and "teaching" do appear, it is worthwhile to note that the sentence/bag-of-words in which they appear still has a relatively high probability of belonging to Topic 0 (~35%).

The "sentences" which have more equal probabilities of belonging to Topic 0 or 1 include:
- time assignment philosophy night
- course lot
- chance education course lot style
- student reason
- way class overview study

For sentences/bag-of-words with relatively equal probabilities of belonging to Topic 0 and Topic 1, we can observe both words implying more interactive, creative aspects ("philosophy", "style", "student", "reason") and words implying more logistical aspects ("time", "assignment", "overview", "study").


From our analysis of the topic probabilities and bag-of-words above, there appears to be evidence to support our initial hypothesis that course reviews are split along two topics: Topic 0, which includes more interactive, qualitative, personable aspects of a course, and Topic 1, which includes more solitary, logistical aspects.

TO DOs:
- We can consider doing "verbs" (use TextBlob)
- Detect "not" before adjectives, this shouldn't be too difficult
- Text before pushing
- Look at differences across departments (Jesse/Andrew)


Let us now continue with a sentiment analysis of the adjectives in the comments using Naive Bayes. We begin by extracting the adjectives as we did before with the nouns. 

In [None]:
nbdatardd=sc.parallelize([ele[1] for ele in parsedcomments])
nbdatardd.cache()
nbdatardd.take(3)

In [None]:
adjvocabtups = (nbdatardd.flatMap(lambda l: l).flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
             .map(lambda (x,y): x)
             .zipWithIndex()
).cache()
adjvocab=adjvocabtups.collectAsMap()

In [None]:
len(adjvocab)

Now we need to flatten the all of the adjectives for a particular semester of comments of a course into a single document and then make an array of these documents for get our adjective "features" and retrieve our response array, comprised of the positive column of our subdf, as this is the response variable we are trying to predict. The length of these two arrays should be equal if we do this correctly

In [None]:
import itertools
Xarraypre=nbdatardd.map(lambda l: " ".join(list(itertools.chain.from_iterable(l))))
Xarray=Xarraypre.collect()
resparray = subdf.rdd.map(lambda r: r.Positive).collect()

In [None]:
len(Xarray), len(resparray)

Next we use mask to create a train and test split.

In [None]:
from sklearn.cross_validation import train_test_split
itrain, itest = train_test_split(xrange(len(Xarray)), train_size=0.7)
mask=np.ones(len(Xarray), dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)

We then transform the adjectives into a bag of words representation and use a vectorizer to create a feature and write some support functions for Naive Bayes analysis. 

In [None]:
def make_xy(X_col, y_col, vectorizer):
    X = vectorizer.fit_transform(X_col)
    y = y_col
    return X, y

We plan on using log-likelyhood as a scoring metric and write a support function to be able to do so

In [None]:
def log_likelihood(clf, x, y):
    prob = clf.predict_log_proba(x)
    negatives = y == False
    positives = ~negatives
    return prob[negatives, False].sum() + prob[positives, True].sum()

We define a function to estimate the cross-validated score (given a classifier, data, and a scoring function).

In [None]:
from sklearn.cross_validation import KFold

def cv_score(clf, x, y, score_func, nfold=5):
    """
    Uses 5-fold cross validation to estimate a score of a classifier
    
    Inputs
    ------
    clf : Classifier object
    x : Input feature vector
    y : Input class labels
    score_func : Function like log_likelihood, that takes (clf, x, y) as input,
                 and returns a score
                 
    Returns
    -------
    The average score obtained by splitting (x, y) into 5 folds of training and 
    test sets, fitting on the training set, and evaluating score_func on the test set
    
    Examples
    cv_score(clf, x, y, log_likelihood)
    """
    result = 0
    for train, test in KFold(y.size, nfold): # split data into train/test groups, 5 times
        clf.fit(x[train], y[train]) # fit
        result += score_func(clf, x[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average

We also define a usseful function for visualizing the calibration of a probabilistic classifier (in order to recognize whether our classifier is over-confident or under-confident).

In [None]:
def calibration_plot(clf, xtest, ytest):
    prob = clf.predict_proba(xtest)[:, 1]
    outcome = ytest
    data = pd.DataFrame(dict(prob=prob, outcome=outcome))

    #group outcomes into bins of similar probability
    bins = np.linspace(0, 1, 20)
    cuts = pd.cut(prob, bins)
    binwidth = bins[1] - bins[0]
    
    #freshness ratio and number of examples in each bin
    cal = data.groupby(cuts).outcome.agg(['mean', 'count'])
    cal['pmid'] = (bins[:-1] + bins[1:]) / 2
    cal['sig'] = np.sqrt(cal.pmid * (1 - cal.pmid) / cal['count'])
        
    #the calibration plot
    ax = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
    p = plt.errorbar(cal.pmid, cal['mean'], cal['sig'])
    plt.plot(cal.pmid, cal.pmid, linestyle='--', lw=1, color='k')
    plt.ylabel("Empirical P(+)")
    
    #the distribution of P(+)
    ax = plt.subplot2grid((3, 1), (2, 0), sharex=ax)
    
    plt.bar(left=cal.pmid - binwidth / 2, height=cal['count'],
            width=.95 * (bins[1] - bins[0]),
            fc=p[0].get_color())
    
    plt.xlabel("Predicted P(+)")
    plt.ylabel("Number")

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

We convert Xarray and Resparray into numpy arrays for use with sklearn

In [None]:
X=np.array(Xarray)
y=np.array(resparray)

Now we write a cross-validation loop to find the best hyper-parameters for our Naive Bayes analysis.

In [None]:
#the grid of parameters to search over
alphas = [0, .1, 1, 5, 10, 50]
min_dfs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]

#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
maxscore=-np.inf
for alpha in alphas:
    for min_df in min_dfs:         
        vectorizer = CountVectorizer(vocabulary = adjvocab, min_df = min_df)       
        Xthis, ythis = make_xy(X, y, vectorizer)
        Xtrainthis=Xthis[mask]
        ytrainthis=ythis[mask]
        clf = MultinomialNB(alpha=alpha)
        cvscore = cv_score(clf, Xtrainthis, ytrainthis, log_likelihood)
        if cvscore > maxscore:
            maxscore = cvscore
            best_alpha, best_min_df = alpha, min_df

In [None]:
print "alpha: %f" % best_alpha
print "min_df: %f" % best_min_df

Now that we have determined the best parameters, we are ready to run the Naive Bayes classifier and create a calibration plot. 

In [None]:
vectorizer = CountVectorizer(vocabulary = adjvocab, min_df=best_min_df)
X2, y2 = make_xy(X, y, vectorizer)
xtrain=X2[mask]
ytrain=y2[mask]
xtest=X2[~mask]
ytest=y2[~mask]

clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)

# Your code here. Print the accuracy on the test and training dataset
training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data:     %0.2f" % (test_accuracy)
calibration_plot(clf, xtest, ytest)

In [None]:
nbdatardd_verbs=sc.parallelize([ele[2] for ele in parsedcomments])
nbdatardd_verbs.cache()
nbdatardd_verbs.take(3)

In [None]:
verbvocabtups = (nbdatardd_verbs.flatMap(lambda l: l).flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
             .map(lambda (x,y): x)
             .zipWithIndex()
).cache()
verbvocab=verbvocabtups.collectAsMap()

In [None]:
#the grid of parameters to search over
alphas = [0, .1, 1, 5, 10, 50]
min_dfs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]

#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
maxscore=-np.inf
for alpha in alphas:
    for min_df in min_dfs:         
        vectorizer = CountVectorizer(vocabulary = verbvocab, min_df = min_df)       
        Xthis, ythis = make_xy(X, y, vectorizer)
        Xtrainthis=Xthis[mask]
        ytrainthis=ythis[mask]
        clf = MultinomialNB(alpha=alpha)
        cvscore = cv_score(clf, Xtrainthis, ytrainthis, log_likelihood)
        if cvscore > maxscore:
            maxscore = cvscore
            best_alpha, best_min_df = alpha, min_df

In [None]:
print "alpha: %f" % best_alpha
print "min_df: %f" % best_min_df

In [None]:
vectorizer = CountVectorizer(vocabulary = verbvocab, min_df=best_min_df)
X2, y2 = make_xy(X, y, vectorizer)
xtrain=X2[mask]
ytrain=y2[mask]
xtest=X2[~mask]
ytest=y2[~mask]

clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)

# Your code here. Print the accuracy on the test and training dataset
training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data:     %0.2f" % (test_accuracy)
calibration_plot(clf, xtest, ytest)