# Curious Comments 
![commentpic](comment_structure.png)

A critical part of any review are the comments. We will now proceed to analyze the comments in our q data. We will judge the predictive power of these comments, and analyze the role they play in a score's q rating.

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import os
import math
from itertools import chain
import ast

We begin by creating a dataframe of comments. We will randomly subsample 10 comments per review and only consider courses 10 or more comments. Our dataframe will consist of three columns; Comment, Course and Overall Positive Rating. A course defined to have an 'Overall Positive Rating' if it has been been given more positive ratings than negative ratings across all semesters that it has been rated.

In [2]:
MIN_PER_COURSESEM_REVIEWS = 5

In [3]:
bigdf=pd.read_csv("bigdf.csv")
bigdf.reset_index(drop=True)
bigdf.head(5)

Unnamed: 0,C_Department,C_Number,Course,C_CatNum,C_ID,C_Semester,C_Year,C_Term,C_Overall,C_Workload,C_Difficulty,C_Recommendation,C_Enrollment,C_ResponseRate,I_First,I_Last,I_ID,I_Overall,I_EffectiveLectures,I_Accessible,I_GeneratesEnthusiasm,I_EncouragesParticipation,I_UsefulFeedback,I_ReturnsAssignmentsTimely,QOverall_1,QOverall_2,QOverall_3,QOverall_4,QOverall_5,QDifficulty_1,QDifficulty_2,QDifficulty_3,QDifficulty_4,QDifficulty_5,QWorkload_1,QWorkload_2,QWorkload_3,QWorkload_4,QWorkload_5,Comments,Sem_Average,Positive
0,HISTSCI,270.0,HISTSCI-270,58523,2697,Spring '12,2011,2,4.67,2.33,3.33,5.0,6,50.0,Rebecca,Lemov,79de794d3e2e19eb71a2033b0ec0b76d,4.67,4.33,4.0,4.33,5.0,4.5,4.0,0,0,0,1,2,0,0,2,1,0,0,2,1,0,0,[u'This course is a perfect example of what gr...,4.22635,True
1,EXPOS,20.132,EXPOS-20.132,22108,1676,Fall '14,2014,1,4.1,7.1,,3.5,13,76.92,Owen,Chen,1341ccb7bd27f47e68625b63b15281d1,4.5,4.6,3.7,3.9,3.9,4.1,4.6,0,1,1,4,4,0,1,2,7,3,1,3,4,1,0,"[u'The class has a fairly high work load, but ...",4.24437,False
2,EXPOS,20.132,EXPOS-20.132,22108,1676,Fall '13,2013,1,3.5,2.6,3.9,3.2,13,100.0,Owen,Chen,1341ccb7bd27f47e68625b63b15281d1,3.8,4.0,2.5,3.5,4.1,4.3,4.4,0,1,1,4,4,0,1,2,7,3,1,3,4,1,0,[u'Philosophy of the State with Dr. Chen offer...,4.256888,False
3,EXPOS,20.132,EXPOS-20.132,22108,1676,Fall '12,2012,1,3.73,2.47,3.67,3.47,15,100.0,Owen,Chen,1341ccb7bd27f47e68625b63b15281d1,3.87,4.33,2.64,3.93,4.18,3.64,3.82,0,1,1,4,4,0,1,2,7,3,1,3,4,1,0,[u'This was by far my favorite course. Dr. Che...,4.190299,False
4,EXPOS,20.132,EXPOS-20.132,22108,1676,Fall '11,2011,1,3.85,2.0,3.54,3.62,13,100.0,Owen,Chen,1341ccb7bd27f47e68625b63b15281d1,4.08,3.75,3.31,3.92,4.46,4.23,4.08,0,1,1,4,4,0,1,2,7,3,1,3,4,1,0,"[u'Be prepared to read', u'Discussions were gr...",4.185893,False


In [4]:
def sample_comments(commentsListAsString):
    if type(commentsListAsString) != str:
        return ""
    else:
        allComments = ast.literal_eval(commentsListAsString)
        if len(allComments) >= MIN_PER_COURSESEM_REVIEWS:
            return " ".join(np.random.choice(allComments, MIN_PER_COURSESEM_REVIEWS, replace=False))
        else:
            return ""

subdf = bigdf[['Course','C_Semester','Comments', 'Positive']].dropna()
subdf["Sampled_Comments"] = subdf.Comments.map(sample_comments)
subdf = subdf[subdf.Sampled_Comments != ""]

In [5]:
subdf.head(12)

Unnamed: 0,Course,C_Semester,Comments,Positive,Sampled_Comments
1,EXPOS-20.132,Fall '14,"[u'The class has a fairly high work load, but ...",False,"Unfortunately, I am unlikely to recommend this..."
2,EXPOS-20.132,Fall '13,[u'Philosophy of the State with Dr. Chen offer...,False,Great class albeit with a lot of reading. Assu...
3,EXPOS-20.132,Fall '12,[u'This was by far my favorite course. Dr. Che...,False,Writing about philosophy is kind of an idiosyn...
4,EXPOS-20.132,Fall '11,"[u'Be prepared to read', u'Discussions were gr...",False,You'll have to think deeply about your opinion...
6,EXPOS-20.133,Spring '12,[u'This is a fantastic course if you have an i...,False,It would be very hard if you're not interested...
7,EXPOS-20.133,Spring '14,[u'This is not a bad Expos course. Dr. Chen is...,False,Be prepared to read a lot of challenging mater...
13,EXPOS-20.131,Fall '14,[u'If you are interested in political philosop...,False,Besides the great readings and interesting sub...
14,EXPOS-20.131,Fall '13,"[u'This class involves quite a bit of reading,...",False,Be ready for lots of reading! Be prepared for ...
15,EXPOS-20.131,Fall '12,"[u""A basic, but thorough, understanding of phi...",False,I feel like I missed a great opportunity on th...
16,EXPOS-20.131,Fall '11,"[u'It is a great class, which combines develop...",True,Philosophy of the State isn't for everybody: t...


Now we will convert our comments dataframe, subdf, to a spark dataframe for text analysis

In [6]:
#setup spark
import os
import findspark
findspark.init()
print findspark.find()
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local')
    .setAppName('pyspark')
    .set("spark.executor.memory", "2g"))
sc = pyspark.SparkContext(conf=conf)
import sys
rdd = sc.parallelize(xrange(10),10)
rdd.map(lambda x: sys.version).collect()
sys.version
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

/usr/local/opt/apache-spark/libexec


In [7]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS
import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")

We write a get parts function to parse the language in the comments

In [8]:
def get_parts(thetext):
    thetext=re.sub(regex1, ' ', thetext)
    thetext=re.sub(regex2, ' ', thetext)
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2

In [9]:
subdf = sqlsc.createDataFrame(subdf)
subdf.show(5)

+------------+----------+--------------------+--------+--------------------+
|      Course|C_Semester|            Comments|Positive|    Sampled_Comments|
+------------+----------+--------------------+--------+--------------------+
|EXPOS-20.132|  Fall '14|[u'The class has ...|   false|Unfortunately, I ...|
|EXPOS-20.132|  Fall '13|[u'Philosophy of ...|   false|Great class albei...|
|EXPOS-20.132|  Fall '12|[u'This was by fa...|   false|Writing about phi...|
|EXPOS-20.132|  Fall '11|[u'Be prepared to...|   false|You'll have to th...|
|EXPOS-20.133|Spring '12|[u'This is a fant...|   false|It would be very ...|
+------------+----------+--------------------+--------+--------------------+
only showing top 5 rows



In [11]:
comment_parts = subdf.rdd.map(lambda r: get_parts(r.Comments))
comment_parts.take(5)

[([[u'class', u'work', u'load', u'writer'],
   [u'quality', u'instruction', u'course', u'instructor', u'time'],
   [u'paper', u'family', u'state'],
   [u'lot', u'grader'],
   [u'teacher', u'though\\xe2\\u20ac\\u201dhe'],
   [u'course'],
   [u'subject', u'course', u'overall', u'expectation'],
   [u'purpose',
    u'class',
    u'freshman',
    u'writing',
    u'skill',
    u'class',
    u'resource',
    u'purpose'],
   [u'work', u'effort', u'essay'],
   [u'essay', u'topic', u'guidance'],
   [u'feedback', u'draft', u'draft'],
   [u'feedback', u'office', u'hour'],
   [u'peer', u'resource', u'guidance', u'class'],
   [u'class', u'requirement'],
   [u'lot', u'work', u'summary', u'assignment']],
  [[u"u'the", u'high', u'helpful'],
   [u'generous'],
   [u'final'],
   [u'u"a', u'hard'],
   [u'good', u'smart', u'nice'],
   [u'unlikely'],
   [u'passionate'],
   [u'useful'],
   [u'manageable'],
   [u'reasonable', u'little'],
   [u'vague', u'unhelpful', u'final'],
   [u'difficult', u'constructive',

In [None]:
%%time
parsedcomments=comment_parts.collect()

We begin our text analysis with an LDA of the nouns in the comments 

In [None]:
[e[0] for e in parsedcomments[:3]]

In [None]:
ldadatardd=sc.parallelize([ele[0] for ele in parsedcomments]).flatMap(lambda l: l)
ldadatardd.cache()
ldadatardd.take(5)

In [None]:
ldadatardd.flatMap(lambda word: word).take(5)

In [None]:
vocabtups = (ldadatardd.flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
             .map(lambda (x,y): x)
             .zipWithIndex()
).cache()

In [None]:
vocab=vocabtups.collectAsMap()
id2word=vocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [None]:
id2word[0], vocab.keys()[5], vocab[vocab.keys()[5]]

In [None]:
len(vocab.keys())

In [None]:
from collections import defaultdict
def helperfunction(element):
    d = defaultdict(int)
    for k in element:
        d[vocab[k]] += 1
    return d.items()
documents = ldadatardd.map(lambda w: helperfunction(w))

In [None]:
documents.take(5)

In [None]:
corpus=documents.collect()

In [None]:
import gensim

In [None]:
lda2 = gensim.models.ldamodel.LdaModel(corpus = corpus, num_topics = 2, id2word=id2word, chunksize=200, passes = 10)

Above, we print the topics we find using LDA.

In [None]:
lda2.print_topics()

The first topic (let us call this Topic 0) includes the combination of words: course, lot, material, time, way, work, experience, paper, field, and person. 


The second topic (let us call this Topic 1) includes the combination of words: class, course, student, professor, lecture, reading, history, topic, discussion, and fun.


Topic 1 seems to encompass the more interactive, qualitative, personable aspects of the course with key terms including student, professor, discussion, lecture, and fun. Topic 0, by contrast, seems to encompass the more solitary, logistical, factual aspects of the course with key terms including material, time, work, experience, paper, and field. One thing worth noticing is that both topics include "course" as a key term but only Topic 1 includes "class". While "course" and "class" are often used interchangably in language, "class" arguably connotes a more personal, interactive experience than "course", which is more administrative and logistical and more likely to be used as an umbrella term for everything from everyday class to homework.


In order to further evaluate our intial hypothesis that course reviews are split along two topics (interactive, qualitiative, personable aspects v. solitary, logistical aspects), we will output the words of some sentences, along with the probability of the sentence belonging to Topic 0 and Topic 1, to qualitatively check that our topics are reasonable and supported.

In [None]:
for bow in corpus[0:1000:100]:
    print bow
    print lda2.get_document_topics(bow)
    print " ".join([id2word[e[0]] for e in bow])
    print "=========================================="

The "sentences" (or bag-of-words) which have a much greater probability of belonging to Topic 0 include:
- field class thesis work
- bit work lab
- order course thing


The words in these sentences relate to more impersonal, logistical aspects of a course. Specifically, "field", "thesis", "work", and "order" describe inflexible, solitary, requirement aspects of a course while "bit" and "thing" are more vague but nevertheless imply a degree of impartiability and detatchedness.



The "setences" which have a much greater probability of belonging to Topic 1 include:
- depth course grad class history student question
- concept commitment moment life
- section lecture exam reading drawback concept
- class entertaining
- evolution class pre cinema

Words in these sentences that are not present in the previous cluster of setences and that stand out as implying more creative, interactive, person-to-person aspects of a course include "depth", "history", "question", "concept", "commitment", "moment", "life", "drawback", "lecture", and "entertaining". 

The "sentences" which have more equal probabilities of belonging to Topic 0 or 1 include:
- lecture person
- resource professor man kind topic

We can observe words implying more logistical aspects ("resource", "topic") and more interactive, creative aspects ("professor", "kind"). 


Of course, words such as "lecture" can belong to either topic since lecture is both a logistical, required part of most courses and an engaging, potentially interactive opportunity for students to learn from professors. From our analysis of the topic probabilities and bag-of-words above, however, there appears to be evidence to support our initial hypothesis that course reviews are split along two topics: Topic 0, which includes more solitary, logistical aspects and Topic 1, which includes more interactive, qualitative, personable aspects of a course.

TO DOs:
- We can consider doing "verbs" (use TextBlob)
- Detect "not" before adjectives, this shouldn't be too difficult
- Text before pushing
- Look at differences across departments (Jesse/Andrew)
