# Curious Comments 
![commentpic](comment_structure.png)

A critical part of any review are the comments. We will now proceed to analyze the comments in our q data. We will judge the predictive power of these comments, and analyze the role they play in a score's q rating.

In [250]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import os
import math
from itertools import chain
import ast

We begin by creating a dataframe of comments. We will randomly subsample 8 comments per review and only consider courses that have at least 10 comments.

In [251]:
MAX_PER_COURSE_SAMPLES = 8
#print tempdf[tempdf['C_Number'] == 'SOC-STD 98nw'].Comments.tolist()[0]
#tempdf.groupby('C_Number').groups

In [281]:
bigdf=pd.read_csv("bigdf.csv")
bigdf.reset_index(drop=True)
tempdf = bigdf[['C_Department','C_Number','Positive','Comments']].dropna()
subdf = pd.DataFrame()
for dept, courseNum in tempdf.groupby(['C_Department','C_Number']).groups:
    allCourseComments = []
    allSemesterComments = tempdf[tempdf.C_Number == courseNum & tempdf.C_Department == dept].Comments.tolist()
    for eachSemesterComments in allSemesterComments:
        allCourseComments = allCourseComments + ast.literal_eval(eachSemesterComments)
    if len(allCourseComments) >= 10:
        sample = np.random.choice(allCourseComments, MAX_PER_COURSE_SAMPLES, replace=False)
        coursedf = pd.DataFrame()
        coursedf['Comment'] = sample
        coursedf['Course'] = dept + courseNum
        subdf = pd.concat([subdf,coursedf]).reset_index(drop=True)

TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]

In [279]:
subdf.head(12)

Now we will convert our comments dataframe, subdf, to a spark dataframe for text analysis

In [224]:
#setup spark
import os
import findspark
findspark.init()
print findspark.find()
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local')
    .setAppName('pyspark')
    .set("spark.executor.memory", "5g"))
sc = pyspark.SparkContext(conf=conf)
import sys
rdd = sc.parallelize(xrange(10),10)
rdd.map(lambda x: sys.version).collect()
sys.version
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

/usr/local/opt/apache-spark/libexec


In [225]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS
import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")

We write a get parts function to parse the language in the comments

In [226]:
def get_parts(thetext):
    thetext=re.sub(regex1, ' ', thetext)
    thetext=re.sub(regex2, ' ', thetext)
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2

In [227]:
subdf = sqlsc.createDataFrame(subdf)
subdf.show(5)

+--------------------+------------+
|             Comment|    C_Number|
+--------------------+------------+
|Take this class i...|SOC-STD 98nw|
|Be prepared to le...|SOC-STD 98nw|
|The course is cha...|SOC-STD 98nw|
|This is a great c...|SOC-STD 98nw|
|If you are at all...|SOC-STD 98nw|
+--------------------+------------+
only showing top 5 rows



In [228]:
comment_parts = subdf.rdd.map(lambda r: get_parts(r.Comment))
comment_parts.take(5)

[([[u'class', u'thesis', u'field', u'work'],
   [u'course', u'experience', u'research', u'thesis'],
   [u'healthcare', u'way', u'problem', u'day'],
   [u'note', u'tea', u'party', u'elephant', u'bias', u'course'],
   [u'person', u'view', u'coverage'],
   [u'view']],
  [[u'actual'],
   [u'true', u'original', u'qualitative'],
   [u'great', u'important', u'political', u'sociological', u'economic'],
   [u'super', u'conservative', u'obsessed', u'general', u'liberal'],
   [u'moderate', u'political', u'universal', u'important'],
   [u'better']]),
 ([[u'healthcare', u'individual'], [u'class', u'reading', u'time']],
  [[u'prepared', u'knowledgeable'], [u'great', u'little', u'unstructured']]),
 ([[u'course', u'workload', u'load'],
   [u'course'],
   [u'practice', u'research', u'thesis']],
  [[u'great'], [u'glad'], [u'confident', u'well-equipped', u'senior']]),
 ([[u'class', u'writing', u'thesis', u'topic', u'healthcare', u'healthcare'],
   [u'course', u'material'],
   [u'method', u'thesis', u'opt

In [229]:
%%time
parsedcomments=comment_parts.collect()

CPU times: user 172 ms, sys: 152 ms, total: 324 ms
Wall time: 1min 4s


We begin our text analysis with an LDA of the nouns in the comments 

In [230]:
[e[0] for e in parsedcomments[:3]]

[[[u'class', u'thesis', u'field', u'work'],
  [u'course', u'experience', u'research', u'thesis'],
  [u'healthcare', u'way', u'problem', u'day'],
  [u'note', u'tea', u'party', u'elephant', u'bias', u'course'],
  [u'person', u'view', u'coverage'],
  [u'view']],
 [[u'healthcare', u'individual'], [u'class', u'reading', u'time']],
 [[u'course', u'workload', u'load'],
  [u'course'],
  [u'practice', u'research', u'thesis']]]

In [231]:
ldadatardd=sc.parallelize([ele[0] for ele in parsedcomments]).flatMap(lambda l: l)
ldadatardd.cache()
ldadatardd.take(5)

[[u'class', u'thesis', u'field', u'work'],
 [u'course', u'experience', u'research', u'thesis'],
 [u'healthcare', u'way', u'problem', u'day'],
 [u'note', u'tea', u'party', u'elephant', u'bias', u'course'],
 [u'person', u'view', u'coverage']]

In [232]:
ldadatardd.flatMap(lambda word: word).take(5)

[u'class', u'thesis', u'field', u'work', u'course']

In [233]:
vocabtups = (ldadatardd.flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
             .map(lambda (x,y): x)
             .zipWithIndex()
).cache()

In [234]:
vocab=vocabtups.collectAsMap()
id2word=vocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [235]:
id2word[0], vocab.keys()[5], vocab[vocab.keys()[5]]

(u'req', u'dynamic', 5)

In [236]:
len(vocab.keys())

4661

In [237]:
from collections import defaultdict
def helperfunction(element):
    d = defaultdict(int)
    for k in element:
        d[vocab[k]] += 1
    return d.items()
documents = ldadatardd.map(lambda w: helperfunction(w))

In [238]:
documents.take(5)

[[(3660, 1), (4627, 1), (292, 1), (195, 1)],
 [(292, 1), (2404, 1), (1421, 1), (2645, 1)],
 [(105, 1), (1002, 1), (3788, 1), (4170, 1)],
 [(3777, 1), (2346, 1), (1421, 1), (4305, 1), (1656, 1), (4216, 1)],
 [(2242, 1), (1558, 1), (3110, 1)]]

In [239]:
corpus=documents.collect()

In [240]:
import gensim

In [241]:
lda2 = gensim.models.ldamodel.LdaModel(corpus = corpus, num_topics = 2, id2word=id2word, chunksize=200, passes = 10)

Above, we print the topics we find using LDA.

In [243]:
lda2.print_topics()

[u'0.049*course + 0.040*lot + 0.039*material + 0.033*time + 0.026*way + 0.025*work + 0.017*experience + 0.015*paper + 0.014*field + 0.014*person',
 u'0.171*class + 0.079*course + 0.027*student + 0.026*professor + 0.026*lecture + 0.025*reading + 0.024*history + 0.022*topic + 0.021*discussion + 0.016*fun']

The first topic (let us call this Topic 0) includes the combination of words: course, lot, material, time, way, work, experience, paper, field, and person. 


The second topic (let us call this Topic 1) includes the combination of words: class, course, student, professor, lecture, reading, history, topic, discussion, and fun.


Topic 1 seems to encompass the more interactive, qualitative, personable aspects of the course with key terms including student, professor, discussion, lecture, and fun. Topic 0, by contrast, seems to encompass the more solitary, logistical, factual aspects of the course with key terms including material, time, work, experience, paper, and field. One thing worth noticing is that both topics include "course" as a key term but only Topic 1 includes "class". While "course" and "class" are often used interchangably in language, "class" arguably connotes a more personal, interactive experience than "course", which is more administrative and logistical and more likely to be used as an umbrella term for everything from everyday class to homework.


In order to further evaluate our intial hypothesis that course reviews are split along two topics (interactive, qualitiative, personable aspects v. solitary, logistical aspects), we will output the words of some sentences, along with the probability of the sentence belonging to Topic 0 and Topic 1, to qualitatively check that our topics are reasonable and supported.

In [249]:
for bow in corpus[0:1000:100]:
    print bow
    print lda2.get_document_topics(bow)
    print " ".join([id2word[e[0]] for e in bow])
    print "=========================================="

[(3660, 1), (4627, 1), (292, 1), (195, 1)]
[(0, 0.69985952214752012), (1, 0.30014047785247994)]
field class thesis work
[(3425, 1), (1421, 1), (4111, 1), (4627, 1), (2477, 1), (4319, 1), (1567, 1)]
[(0, 0.33855889890284857), (1, 0.66144110109715148)]
depth course grad class history student question
[(3178, 1), (195, 1), (254, 1)]
[(0, 0.62540422226333714), (1, 0.37459577773666292)]
bit work lab
[(1113, 1), (4515, 1), (405, 1), (729, 1)]
[(0, 0.29945806159612509), (1, 0.70054193840387502)]
concept commitment moment life
[(3846, 1), (3110, 1)]
[(0, 0.49992283880028238), (1, 0.50007716119971768)]
lecture person
[(2370, 1), (3846, 2), (3884, 1), (1805, 1), (792, 1), (1113, 1)]
[(0, 0.19442318510549644), (1, 0.8055768148945035)]
section lecture exam reading drawback concept
[(4627, 1), (551, 1)]
[(0, 0.16684756233003178), (1, 0.83315243766996827)]
class entertaining
[(280, 1), (1421, 1), (1551, 1)]
[(0, 0.84685027552522507), (1, 0.15314972447477493)]
order course thing
[(1465, 1), (538, 1),

The "sentences" (or bag-of-words) which have a much greater probability of belonging to Topic 0 include:
- field class thesis work
- bit work lab
- order course thing


The words in these sentences relate to more impersonal, logistical aspects of a course. Specifically, "field", "thesis", "work", and "order" describe inflexible, solitary, requirement aspects of a course while "bit" and "thing" are more vague but nevertheless imply a degree of impartiability and detatchedness.



The "setences" which have a much greater probability of belonging to Topic 1 include:
- depth course grad class history student question
- concept commitment moment life
- section lecture exam reading drawback concept
- class entertaining
- evolution class pre cinema

Words in these sentences that are not present in the previous cluster of setences and that stand out as implying more creative, interactive, person-to-person aspects of a course include "depth", "history", "question", "concept", "commitment", "moment", "life", "drawback", "lecture", and "entertaining". 

The "sentences" which have more equal probabilities of belonging to Topic 0 or 1 include:
- lecture person
- resource professor man kind topic

We can observe words implying more logistical aspects ("resource", "topic") and more interactive, creative aspects ("professor", "kind"). 


Of course, words such as "lecture" can belong to either topic since lecture is both a logistical, required part of most courses and an engaging, potentially interactive opportunity for students to learn from professors. From our analysis of the topic probabilities and bag-of-words above, however, there appears to be evidence to support our initial hypothesis that course reviews are split along two topics: Topic 0, which includes more solitary, logistical aspects and Topic 1, which includes more interactive, qualitative, personable aspects of a course.

TO DOs:
- We can consider doing "verbs" (use TextBlob)
- Detect "not" before adjectives, this shouldn't be too difficult
- Text before pushing
- Look at differences across departments (Jesse/Andrew)
