# Curious Comments 
![commentpic](comment_structure.png)

A critical part of any review are the comments. We will now proceed to analyze the comments in our q data. We will judge the predictive power of these comments, and analyze the role they play in a score's q rating.

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import os
import math
from itertools import chain
import ast

We begin by creating a dataframe of comments. We will randomly subsample 10 comments per review and only consider courses 10 or more comments. Our dataframe will consist of three columns; Comment, Course and Overall Positive Rating. A course defined to have an 'Overall Positive Rating' if it has been been given more positive ratings than negative ratings across all semesters that it has been rated.

In [2]:
MAX_PER_COURSE_SAMPLES = 10
#print tempdf[tempdf['C_Number'] == 'SOC-STD 98nw'].Comments.tolist()[0]
#tempdf.groupby('C_Number').groups

In [3]:
bigdf=pd.read_csv("bigdf.csv")
bigdf.reset_index(drop=True)
tempdf = bigdf[['Course','Comments', 'Positive']].dropna()
subdf = pd.DataFrame()
for course in tempdf.groupby('Course').groups:
    allCourseComments = []
    allSemesterComments = tempdf[tempdf.Course == course].Comments.tolist()
    allSemesterPositives = tempdf[tempdf.Course == course].Positive.tolist()
    for eachSemesterComments in allSemesterComments:
        allCourseComments = allCourseComments + ast.literal_eval(eachSemesterComments)
    if len(allCourseComments) >= 10:
        sample = np.random.choice(allCourseComments, MAX_PER_COURSE_SAMPLES, replace=False)
        coursedf = pd.DataFrame()
        coursedf['Comment'] = sample
        coursedf['Course'] = course
        #citation: http://stackoverflow.com/questions/1518522/python-most-common-element-in-a-list
        coursedf['Overall Positive Rating'] = max(set(allSemesterPositives), key=allSemesterPositives.count)
        subdf = pd.concat([subdf,coursedf]).reset_index(drop=True)

In [4]:
subdf.head(14)

Unnamed: 0,Comment,Course,Overall Positive Rating
0,The opportunity to have a one-on-one tutorial ...,RELIGION-98a,True
1,"This is an invaluable opportunity, make it yours.",RELIGION-98a,True
2,"Do your reading, go to class prepared, and you...",RELIGION-98a,True
3,"Excellent readings, great leader, really enjoy...",RELIGION-98a,True
4,"It's a thorough, informative way to delve into...",RELIGION-98a,True
5,"Well, all religion concentrators need to take ...",RELIGION-98a,True
6,Be prepared to let your intellectual horizons ...,RELIGION-98a,True
7,"Though it was created for me, I think this syl...",RELIGION-98a,True
8,Can learn a lot studying under Chip's tutelage.,RELIGION-98a,True
9,"If you get a small tutorial like I did, be pre...",RELIGION-98a,True


Now we will convert our comments dataframe, subdf, to a spark dataframe for text analysis

In [5]:
#setup spark
import os
import findspark
findspark.init()
print findspark.find()
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local')
    .setAppName('pyspark')
    .set("spark.executor.memory", "5g"))
sc = pyspark.SparkContext(conf=conf)
import sys
rdd = sc.parallelize(xrange(10),10)
rdd.map(lambda x: sys.version).collect()
sys.version
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

/home/vagrant/spark


In [6]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS
import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")

We write a get parts function to parse the language in the comments

In [7]:
def get_parts(thetext):
    thetext=re.sub(regex1, ' ', thetext)
    thetext=re.sub(regex2, ' ', thetext)
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2

In [8]:
subdf = sqlsc.createDataFrame(subdf)
subdf.show(5)

+--------------------+------------+-----------------------+
|             Comment|      Course|Overall Positive Rating|
+--------------------+------------+-----------------------+
|The opportunity t...|RELIGION-98a|                   true|
|This is an invalu...|RELIGION-98a|                   true|
|Do your reading, ...|RELIGION-98a|                   true|
|Excellent reading...|RELIGION-98a|                   true|
|It's a thorough, ...|RELIGION-98a|                   true|
+--------------------+------------+-----------------------+
only showing top 5 rows



In [9]:
comment_parts = subdf.rdd.map(lambda r: get_parts(r.Comment))
comment_parts.take(5)

[([[u'opportunity', u'tutorial', u'individual']],
  [[u'one-on-one', u'incredible']]),
 ([[u'opportunity']], [[u'invaluable']]),
 ([], []),
 ([[u'reading', u'leader', u'reading', u'material']],
  [[u'excellent', u'great', u'enjoyed']]),
 ([[u'way', u'dialogue', u'religion', u'science', u'regard', u'evolution']],
  [[u'thorough', u'informative']])]

In [10]:
%%time
parsedcomments=comment_parts.collect()

CPU times: user 158 ms, sys: 21.1 ms, total: 179 ms
Wall time: 55.3 s


We begin our text analysis with an LDA of the nouns in the comments 

In [11]:
[e[0] for e in parsedcomments[:3]]

[[[u'opportunity', u'tutorial', u'individual']], [[u'opportunity']], []]

In [12]:
ldadatardd=sc.parallelize([ele[0] for ele in parsedcomments]).flatMap(lambda l: l)
ldadatardd.cache()
ldadatardd.take(5)

[[u'opportunity', u'tutorial', u'individual'],
 [u'opportunity'],
 [u'reading', u'leader', u'reading', u'material'],
 [u'way', u'dialogue', u'religion', u'science', u'regard', u'evolution'],
 [u'non-concentrator', u'class', u'religion']]

In [13]:
ldadatardd.flatMap(lambda word: word).take(5)

[u'opportunity', u'tutorial', u'individual', u'opportunity', u'reading']

In [14]:
vocabtups = (ldadatardd.flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
             .map(lambda (x,y): x)
             .zipWithIndex()
).cache()

In [15]:
vocab=vocabtups.collectAsMap()
id2word=vocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [16]:
id2word[0], vocab.keys()[5], vocab[vocab.keys()[5]]

(u'req', u'dynasty', 5)

In [17]:
len(vocab.keys())

3898

In [18]:
from collections import defaultdict
def helperfunction(element):
    d = defaultdict(int)
    for k in element:
        d[vocab[k]] += 1
    return d.items()
documents = ldadatardd.map(lambda w: helperfunction(w))

In [19]:
documents.take(5)

[[(2464, 1), (682, 1), (781, 1)],
 [(682, 1)],
 [(2003, 1), (3742, 1), (1526, 2)],
 [(2085, 1), (1233, 1), (3850, 1), (1841, 1), (981, 1), (858, 1)],
 [(3360, 1), (981, 1), (3423, 1)]]

In [20]:
corpus=documents.collect()

In [21]:
import gensim

In [22]:
lda2 = gensim.models.ldamodel.LdaModel(corpus = corpus, num_topics = 2, id2word=id2word, chunksize=200, passes = 10)

Above, we print the topics we find using LDA.

In [23]:
lda2.print_topics()

[(0,
  u'0.129*course + 0.046*material + 0.026*lecture + 0.025*topic + 0.023*professor + 0.015*way + 0.014*history + 0.013*year + 0.012*problem + 0.011*question'),
 (1,
  u'0.179*class + 0.043*lot + 0.031*student + 0.029*time + 0.025*work + 0.022*reading + 0.017*discussion + 0.016*paper + 0.015*fun + 0.014*research')]

The first topic (let us call this Topic 0) includes the combination of words: course, lot, material, time, way, work, experience, paper, field, and person. 


The second topic (let us call this Topic 1) includes the combination of words: class, course, student, professor, lecture, reading, history, topic, discussion, and fun.


Topic 1 seems to encompass the more interactive, qualitative, personable aspects of the course with key terms including student, professor, discussion, lecture, and fun. Topic 0, by contrast, seems to encompass the more solitary, logistical, factual aspects of the course with key terms including material, time, work, experience, paper, and field. One thing worth noticing is that both topics include "course" as a key term but only Topic 1 includes "class". While "course" and "class" are often used interchangably in language, "class" arguably connotes a more personal, interactive experience than "course", which is more administrative and logistical and more likely to be used as an umbrella term for everything from everyday class to homework.


In order to further evaluate our intial hypothesis that course reviews are split along two topics (interactive, qualitiative, personable aspects v. solitary, logistical aspects), we will output the words of some sentences, along with the probability of the sentence belonging to Topic 0 and Topic 1, to qualitatively check that our topics are reasonable and supported.

In [24]:
for bow in corpus[0:1000:100]:
    print bow
    print lda2.get_document_topics(bow)
    print " ".join([id2word[e[0]] for e in bow])
    print "=========================================="

[(2464, 1), (682, 1), (781, 1)]
[(0, 0.37402945436085688), (1, 0.62597054563914312)]
individual opportunity tutorial
[(929, 2), (2628, 1), (390, 2), (3718, 1), (2369, 1), (2775, 1), (3228, 1)]
[(0, 0.70656544083864192), (1, 0.29343455916135819)]
diet edge nutrition recommendation finding guideline math
[(3225, 1), (1563, 1), (2575, 1)]
[(0, 0.62695519457571203), (1, 0.37304480542428808)]
lecture learning homework
[(2563, 1)]
[(0, 0.74986962180603833), (1, 0.25013037819396161)]
instructor
[(2297, 1), (514, 1), (459, 1), (1885, 1)]
[(0, 0.8911380396522296), (1, 0.1088619603477704)]
introduction linguist syntax field
[(1368, 1), (3433, 1), (1475, 1), (2741, 1)]
[(0, 0.32013442627167082), (1, 0.67986557372832912)]
orgo undergrad enthusiasm step
[(3360, 1), (2578, 1)]
[(0, 0.16667798047327986), (1, 0.83332201952672014)]
class fun
[(2297, 1), (2851, 1), (1309, 1)]
[(0, 0.62510619781304499), (1, 0.37489380218695495)]
introduction department language
[(403, 1)]
[(0, 0.25015581326860203), (1, 0

The "sentences" (or bag-of-words) which have a much greater probability of belonging to Topic 0 include:
- field class thesis work
- bit work lab
- order course thing


The words in these sentences relate to more impersonal, logistical aspects of a course. Specifically, "field", "thesis", "work", and "order" describe inflexible, solitary, requirement aspects of a course while "bit" and "thing" are more vague but nevertheless imply a degree of impartiability and detatchedness.



The "setences" which have a much greater probability of belonging to Topic 1 include:
- depth course grad class history student question
- concept commitment moment life
- section lecture exam reading drawback concept
- class entertaining
- evolution class pre cinema

Words in these sentences that are not present in the previous cluster of setences and that stand out as implying more creative, interactive, person-to-person aspects of a course include "depth", "history", "question", "concept", "commitment", "moment", "life", "drawback", "lecture", and "entertaining". 

The "sentences" which have more equal probabilities of belonging to Topic 0 or 1 include:
- lecture person
- resource professor man kind topic

We can observe words implying more logistical aspects ("resource", "topic") and more interactive, creative aspects ("professor", "kind"). 


Of course, words such as "lecture" can belong to either topic since lecture is both a logistical, required part of most courses and an engaging, potentially interactive opportunity for students to learn from professors. From our analysis of the topic probabilities and bag-of-words above, however, there appears to be evidence to support our initial hypothesis that course reviews are split along two topics: Topic 0, which includes more solitary, logistical aspects and Topic 1, which includes more interactive, qualitative, personable aspects of a course.

TO DOs:
- We can consider doing "verbs" (use TextBlob)
- Detect "not" before adjectives, this shouldn't be too difficult
- Text before pushing
- Look at differences across departments (Jesse/Andrew)
