# Yopine Natural Content Categories

What are the natural content 'Categories' that have been created via app usage?

All content is presented in ‘Explore’ view today simply chronological.  Users shouldn’t have to sift through content that is of no interest to them.  Everything we need to know is contained in the questions and the responses.  

The goal is to let Machine Learning do that work.

There is one table in the Yopine schema that contains all relevant data:  POLL
https://www.dropbox.com/s/6l2o2mitruef5x7/PollTable.png?dl=0

There are two columns in the POLL table that contains all relevant data:  pollQuestion & pollAnswer
https://www.dropbox.com/s/xwjin2qehkuoixu/PollTableCols.png?dl=0

Here is the data file
https://www.dropbox.com/s/iql221nnoyk8ntf/poll3.csv?dl=0

There are two places from which to draw the the data that might contain the natural categories:
1.  pollQuestion is nice and easy...it's just a string of natural language
    in the example the string 'Fitness: What's your jam?' - we want 'Fitness'
2.  pollAnswer requires some crafty parsing as the relevant data is buried in each record in the field "answerText"
    in the example dictionary we want 'Weights', 'Yoga', 'Running', 'Classes' and 'Other ->'.

#### We might also like (perhaps a future endeavor) to add weight to "answerText" by applying its corresponding "voteCount" integer.

#### The first step is to create the dataframe

In [3]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
import numpy as np


Brennan reviewed and organized the DF into groups (categories) - poll3_grouped.csv 
https://www.dropbox.com/s/kv7mjndpxa291x3/poll3_grouped.csv?dl=0

He added the 'a _groups' column to it and organized it into animals, anyone, books, brands, business, celebrity, class, coffee, contest, dancing, donate, drinks, fashion, favorites, food, games, gear, health, hobbies, holidays, home, jobs, joke, love, meetup, mood, movies, music, news, outdoors, party, places, plans, politics, religion, rides, school, shopping, smoke, social, sports, startups, tech, test, travel, tv, weather;
+ 1657 rows un-categorized.

* I need to get the code that performed the above

#### LDA for Natural Language Processing - I want to create a repeatable model that, given any pollAnswer or pollQuestion

In [3]:
import lda
import numpy as np # not able to import LDA
from sklearn.feature_extraction.text import CountVectorizer
sentences = ["my name is sinan", "Im gary", "This is gary and sinan"]
# Instantiate a count vectorizer with two additional parameters
vect = CountVectorizer(stop_words='english', ngram_range=[1,3]) 
sentences_train = vect.fit_transform(sentences)

# Instantiate an LDA model
model = lda.LDA(n_topics=10, n_iter=500)
model.fit(sentences_train) # Fit the model 
n_top_words = 10
topic_word = model.topic_word_
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vect.get_feature_names())[np.argsort(topic_dist)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ', '.join(topic_words)))



Topic 0: sinan, im gary, im, gary sinan, gary
Topic 1: gary sinan, sinan, im gary, im, gary
Topic 2: gary, sinan, im gary, im, gary sinan
Topic 3: sinan, im gary, im, gary sinan, gary
Topic 4: sinan, im gary, im, gary sinan, gary
Topic 5: im, sinan, im gary, gary sinan, gary
Topic 6: gary, sinan, im gary, im, gary sinan
Topic 7: im gary, sinan, im, gary sinan, gary
Topic 8: sinan, im gary, im, gary sinan, gary
Topic 9: sinan, im gary, im, gary sinan, gary


#### The next (future) process to be applied is k-means and TDIFvectorizer

In [6]:
# import TFIDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# pollAnswer field is a JSON object
# read the Sony pollAnswer into an object called 'sony'
import json
# code to look at one single record
# sony = json.loads(data.pollAnswer[3881])
# Need to extract sentence(s) list from data

# iterate to create 2 lists, poll and answer then write them to questions and answers:
# data.head()
#for i in range(0,len(data)):
#    poll = json.loads(data.pollAnswer[i])
#    print data.pollQuestion[i]
#    poll_tally = {}
#    for answer in poll:
#         key = answer['answerText']
#         if 'votes' in answer.keys():
#             value = len(answer['votes'])
#         poll_tally[key] = value
#    print poll_tally
# json.loads(data.pollAnswer[0])
# questions = data.pollQuestion
# answers = [' '.join([b['answerText'] for b in json.loads(a)]) for a in data.pollAnswer]
#for q, a in zip(questions, answers):
#    print q, a

Where should we eat?
{u'morandimotandi': 0, u'otto': 1}
Where should we meet?
{u'tcd': 1, u'taco bell': 0}
Servira esto?
{u'si ca\xf1on': 1, u'nada q ver': 0}
Where should we eat?
{u'test': 1}
Where should we eat?
{u'fastidious funnel cakes ': 0, u"mopey's moon pies": 1}
Where should we meet?
{u'half way, right at the borderline.': 5, u'in the middle ': 3}
What should we do tonight?
{u'hike': 0, u'bike ride': 0}
Will I get out of grand jury duty today?
{u'yes': 1, u'no': 1}
Best Mother's Day brunch spot in NYC?  In DC?
{u'Bonaparte - DC': 1, u'Balthazar - NYC': 1, u'Balthazar is the bomb in NYC. ': 1}
Where should we do happy hour?
{u'\u3147\u3147': 0, u'\u3147': 0}
Where should I go for vacation?
{u'\u3147\u3139': 0, u'\u3147': 0}
Would you really shoot someone that broke into your house?  What if it were a coworker?
{u'I also shoot their bleeding corpse': 0, u'Yes. More shots if a coworker. ': 0}
What book should I read for book club?
{u'hhh': 0, u'50 Shades of Gary': 0}
Hawks or Win

K-means _ TDIDFvectorizer - to group similar data in to sets (clustering) 

In [None]:
# sentence_list=[data]  
# Need to dump this into a new df

# sentence_list=['hello how are you', "I am doing great", "my name is abc"]

vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(answers)
# vectorized=vectorizer.fit_transform(questions)

km=KMeans(n_clusters=3, init='k-means++',n_init=10)
km.fit(vectorized)
print vectorized

# i now need to take the tfidf scores and interpret them

km.labels_

# Stuck here - types i believe
km.predict(vectorizer.fit_transform(answers))
km.predict(vectorizer.fit_transform(questions))

# Run KMeans
est = KMeans(n_clusters=3, init='random')
# need to remove non strings from data first

est.fit(d) 
y_kmeans = est.predict(d)

#### I would like some output in the form of pollQuestion in - dataframe table & pollAnswer in -> dataframe table that i can add to my slide presentation.

## DAT4-FinalProj slides - https://www.dropbox.com/s/on072oij3p4jlno/DAT4-FinalProj.pptx?dl=0