# Introduction

**The purpose of this anlaysis is trying to find out the products that are purchased most in the HappyDB dataset.**  
These products are more likely to make people feel happy.

**I'll use KOKO, a rule-based entity extraction system, to perform the analysis.**  

KOKO allows users to specify conditions of desirable entities with a declarative language (see [KOKO syntax](#koko_syntax)).  
Those entities that obtain scores higher than a threshold are extracted.

KOKO is especially suitable for entity extraction with limited evidence in the corpus (e.g. extraction of cafe names within only one or a few blogs). 

**The whole analysis described in this notebook comprises the following steps:**  

- Data preprocessing: load HappyDB dataset and convert it to a text file as input to KOKO.
- KOKO introduction: briefly introduce the syntax and semantics of KOKO, with an example query.
- Entity extraction: a KOKO query is written and evaluated, extracting product names in the dataset.

Let's get started!

# 1. Data preprocessing

First, let's load the data and take a look at the happy moments.  

## Load HappyDB

In [19]:
import pandas as pd

data = pd.read_csv('./cleaned_hm.csv')
data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1.1.1,hmid,wid,reflection_period,cleaned_hm,original_hm,modified
0,0,0,0,3833.0,31526.0,8962.0,24h,I found a silver coin from 1852 buried in the ...,I found a silver coin from 1852 buried in the ...,False
1,1,1,1,9336.0,37050.0,10252.0,24h,"This one is pretty minuscule, we had to go to ...","This one is pretty minuscule, we had to go to ...",False
2,2,2,2,18454.0,46196.0,586.0,24h,The word problem is never part of a happy pers...,The word aproblema is never part of a happy pe...,False
3,3,3,3,22355.0,50108.0,409.0,24h,I started studying bagavthgeetha .,I started studying bagavthgeetha .,True
4,4,4,4,27323.0,55093.0,731.0,24h,i go to old age home and service enjoy the mom...,i go to oldage home and service enjoy the moment.,False


Within the dataset, the most interesting part -- which is also the input to our analysis -- is the coloum of 'cleaned_hm'.  

'cleaned_hm' stands for "cleaned happy moments". Let's take a look.

In [24]:
pd.set_option('display.max_colwidth', -1)
data_clean = data['cleaned_hm']
data_clean.head()

0    I found a silver coin from 1852 buried in the sand of CapAo da Canoa beach, located in southern Brazil.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
1    This one is pretty minuscule, we had to go to the laundromat today and obviously I had all three of my kids with me, my 5 year old, my 2 year old and my 5 month old. Last time we had to go they were all butts, my 5 month old wanted to scream the whole time and my two girls were all 

## Identify purchasing-related moments

Let's use Python to take a peek at all the happy moments related to purchasing behavior -- 
i.e., moments that contain keywords 'buy', 'bought' or 'purchase'.  

This process could help us understand the patterns of products appearing in the happy moments, and faciliate condition specification in latter steps when we use KOKO.

In [25]:
num_moments = 100
assert (num_moments < data_clean.size)
print('Happy moments involving purchasing:\n')
for i in range(0, num_moments):
    if 'buy' in data_clean.iloc[i] or \
       'bought' in data_clean.iloc[i] or \
       'purchase' in data_clean.iloc[i]:
       print("{}: {}".format(i, data_clean.iloc[i]))

Happy moments involving purchasing:

38: I waited patiantly for my income tax return and finally received it. And no i haven't went rogue with impulse purchases. I just did  the noble thing to do and take care of home, made a couple of investments and put the rest in my bank account.
45: I am happy when i purchase a new vehicle 
94: My husband purchased a new nan for me.


<a id='koko_syntax'></a>
# 2. Introduction to KOKO

Before using KOKO to extract products from HappyDB, I'll give a brief introduction of KOKO's query language.

**Here's am example query *Q_prod* that I'll use to extract products.**

In [26]:
with open('./products.koko', 'r') as query:
    print(query.read())

extract "Ngrams(1,1)" x from "./happydb.txt" if
		   ("bought a new" x {0.01}) or
		   ("bought a few" x {0.01}) or
		   ("bought some" x {0.01}) or
		   ("bought a" x {0.01}) or
		   (x "I bought" {0.01}) or		   
		   ("purchased a new" x {0.01}) or
		   ("purchased a few" x {0.01}) or
		   ("purchased some" x {0.01}) or
		   ("purchased a" x {0.01}) or
		   (x "I purchased" {0.01})
with threshold 0.2
excluding (str(x) matches ".*(new|NEW|few).*")
excluding (str(x) matches ".*(,|\.|;|!|\$|\(|\)|-).*")
excluding (str(x) matches ".*[0-9]+.*")
excluding (str(x) matches ".*(and|or|so|the|this|that).*")



To understand *Q_prod*, let's take a look at the syntax of KOKO.

<a id='koko_syntax'></a>
## KOKO syntax

(**extract** ⟨keyword⟩ x **from** ⟨document name⟩ **if**  
⟨condition⟩  
(**with threshold** ⟨threshold⟩)  
[**excluding** ⟨e-condition⟩]

where the conditions are defined as follows:

*⟨condition⟩ ::= ⟨condition⟩ or ⟨condition⟩ |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(x {⟨string⟩} ⟨weight⟩) |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(x ⟨string⟩ ⟨weight⟩) |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(x near ⟨string⟩ ⟨weight⟩) |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(str(x) matches⟨pattern⟩) |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(str(x) [contains|mentions] {⟨string⟩} ⟨weight⟩) |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(str(x) [contains|mentions] ⟨string⟩ ⟨weight⟩)*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  
*⟨weight⟩ ::= empty | number in [0,1]*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  
*⟨threshold⟩ ::= number in [0,1]*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  
*⟨pattern⟩::= ⟨regular expression⟩*

Specifically, a KOKO query will extract entities *x* of type ⟨keyword⟩ from a document ⟨document name⟩, 
if the score of *x* exceeds the treshold ⟨threshold⟩.  

The score is computed as the cumulative weights of conditions that are satisfied by *x* in the dataset.

** Let's use the query *Q_prod* presented above as an example**

In *Q_prod*, the ⟨keyword⟩ is "Ngrams(1,1)", which means all the one-gram in the document.  
We can also use "Ents" for named entities, or "Nps" for noun phrases.  

There are twelve conditions in ⟨condition⟩. For example, ("bought a new" x {0.01}) means that an entity *x* with a preceding string of "bought a new" will have its score increased by 0.01 -- i.e., the weight of the condition.  

And the first "excluding" keyword specifies that the matching entities should not be any word containing "new", "NEW", or "few" -- we are more interested in "car", for instance, than "a new car" or "a few cars".  
Other "excluding" conditions have a similar role.

# 3. Entity extraction with KOKO

Now we are ready to run the query for product extraction.  

First, we need to install the KOKO package.

## Install KOKO

To install KOKO locally, simply run the following command:

    pip install pykoko

## Generate plain-text happy moments

KOKO queries take texts as input. Let's generate plain-text happy moments for query evaluation.

In [27]:
import sys

# Read the happyDB sample file
with open('./happydb.txt', 'w') as ofile:
    for i in range(0, data['cleaned_hm'].size-1):    
        ofile.write("\t" + data['cleaned_hm'].iloc[i] + '\n')
        
print("Plain-text happy moments are generated!")

Plain-text happy moments are generated!


## Run KOKO

After KOKO is installed, we can run the example query *Q_prod*.  

Considering the size of the dataset, it might take several minutes to get results.

In [15]:
import koko

koko.run('./products.koko')

Parsed query: extract "/Users/chen/Research/Playground/Github_Playground/happydb/data/happyDB_clean.txt" Ngrams(1,1) from "x" if
	("bought a new" x { 0.01 }) or
	("bought a few" x { 0.01 }) or
	("bought some" x { 0.01 }) or
	("bought a" x { 0.01 }) or
	(x "I bought" { 0.01 }) or
	("purchased a new" x { 0.01 }) or
	("purchased a few" x { 0.01 }) or
	("purchased some" x { 0.01 }) or
	("purchased a" x { 0.01 }) or
	(x "I purchased" { 0.01 })   
with threshold 0.20
excluding
	(str(x) matches ".*(new|NEW|few).*")
	(str(x) matches ".*(,|\.|;|!|\$|\(|\)|-).*")
	(str(x) matches ".*[0-9]+.*")
	(str(x) matches ".*(and|or|so|the|this|that).*")


Results:

Entity name                                        Entity score
car                                                1.000000
When                                               1.000000
pair                                               0.470000
house                                              0.440000
laptop                                     

The results contain certain noise, such as "When", "pair". But most entries are relevant.

**It seems that expensive purchase, such as cars, houses or laptops, are mentioned most in HappyDB.**  

Well, this makes sense. Expensive purchase is often for products people long for but could only afford after saving money for an extended period of time.  
No wonder such purchase makes people happy.

We can also use spaCy or Google NLP for document parsing, instead of KOKO's default parser.

**Here's an example of using spaCy as the parser**

In [17]:
import spacy

koko.run('./products.koko', doc_parser='spacy')

INFO 2017-09-22 15:15:48,160 - Loading SpaCy English models
INFO 2017-09-22 15:15:48,160 - Loading SpaCy English models
INFO 2017-09-22 15:15:48,160 - Loading SpaCy English models
INFO 2017-09-22 15:15:48,160 - Loading SpaCy English models
INFO 2017-09-22 15:15:50,853 - Done
INFO 2017-09-22 15:15:50,853 - Done
INFO 2017-09-22 15:15:50,853 - Done
INFO 2017-09-22 15:15:50,853 - Done
Parsed query: extract "/Users/chen/Research/Playground/Github_Playground/happydb/data/happyDB_clean.txt" Ngrams(1,1) from "x" if
	("bought a new" x { 0.01 }) or
	("bought a few" x { 0.01 }) or
	("bought some" x { 0.01 }) or
	("bought a" x { 0.01 }) or
	(x "I bought" { 0.01 }) or
	("purchased a new" x { 0.01 }) or
	("purchased a few" x { 0.01 }) or
	("purchased some" x { 0.01 }) or
	("purchased a" x { 0.01 }) or
	(x "I purchased" { 0.01 })   
with threshold 0.20
excluding
	(str(x) matches ".*(new|NEW|few).*")
	(str(x) matches ".*(,|\.|;|!|\$|\(|\)|-).*")
	(str(x) matches ".*[0-9]+.*")
	(str(x) matches ".*(and|

The results are slightly different. But the entities extracted are identical except "TV".

# 4. Conclusion

This is a quick and preliminary analysis of the HappyDB dataset. In the analysis:

- I tried to extract the products that tend to make people happy.

- I introduced KOKO, an entity extraction system with a declarative rule-based specification language.

- I showed how to write KOKO queries to concisely specify desirable entities, and use KOKO runtime to extract these entities.