# MIING OPINION FEATURES IN CUSTOMER REVIEWS
### by Minqing Hu and Bing Liu

## Abstract of research paper

#### MOTIVE

In order to enhance customer satisfaction and their shopping experiences, online merchants enable their customers to review or to express opinions on the
products that they buy.

With expansion of e-commerce, online shopping, more and more common users reviewing products, the comments becomes incomprehensible, and looses its meaning. This makes it very hard for a potential customer to read them to help him or her to make a decision on whether to buy the product.

In this research, we propose to study the problem of feature-based opinion summarization of customer reviews of  products sold online. The task is performed in two steps:
1. Identify the features of the product that customers have expressed opinions on (called opinion features), rank the features according to their frequencies that they appear in the reviews.
2. For each feature, we identify how many customer reviews have positive or negative opinions.


#### OBJECTIVE
This project aims to summarize all the customer reviews of a product.
Unlike traditional summarization tasks, author aims to provide a specific feature of a product that a customer have opinion on, and also if the opinion is positive or negative.
###### In this paper, author is only focusing on mining opinions/ product features that the reviewers have commented.

#### APPROACH
The paper proposes a number of techniques based on data mining and NLP methods to mine opinions/ product features.
The system, after crawling through all reviews, creates a "REVIEW DATABASE".
The paper incorporates 2 techniques for discovering terms:
1. Symbolic Approach : relies on SYNTACTIC DESCRIPTION of terms, namely noun phrase.
2. Statistical Approach : to find the infrequent features by exploiting the fact that people uses same adjectives to describe different subjects.

#### MODEL
<img src = "model.jpg">

##### 1. POS Tagging
Since the tasks focus on finding features that appear explicitly as nouns or noun phrases in the reviews. To them from the reviews, part-of-speech tagging is used.
Each sentence is saved in Review database along with a pos tag with each word.
A Transaction file is then created after preprocessing , which includes deletion of stopwords, stemming and fuzzy matching.
##### 2.  Frequent feature generation
This step is to find features that people are most interested in. In order to do this, Association rule mining to find all frequent itemsets is used. Considering "itemset" as a set of words or a phrase that occurs together.
##### 3. Feature Pruning
Since not all frequent features generated by Association rule mining are useful or geniune features, therefore feature pruning is necessary. There are 2 techniques used in the paper :-
1. Compactness prunning : for a feature phrase and a sentence that contans the phrase, looking at POSITION INFORMATION of every word of phrase and check whether it is "COMPACT" (see def. in paper) or not in the sentence. If there are not even 2 sentences in review database, the feature phrase is pruned.
2. Redundance pruning : If a feature has a "p-Support" lower than specified minimum value (3  in paper), and, feature is subset of another feature phrase, then it is pruned.

##### 4. Opinion word extraction
Opinion words are words that people use to express a positive or negative opinion.
Observing that people often express their opinions of a product feature using opinion words that are located around the feature in the sentence, we can extract opinion words from the review database using all the remaining frequent features (after pruning). Saving after stemming and fuzzy-matching to create "Opinion word list".
##### 5. Infrequent feature identification
The observation “opinions tend to appear closely together with features” can be used to identify infrequent features.
For each sentence in the review database, if it contains no frequent feature but one or more opinion words, the nearest noun/noun phrase of the opinion word is then stored in the feature set as an infrequent feature.

### IMPLEMENTATION

In [2]:
import re
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize


##### Function Definitions

In [3]:
data_dir = 'data/'
def getList():
    global data_dir
    prod_list = os.listdir(data_dir)
    for p in prod_list:
        print (p)

In [114]:
def get_RDB (prod_name):
    f = data_dir + prod_name + ".txt"
    with open(f, 'r') as content_file:
        content = content_file.read()
    content = ( content.split("##") )
    line = ''
    for l in content:
        line += l
    return line

In [5]:
def tagIt (data):
	try :
		for item in data:
			print (item)
			break
	except Exception:
		print ("Sorry.. exception occurred")

In [6]:
def tag (data):
    try :
        for item in data:
            tokenized = nltk.word_tokenize (item)
            tagged = nltk.pos_tag (tokenized)
            print (tagged)

    except Exception:
        print ("Sorry.. exception occurred")

In [56]:
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
def normalize (content):
#     tok_list = [(ps.stem(w)).lower() for w in content]
#     tok_list = [w for w in tok_list if not w in  ['(',')','%','``','.',',', '\'',"''","'s"]]
#     filtered_sentence = [w for w in tok_list if len(w)>1]
#     print (filtered_sentence)
    tok_list = [w.lower() for w in content]
    tok_list = [w for w in tok_list if not w in  ['(',')','%','``','.',',', '\'',"''","'s","#"]]
    print (tok_list)
    

##### Main code

In [116]:
rdb = get_RDB ("ipod")

    

Don't get
