# Text data analysis with Knowledge-based system

# Data preperation

We will use a dataset consisting of baby product reviews on Amazon.com.

In [1]:
NUMBER_OF_REVIEWS_TO_ANALYZE = 100000

In [2]:
import pandas as pd

In [3]:
products = pd.read_csv("../valt_sa_data/amazon_baby.csv")[['review', 'rating']]

In [4]:
products = products[0:NUMBER_OF_REVIEWS_TO_ANALYZE]

In [5]:
products

Unnamed: 0,review,rating
0,"These flannel wipes are OK, but in my opinion ...",3
1,it came early and was not disappointed. i love...,5
2,Very soft and comfortable and warmer than it l...,5
3,This is a product well worth the purchase. I ...,5
4,All of my kids have cried non-stop when I trie...,5
5,"When the Binky Fairy came to our house, we did...",5
6,"Lovely book, it's bound tightly so you may not...",4
7,Perfect for new parents. We were able to keep ...,5
8,A friend of mine pinned this product on Pinter...,5
9,This has been an easy way for my nanny to reco...,4


## Build the word count vector for each review

Let us explore a specific example of a baby product.

In [6]:
products.iloc[9]

review    This has been an easy way for my nanny to reco...
rating                                                    4
Name: 9, dtype: object

Now, we will perform 2 simple data transformations:

1. Remove punctuation using Python's built-in string functionality.
2. Transform the reviews into word-counts of positive and negative word.
3. Finally made prediction based on positive/negative words ratio.

In [7]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

review_without_puctuation = products['review'].apply(str).apply(remove_punctuation)

In [8]:
significant_words = pd.read_csv('../valt_sa_data/positive-negative-words.csv', header=None)[0].tolist()

positive_words = pd.read_csv('../valt_sa_data/positive-words.csv', header=None)[0].tolist()
negative_words = pd.read_csv('../valt_sa_data/negative-words.csv', header=None)[0].tolist()
        
def count_number_of_significant_words(text):
    prediction = 3
    words = text['review'].split()
    word_dict = {}
    for word in significant_words:
        word_dict[word] = 0
    for word in words:
        if word in significant_words:
            if word not in word_dict:
                word_dict[word] = 1
            else:
                word_dict[word] = word_dict[word] + 1
    positive = 0
    negative = 0
    for positive_word in positive_words:
        if positive_word in word_dict:
            positive += word_dict[positive_word]
    
    for negative_word in negative_words:
        if negative_word in word_dict:
            negative += word_dict[negative_word]
                        
    n = positive + negative
    if n > 0:
        prediction = 1 + int(round(float(positive) / n * 4))
     
    return pd.Series(prediction)

predictions_df = pd.DataFrame(review_without_puctuation).apply(count_number_of_significant_words, axis=1)
predictions_df.columns = ['prediction']

products_with_words = products.join(predictions_df)

Now, let us see what rating and predictions look like.

In [9]:
products_with_words

Unnamed: 0,review,rating,prediction
0,"These flannel wipes are OK, but in my opinion ...",3,3
1,it came early and was not disappointed. i love...,5,3
2,Very soft and comfortable and warmer than it l...,5,5
3,This is a product well worth the purchase. I ...,5,5
4,All of my kids have cried non-stop when I trie...,5,5
5,"When the Binky Fairy came to our house, we did...",5,4
6,"Lovely book, it's bound tightly so you may not...",4,3
7,Perfect for new parents. We were able to keep ...,5,5
8,A friend of mine pinned this product on Pinter...,5,5
9,This has been an easy way for my nanny to reco...,4,5


## Evaluate the model

In [10]:
y_true = products_with_words['rating']
y_predicted = products_with_words['prediction']

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_predicted)

print 'Confusion matrix:'
print cm

from sklearn.metrics import classification_report

print 'Classification report:'
print classification_report(y_true, y_predicted)


Confusion matrix:
[[ 1169  1818  3353  1683   897]
 [  413   807  2345  1762   997]
 [  381   707  2850  3015  2204]
 [  296   736  4224  7058  5977]
 [  707  1296  9185 18505 27615]]
Classification report:
             precision    recall  f1-score   support

          1       0.39      0.13      0.20      8920
          2       0.15      0.13      0.14      6324
          3       0.13      0.31      0.18      9157
          4       0.22      0.39      0.28     18291
          5       0.73      0.48      0.58     57308

avg / total       0.52      0.39      0.43    100000

