## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
import spacy 
nlp = spacy.load("en_core_web_sm")

### Read reviews data

In [16]:
con=open("./Samsung.txt",'r', encoding="utf-8")
samsung_reviews=con.read()
con.close()

In [17]:
len(samsung_reviews.split("\n"))

46355

### Dataset is a text file where each review is in a new line

In [47]:
samsung_reviews.split("\n")[3:10]

['It works good but it goes slow sometimes but its a very good phone I love it',
 'Great phone to replace my lost phone. The only thing is the volume up button does not work, but I can still go into settings to adjust. Other than that, it does the job until I am eligible to upgrade my phone again.Thaanks!',
 'I originally was using the Samsung S2 Galaxy for Sprint and wanted to return back to the Samsung EPIC 4G for Sprint because I really missed the keyboard, I really liked the smaller compact size of the phone, and I still needed some of the basic functions of a smart phone (i.e. checking e-mail, getting directions, text messaging) Because the phone is not as powerful as the newer cell phones out there, just be aware that the more applications you install the slower the phone runs and will most likely freeze up from time to time. But the camera works great, the video is great as well, and even the web browsing is decent and gives me what I need. I also notice that battery life lasts 

### Will our hypothesis hold on real world data? `Product features---POS_NOUN`

In [53]:
review1=samsung_reviews.split("\n")[0]
review1=nlp(review1)

### Lets do nlp parse on part of one review in our dataset

In [56]:
for tok in review1[0:10]:
    print(tok.text,"---",tok.lemma_,"---",tok.pos_)

I --- I --- PRON
feel --- feel --- VERB
so --- so --- ADV
LUCKY --- LUCKY --- NOUN
to --- to --- PART
have --- have --- AUX
found --- find --- VERB
this --- this --- DET
used --- use --- VERB
( --- ( --- PUNCT


#### Real world data is usually messy, observe the words `found` and `used`

In [57]:
pos = []
lemma = []
text = []
for tok in review1:
    pos.append(tok.pos_)
    lemma.append(tok.lemma_)
    text.append(tok.text)

In [64]:
nlp_table = pd.DataFrame({'text':text,'lemma':lemma,'pos':pos})
nlp_table.head(100)

Unnamed: 0,text,lemma,pos
0,I,I,PRON
1,feel,feel,VERB
2,so,so,ADV
3,LUCKY,LUCKY,NOUN
4,to,to,PART
...,...,...,...
81,from,from,ADP
82,them,they,PRON
83,again,again,ADV
84,!,!,PUNCT


In [86]:
## Get most frequent lemma forms of nouns
nlp_table[nlp_table['pos']=='VERB']['lemma'].value_counts()

lemma
use           2
feel          1
find          1
upgrade       1
sell          1
like          1
fall          1
want          1
thank         1
appreciate    1
say           1
recommend     1
Name: count, dtype: int64

#### It seems possible that if we extract all the nouns from the reviews and look at the top 5 most frequent lemmatised noun forms, we will be able to identify `What people are talking about?`

### Lets repeat this experiment on a larger set of reviews

In [87]:
nouns = []
for review in samsung_reviews.split("\n")[0:100]:
    doc = nlp(review)
    for tok in doc:
        if tok.pos_=="NOUN":
            nouns.append(tok.lemma_.lower())

### Lets add some way of keeping track of time

In [94]:
from tqdm import tqdm
nouns = []
for review in tqdm(samsung_reviews.split("\n")[0:1000]):
    doc = nlp(review)
    for tok in doc:
        if tok.pos_=="VERB":
            nouns.append(tok.lemma_.lower())
pd.Series(nouns).value_counts().head(5)

100%|██████████| 1000/1000 [00:05<00:00, 199.60it/s]


have    397
work    271
use     216
get     190
love    165
Name: count, dtype: int64

In [90]:
len(samsung_reviews.split("\n"))

46355

### Did you notice anything? What do you think will be the time taken to process all the reviews?

In [97]:
(46355//1000)*17

782

In [99]:
(46355//1000)*5 # for my computer

230

In [100]:
782//60

13

In [102]:
230//60 # for my computer

3

## Summary
- POS tag based rule seems to be working well
- We need to figure out a way to reduce the time taken to process reviews