## Importing libraries

In [12]:
import pandas as pd
import numpy as np
import os
import spacy 
from tqdm import tqdm

### Read reviews data

In [13]:
con=open("./Samsung.txt",'r', encoding="utf-8")
samsung_reviews=con.read()
con.close()

The con.close() is needed for the following code because it closes the file "./Samsung.txt" after reading it. This is important for two reasons:

It frees up the system resources that were allocated to the file, such as memory and file handles.
It ensures that any changes made to the file are flushed to the disk, avoiding data loss or corruption.
You should always close your files after using them, preferably using the with statement, which automatically closes the file when the block of code ends . For example, you can rewrite your code as:

In [14]:
# the above codes can be rewritten as follows
with open("./Samsung.txt",'r', encoding="utf-8") as con:
    samsung_reviews=con.read()

### Can we reduce the time taken?
[Pipelines (Spacy)](https://spacy.io/usage/processing-pipelines)


<img src='./images/spacy_pipeline.png'>

In [15]:
# shorten the pipline loading
nlp=spacy.load('en_core_web_sm',disable=['parser','ner'])

In [16]:
nouns = []
for review in tqdm(samsung_reviews.split("\n")[0:1000]):
    doc = nlp(review)
    for tok in doc:
        if tok.pos_=="NOUN":
            nouns.append(tok.lemma_.lower())

  0%|          | 0/1000 [00:00<?, ?it/s]

100%|██████████| 1000/1000 [00:03<00:00, 280.53it/s]


In [17]:
len(samsung_reviews.split("\n"))

46355

In [9]:
(46355/1000)*6 # for instructor's computer

278.13

In [18]:
(46355/1000)*3 # for my computer

139.065

In [10]:
278/60 # for instructor's computer

4.633333333333334

In [20]:
139.65/60 # for my computer

2.3275

### Lets process all the reviews now and see if time taken is less !!!

In [25]:
nouns = []
for review in tqdm(samsung_reviews.split("\n")):
    doc = nlp(review)
    for tok in doc:
        if tok.pos_=="NOUN":
            nouns.append(tok.lemma_.lower())

  0%|          | 0/46355 [00:00<?, ?it/s]

100%|██████████| 46355/46355 [03:46<00:00, 204.90it/s]


### Does the hypothesis of nouns capturing `product features` hold?

In [26]:
nouns=pd.Series(nouns)
nouns.value_counts().head(5)

phone      43507
battery     4334
product     3992
screen      3838
time        3810
Name: count, dtype: int64

In [27]:
nouns.value_counts().head(10)

phone      43507
battery     4334
product     3992
screen      3838
time        3810
card        3384
price       3149
problem     3137
camera      2936
app         2593
Name: count, dtype: int64

### We now know that people mention `battery`, `product`, `screen` etc. But we still don't know in what context they mention these keywords

### Summary:
 - Most frequently used lemmatised forms of noun, inform us about the product features people are talking about in product reviews
 - In order to process the review data faster spacy allows us to use the idea of enabling parts of model inference pipeline via `spacy.loads()` command and `disable` parameter