<a href="https://colab.research.google.com/github/anurag0308/Natural_Languange_Processing/blob/master/Extracting_Key_Selling_Keywords_from_product_descriptions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Extracting Search Engine Appropriate Keywords and Key Selling Points from a Product's description in E-Commerce Websites

**Problem Statement: Extract from mobile phone description dataset**

1. Search Engine Appropriate Keywords

2. Key Selling Points


**1. Extracting Search Engine Appropriate Keywords:**

---
Proposed Approach:
 Search engine appropriate keywords can be: 

  a. Specific : eg. Samsung Galaxy, Flip cover, Redmi note 3, etc. i.e. proper nouns

  b. General :Features such as: Smartphone, earphone, tangle free earphone (nouns other than proper)

---

**Solution: POS tagging + TFIDF + ngram**
 For each document :

   Extract Noun/ Proper nouns from preprocessed (stopwords+ lemmatization) text.

   Join the extracted words to make a new document (pos_filtered_doc)

 On list of pos_filtered_docs:

    Apply ngram TFIDFVectorizer and for each document extract words with top-n TFIDF value.



In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import word_tokenize

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data = pd.read_csv('/content/drive/My Drive/amazon_phone_dataset.csv')

In [None]:
data.head()

Unnamed: 0,Product_name,by_info,Product_url,Product_img,Product_price,rating,total_review,ans_ask,prod_des,feature,cust_review
0,"Samsung Galaxy M10 (Ocean Blue, 3+32GB)",Samsung,https://www.amazon.in/Samsung-Galaxy-Ocean-Blu...,https://images-na.ssl-images-amazon.com/images...,"₹ 7,990.00",4.0 out of 5 stars,"7,353 customer reviews",1000+ answered questions,The Samsung Galaxy M10 is especially created f...,13MP+5MP ultra-wide angle dual camera | 5MP f2...,"Well, I was a bit sceptical before buying this..."
1,"Redmi 6 Pro (Black, 4GB RAM, 64GB Storage)",Mi,https://www.amazon.in/Redmi-Pro-Black-64GB-Sto...,https://images-na.ssl-images-amazon.com/images...,,4.1 out of 5 stars,"32,250 customer reviews",1000+ answered questions,"Qualcomm Snapdragon 625, 2.0 GHz processor wit...",12MP+5MP dual rear camera | 5MP front facing c...,"Display quality is top notch, overall the qual..."
2,"Coolpad Cool 3 Plus (Ocean Blue, 2GB RAM, 16GB...",Coolpad,https://www.amazon.in/Coolpad-Cool-Plus-Ocean-...,https://images-na.ssl-images-amazon.com/images...,"₹ 5,999.00",3.1 out of 5 stars,76 customer reviews,69 answered questions,"Coolpad Cool 3 Plus-Designed for all, 5.71'' H...","13MP primary camera with bokeh mode, timelapse...",Low bagget high kwality***It's not good Phone ...
3,"Redmi 6 Pro (Black, 3GB RAM, 32GB Storage)",Mi,https://www.amazon.in/Redmi-Pro-Black-32GB-Sto...,https://images-na.ssl-images-amazon.com/images...,,4.1 out of 5 stars,"32,250 customer reviews",1000+ answered questions,"Qualcomm Snapdragon 625, 2.0 GHz processor wit...",12MP+5MP dual rear camera | 5MP front facing c...,"Display quality is top notch, overall the qual..."
4,Nokia 105 (Black),Nokia,https://www.amazon.in/Nokia-105-Black/dp/B0745...,https://images-na.ssl-images-amazon.com/images...,,4.1 out of 5 stars,"6,474 customer reviews",1000+ answered questions,The design Shaped for your palm Featuring a cu...,4.572 centimeters (1.8-inch) display with 240 ...,Using the mobile phone for last 3 months. I am...


In [None]:
data = pd.DataFrame({'Product_name':data['Product_name'],'prod_des':data['prod_des']}).dropna().reset_index(drop = True)

In [None]:
data

Unnamed: 0,Product_name,prod_des
0,"Samsung Galaxy M10 (Ocean Blue, 3+32GB)",The Samsung Galaxy M10 is especially created f...
1,"Redmi 6 Pro (Black, 4GB RAM, 64GB Storage)","Qualcomm Snapdragon 625, 2.0 GHz processor wit..."
2,"Coolpad Cool 3 Plus (Ocean Blue, 2GB RAM, 16GB...","Coolpad Cool 3 Plus-Designed for all, 5.71'' H..."
3,"Redmi 6 Pro (Black, 3GB RAM, 32GB Storage)","Qualcomm Snapdragon 625, 2.0 GHz processor wit..."
4,Nokia 105 (Black),The design Shaped for your palm Featuring a cu...
...,...,...
3409,TokyoTon Mobile Battery for Moto M XT1662 XT16...,TokyoTon Mobile Battery for Moto M XT1662 XT16...
3410,AM Safe x Cable for iPhone LED Fast Charging D...,Fast Charging Nylon Braided USB LED Cable for ...
3411,Ktrack Metal and Plastic Open Pry Screwdriver ...,Product: Tool Metal and plastic. 8 in 1 tools ...
3412,Samsung S-View Flip Cover for Samsung Galaxy S...,"The S-View Flip Cover, Clear allows you to see..."



**1. Search Appropriate Keywords**

---

Search engine appropriate keywords can be:</br> a. Specific : eg. Samsung Galaxy, Flip cover, Redmi note 3, etc. i.e. proper nouns</br> b. General :Features such as: Smartphone, earphone, tangle free earphone (nouns other than proper)

Task: To extract search engine appropriate keywords for a given product description



In [None]:
def extract_noun(doc):
  return ([i for i in nlp(doc) if i.pos_=='NOUN'],[i for i in nlp(doc) if i.pos_=='PROPN'])

In [None]:
data['noun_keywords'] =  data['prod_des'].apply(extract_noun)

In [None]:
df_noun_out = pd.DataFrame(data = {'Product_name': data['Product_name'],'prod_des':data['prod_des'],'keywords':data['noun_keywords']})

In [None]:
df_noun_out.head()

Unnamed: 0,Product_name,prod_des,keywords
0,"Samsung Galaxy M10 (Ocean Blue, 3+32GB)",The Samsung Galaxy M10 is especially created f...,"([millennials, edge, infinity, V, display, ang..."
1,"Redmi 6 Pro (Black, 4GB RAM, 64GB Storage)","Qualcomm Snapdragon 625, 2.0 GHz processor wit...","([GHz, processor, nm, architecture, battery, c..."
2,"Coolpad Cool 3 Plus (Ocean Blue, 2GB RAM, 16GB...","Coolpad Cool 3 Plus-Designed for all, 5.71'' H...","([HD, dewdrop, display, GB, ROM, upto, sensor,..."
3,"Redmi 6 Pro (Black, 3GB RAM, 32GB Storage)","Qualcomm Snapdragon 625, 2.0 GHz processor wit...","([GHz, processor, nm, architecture, battery, c..."
4,Nokia 105 (Black),The design Shaped for your palm Featuring a cu...,"([design, palm, body, island, layout, dialling..."


In [None]:
df_noun_out.tail()

Unnamed: 0,Product_name,prod_des,keywords
3409,TokyoTon Mobile Battery for Moto M XT1662 XT16...,TokyoTon Mobile Battery for Moto M XT1662 XT16...,"([], [TokyoTon, Mobile, Battery, Moto, M, XT16..."
3410,AM Safe x Cable for iPhone LED Fast Charging D...,Fast Charging Nylon Braided USB LED Cable for ...,"([USB], [Fast, Charging, Nylon, Braided, LED, ..."
3411,Ktrack Metal and Plastic Open Pry Screwdriver ...,Product: Tool Metal and plastic. 8 in 1 tools ...,"([Product, Metal, plastic, tools, scrapers, sc..."
3412,Samsung S-View Flip Cover for Samsung Galaxy S...,"The S-View Flip Cover, Clear allows you to see...","([folio, complement, edge, calls, alarms, even..."
3413,TheGiftKart Full Body 3 in 1 Slim Fit 360 Degr...,"DESCRIPTION:br>Beautiful design, elegant appea...","([DESCRIPTION, design, appearance, protection,..."


In [None]:
df_noun_out.to_csv('df_pos_output.csv',index = False)

**2. POS + TF-IDF approach**

In [None]:
def tuple_to_list(t):
  l1,l2 = t
  l1.extend(l2)
  return [str(i) for i in l1]

In [None]:
data['keywords'] = df_noun_out['keywords'].apply(tuple_to_list)

In [None]:
data['keywords']

0       [millennials, edge, infinity, V, display, angl...
1       [GHz, processor, nm, architecture, battery, ca...
2       [HD, dewdrop, display, GB, ROM, upto, sensor, ...
3       [GHz, processor, nm, architecture, battery, ca...
4       [design, palm, body, island, layout, dialling,...
                              ...                        
3409    [TokyoTon, Mobile, Battery, Moto, M, XT1662, X...
3410    [USB, Fast, Charging, Nylon, Braided, LED, Cab...
3411    [Product, Metal, plastic, tools, scrapers, scr...
3412    [folio, complement, edge, calls, alarms, event...
3413    [DESCRIPTION, design, appearance, protection, ...
Name: keywords, Length: 3414, dtype: object

In [None]:
df = [' '.join(i) for i in data['keywords']]

In [None]:
df

['millennials edge infinity V display angle camera processor smartphone Samsung Galaxy M10 Galaxy M10',
 'GHz processor nm architecture battery capacity cm camera portrait mode flash camera portrait mode Proximity sensor E compass Qualcomm Snapdragon FHD+ 1080x2280 Display GB GB Flash Memory Stock Android Oreo MP MP PDAF HDR LED MP Gyroscope Accelerometer IR Blaster',
 'HD dewdrop display GB ROM upto sensor Support Coolpad Cool GB RAM GB Fingerprint Faceunlock Gradient ID Helio A22 Quad Core Processor Android Pie OTG',
 'GHz processor nm architecture battery capacity cm camera portrait mode flash camera portrait mode Proximity sensor E compass Qualcomm Snapdragon FHD+ 1080x2280 Display GB GB Flash Memory Stock Android Oreo MP MP PDAF HDR LED MP Gyroscope Accelerometer IR Blaster',
 'design palm body island layout dialling texting feeling quality phone palm hand battery hours talk time month dawn dusk companion day polycarbonate shell color flashlight games moment Nokia standby Nokia Sn

**Trial 1:**

In [None]:
vectorizer1 = TfidfVectorizer(df,lowercase = True, analyzer='word', stop_words='english', min_df = 0.1,max_df = 0.9)
tfidfmat1 = vectorizer1.fit_transform(df)
print("TFIDF shape: ", tfidfmat1.shape)
print("Terms in TFIDF: ",vectorizer1.get_feature_names())

TFIDF shape:  (3414, 27)
Terms in TFIDF:  ['access', 'android', 'battery', 'buttons', 'camera', 'case', 'cover', 'design', 'device', 'display', 'experience', 'features', 'gb', 'material', 'mobile', 'music', 'phone', 'phones', 'power', 'product', 'protection', 'quality', 'scratches', 'screen', 'smartphone', 'technology', 'time']


In [None]:
tfidfmat1

<3414x27 sparse matrix of type '<class 'numpy.float64'>'
	with 16118 stored elements in Compressed Sparse Row format>

**Trial 1.2 : Top 5**

---

simple topn

In [None]:
def topn_indices(tfidfmat,n):
  lst = []
  lst = [tfidfmat.todense()[i].argsort()[:,-n:] for i in range(tfidfmat.shape[0])]
  return lst

In [None]:
def get_topn_multi_keywords(tfidfmat,n,terms):
  topn_lst_indices = topn_indices(tfidfmat,n)
  topn_lst_indices = np.array(topn_lst_indices).reshape((tfidfmat.shape[0],n))
  doc_keywords = [[terms[i] for i in l] for l in topn_lst_indices]
  return doc_keywords

In [None]:
doc_keywords = get_topn_multi_keywords(tfidfmat1,3,terms)
df_tfidf_out1 = pd.DataFrame(data = {'Product_name': data['Product_name'],'prod_des':data['prod_des'],'multi_keywords':doc_keywords})
df_tfidf_out1

NameError: ignored