## In this notebook

* we pull reviews from the database that correpond to our product ~~categories~~ category (`Monitors`) 
* we extract keyphrases from the reviews
* we build a database table to store keyphrases for product category as well as scores and other metadata

In [37]:
from sqlalchemy import create_engine
import psycopg2 
import io


In [38]:
import pandas as pd
import json

In [39]:
pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 200
pd.options.display.max_rows = 1000

In [40]:
conn_string = 'postgresql+psycopg2://gabbydbuser:gabbyDBpass@localhost:5432/gabbyDB'

In [41]:
db = create_engine(conn_string)
conn = db.connect()

## Getting data for product category

In [42]:
monitor_reviews_query = \
    '''SELECT BR.*
        FROM baseline_reviews BR,  
        (SELECT asin
        FROM baseline_products 
        WHERE title ILIKE '%%inch%%' 
        AND title ILIKE '%%monitor%%') AS BP 
        WHERE BR.asin = BP.asin; '''
monitor_reviews = pd.read_sql(monitor_reviews_query, conn)

In [43]:
monitor_reviews.shape

(45480, 10)

In [44]:
monitor_reviews.head()

Unnamed: 0,review_id,rating,sentiment,vote,verified,reviewerID,asin,reviewText,reviewTitle,reviewTime
0,34130,4.0,positive,18.0,False,AA797LFUG2ZGS,B00020E4KK,I have the earlier reincarnation of this monitor PL-190M which I have had for about 2 years. I bought it when LCDs used to be very expensive. Mine has not a single bad pixel and I am very happy wi...,Sleek style & excellent quality,2005-02-24
1,54659,4.0,positive,0.0,True,A1A27N3E2A4PVT,B000A5S926,Awesome sound for the price.,Awesome sound for the price.,2017-07-26
2,30563,5.0,positive,0.0,True,A1OHXJOZU7XCKJ,B000196YF0,I was truly impressed with this monitor . It looks like a CRT which is impressive . It was reconditioned\nso I got the monitor at a fraction of the cost .No scratches at all on the screen . That's...,Planar LCD Monitor,2012-07-25
3,30564,5.0,positive,0.0,True,AABDY8APOV7W5,B000196YF0,"Everything arrived on time and in good order. And, Amazon, don't tell me how many words to use!! This is new and I don't like it.",planar,2011-10-21
4,30565,5.0,positive,0.0,True,A1MGHYD1O6HEUP,B000196YF0,arrived in 1 week was super easy to hook-up to laptop for 2nd monitor works awesome GREAT PRICE THE PICTURE IS Many times better than the picture on my toshiba laptop will be getting one for my so...,excellent picture and was very cheap,2010-10-26


## Computing Keyterms

* ensure that you are removing brand names from keyterms
* check for differences in negative and positive keyterms (extracted from negative and positive reviews)
* NOTE: cannot run yake or other textacy algos as spacy conks out with `[E088] Text of length 22512050 exceeds maximum of 1000000.`

In [48]:
monitor_reviews['contents'] = (monitor_reviews['reviewTitle'] + '.' + monitor_reviews['reviewText']).str.lower()

In [49]:
monitor_reviews[monitor_reviews['sentiment'] == 'negative'].shape

(8373, 11)

In [50]:
monitor_reviews[monitor_reviews['sentiment'] == 'positive'].shape

(37107, 11)

In [51]:
import spacy 
nlp = spacy.load('en_core_web_sm')

In [52]:
from tqdm.notebook import tqdm
tqdm.pandas()

In [91]:
key_phrase_data = {}

def _extract_phrases_and_metadata(e_nc, review):
    # if more than 50% of terms are stop words, drop this chunk
    n_toks = len(e_nc.text.split())
    n_stop = sum([tok in spacy.lang.en.stop_words.STOP_WORDS for tok in e_nc.text.split()])
    if float(n_stop)/float(n_toks) > .5:
        pass
    else:
        # insert into key_phrase_data
        if e_nc.text not in key_phrase_data:
            key_phrase_data[e_nc.text] = {
                'reviews': set(), #review ids
                'reviewers': set(),
                'products': set(), #product asins
                'n_positive': 0, #number of occurrences in positive reviews
                'n_negative': 0, #number of occurrences in negative review
                #'n_reviewers_positive': 0, #reviewers ids
                #'n_reviewers_negative': 0, #reviewers ids
            }
        key_phrase_data[e_nc.text]['reviews'].add(review['review_id'])
        key_phrase_data[e_nc.text]['reviewers'].add(review['reviewerID'])
        if review['sentiment'] == 'positive':
            key_phrase_data[e_nc.text]['n_positive'] += 1
            #key_phrase_data[e_nc.text]['n_reviewers_positive'] += 1
        else:
            key_phrase_data[e_nc.text]['n_negative'] += 1
            #key_phrase_data[e_nc.text]['n_reviewers_negative'] += 1
        key_phrase_data[e_nc.text]['products'].add(review['asin'])

def get_phrase_metadata(review):
    #print(review)
    doc = nlp(review['contents'])
    for sent in doc.sents:
        for nc in sent.noun_chunks:
            _extract_phrases_and_metadata(nc, review)
        for e in sent.ents:
            _extract_phrases_and_metadata(e, review)

def extract_phrases_and_metadata_from_reviews(df):
    df.progress_apply(lambda review: get_phrase_metadata(review), axis=1)

In [71]:
extract_phrases_and_metadata_from_reviews(monitor_reviews)

  0%|          | 0/45480 [00:00<?, ?it/s]

In [72]:
key_phrases = pd.DataFrame.from_dict(key_phrase_data, orient='index')

In [73]:
import pickle

In [74]:
key_phrases.to_pickle('../data/amazon-review-data-2018/sample/key_phrases_monitors.pkl')

In [75]:
key_phrases.head()

Unnamed: 0,reviews,reviewers,n_positive,n_negative,n_reviewers_positive,n_reviewers_negative,products
sleek style,{34130},{AA797LFUG2ZGS},1,0,1,0,{B00020E4KK}
excellent quality.i,"{57457, 34130, 776948, 908566}","{A39H1GO9E3YQIB, A1YXH8WIKCPYDO, AA797LFUG2ZGS, A24B990VP9QC6C}",4,0,4,0,"{B00PC9HFO8, B00020E4KK, B000BMBUAQ, B00JR6GCZA}"
the earlier reincarnation,{34130},{AA797LFUG2ZGS},1,0,1,0,{B00020E4KK}
this monitor,"{524289, 360450, 524291, 524292, 524293, 753672, 753673, 524298, 524299, 524300, 524301, 524302, 360464, 360465, 753681, 753690, 360476, 753692, 753693, 360480, 753696, 1343522, 753699, 753710, 13...","{AIN25734EXYX, A363YSQOQMF33P, A2BSH7IVT8AOKG, A1JH469QSCW288, A2W2E6BPPYZO0T, A3MF5GC81U7W4Y, A1HURX19SW1B6K, A20M4E1J5RDIY8, A15SZSX85U5D25, A3W2N6FHHKB6XJ, A3NV817WV1UTSX, ATGCPGRRYSDWA, A3QQ7S...",10892,2788,10892,2788,"{B00WR6ZQTK, B0009Z66QI, B00DSCYR7E, B004XIALVS, B00Y09G6JG, B00NAKT2J2, B000R0FG5W, B005BZND6M, B006HIKIHO, B00CE2T19I, B007IDKWOG, B007HSKSMI, B00WJXPLJG, B00FECAUSG, B005ZT5C2M, B00Q9NX4UU, B01..."
about 2 years,"{149280, 530439, 663080, 897865, 472878, 428273, 34130, 203314, 485841, 668789, 122359, 611224, 380191}","{A3LF55JE58OG5M, AZLYIET1TXBK2, AA797LFUG2ZGS, A1XOYUNOJMQKCT, AU4OXQZQXVOBA, A3UTOBG03S1AMG, A1CMSLD56I0C0, A3UH7Q5PBLXBFK, A1C77DEGL3QOS4, A31P3WRM4BRRAV, AFWTFYB9ZRPDU, A2374KS5MTBWKI}",14,10,14,10,"{B007HSKSMI, B00020E4KK, B00D601UC8, B00OKSEVTY, B009F1IKFC, B002FOAZSG, B0017SDMGI, B00EZSUWFG, B0098Y77U0, B0062K9LXE, B00ES9GPGW, B00B3YQG4Q}"


In [76]:
#
#spacy.lang.en.stop_words.STOP_WORDS

In [77]:
key_phrases.shape

(228913, 7)

In [78]:
key_phrases['reviews'].apply(lambda r: len(r)).sum()

684283

In [79]:
import numpy as np

In [80]:
key_phrases['reviewers'].apply(lambda r: len(r))

sleek style                     1
excellent quality.i             4
the earlier reincarnation       1
this monitor                 7695
about 2 years                  12
                             ... 
the beginner user               1
auto frequency search           1
a beginner unit                 1
this headset                    1
your transmitter                1
Name: reviewers, Length: 228913, dtype: int64

In [85]:
key_phrases['category'] = 'Monitor'
key_phrases['n_reviewers'] = key_phrases['reviewers'].apply(lambda r: len(r))
key_phrases['n_reviews'] = key_phrases['reviews'].apply(lambda r: len(r))
key_phrases['reviewer_idf'] = np.log(monitor_reviews.shape[0]/key_phrases['n_reviewers'])

In [86]:
key_phrases.iloc[0]['reviews']

{34130}

In [87]:
key_phrases[key_phrases['n_reviewers'] >= 5].sort_values('reviewer_idf').reset_index().head()

Unnamed: 0,index,reviews,reviewers,n_positive,n_negative,n_reviewers_positive,n_reviewers_negative,products,category,n_reviewers,n_reviews,reviewer_idf
0,this monitor,"{524289, 360450, 524291, 524292, 524293, 753672, 753673, 524298, 524299, 524300, 524301, 524302, 360464, 360465, 753681, 753690, 360476, 753692, 753693, 360480, 753696, 1343522, 753699, 753710, 13...","{AIN25734EXYX, A363YSQOQMF33P, A2BSH7IVT8AOKG, A1JH469QSCW288, A2W2E6BPPYZO0T, A3MF5GC81U7W4Y, A1HURX19SW1B6K, A20M4E1J5RDIY8, A15SZSX85U5D25, A3W2N6FHHKB6XJ, A3NV817WV1UTSX, ATGCPGRRYSDWA, A3QQ7S...",10892,2788,10892,2788,"{B00WR6ZQTK, B0009Z66QI, B00DSCYR7E, B004XIALVS, B00Y09G6JG, B00NAKT2J2, B000R0FG5W, B005BZND6M, B006HIKIHO, B00CE2T19I, B007IDKWOG, B007HSKSMI, B00WJXPLJG, B00FECAUSG, B005ZT5C2M, B00Q9NX4UU, B01...",Monitor,7695,8259,1.776702
1,the monitor,"{524289, 753665, 524292, 524293, 524298, 524299, 753674, 524301, 524302, 753677, 753678, 753681, 753690, 1343522, 753699, 753706, 753710, 753723, 753731, 753738, 753741, 753749, 753752, 753763, 75...","{AIN25734EXYX, A2BSH7IVT8AOKG, A2W2E6BPPYZO0T, A1HURX19SW1B6K, A39YHXDJP6LPWX, A20M4E1J5RDIY8, A2QVE21Y4SI1R7, ATGCPGRRYSDWA, A3QQ7SNO51QY5R, A2D8AQI8OZY9VS, A3QFOE3ZD1EWPR, A3U207W7TF1L3F, AG2RS2...",9363,3182,9363,3182,"{B0009Z66QI, B00DSCYR7E, B004XIALVS, B00Y09G6JG, B000R0FG5W, B00CE2T19I, B007IDKWOG, B007HSKSMI, B00WJXPLJG, B00FECAUSG, B005ZT5C2M, B005JN9310, B00IEZGWI2, B00GA2OUOY, B00A7OZ49G, B01GFG3MCK, B00...",Monitor,6398,6819,1.961287
2,the price,"{1024000, 360450, 524290, 524292, 360453, 1024006, 753671, 360456, 360457, 1024010, 1204230, 1227682, 1204237, 524302, 1204239, 360464, 753681, 753682, 974867, 753684, 1204244, 1299326, 974871, 12...","{AORTVSGHMP38H, A1IYJOUXK7SSP7, A2W959ENBJ23OE, A363YSQOQMF33P, A35MVSSQW5RKH9, A2EA4X9BIXF123, A24LYSNVPET5EV, A30RLRW6S8LYGT, AC4BGBGAVCVXT, A3JTJLZLRFJXN2, A31V95T8ZPL53Q, A1PWFRD1XASSCC, A2MEY...",4541,644,4541,644,"{B0009Z66QI, B00DSCYR7E, B004XIALVS, B00Y09G6JG, B000R0FG5W, B005BZND6M, B00CE2T19I, B0071I0EL4, B007IDKWOG, B007HSKSMI, B00WJXPLJG, B00FECAUSG, B005ZT5C2M, B017VXMZB0, B00069QNCY, B00Q9NX4UU, B00...",Monitor,4461,4611,2.3219
3,the screen,"{1024000, 974851, 524293, 974853, 1204231, 524299, 974859, 524301, 524302, 753679, 974867, 1204243, 1204245, 753688, 1204248, 1204254, 1204257, 974882, 1343522, 139305, 753705, 1302572, 1317898, 7...","{AVNF983B8CMP7, A2BSH7IVT8AOKG, A2FOCA7MOEWE4P, AJSJEDQR13SNH, AB3Y9FA5ALTK2, A2MEY3LCF6N3Z0, AOA743KA8G2KF, AJ2TUF976WQEC, A3MF5GC81U7W4Y, A39YHXDJP6LPWX, A1JS70MP1NPZ18, A2DKPB5N1CZU6G, A8V7WUK4...",4272,1855,4272,1855,"{B00WR6ZQTK, B004XIALVS, B00Y09G6JG, B00NAKT2J2, B005BZND6M, B00CE2T19I, B0071I0EL4, B007IDKWOG, B007HSKSMI, B00WJXPLJG, B01GNL6UVM, B005ZT5C2M, B017VXMZB0, B00Q9NX4UU, B00IEZGWI2, B00GA2OUOY, B00...",Monitor,4043,4265,2.420286
4,2,"{1024000, 1007620, 1024006, 1204236, 974863, 1286167, 974874, 1204252, 1286173, 1343522, 1294373, 139304, 1204269, 974895, 360498, 548916, 139319, 360506, 1204286, 360511, 139329, 753737, 1351755,...","{A363YSQOQMF33P, A2O2QFMRPSPOM, A39YI2DO6NQF8, A9EUOD3NR6Z52, A2MEY3LCF6N3Z0, A1ZVFCPHCWFV71, A1X69RZ08LRNGL, A1US4CULFV04FT, A3CF9OL5UMIZQY, A2QVE21Y4SI1R7, A3HEY3AXJUNHRF, A3OBL0OYFQP77X, A1UB2V...",2554,839,2554,839,"{B00WR6ZQTK, B004XIALVS, B00Y09G6JG, B00NAKT2J2, B000R0FG5W, B00CE2T19I, B0071I0EL4, B007IDKWOG, B007HSKSMI, B00WJXPLJG, B01GNL6UVM, B005ZT5C2M, B017VXMZB0, B00069QNCY, B00Q9NX4UU, B005JN9310, B00...",Monitor,2518,2618,2.893808


In [88]:
key_phrases['key_phrase_id'] = list(range(key_phrases.shape[0]))

In [90]:
key_phrases = key_phrases.reset_index().rename(columns={'index': 'phrase'})[[ 
    'key_phrase_id', 'phrase', 'reviews', 'reviewers', 'products', 'n_positive', 'n_negative', 'category', 'reviewer_idf'
]]

In [92]:
key_phrases[key_phrases['n_positive'] < key_phrases['n_negative']].sort_values('n_negative', ascending=False).head()

Unnamed: 0,key_phrase_id,phrase,reviews,reviewers,products,n_positive,n_negative,category,reviewer_idf
4327,4327,the issue,"{524293, 594949, 643078, 713735, 594448, 1297429, 1355285, 83488, 54305, 941092, 1345061, 474151, 246824, 488487, 905259, 866863, 1184307, 488504, 908346, 525372, 612933, 1049671, 908362, 994382, ...","{AGND98TS1KMQ4, AZADBEUXFBZD4, A198M1L1QCCNHR, A2I9MRRXFIPAOO, AXF7Q1ULTPQOD, AUQ1Z9SUPMKUW, A2Q51U49D16L05, A1VK3POF9ND2G2, AQDS0SKR0LVUT, AVUWENSU0MJ5P, A2T7APXLU6DGYW, A21NUW7ZM9Y6T2, A3QNBXVZ1...","{B0009Z66QI, B00AZMLIDQ, B010BEDXSA, B00Y09G6JG, B00C8T5KOW, B000PDD130, B004K6O3C6, B0098Y77U0, B00DT0L87C, B007ILEHNU, B00JR6GCZA, B00D601UC8, B015WCV70W, B01BV1Q0L4, B00AYA7NBA, B00PXYRMPE, B01...",148,167,Monitor,5.119226
7184,7184,no way,"{345600, 905217, 1224193, 225796, 610820, 594953, 685065, 1135113, 659989, 1355285, 889367, 866841, 77341, 747549, 643104, 487970, 473125, 488487, 437803, 791085, 668725, 525367, 1184315, 265279, ...","{A2STPUTDS1ALC6, A1KBYHP3MIRPV, AUQ1Z9SUPMKUW, A2LP73ZEANU8MY, A3M82I6X801R9L, A1OADVGY776K9T, A1HWQBVEUOUWFY, A3BWXFJQ0STV0Z, A2541WPVQU8B0H, A2V125RU0CIT86, A25CQUD6VMNB7W, AQZL4RGKSC8KH, A2T36D...","{B0019HDAP0, B00UF8ZTAS, B00K6E8ACU, B00AZMLIDQ, B010BEDXSA, B00FMBOLIY, B00Y09G6JG, B000R0FG5W, B00558ORME, B003UT2C4U, B0098Y77U0, B003Y73Q4S, B00KRA5RTC, B00JR6GCZA, B007IDKWOG, B00AYA7NAQ, B00...",128,164,Monitor,5.145298
1697,1697,your money,"{578565, 713733, 758280, 1334793, 941067, 1012235, 720406, 713755, 941084, 756261, 1038374, 203309, 1389613, 866870, 406584, 511033, 203322, 247355, 747576, 1129529, 1184316, 1314885, 77385, 68513...","{A393028LDV2V0P, A1NDKOP4QR8VWR, A9EUOD3NR6Z52, AC3UHZKNWT938, A2VSXP6W0F5FMT, A3AAVD4WTVFKMO, A2QRBQ66XVLRBR, AJHX4SMDG8V49, AOAGPIRYQPQWE, A1DTNQHAXPG3S2, A3V48ZQJGFU3X5, A2BRPB9YYL48FB, A3VXRSF...","{B00FMBOLIY, B00C8T5KOW, B0098Y77U0, B003Y73Q4S, B00KRA5RTC, B015WCV70W, B01HGAZTJI, B00E7LVBCE, B005ZT5C2M, B00069QNCY, B00IEZGWI2, B00B332A9C, B0051GN8JI, B00KCIZBJ0, B0045IIZKU, B007SLDF7O, B00...",41,133,Monitor,5.701147
3258,3258,this issue,"{1024000, 178177, 1382403, 1128968, 488458, 473099, 594442, 720395, 1305612, 1129489, 228882, 1297426, 228884, 228887, 371736, 578585, 643097, 246811, 941083, 488479, 660000, 747553, 758305, 79107...","{A16D6AD0492ZYP, A3V79Y437V09C5, AUQ1Z9SUPMKUW, A36HIGDJ7UZ5F6, A2Q51U49D16L05, A388QZS2LA7HCE, A19ZGA16O68V6Q, A1M2QDWEC0L5CZ, A2L6NE9NUB8TSL, A2SKIYL6PT0VF8, A3V23601KMXSAT, A3ENKBYY8G1OR4, A36W...","{B002R0JJYO, B00Y09G6JG, B00C8T5KOW, B004K6O3C6, B0098Y77U0, B003Y73Q4S, B01CNAVSK0, B00JR6GCZA, B007HSKSMI, B015WCV70W, B01BV1Q0L4, B00PXYRMPE, B01CKU7HDK, B00IEZGWI2, B00B332A9C, B0063BM5NK, B00...",128,130,Monitor,5.313382
8322,8322,warranty,"{1293825, 1295361, 889348, 316933, 1128964, 747531, 1038347, 602138, 83488, 247331, 488487, 720426, 345644, 428081, 984114, 1285686, 668727, 839225, 1325113, 345659, 199740, 514622, 548932, 118432...","{A1UOCDSP3Q612D, A2DDE4CYU6JCEX, A1IUFC87K94CJA, A3P8GMXS1SPUTN, A289BG1P365S5B, A3IRCDZFPFLK0, A2Q4KWK8K4JGUF, A21NUW7ZM9Y6T2, ATMSDTM6KB29M, A26XDMR7KUNN70, A192TCJEPB1M3G, A2YJ2UGI5C36FS, A1D3N...","{B00WR6ZQTK, B0019HDAP0, B010BEDXSA, B00C8T5KOW, B01BECUNCC, B000PDD130, B003UT2C4U, B003Y73Q4S, B007ILEHNU, B00JR6GCZA, B007HSKSMI, B00D601UC8, B015WCV70W, B00PXYRMPE, B0176WQ792, B005ZT5C2M, B00...",58,125,Monitor,5.688075


In [93]:
key_phrases.shape

(228913, 9)

In [30]:
monitor_reviews.shape

(45480, 10)

### Saving to file

In [94]:
key_phrases.to_csv('../data/amazon-review-data-2018/sample/key_phrases_monitors.csv')

In [95]:
key_phrases.to_pickle('../data/amazon-review-data-2018/sample/key_phrases_monitors.pkl')

## Saving key phrases to database
- we need to split up the key_phrase data frame into multiple tables with 2 columns to track review_ids, reviewer_ids etc

In [96]:
key_phrases.sample(10)

Unnamed: 0,key_phrase_id,phrase,reviews,reviewers,products,n_positive,n_negative,category,reviewer_idf
152433,152433,the inner and outer sides,{858358},{A3K5JLENAAC8XS},{B00MSOND8C},0,1,Monitor,10.725028
125563,125563,recent dell firmware fix,{714422},{A108XABRHAA9E7},{B00GVE7QEC},1,0,Monitor,10.725028
141470,141470,major smudging,{758041},{A1U5B0CNYGRUB2},{B00ITORMNM},0,1,Monitor,10.725028
21891,21891,"19, 20,","{110896, 137817}",{A3GA4K7MVW9JUF},{B0012TVEOO},2,0,Monitor,10.725028
19514,19514,qa,"{437664, 488480, 1049763, 890214, 281480, 590398, 890216, 1049484, 897807, 1049457, 941076, 112373, 1084598, 1199988, 573786, 1223991, 139294}","{AAIDZGF6BKKOT, A18XG9T9FW4KGV, A3LJGXODSO0J3K, A3LKN9GND01EWJ, A2CVZERVHR9B53, A2YHMZIQPZGPB3, A2D2YJQAFRLXIM, ASAQO7HN14BNK, A33RM8D38PH1YC, ANAANW7N0E73N, A1Y8SSNYJDR157, A3U47C8IBDWF1F, A141OP...","{B00ZOO348C, B00OKSEWL6, B01CX26WPY, B012UNOCJY, B01B9IDLAW, B009H0XQQY, B004HJ7JAE, B00139S3U6, B00O0Z5682, B00RORBPEW, B00CE2T19I, B00C2RPW8O, B007SIC6KY}",11,7,Monitor,7.952439
141247,141247,a stud location,{754524},{A2HZ1217XQZ2EN},{B00IM3XC8E},1,0,Monitor,10.725028
216912,216912,barely big enough @ 1360x768 pixels,{1294023},{A38R6K4AGOC0Q3},{B001OIOQ1Q},1,0,Monitor,10.725028
215546,215546,the best replacement,{1288568},{A1RJMH6CB07LP5},{B000S5XEVY},0,1,Monitor,10.725028
68225,68225,quality components,{352059},{AWUH97GWPLV86},{B005HPSFWI},1,0,Monitor,10.725028
124938,124938,big system sound,{699521},{A22G0HMYN6JQ3M},{B00G5AH2YG},1,0,Monitor,10.725028


In [97]:
key_phrases.dtypes

key_phrase_id      int64
phrase            object
reviews           object
reviewers         object
products          object
n_positive         int64
n_negative         int64
category          object
reviewer_idf     float64
dtype: object

In [99]:
# phrase_id and phrase
key_phrase_root = key_phrases[['key_phrase_id', 'phrase', 'category']]
key_phrase_root.sample(5)

Unnamed: 0,key_phrase_id,phrase,category
174300,174300,the four soft-touch buttons,Monitor
120832,120832,my operating position,Monitor
39057,39057,the weights,Monitor
71239,71239,five stars.perfect size,Monitor
83238,83238,just a good picture,Monitor


In [100]:
key_phrase_root.to_sql('key_phrase_root', con=conn, index=False, method='multi')

228913

In [102]:
# phrase_id and scores
key_phrase_scores = key_phrases[['key_phrase_id', 'n_positive', 'n_negative', 'reviewer_idf']]


In [107]:
key_phrase_scores = key_phrase_scores.assign(n_reviews=key_phrases['reviews'].apply(lambda x: len(x)))
key_phrase_scores = key_phrase_scores.assign(n_reviewers=key_phrases['reviewers'].apply(lambda x: len(x)))

In [110]:
key_phrase_scores.sample(5)

Unnamed: 0,key_phrase_id,n_positive,n_negative,reviewer_idf,n_reviews,n_reviewers
23753,23753,15,2,8.016978,17,15
39649,39649,3,1,9.338734,4,4
140365,140365,1,0,10.725028,1,1
161408,161408,1,1,10.031881,2,2
218203,218203,1,0,10.725028,1,1


In [111]:
key_phrase_scores.to_sql('key_phrase_scores', con=conn, index=False, method='multi')

228913

In [119]:
#phrase_ids and review_ids
key_phrase_reviews = key_phrases[['key_phrase_id', 'reviews']].explode('reviews').rename(columns={'reviews': 'review_id'})
key_phrase_reviews.shape

(684283, 2)

In [120]:
key_phrase_reviews.sample(10)

Unnamed: 0,key_phrase_id,review_id
203001,203001,1204195
494,494,137818
112009,112009,596889
182010,182010,994638
4633,4633,629041
183827,183827,994463
10959,10959,1271874
9324,9324,889561
170182,170182,908442
87797,87797,488500


In [121]:
key_phrase_reviews.to_sql('key_phrase_reviews', con=conn, index=False, method='multi')

684283

In [63]:
#key_phrases.to_sql('key_phrases', con=conn,index=False, method='multi')
#baseline_products_asin5count.to_sql('baseline_products', con=conn, if_exists='replace',index=False, method='multi')

# Scratch

In [None]:
import phrase_filters
import phrase_extraction

In [None]:
import importlib
importlib.reload(phrase_filters)
importlib.reload(phrase_extraction)

<module 'phrase_extraction' from '/Users/nimblenotions/Experiments/GetGabby/notebooks/phrase_extraction.py'>

In [None]:

    all_noun_chunks = [e.text.lower() for sd in spacy_docs for sent in sd.sents for e in sent.noun_chunks]
    all_entities = [e.text.lower() for sd in spacy_docs for sent in sd.sents for e in sent.ents]

    spacy_counts = Counter(all_noun_chunks + all_entities)

    spacy_count_df = pd.DataFrame({
        'phrase': list(spacy_counts.keys()),
        'score': list(spacy_counts.values())
    })
    return spacy_count_df

In [30]:
phrases_ent_nc = phrase_extraction.keyterm_extraction_entities_and_noun_chunks(monitor_reviews)

  0%|          | 0/45480 [00:00<?, ?it/s]

In [31]:
phrases_ent_nc.shape

(241216, 2)

In [32]:
phrases_ent_nc.head()

Unnamed: 0,phrase,score
0,excellent monitor,250
1,dslrs.this,1
2,an excellent monitor,136
3,your dslr,4
4,i,121403


In [33]:
phrases_ent_nc.to_csv('phrases_ent_nc.csv')

In [51]:
#sorted(list(phrase_filters.spacy_stopwords))

In [40]:
phrases_ent_nc_no_stopwords = phrase_filters.filter_phrases_containing_stopwords(phrases_ent_nc)
phrases_ent_nc_1stopword = phrase_filters.filter_phrases_containing_more_than_n_stopwords(phrases_ent_nc, n=1)



In [41]:
phrases_ent_nc_no_stopwords.shape

(87509, 2)

In [42]:
phrases_ent_nc_1stopword.shape

(212579, 2)

In [43]:
phrases_ent_nc_no_stopwords.sort_values('score', ascending=False).head(100)

Unnamed: 0,phrase,score
228108,2,3393
8026,samsung,2959
18659,second,2414
78287,4k,2331
82,amazon,2299
228126,3,2197
228179,1,1655
2510,hdmi,1647
3025,windows,1624
228109,4,1489


In [37]:
phrases_ent_nc_1stopword.sort_values('score', ascending=False).head(100)

Unnamed: 0,phrase,score
4,i,121403
13,it,96043
29,you,36851
5,this,21477
34,that,20232
19,this monitor,13680
48,the monitor,12545
195,they,10877
40,me,9590
37,which,8020


In [44]:
phrases_ent_nc.sort_values('score', ascending=False).head(100)

Unnamed: 0,phrase,score
4,i,121403
13,it,96043
29,you,36851
5,this,21477
34,that,20232
19,this monitor,13680
48,the monitor,12545
195,they,10877
40,me,9590
37,which,8020


In [None]:
phrases_ent_nc = phrase_filters.filter_phrases_containing_brand_model_terms(phrases_ent_nc, brand_model_terms)
phrases_ent_nc.shape

# SCRATCH

In [20]:
phrases_yake = phrase_extraction.keyterm_extraction_yake(monitor_reviews, 1000)

massive doc length =  22512050
Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/Users/nimblenotions/opt/anaconda3/envs/gabby-env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/2p/4k_lvbl52y5bxqvn873xgfs00000gn/T/ipykernel_37027/2515965506.py", line 1, in <cell line: 1>
    phrases_yake = phrase_extraction.keyterm_extraction_yake(monitor_reviews, 1000)
  File "/Users/nimblenotions/Experiments/GetGabby/notebooks/phrase_extraction.py", line 117, in keyterm_extraction_yake
    massive_spacy_doc = _construct_textacy_document(df)
  File "/Users/nimblenotions/Experiments/GetGabby/notebooks/phrase_extraction.py", line 82, in _construct_textacy_document
    massive_spacy_doc = nlp(massive_doc)
  File "/Users/nimblenotions/opt/anaconda3/envs/gabby-env/lib/python3.10/site-packages/spacy/language.py", line 1008, in __call__
    doc = self._ensure_doc(text)
  File "/Users/nimblenotions/opt/anaconda3/envs/gabby-env/lib/py