# Notebook: Subject extraction

## 1. Import Libraries

To use our coding example, we will use the libraries :
- **nltk**: state-of-the art library for NLP analysis (https://github.com/nltk/nltk)
- **stanza**: famous NLP library made by the university of Stanford for NLP analysis (https://github.com/stanfordnlp/stanza)
- **rake_nltk**: a Python implementation of the RAKE algorithm using the NLTK library (https://github.com/csurfer/rake-nltk)

In [None]:
import pandas as pd
import numpy as np
from pandas.core.frame import DataFrame
import re
import strin

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import tokenize,WordNetLemmatizer, PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
from string import punctuation

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 2. Read the Data

In [None]:
df= pd.read_csv('https://raw.githubusercontent.com/emaleeeeeee/new_df/main/new_df.csv')
df = df[['Title','Review']]
df


Unnamed: 0,Title,Review
0,Read their Privacy Policy. They’re intrusive!,Have you checked out their ‘Privacy Policy’?Th...
1,"Great hardware, but software doesn't work. Do...","I very rarely write reviews on Amazon, but I f..."
2,"I’m keeping it, but...",The stripped-down version of Android is a real...
3,Lies,Who reviewed this POS and gave it 5 stars?1. B...
4,It doesn’t work on most formats.,The projector will play Netflix off it’s nativ...
...,...,...
289,Not worth the price,"Easy to use, but it’s dark. It takes me a whil..."
290,not working with its own Twitch app,"Upset, not working with its own Twitch app. us..."
291,Poco brillo y baja resolución,"No tiene buen brillo, no sirve para jugar play..."
292,Badluck,One day the nebula get heated more and suddenl...


In [None]:
df = pd.read_csv('REVIEWS RESCRAP.csv')
df = df[~df['Comment'].isnull()]

spanish = df.index[(df['Comment'].str.contains('muy')) | (df['Comment'].str.contains('bueno'))| (df['Comment'].str.contains('en el'))|  (df['Comment'].str.contains('ninguno')) | (df['Comment'].str.contains('tiempo'))| 
(df['Comment'].str.contains('teléfonos'))| df['Comment'].str.contains('calidad de imagen') ]
bad_df = df.index.isin(spanish)
df = df[~bad_df]
df.index  = np.arange(1, len(df)+1)
#2021/10/17
df.to_csv('All Reviews.csv', encoding='utf-8-sig')

## 3. Text Cleaning:

 1. Turn the text into lower case  
 2. Replace any abbreviation
 3. Remove the punctuation
 
 For example:   
 - won't -> will not
 - I've  -> I have

In [None]:
#Text cleaning

def text_cleaning(text):
    
    text = text.lower()
    text = re.sub('\n', '', str(text))
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"n\’t", " not", text)
    text = re.sub(r"\’re", " are", text)
    text = re.sub(r"\’s", " is", text)
    text = re.sub(r"\’d", " would", text)
    text = re.sub(r"\’ll", " will", text)
    text = re.sub(r"\’t", " not", text)
    text = re.sub(r"\’ve", " have", text)
    text = re.sub(r"\’m", " am", text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', str(text))
    return text

In [None]:
df['Title']=df['Title'].apply(text_cleaning)
df

Unnamed: 0,Title,Review
0,read their privacy policy they are intrusive,Have you checked out their ‘Privacy Policy’?Th...
1,great hardware but software does not work do ...,"I very rarely write reviews on Amazon, but I f..."
2,i am keeping it but,The stripped-down version of Android is a real...
3,lies,Who reviewed this POS and gave it 5 stars?1. B...
4,it does not work on most formats,The projector will play Netflix off it’s nativ...
...,...,...
289,not worth the price,"Easy to use, but it’s dark. It takes me a whil..."
290,not working with its own twitch app,"Upset, not working with its own Twitch app. us..."
291,poco brillo y baja resolución,"No tiene buen brillo, no sirve para jugar play..."
292,badluck,One day the nebula get heated more and suddenl...


## 4. Reset the Index (TBD)

In [None]:
df.index = np.arange(1, len(df)+1)
df['ID'] = df.index
df

Unnamed: 0,Title,Review,ID
1,read their privacy policy they are intrusive,Have you checked out their ‘Privacy Policy’?Th...,1
2,great hardware but software does not work do ...,"I very rarely write reviews on Amazon, but I f...",2
3,i am keeping it but,The stripped-down version of Android is a real...,3
4,lies,Who reviewed this POS and gave it 5 stars?1. B...,4
5,it does not work on most formats,The projector will play Netflix off it’s nativ...,5
...,...,...,...
290,not worth the price,"Easy to use, but it’s dark. It takes me a whil...",290
291,not working with its own twitch app,"Upset, not working with its own Twitch app. us...",291
292,poco brillo y baja resolución,"No tiene buen brillo, no sirve para jugar play...",292
293,badluck,One day the nebula get heated more and suddenl...,293


In [None]:
#Splitting each review into 1 word per row & keeping ID to the right review
new_df = pd.DataFrame(df['Title'].str.split(' ').tolist(), index=df['ID']).stack()
new_df

ID    
1    0          read
     1         their
     2       privacy
     3        policy
     4          they
             ...    
292  4    resolución
293  0       badluck
294  0          wont
     1          last
     2          long
Length: 1626, dtype: object

In [None]:
#Reset index
new_df = new_df.reset_index([0, 'ID'])
new_df

Unnamed: 0,ID,0
0,1,read
1,1,their
2,1,privacy
3,1,policy
4,1,they
...,...,...
1621,292,resolución
1622,293,badluck
1623,294,wont
1624,294,last


In [None]:
#Renaming columns
new_df.columns = ['ID', 'Word']
new_df

Unnamed: 0,ID,Word
0,1,read
1,1,their
2,1,privacy
3,1,policy
4,1,they
...,...,...
1621,292,resolución
1622,293,badluck
1623,294,wont
1624,294,last


In [None]:
#Text cleaning

def text_cleaning(text):
    
    text = text.lower()
#    text = re.sub('\[.*?\]', '', str(text))
#    text = re.sub("\\W"," ",text) # remove special chars
#    text = re.sub('https?://\S+|www\.\S+', '', str(text))
#    text = re.sub('<.*?>+', '', str(text))
    text = re.sub('\n', '', str(text))
#    text = re.sub('\w*\d\w*', '', str(text))

    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"n\’t", " not", text)
    text = re.sub(r"\’re", " are", text)
    text = re.sub(r"\’s", " is", text)
    text = re.sub(r"\’d", " would", text)
    text = re.sub(r"\’ll", " will", text)
    text = re.sub(r"\’t", " not", text)
    text = re.sub(r"\’ve", " have", text)
    text = re.sub(r"\’m", " am", text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', str(text))
    return text

df_text=new_df['Word'].apply(text_cleaning)

new_df['Word']=df_text.apply (lambda x:''.join(x)).apply(text_cleaning)
new_df

Unnamed: 0,ID,Word
0,1,read
1,1,their
2,1,privacy
3,1,policy
4,1,they
...,...,...
1621,292,resolución
1622,293,badluck
1623,294,wont
1624,294,last


In [None]:
#Tokenize words

from nltk.tokenize import word_tokenize

def tokenizer(token_text):
    token_text = word_tokenize(token_text)
    return token_text

df_text=df_text.apply(tokenizer)
df_text

0             [read]
1            [their]
2          [privacy]
3           [policy]
4             [they]
            ...     
1621    [resolución]
1622       [badluck]
1623          [wont]
1624          [last]
1625          [long]
Name: Word, Length: 1626, dtype: object

In [None]:
new_df['Word']

0             read
1            their
2          privacy
3           policy
4             they
           ...    
1621    resolución
1622       badluck
1623          wont
1624          last
1625          long
Name: Word, Length: 1626, dtype: object

In [None]:
new_df_word = new_df['Word'] 
new_df_word

0             read
1            their
2          privacy
3           policy
4             they
           ...    
1621    resolución
1622       badluck
1623          wont
1624          last
1625          long
Name: Word, Length: 1626, dtype: object

In [None]:
#Tokenize words

from nltk.tokenize import word_tokenize

def tokenizer(token_text):
    token_text = word_tokenize(token_text)
    return token_text

df_text = new_df_word.apply(tokenizer)
df_text1 =" ".join('%s' %id for id in df_text)
df_text1

"['read'] ['their'] ['privacy'] ['policy'] ['they'] ['are'] ['intrusive'] ['great'] ['hardware'] ['but'] ['software'] ['does'] ['not'] ['work'] [] ['do'] ['not'] ['buy'] ['unless'] ['you'] ['want'] ['to'] ['project'] ['from'] ['hdmi'] ['source'] ['only'] ['i'] ['am'] ['keeping'] ['it'] ['but'] ['lies'] ['it'] ['does'] ['not'] ['work'] ['on'] ['most'] ['formats'] ['needs'] ['to'] ['be'] ['brighter'] ['2'] ['weeks'] ['7'] ['etc'] ['support'] ['calls'] ['and'] [] ['76'] ['worth'] ['of'] ['hdmi'] ['cables'] ['and'] ['connectors'] ['but'] ['still'] ['does'] ['not'] ['work'] ['a'] ['good'] ['attempt'] ['but'] ['needs'] ['improvement'] ['lovely'] ['outside'] ['but'] ['the'] ['guts'] ['on'] ['this'] ['unit'] ['leaves'] ['something'] ['to'] ['be'] ['desired'] ['no'] ['longevity'] ['sporadic'] ['functionality'] ['it'] ['must'] ['be'] ['unlocked'] ['almost'] ['love'] ['it'] ['not'] ['worth'] ['the'] ['price'] ['terrible'] ['software'] ['and'] ['barely'] ['usable'] ['mala'] ['resolucion'] ['en'] [

## 5.Apply Pos-Tag (TBD)

In [None]:
from nltk import word_tokenize, pos_tag, pos_tag_sents

texts = new_df_word.tolist()


tagged_texts = pos_tag_sents(map(word_tokenize, texts))
tagged_texts

[[('read', 'NN')],
 [('their', 'PRP$')],
 [('privacy', 'NN')],
 [('policy', 'NN')],
 [('they', 'PRP')],
 [('are', 'VBP')],
 [('intrusive', 'JJ')],
 [('great', 'JJ')],
 [('hardware', 'NN')],
 [('but', 'CC')],
 [('software', 'NN')],
 [('does', 'VBZ')],
 [('not', 'RB')],
 [('work', 'NN')],
 [],
 [('do', 'VB')],
 [('not', 'RB')],
 [('buy', 'VB')],
 [('unless', 'IN')],
 [('you', 'PRP')],
 [('want', 'NN')],
 [('to', 'TO')],
 [('project', 'NN')],
 [('from', 'IN')],
 [('hdmi', 'NN')],
 [('source', 'NN')],
 [('only', 'RB')],
 [('i', 'NN')],
 [('am', 'VBP')],
 [('keeping', 'VBG')],
 [('it', 'PRP')],
 [('but', 'CC')],
 [('lies', 'NNS')],
 [('it', 'PRP')],
 [('does', 'VBZ')],
 [('not', 'RB')],
 [('work', 'NN')],
 [('on', 'IN')],
 [('most', 'JJS')],
 [('formats', 'NNS')],
 [('needs', 'NNS')],
 [('to', 'TO')],
 [('be', 'VB')],
 [('brighter', 'NN')],
 [('2', 'CD')],
 [('weeks', 'NNS')],
 [('7', 'CD')],
 [('etc', 'NN')],
 [('support', 'NN')],
 [('calls', 'NNS')],
 [('and', 'CC')],
 [],
 [('76', 'CD')]

In [None]:
new_df['pos_tag'] = tagged_texts
new_df['pos_tag']

0             [(read, NN)]
1          [(their, PRP$)]
2          [(privacy, NN)]
3           [(policy, NN)]
4            [(they, PRP)]
               ...        
1621    [(resolución, NN)]
1622       [(badluck, NN)]
1623          [(wont, NN)]
1624          [(last, JJ)]
1625          [(long, RB)]
Name: pos_tag, Length: 1626, dtype: object

In [None]:
new_df['pos_tag'].str.get(0)[20][1]

'NN'

In [None]:
new_df['pos_tag'] = new_df['pos_tag'].astype(str)
L = new_df['pos_tag'].str.get(-6)+new_df['pos_tag'].str.get(-5)+new_df['pos_tag'].str.get(-4)

In [None]:
new_df['tag'] = L

In [None]:
new_df[['ID', 'Word', 'tag']]

Unnamed: 0,ID,Word,tag
0,1,read,'NN
1,1,their,RP$
2,1,privacy,'NN
3,1,policy,'NN
4,1,they,PRP
...,...,...,...
1621,292,resolución,'NN
1622,293,badluck,'NN
1623,294,wont,'NN
1624,294,last,'JJ


In [None]:
_deepnote_run_altair(new_df, """{"$schema":"https://vega.github.io/schema/vega-lite/v4.json","mark":{"type":"bar","tooltip":{"content":"data"}},"height":220,"autosize":{"type":"fit"},"data":{"name":"placeholder"},"encoding":{"x":{"field":"Word","type":"nominal","sort":{"field":"COUNT(*)","op":"count","order":"descending"},"scale":{"type":"linear","zero":false}},"y":{"field":"COUNT(*)","type":"quantitative","sort":{"field":"COUNT(*)","op":"count","order":"descending"},"aggregate":"count","scale":{"type":"linear","zero":true}},"color":{"field":"","type":"nominal","sort":null,"scale":{"type":"linear","zero":false}}}}""")

In [None]:
new_df['tag'] = new_df['tag'].str.replace("'", '')
new_df['tag']

0        NN
1       RP$
2        NN
3        NN
4       PRP
       ... 
1621     NN
1622     NN
1623     NN
1624     JJ
1625     RB
Name: tag, Length: 1626, dtype: object

In [None]:
new_df[['ID','Word','tag']].head(20)

Unnamed: 0,ID,Word,tag
0,1,read,NN
1,1,their,RP$
2,1,privacy,NN
3,1,policy,NN
4,1,they,PRP
5,1,are,VBP
6,1,intrusive,JJ
7,2,great,JJ
8,2,hardware,NN
9,2,but,CC


In [None]:
new_df.dropna()

Unnamed: 0,ID,Word,pos_tag,tag
0,1,read,"[('read', 'NN')]",NN
1,1,their,"[('their', 'PRP$')]",RP$
2,1,privacy,"[('privacy', 'NN')]",NN
3,1,policy,"[('policy', 'NN')]",NN
4,1,they,"[('they', 'PRP')]",PRP
...,...,...,...,...
1621,292,resolución,"[('resolución', 'NN')]",NN
1622,293,badluck,"[('badluck', 'NN')]",NN
1623,294,wont,"[('wont', 'NN')]",NN
1624,294,last,"[('last', 'JJ')]",JJ


Example : 
- it does not work on most format --> not work format
- This projector has a good hardware but it has also a bad software --> good hardware / bad software
- 2 weeks 7 supports but still does not work --> not work
- read their policy, they are intrusive --> policy intrusive

How do you think you can do it ?
- Write a loop on all cutomers ID. 
- if pos_tag == NN or NNS or JJ or RB, then append the list


In [None]:
ID_1 = new_df['ID'] ==1
ID_2 = new_df['ID'] ==2
ID_3 = new_df['ID'] ==3
ID_4 = new_df['ID'] ==4
ID_5 = new_df['ID'] ==5
ID_6 = new_df['ID'] ==6
ID_7 = new_df['ID'] ==7

NN = new_df['tag'] =='NN'
NNS = new_df['tag'] =='NNS'
PRP = new_df['tag'] =='PRP'
ADJ = new_df['tag'] =='JJ'
RB = new_df['tag'] =='RB'

new_df[ID_5]

Unnamed: 0,ID,Word,pos_tag,tag
33,5,it,"[('it', 'PRP')]",PRP
34,5,does,"[('does', 'VBZ')]",VBZ
35,5,not,"[('not', 'RB')]",RB
36,5,work,"[('work', 'NN')]",NN
37,5,on,"[('on', 'IN')]",IN
38,5,most,"[('most', 'JJS')]",JJS
39,5,formats,"[('formats', 'NNS')]",NNS


In [None]:
list_tag = ["NN", "NNS", "JJ", "RB"]

df_tmp = new_df[ID_1]
df_test = df_tmp[ df_tmp["tag"].isin(list_tag) ]
print(' '.join(df_test["Word"].to_list()))
df_tmp

read privacy policy intrusive


Unnamed: 0,ID,Word,pos_tag,tag
0,1,read,"[('read', 'NN')]",NN
1,1,their,"[('their', 'PRP$')]",RP$
2,1,privacy,"[('privacy', 'NN')]",NN
3,1,policy,"[('policy', 'NN')]",NN
4,1,they,"[('they', 'PRP')]",PRP
5,1,are,"[('are', 'VBP')]",VBP
6,1,intrusive,"[('intrusive', 'JJ')]",JJ


## 6. Apply Cao's package:
- Seperate the sentence for each subject
For example: "this is a good movie, but the room is too dark." -> "this is a good movie" & "but the room is too dark."

In [None]:
"""
Created on Mon Mar 29 14:13:12 2021

@author: Cao Tri DO
"""

import nltk

import networkx 
from networkx.algorithms.components.connected import connected_components
import pandas as pd

from nltk.corpus import stopwords
import stanza
import collections


#%% subfunctions

def theme_extraction(txt, stop_words, nlp, threshold = 7):
    """
    Function to extract theme from a verbatims

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    stop_words : list
        list of stop to not considered when extract subject.
    nlp : stanza object
        stanza object engine.
    threshold : int, optional
        DESCRIPTION. The default is 7.

    Returns
    -------
    lst_subject : list
        list of subject analyzed.

    """
    try:
        # Lower the text
        doc = nlp(txt.lower())
        # Initiate void list
        lst_subject = []
        
        # Loop for all sentences in the verbatims
        for sent in doc.sentences:
            # If the sentence if too short, keep all in one subject
            if len(sent._tokens) <= threshold:          
                # print("Sentence is too short. Keep all words")
                finalcluster = [word.text for word in sent.words]
                lst_subject.append(finalcluster)  
            # For the sentences that are long enough
            else:
                # Extract subject using the POS and xparel method
                # print("Sentence is long. Apply subject extraction")
                finalcluster = theme_clustering(sent.text, stop_words, nlp)
                
                # Only apply the subject engine if there are extracted subject
                if len(finalcluster)>0: # case when there are issues e.g., 'PRPBLEME D IMPRESSION?????????????????????????????????'
                    ##################################################################
                    # Group first element if there are many of them
                    ##################################################################
                    # list of size of element
                    lst_size = [len(element) for element in finalcluster]
                    # first element of size 1
                    try:
                        first_element = next(x[0] for x in enumerate(lst_size) if x[1] > 1)
                    except: # if all are equal to 1
                        first_element = len(lst_size)-1
                    
                    finalcluster_first_group = []
                    tmp_first_element = []
                    for idx_element in range(0,first_element+1):
                        element = finalcluster[idx_element]
                        tmp_first_element.extend(element)
                    finalcluster_first_group.append(tmp_first_element)
                    for idx_element in range(first_element+1, len(finalcluster)):
                        element = finalcluster[idx_element]
                        finalcluster_first_group.append(element)
                    
                    ##################################################################
                    
                    finalcluster_clean = []
                    count = 0
                    tmp = []
                    for element_idx in range(0,len(finalcluster_first_group)):
                        element = finalcluster_first_group[element_idx]
                        if element_idx == 0 and len(element)== 1:
                            tmp = element
                        elif len(tmp)>0 or len(element)>1:
                            tmp.extend(element)
                            finalcluster_clean.append(tmp)
                            tmp = []
                            count = len(finalcluster_clean)-1
                        else:
                            tmp.extend(element)
                            finalcluster_clean[count].extend(tmp)
                            tmp = []
                    
                    for element in finalcluster_clean:
                        lst_subject.append(element)    
    except:
        print(txt)
        lst_subject = []
    
    return lst_subject


#%% utilities functions

def get_meta_data_nlp(txt, nlp):
    """
    Function to transform stanza object result to dataframe

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    df : dataframe
        dataframe allowing to analyze the results from Stanza.

    """
    doc = nlp(txt)
    
    df = pd.DataFrame()
    count = 1
    for sentence in doc.sentences:
        print(sentence._text)
        for word in sentence.words:
            df.loc[count, "word"]  = word.text
            df.loc[count, "lemma"] = word.lemma
            # df.loc[count, "feats"] = word.feats 
            df.loc[count, "pos"] = word.pos
            df.loc[count, "upos"] = word.upos
            df.loc[count, "xpos"] = word.xpos
            df.loc[count, "id head"] = word.head
            df.loc[count, "word root"] = sentence.words[word.head-1].text if word.head > 0 else "root"
            df.loc[count, "deprel"] = word.deprel
            # print(word.text, " | ", word.lemma, " | ", word.pos)
            count = count +1
    print(df)
    
    return df


def get_monolabel(X_test):
    """
    Extract monolabel from dataframe

    Parameters
    ----------
    X_test : dataframe
        containing the data.

    Returns
    -------
    df_monolabel : dataframe
        Contains only the monolabel.
    df_multilabel : dataframe
        Contains only the multilabel.
    """
    df_theme = X_test.copy()
    mask = df_theme["id_unique"].value_counts()
    lst_unique_idx = list(mask[mask>1].index)
    df_monolabel = X_test[~X_test['id_unique'].isin(lst_unique_idx)].copy()
    df_multilabel = X_test[X_test['id_unique'].isin(lst_unique_idx)].copy()
    return df_monolabel, df_multilabel


#%% subfunctions for the code 

def theme_clustering(txt, stop_words, nlp):
    """
    Main program to cluster the subject

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    stop_words : list
        list of stop to not considered when extract subject.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    lst_merged_cluster_word_all : list
        list of subject identified by the algorithm.

    """
    # sentList = clean_txt(txt)
    sentList = clean_txt(txt, nlp)


    # for line in sentList:
    lst_merged_cluster_word_all = []
    for line in sentList.sentences:
        
        # NLTK POS Tagging
        # taggedList = nltk_pos_tag(line)
        # taggedList = stanza_pos_tag(line)

        # join consecutive noun
        # finaltxt, newwordList, taggedList_new, bool_valid_txt = join_consecutive_noun(taggedList, nlp, noun_tag = "NOUN")
        bool_valid_txt = True
        
        if bool_valid_txt:
            # Create Stanford object model
            # doc = nlp(txt) # Object of Stanford NLP Pipeleine
            taggedList_new, taggedList_id_new = stanza_pos_tag(line)
            
            # Getting the dependency relations betwwen the words
            # Coverting it into appropriate format
            dep_node, dep_node_id = get_dependencies(sentList)
    
            # we will select only those sublists from the <dep_node> that could probably contain the features
            totalfeatureList, featureList, featureList_id, categories = select_potential_features(taggedList_new, taggedList_id_new)
            
            # Now using the <dep_node> list and the <featureList> we will determine 
            # to which of the words these features in the feature list are related to.
            fcluster = create_relation(featureList, dep_node)
            fcluster_id = create_relation(featureList_id, dep_node_id)
            
            # finalcluster_out, finalcluster, dic = select_NN_feature_relation(totalfeatureList, fcluster)
            
            # Merge list commonelements
            lst_merged_cluster_id = merge_list_common_elements(fcluster_id)
            
            # Order the id within the list
            lst_merged_cluster_id_sort1 = sort_list_list(lst_merged_cluster_id)
            
            # Order the element by id of 1st element
            lst_merged_cluster_id_keysort = sort_list_by_key(lst_merged_cluster_id_sort1)
            
            # Bug fix : if a list is only of 1 element and is contains within the range of the previous list, put it into it
            # Do not take into account 1st element 0
            lst_merged_cluster_id_no_orphelin = combine_orphelin_list(lst_merged_cluster_id_keysort)
            
            
            # lst_merged_cluster_id_sorted = sort_list_list(lst_merged_cluster_id)
            lst_merged_cluster_id_sorted = sort_list_list(lst_merged_cluster_id_no_orphelin)
            
            lst_merged_cluster_word = convert_idx_lst_to_word_lst(lst_merged_cluster_id_sorted, taggedList_new)
            
            
        else:
            lst_merged_cluster_word = []   
        
        for element in lst_merged_cluster_word:
            lst_merged_cluster_word_all.append(element)
    
    return lst_merged_cluster_word_all


def combine_orphelin_list(lst_merged_cluster_id_keysort):
    """
    Function to combine orphelin list if there are within previous range
    For example : [[1,3], [2], [4]] will give [[1,2, 3], [4]]

    Parameters
    ----------
    lst_merged_cluster_id_keysort : list
        original list to combine.

    Returns
    -------
    lst_merged_cluster_id_no_orphelin : list
        combined list.

    """
    
    #♦ Assignaton
    lst_merged_cluster_id_no_orphelin = []
    
    # Make a copy
    lst_merged_cluster_id_no_orphelin.append(lst_merged_cluster_id_keysort[0].copy())
    id_to_compare = 0
    
    # Loop
    for i in range(1,len(lst_merged_cluster_id_keysort)):
        # Only if the list is with 1 element
        if len(lst_merged_cluster_id_keysort[i])==1:
            id_element = lst_merged_cluster_id_keysort[i][0]
            # Meta data of previous list
            previous_list = lst_merged_cluster_id_no_orphelin[id_to_compare]
            min_id_previous_list = min(previous_list)
            max_id_previous_list = max(previous_list)
            # If new list within the range of previous list, merge with it
            if id_element>=min_id_previous_list and id_element<=max_id_previous_list:
                lst_merged_cluster_id_no_orphelin[id_to_compare].extend(lst_merged_cluster_id_keysort[i].copy())
            else:
                lst_merged_cluster_id_no_orphelin.append(lst_merged_cluster_id_keysort[i].copy())
                id_to_compare = len(lst_merged_cluster_id_no_orphelin)-1
        else:
            lst_merged_cluster_id_no_orphelin.append(lst_merged_cluster_id_keysort[i].copy())
            id_to_compare = len(lst_merged_cluster_id_no_orphelin)-1   
    return lst_merged_cluster_id_no_orphelin

def sort_list_by_key(lst_merged_cluster_id_sort1):
    """
    Function to sort a list of list into an ordered list
    For example : [[2], [1,3], [4]] will give [[1,3], [2], [4]]

    Parameters
    ----------
    lst_merged_cluster_id_sort1 : list
        original list unsorted.

    Returns
    -------
    final_list : list
        sorted list.

    """
    dct_list = {}
    for i in range(0, len(lst_merged_cluster_id_sort1)):
        first_id = lst_merged_cluster_id_sort1[i][0]
        dct_list[first_id] = lst_merged_cluster_id_sort1[i]
    
    # sort dict key
    od = collections.OrderedDict(sorted(dct_list.items()))
    
    final_list = []
    for element in od:
        final_list.append(od[element])
    return final_list
    

def sort_list_list(lst_merged_cluster_id_no_orphelin):
    """
    Function to sort the list of list to order the id of the words

    Parameters
    ----------
    lst_merged_cluster_id : list
        DESCRIPTION.

    Returns
    -------
    lst_merged_cluster_id_sorted : list
        DESCRIPTION.

    """
    lst_merged_cluster_id_sorted = lst_merged_cluster_id_no_orphelin.copy()
    for i in range(0, len(lst_merged_cluster_id_sorted)):
        lst_merged_cluster_id_sorted[i] = sorted(lst_merged_cluster_id_no_orphelin[i]).copy()
    # Initiate the void dictionnary
    # dct_idx = {}
    # # Loop on all element of the list
    # for element in lst_merged_cluster_id_no_orphelin:
    #     count = element[0]
    #     dct_idx[count] = element
    # # sort the list
    # dct_idx_sorted = dict(sorted(dct_idx.items(), key=lambda item: item[1]))
    # lst_merged_cluster_id_sorted = []
    # for key_element in dct_idx_sorted:
    #     lst_merged_cluster_id_sorted.append(dct_idx_sorted[key_element])

    return lst_merged_cluster_id_sorted

def convert_idx_lst_to_word_lst(lst_merged_cluster_id, taggedList_new):
    """
    Function to convert the idx to the words

    Parameters
    ----------
    lst_merged_cluster_id : list
        list of index.
    taggedList_new : str
        list of words.

    Returns
    -------
    lst_merged_cluster_word : list
        list of words.

    """
    lst_merged_cluster_word = []
    for lst in lst_merged_cluster_id:
        lst_word = []
        for element in lst:
            word_tmp = taggedList_new[element-1][0]
            lst_word.append(word_tmp)
        lst_merged_cluster_word.append(lst_word)
    return lst_merged_cluster_word

def clean_txt(txt, nlp):
    """
    Function to clean the text by lower and separate the sentence using stanza

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    sentList : list
        list of sentences.

    """
    txt = txt.lower() # LowerCasing the given Text
    sentList = nlp(txt)
    
    return sentList

def clean_txt_nltk(txt):
    """
    Function to clean the text by lower and separate the sentence using nltk

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    sentList : list
        list of sentences.

    """
    txt = txt.lower() # LowerCasing the given Text
    sentList = nltk.sent_tokenize(txt) # Splitting the text into sentences

    return sentList


def nltk_pos_tag(line):
    """
    for each sentence in the <sentList> tokenize it and perform POS Tagging and store it into a Tagged List
    """
    txt_list = nltk.word_tokenize(line) # Splitting up into words
    taggedList = nltk.pos_tag(txt_list) # Doing Part-of-Speech Tagging to each word
    return taggedList

def stanza_pos_tag(line):
    taggedList = []
    taggedList_id = []
    for word in line.words:
        taggedList.append((word.text, word.pos))
        taggedList_id.append((word.id, word.pos))
        
    return taggedList, taggedList_id

def join_consecutive_noun(taggedList, nlp, noun_tag = "NOUN"):
    """
    there are many instances where a feature is represented by multiple words 
    so we need to handle that first by joining multiple words features into a one-word feature
    
     noun_tag = "NOUN" (stanza) or NN (nltk)
    """
    newwordList = []
    flag = 0
    for i in range(0,len(taggedList)-1):
        if(taggedList[i][1]==noun_tag and taggedList[i+1][1]==noun_tag): # If two consecutive words are Nouns then they are joined together
            newwordList.append(taggedList[i][0]+taggedList[i+1][0])
            flag=1
        else:
            if(flag==1):
                flag=0
                continue
            newwordList.append(taggedList[i][0])
            if(i==len(taggedList)-2):
                newwordList.append(taggedList[i+1][0])

    finaltxt = ' '.join(word for word in newwordList) 
    
    # Tokenize and POS Tag the new sentence.
    # new_txt_list = nltk.word_tokenize(finaltxt)
    # wordsList = [w for w in new_txt_list if not w in stop_words]
    # taggedList = nltk.pos_tag(wordsList)
    # print(wordsList)
    doc = nlp(finaltxt)

    if len(finaltxt) > 0:
        taggedList, taggedList_id = stanza_pos_tag(doc.sentences[0])
        bool_valid_txt = True
    else:
        taggedList = []
        bool_valid_txt = False

    return finaltxt, newwordList, taggedList, bool_valid_txt

def get_dependencies(doc):
    """
    Function to extract the dependencies from the analysis from Stanza

    Parameters
    ----------
    doc : stanza object
        stanza object applied to a verbatim.

    Returns
    -------
    dep_node : list
        list to tuple of relationship with the words.
    dep_node_id : list
        list to tuple of relationship with the ids.

    """
    # Getting the dependency relations betwwen the words
    dep_node = []
    dep_node_id = []
    for dep_edge in doc.sentences[0].dependencies:
        dep_node.append([dep_edge[2].text, dep_edge[0].id, dep_edge[1]])
        dep_node_id.append([dep_edge[2].id, dep_edge[0].id, dep_edge[1]])
        
    # Coverting it into appropriate format
    for i in range(0, len(dep_node)):
        if (int(dep_node[i][1]) != 0):
            dep_node[i][1] = dep_node[(int(dep_node[i][1]) - 1)][0]
            
    return dep_node, dep_node_id

def select_potential_features(taggedList_new, taggedList_id_new):
    """
    Function to extract potential subject from the dep node

    Parameters
    ----------
    taggedList_new : TYPE
        list of words.
    taggedList_id_new : TYPE
        list of id words

    Returns
    -------
    totalfeatureList : TYPE
        DESCRIPTION.
    featureList : TYPE
        DESCRIPTION.
    featureList_id : TYPE
        DESCRIPTION.
    categories : TYPE
        DESCRIPTION.

    """
    totalfeatureList = []
    featureList_id = []
    featureList = []
    categories = []
    count = 1
    
    for idx in range(0,len(taggedList_new)):
        i = taggedList_new[idx]
        i_idx = taggedList_id_new[idx]
        # if(i[1]=='JJ' or i[1]=='NN' or i[1]=='JJR' or i[1]=='NNS' or i[1]=='RB'):
        if(i[1]!='PUNCT'):
        # if(i[1]=='ADJ' or i[1]=='NOUN' or i[1]=='ADV' or i[1]=="VERB" or i[1]=="PRON" or i[1]=="DET"):
            featureList.append(list(i)) # For features for each sentence
            totalfeatureList.append(i) # Stores the features of all the sentences in the text
            featureList_id.append(list(i_idx))
            categories.append(i[0])
        count = count+1

    return totalfeatureList, featureList, featureList_id, categories

def create_relation(featureList, dep_node):
    """
    Function to create relationship from the dep_node

    Parameters
    ----------
    featureList : list
        list of potential subjects.
    dep_node : list
        tree of dependencies.

    Returns
    -------
    fcluster : list
        list of clusters.

    """
    
    lst_deprel = ["nsubj", 
                  "acl:relcl", 
                  "obj", "dobj", 
                  "agent", 
                  "advmod", 
                  "amod", 
                  "neg", 
                  "prep_of", 
                  "acomp", 
                  "xcomp", 
                  "compound",
                  
                  "cop"
                  ]
    
    fcluster = []
    
    for i in featureList:
        filist = []
        for j in dep_node:
            if((j[0]==i[0] or j[1]==i[0]) and (j[2] in lst_deprel)):
                if(j[0]==i[0]):
                    filist.append(j[1])
                else:
                    filist.append(j[0])
        fcluster.append([i[0], filist])
    
    return fcluster

def select_NN_feature_relation(totalfeatureList, fcluster):
    """
    Focus the subject on the noun

    Parameters
    ----------
    totalfeatureList : list
        list of potential features for subject.
    fcluster : list
        list of extracted potential subjects.

    Returns
    -------
    finalcluster_out : list
        list of extracted potential subjects focus on noun.
    finalcluster : list
        list of extracted potential subjects focus on noun.
    dic : TYPE
        dic of extracted potential subjects focus on noun..

    """
    finalcluster = []
    dic = {}
    
    for i in totalfeatureList:
        dic[i[0]] = i[1]
    
    for i in fcluster:
        if(dic[i[0]]=="NOUN"):
            finalcluster.append(i)
    
    finalcluster_out = flat_list(finalcluster)
    
    return finalcluster_out, finalcluster, dic


#%%

def merge_list_common_elements(fcluster_id):
    """
    Function to merge list that have common 

    Parameters
    ----------
    fcluster_id : TYPE
        DESCRIPTION.

    Returns
    -------
    lst_merged_sorted : TYPE
        DESCRIPTION.

    """
    lst_flat = flat_list(fcluster_id)
    G = to_graph(lst_flat)
    set_merged = list(connected_components(G))
    lst_merged = list(map(list,set_merged))
    
    lst_merged_sorted = []
    for element in lst_merged:
        lst_merged_sorted.append(sorted(element))
    
    return lst_merged_sorted


def flat_list(finalcluster):
    lst_final = []
    for lst_tmp in finalcluster:
        lst_tmp_final = []
        for element in lst_tmp:
            if type(element) == list:
                lst_tmp_final.extend(element)
            else:
                lst_tmp_final.append(element)
        lst_tmp_final = sorted(lst_tmp_final)
        lst_final.append(lst_tmp_final)
    
    # Remove duplicated list
    set_theme = set(map(tuple,lst_final))
    lst_theme = list(map(list,set_theme))
    
    return lst_theme


def to_graph(l):
    G = networkx.Graph()
    for part in l:
        # each sublist is a bunch of nodes
        G.add_nodes_from(part)
        # it also imlies a number of edges:
        G.add_edges_from(to_edges(part))
    return G

def to_edges(l):
    """ 
        treat `l` as a Graph and returns it's edges 
        to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
    """
    it = iter(l)
    last = next(it)

    for current in it:
        yield last, current
        last = current    


In [None]:
def initialize_stanza():
    """
    Function to initialize the stanza engine (default parameters for Engie)

    Returns
    -------
    nlp : object
        stanza object.

    """
    nlp = stanza.Pipeline(lang="en", 
                          use_gpu = True, 
                          verbose = False, 
                          processors= {
                                          'tokenize':'default', 
                                          'pos' :'default',
                                          'lemma':'default', 
                                          'depparse':'default',
                                        })
    return nlp

In [None]:
# Define the stanza object
stanza.download('en')
nlp = initialize_stanza()

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.2.2.json:   0%|   …

2021-10-03 04:48:05 INFO: Downloading default packages for language: en (English)...
2021-10-03 04:48:09 INFO: File exists: /root/stanza_resources/en/default.zip.
2021-10-03 04:48:16 INFO: Finished downloading models and saved to /root/stanza_resources.


In [None]:
txt = 'the movie is so bad! i will never see it again!'
# Use our custom phrase extraction code
result = theme_extraction(txt = txt, stop_words = [], nlp = nlp, threshold = 7)

for tmp_text in result:
    print(tmp_text)

['the', 'movie', 'is', 'so', 'bad', '!']
['i', 'will', 'never', 'see', 'it', 'again', '!']


In [None]:
def cao_package(text):
  nlp = initialize_stanza()
  result = theme_extraction(txt = text, stop_words = [], nlp = nlp, threshold = 7)
  #extraction = [tmp_text for tmp_text in result]
  return result

In [None]:
try_df = df.head(5)
try_df

Unnamed: 0,Title,Review,ID
1,read their privacy policy they are intrusive,Have you checked out their ‘Privacy Policy’?Th...,1
2,great hardware but software does not work do ...,"I very rarely write reviews on Amazon, but I f...",2
3,i am keeping it but,The stripped-down version of Android is a real...,3
4,lies,Who reviewed this POS and gave it 5 stars?1. B...,4
5,it does not work on most formats,The projector will play Netflix off it’s nativ...,5


## 7. Apply Rake package:
- For each sentence to extract the key information
For example: "this is a good movie" & "but the room is too dark." -> "good movie" & "room drak"

In [None]:
from rake_nltk import Rake
from nltk.corpus import stopwords 

ModuleNotFoundError: No module named 'rake_nltk'

In [None]:
#nltk.download('stopwords')
stopword = list(stopwords.words('english'))
stopword.remove('not')

In [None]:
def rake_package(text):
  r = Rake(stopwords =stopword)
  a = r.extract_keywords_from_text(text)
  extraction = r.get_ranked_phrases()
  return extraction

In [None]:
df['Rake'] = df['Title'].apply(rake_package)
df['Rake']

Unnamed: 0,Title,Review,ID,Rake
1,read their privacy policy they are intrusive,Have you checked out their ‘Privacy Policy’?Th...,1,"[privacy policy, read, intrusive]"
2,great hardware but software does not work do ...,"I very rarely write reviews on Amazon, but I f...",2,"[not buy unless, not work, hdmi source, great ..."
3,i am keeping it but,The stripped-down version of Android is a real...,3,[keeping]
4,lies,Who reviewed this POS and gave it 5 stars?1. B...,4,[lies]
5,it does not work on most formats,The projector will play Netflix off it’s nativ...,5,"[not work, formats]"
...,...,...,...,...
290,not worth the price,"Easy to use, but it’s dark. It takes me a whil...",290,"[not worth, price]"
291,not working with its own twitch app,"Upset, not working with its own Twitch app. us...",291,"[twitch app, not working]"
292,poco brillo y baja resolución,"No tiene buen brillo, no sirve para jugar play...",292,"[poco brillo, baja resolución]"
293,badluck,One day the nebula get heated more and suddenl...,293,[badluck]


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=22d1188b-dbee-4618-bda5-79d4ace33c29' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>