# Notebook: Subject extraction

## 1. Import Libraries

To use our coding example, we will use the libraries :
- **nltk**: state-of-the art library for NLP analysis (https://github.com/nltk/nltk)
- **stanza**: famous NLP library made by the university of Stanford for NLP analysis (https://github.com/stanfordnlp/stanza)
- **rake_nltk**: a Python implementation of the RAKE algorithm using the NLTK library (https://github.com/csurfer/rake-nltk)

In [None]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
import seaborn as sns
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

from string import punctuation
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from tqdm import tqdm

%matplotlib inline
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk import tokenize,WordNetLemmatizer, PorterStemmer
from nltk.corpus import wordnet

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


## 2. Read the Data

In [None]:
#re-scrap dataset
df= pd.read_csv('https://raw.githubusercontent.com/emaleeeeeee/review_mini/main/REVIEWS%20RESCRAP.csv')
df.columns = ['Customer','Title','Review','Rating','Date']
df = df[['Title','Review']]
df = df.dropna()
spanish = df.index[(df['Review'].str.contains('muy')) | (df['Review'].str.contains('bueno'))| (df['Review'].str.contains('en el'))|  (df['Review'].str.contains('ninguno')) | (df['Review'].str.contains('tiempo'))| 
(df['Review'].str.contains('teléfonos'))| df['Review'].str.contains('calidad de imagen') ]
bad_df = df.index.isin(spanish)
df = df[~bad_df]
df

Unnamed: 0,Title,Review
0,Great projector,"Great item, worth every penny. I was skeptical..."
1,A little projector that packs a punch!,Bought this for my wife for her birthday. She ...
2,"Well thought-out, high-quality","Every time I use this projector, I like it mor..."
3,Truly amazing piece of hardware. I'm blown away!,First the not so great: Initially it is kind o...
4,Awesome little projector,Awesome little projector! Google and Hulu does...
...,...,...
1452,Low Res,"Only 480p so very blurry, was not acceptable f..."
1453,Not worth the price,"Easy to use, but it’s dark. It takes me a whil..."
1454,not working with its own Twitch app,"Upset, not working with its own Twitch app. us..."
1456,Badluck,One day the nebula get heated more and suddenl...


# Text cleaning

 1. Turn the text into lower case  
 2. Replace any abbreviation
 3. Remove the punctuation
 
 For example:   
 - won't -> will not
 - I've  -> I have

In [None]:
#Text cleaning
import re
import string
def text_cleaning(text):
    
    text = text.lower()
    text = re.sub(r"however", ". however", text)
    text = re.sub(r"n\but", ". but", text)
    text = re.sub(r"wont", "will not", text)
    text = re.sub(r"can't", "can not", text)
    text = re.sub(r"i'm", "i am", text)
    text = re.sub('\n', '', str(text))
    text = re.sub(r"\’re", " are", text)
    text = re.sub(r"\’s", " is", text)
    text = re.sub(r"\’d", " would", text)
    text = re.sub(r"\’ll", " will", text)
    text = re.sub(r"\’t", " not", text)
    text = re.sub(r"you've", " you have", text)
    text = re.sub(r"\’ve", " have", text)
    text = re.sub(r"\’m", " am", text)
    return text

In [None]:
df['Review']=df['Review'].apply(text_cleaning)
df

Unnamed: 0,Title,Review
0,Great projector,"great item, worth every penny. i was skeptical..."
1,A little projector that packs a punch!,bought this for my wife for her birthday. she ...
2,"Well thought-out, high-quality","every time i use this projector, i like it mor..."
3,Truly amazing piece of hardware. I'm blown away!,first the not so great: initially it is kind o...
4,Awesome little projector,awesome little projector! google and hulu does...
...,...,...
1452,Low Res,"only 480p so very blurry, was not acceptable f..."
1453,Not worth the price,"easy to use, but it is dark. it takes me a whi..."
1454,not working with its own Twitch app,"upset, not working with its own twitch app. us..."
1456,Badluck,one day the nebula get heated more and suddenl...


## Cut the sentence 
1. we plan to use cao's package but the RAM crash all the time
2. even it's only 5 titles, it takes...20mins
3. So we use NLTK for cutting the sentence

In [None]:
# for example:
text = "this actor is great. but the movie was awful. The theatre was amazing"
print(sent_tokenize(text))

['this actor is great.', 'but the movie was awful.', 'The theatre was amazing']


In [None]:
def nltk_sen(text):
  return sent_tokenize(text)

In [None]:
df['cut'] = df['Review'].apply(nltk_sen)

## Text cleaning for replacing the puntuaction

In [None]:
df['cut'] = df['cut'].astype(str).str.replace("]", '')
df['cut'] = df['cut'].str.replace("[", '')
df['cut']

  """Entry point for launching an IPython kernel.
  


0       'great item, worth every penny.', 'i was skept...
1       'bought this for my wife for her birthday.', '...
2       'every time i use this projector, i like it mo...
3       "first the not so great: initially it is kind ...
4       'awesome little projector!', 'google and hulu ...
                              ...                        
1452    'only 480p so very blurry, was not acceptable ...
1453    'easy to use, but it is dark.', 'it takes me a...
1454    'upset, not working with its own twitch app.',...
1456    'one day the nebula get heated more and sudden...
1458              'excellent services and fast turn over'
Name: cut, Length: 1422, dtype: object

## Put it to Dataframe for splitting each sentence

In [None]:
cut_df = pd.DataFrame(df['cut'])
cut_df.index = np.arange(1, len(df)+1)
cut_df

Unnamed: 0,cut
1,"'great item, worth every penny.', 'i was skept..."
2,"'bought this for my wife for her birthday.', '..."
3,"'every time i use this projector, i like it mo..."
4,"""first the not so great: initially it is kind ..."
5,"'awesome little projector!', 'google and hulu ..."
...,...
1418,"'only 480p so very blurry, was not acceptable ..."
1419,"'easy to use, but it is dark.', 'it takes me a..."
1420,"'upset, not working with its own twitch app.',..."
1421,'one day the nebula get heated more and sudden...


In [None]:
#Link each review to an ID by matching ID number to index number
cut_df.index = np.arange(1, len(cut_df)+1)
cut_df['ID'] = cut_df.index
cut_df

Unnamed: 0,cut,ID
1,"'great item, worth every penny.', 'i was skept...",1
2,"'bought this for my wife for her birthday.', '...",2
3,"'every time i use this projector, i like it mo...",3
4,"""first the not so great: initially it is kind ...",4
5,"'awesome little projector!', 'google and hulu ...",5
...,...,...
1418,"'only 480p so very blurry, was not acceptable ...",1418
1419,"'easy to use, but it is dark.', 'it takes me a...",1419
1420,"'upset, not working with its own twitch app.',...",1420
1421,'one day the nebula get heated more and sudden...,1421


In [None]:
#Splitting each review into 1 word per row & keeping ID to the right review
new_df = pd.DataFrame(cut_df['cut'].str.split(',').tolist(), index=cut_df['ID']).stack()
new_df

ID     
1     0                                          'great item
      1                                  worth every penny.'
      2     'i was skeptical about it after seeing commen...
      3     'in all actually i barely hear it when the vo...
      4     but it does an amazing job in cooling the pro...
                                 ...                        
1420  0                                               'upset
      1                not working with its own twitch app.'
      2                                    'useless for me.'
1421  0    'one day the nebula get heated more and sudden...
1422  0              'excellent services and fast turn over'
Length: 9693, dtype: object

In [None]:
#Reset index
new_df = new_df.reset_index([0, 'ID'])
#Renaming columns
new_df.columns = ['ID', 'Sentence']

###text cleaning for replacing the puntuaction

In [None]:
new_df['Sentence'] = new_df['Sentence'].astype(str).str.replace("'", '')

In [None]:
new_df

Unnamed: 0,ID,Sentence
0,1,great item
1,1,worth every penny.
2,1,i was skeptical about it after seeing comment...
3,1,in all actually i barely hear it when the vol...
4,1,but it does an amazing job in cooling the pro...
...,...,...
9688,1420,upset
9689,1420,not working with its own twitch app.
9690,1420,useless for me.
9691,1421,one day the nebula get heated more and suddenl...


In [None]:
new_df.loc[new_df['ID']==390,'Sentence']

3616                       we love this little projector.
3617     we have used it inside and out and the pictur...
3618     what a fun way to watch football outside or m...
3619                              perfect christmas gift.
3620     customer service is also very responsive when...
Name: Sentence, dtype: object

## Tokenize & Lemmitize the cut sentence

In [None]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
#nltk
wnl = WordNetLemmatizer()
def lemmatize(s):
     lemmatized = [wnl.lemmatize(word,get_wordnet_pos(word)) for word in word_tokenize(s) if word not in string.punctuation]
     return lemmatized

In [None]:
new_df['Sentence'] = new_df['Sentence'].apply(lemmatize)

In [None]:
new_df

Unnamed: 0,ID,Sentence
0,1,"[great, item]"
1,1,"[worth, every, penny]"
2,1,"[i, be, skeptical, about, it, after, see, comm..."
3,1,"[in, all, actually, i, barely, hear, it, when,..."
4,1,"[but, it, do, an, amaze, job, in, cool, the, p..."
...,...,...
9688,1420,[upset]
9689,1420,"[not, work, with, it, own, twitch, app]"
9690,1420,"[useless, for, me]"
9691,1421,"[one, day, the, nebula, get, heat, more, and, ..."


### Text cleaning again -> preparation for the Rake

In [None]:
new_df['Sentence'] = new_df['Sentence'].astype(str).str.replace("[", '')
new_df['Sentence'] = new_df['Sentence'].str.replace("]", '')
new_df['Sentence'] = new_df['Sentence'].str.replace("'", '')
new_df['Sentence'] = new_df['Sentence'].str.replace(",", '')
new_df['Sentence']

  """Entry point for launching an IPython kernel.
  


0                                              great item
1                                       worth every penny
2       i be skeptical about it after see comment on h...
3       in all actually i barely hear it when the volu...
4            but it do an amaze job in cool the projector
                              ...                        
9688                                                upset
9689                      not work with it own twitch app
9690                                       useless for me
9691    one day the nebula get heat more and suddenly ...
9692                 excellent service and fast turn over
Name: Sentence, Length: 9693, dtype: object

## Import Rake package for subject extraction

In [None]:
! pip install rake_nltk

You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m


In [None]:
# one example of how it works:
from rake_nltk import Rake
import nltk

In [None]:
# one example of how it works:


r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.Please note that "hello" is not included in the list of stopwords.

text='Bad picture quality of all products!'
a=r.extract_keywords_from_text(text)
b=r.get_ranked_phrases()
c=r.get_ranked_phrases_with_scores()

print(b)
print(c)

['bad picture quality', 'products']
[(9.0, 'bad picture quality'), (1.0, 'products')]


In [None]:
# define our stopwords

nltk.download('stopwords')
our_stopwords = ['also', 'would', 'one','still']
stopword = list(stopwords.words('english'))+our_stopwords
stopword.remove('not')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def rake_package(text):
  r = Rake(stopwords =stopword)
  a = r.extract_keywords_from_text(text)
  extraction = r.get_ranked_phrases()
  return extraction

In [None]:
new_df['Rake'] = new_df['Sentence'].apply(rake_package)

In [None]:
new_df.loc[new_df['ID']==390]

Unnamed: 0,ID,Sentence,Rake
3616,390,we love this little projector,"[little projector, love]"
3617,390,we have use it inside and out and the picture ...,"[use, picture, inside, great]"
3618,390,what a fun way to watch football outside or mo...,"[watch football outside, grown child want, fun..."
3619,390,perfect christmas gift,[perfect christmas gift]
3620,390,customer service be also very responsive when ...,"[customer service, add content, responsive, qu..."


### Text cleaning again -> preparation for split into dataframe

In [None]:
new_df['Rake'] = new_df['Rake'].astype(str).str.replace("]", '')
new_df['Rake'] = new_df['Rake'].str.replace("[", '')
new_df['Rake']

  """Entry point for launching an IPython kernel.
  


0                                            'great item'
1                                     'worth every penny'
2               'see comment', 'skeptical', 'loud', 'fan'
3              'barely hear', 'volume', 'low', 'actually'
4                        'amaze job', 'projector', 'cool'
                              ...                        
9688                                              'upset'
9689                             'twitch app', 'not work'
9690                                            'useless'
9691    'nebula get heat', 'bad odour smell', 'switch'...
9692                     'fast turn', 'excellent service'
Name: Rake, Length: 9693, dtype: object

### Make it to dataframe -> ID ->split again!

In [None]:
df1 = new_df [['ID','Rake']]

In [None]:
#Splitting each review into 1 word per row & keeping ID to the right review
df1 = pd.DataFrame(new_df['Rake'].str.split(',').tolist(), index=new_df['ID']).stack()
df1

ID     
1     0            'great item'
      0     'worth every penny'
      0           'see comment'
      1             'skeptical'
      2                  'loud'
                   ...         
1421  5                  'lose'
      6                   'day'
      7                  'come'
1422  0             'fast turn'
      1     'excellent service'
Length: 33947, dtype: object

In [None]:
#Reset index
df1 = df1.reset_index([0, 'ID'])
#Renaming columns
df1.columns = ['ID', 'subject']

### Text Cleaning Final !

In [None]:
#Text cleaning
import re
import string
def text_cleaning(text):
    text = re.sub('[%s]' % re.escape(string.punctuation), '', str(text))
    text = text.strip()
    text = " ".join(text.split())
    return text

df1['subject_clean'] = df1['subject'].apply(text_cleaning)

In [None]:
df1['Len'] = df1['subject_clean'].apply(lambda x: len(word_tokenize(x)))
df1

Unnamed: 0,ID,subject,subject_clean,Len
0,1,'great item',great item,2
1,1,'worth every penny',worth every penny,3
2,1,'see comment',see comment,2
3,1,'skeptical',skeptical,1
4,1,'loud',loud,1
...,...,...,...,...
33942,1421,'lose',lose,1
33943,1421,'day',day,1
33944,1421,'come',come,1
33945,1422,'fast turn',fast turn,2


In [None]:
_deepnote_run_altair(df1, """{"$schema":"https://vega.github.io/schema/vega-lite/v4.json","mark":{"type":"bar","tooltip":{"content":"data"}},"height":220,"autosize":{"type":"fit"},"data":{"name":"placeholder"},"encoding":{"x":{"field":"subject_clean","type":"nominal","sort":{"field":"COUNT(*)","op":"count","order":"descending"},"scale":{"type":"linear","zero":false}},"y":{"field":"COUNT(*)","type":"quantitative","sort":null,"aggregate":"count","scale":{"type":"linear","zero":true},"bin":false},"color":{"field":"","type":"nominal","sort":null,"scale":{"type":"linear","zero":false}}}}""")

In [None]:
df1['subject_clean'].value_counts().head(10)

             985
use          589
projector    369
not          315
love         259
great        240
get          215
device       198
remote       179
size         163
Name: subject_clean, dtype: int64

In [None]:
df1

Unnamed: 0,ID,subject,subject_clean,Len
0,1,'great item',great item,2
1,1,'worth every penny',worth every penny,3
2,1,'see comment',see comment,2
3,1,'skeptical',skeptical,1
4,1,'loud',loud,1
...,...,...,...,...
33942,1421,'lose',lose,1
33943,1421,'day',day,1
33944,1421,'come',come,1
33945,1422,'fast turn',fast turn,2


In [None]:
#Exclude len==0 columns
Len = df1['Len']==0
Str = df1['subject_clean'].str.contains('use')
df1 = df1[df1.Len != 0]
df1

Unnamed: 0,ID,subject,subject_clean,Len
0,1,'great item',great item,2
1,1,'worth every penny',worth every penny,3
2,1,'see comment',see comment,2
3,1,'skeptical',skeptical,1
4,1,'loud',loud,1
...,...,...,...,...
33942,1421,'lose',lose,1
33943,1421,'day',day,1
33944,1421,'come',come,1
33945,1422,'fast turn',fast turn,2


In [None]:
#just wanna see the combination to which more than 1 words
df1.loc[df1['Len']==8,'subject_clean'].value_counts().head(30)

castle crusher 4 player w friend watch movie                  1
internal power supply always need connection not easy         1
que se lo tenga conectado igual se descarga                   1
build quality nebula capsulethe nebula capsule feel denser    1
paso de 4 horas ante que se apague                            1
native amazon prime app — black panther look                  1
not install local apps like king 5 news                       1
support many android apps albeit not always perfect           1
picture quality doesnt seem quite 720p despite claim          1
living room plex app work great work great                    1
Name: subject_clean, dtype: int64

In [None]:
#check the len average
df1.loc[df1['Len']==2,'subject_clean'].value_counts().head(50)

picture quality       102
not work               70
nebula capsule         57
customer service       56
sound quality          54
work great             50
battery life           49
bluetooth speaker      44
watch movie            40
work well              40
remote control         34
great product          32
portable projector     30
dark room              30
highly recommend       26
little projector       24
amazon prime           23
not use                23
great picture          22
movie night            21
great projector        20
image quality          19
mogo pro               19
customer support       19
good quality           18
app store              17
not get                17
screen mirror          16
not sure               16
really like            16
pretty good            16
hdmi cable             16
loud enough            15
much well              15
not wait               14
really good            13
phone app              13
small projector        13
5 star      

In [None]:
df1.loc[df1['ID']==41]

Unnamed: 0,ID,subject,subject_clean,Len
1105,41,'add apps',add apps,2
1106,41,'difficult',difficult,1
1107,41,'love',love,1


In [None]:
#2021/10/17
df1.to_csv('Review Extraction.csv',encoding='utf-8-sig')

In [None]:
_deepnote_run_altair(df1, """{"$schema":"https://vega.github.io/schema/vega-lite/v4.json","mark":{"type":"bar","tooltip":{"content":"data"}},"height":220,"autosize":{"type":"fit"},"data":{"name":"placeholder"},"encoding":{"x":{"field":"Len","type":"quantitative","sort":{"field":"COUNT(*)","op":"count","order":"descending"},"scale":{"type":"linear","zero":false},"axis":{"grid":false}},"y":{"field":"COUNT(*)","type":"quantitative","sort":{"field":"COUNT(*)","op":"count","order":"descending"},"aggregate":"count","scale":{"type":"linear","zero":true},"bin":false},"color":{"field":"","type":"nominal","sort":null,"scale":{"type":"linear","zero":false}}}}""")

## Cao's package (fail to implement it)

In [None]:
from nltk.corpus import stopwords

nltk.download('stopwords')
our_stopwords = ['also', 'would', 'one','still','actually']
stopword = list(stopwords.words('english'))+our_stopwords
stopword.remove('not')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df['Rake'] = df['Title'].apply(rake_package)
df['Rake'][0]

['great projector']

In [None]:
print('Original Text', df['Title'] [3])
print('After Rake', df['Rake'] [3])

Original Text Truly amazing piece of hardware. I'm blown away!
After Rake ['truly amazing piece', 'blown away', 'hardware']


In [None]:
"""
Created on Mon Mar 29 14:13:12 2021

@author: Cao Tri DO
"""

import nltk

import networkx 
from networkx.algorithms.components.connected import connected_components
import pandas as pd

from nltk.corpus import stopwords
import stanza
import collections


#%% subfunctions

def theme_extraction(txt, stop_words, nlp, threshold = 7):
    """
    Function to extract theme from a verbatims

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    stop_words : list
        list of stop to not considered when extract subject.
    nlp : stanza object
        stanza object engine.
    threshold : int, optional
        DESCRIPTION. The default is 7.

    Returns
    -------
    lst_subject : list
        list of subject analyzed.

    """
    try:
        # Lower the text
        doc = nlp(txt.lower())
        # Initiate void list
        lst_subject = []
        
        # Loop for all sentences in the verbatims
        for sent in doc.sentences:
            # If the sentence if too short, keep all in one subject
            if len(sent._tokens) <= threshold:          
                # print("Sentence is too short. Keep all words")
                finalcluster = [word.text for word in sent.words]
                lst_subject.append(finalcluster)  
            # For the sentences that are long enough
            else:
                # Extract subject using the POS and xparel method
                # print("Sentence is long. Apply subject extraction")
                finalcluster = theme_clustering(sent.text, stop_words, nlp)
                
                # Only apply the subject engine if there are extracted subject
                if len(finalcluster)>0: # case when there are issues e.g., 'PRPBLEME D IMPRESSION?????????????????????????????????'
                    ##################################################################
                    # Group first element if there are many of them
                    ##################################################################
                    # list of size of element
                    lst_size = [len(element) for element in finalcluster]
                    # first element of size 1
                    try:
                        first_element = next(x[0] for x in enumerate(lst_size) if x[1] > 1)
                    except: # if all are equal to 1
                        first_element = len(lst_size)-1
                    
                    finalcluster_first_group = []
                    tmp_first_element = []
                    for idx_element in range(0,first_element+1):
                        element = finalcluster[idx_element]
                        tmp_first_element.extend(element)
                    finalcluster_first_group.append(tmp_first_element)
                    for idx_element in range(first_element+1, len(finalcluster)):
                        element = finalcluster[idx_element]
                        finalcluster_first_group.append(element)
                    
                    ##################################################################
                    
                    finalcluster_clean = []
                    count = 0
                    tmp = []
                    for element_idx in range(0,len(finalcluster_first_group)):
                        element = finalcluster_first_group[element_idx]
                        if element_idx == 0 and len(element)== 1:
                            tmp = element
                        elif len(tmp)>0 or len(element)>1:
                            tmp.extend(element)
                            finalcluster_clean.append(tmp)
                            tmp = []
                            count = len(finalcluster_clean)-1
                        else:
                            tmp.extend(element)
                            finalcluster_clean[count].extend(tmp)
                            tmp = []
                    
                    for element in finalcluster_clean:
                        lst_subject.append(element)    
    except:
        print(txt)
        lst_subject = []
    
    return lst_subject


#%% utilities functions

def get_meta_data_nlp(txt, nlp):
    """
    Function to transform stanza object result to dataframe

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    df : dataframe
        dataframe allowing to analyze the results from Stanza.

    """
    doc = nlp(txt)
    
    df = pd.DataFrame()
    count = 1
    for sentence in doc.sentences:
        print(sentence._text)
        for word in sentence.words:
            df.loc[count, "word"]  = word.text
            df.loc[count, "lemma"] = word.lemma
            # df.loc[count, "feats"] = word.feats 
            df.loc[count, "pos"] = word.pos
            df.loc[count, "upos"] = word.upos
            df.loc[count, "xpos"] = word.xpos
            df.loc[count, "id head"] = word.head
            df.loc[count, "word root"] = sentence.words[word.head-1].text if word.head > 0 else "root"
            df.loc[count, "deprel"] = word.deprel
            # print(word.text, " | ", word.lemma, " | ", word.pos)
            count = count +1
    print(df)
    
    return df


def get_monolabel(X_test):
    """
    Extract monolabel from dataframe

    Parameters
    ----------
    X_test : dataframe
        containing the data.

    Returns
    -------
    df_monolabel : dataframe
        Contains only the monolabel.
    df_multilabel : dataframe
        Contains only the multilabel.
    """
    df_theme = X_test.copy()
    mask = df_theme["id_unique"].value_counts()
    lst_unique_idx = list(mask[mask>1].index)
    df_monolabel = X_test[~X_test['id_unique'].isin(lst_unique_idx)].copy()
    df_multilabel = X_test[X_test['id_unique'].isin(lst_unique_idx)].copy()
    return df_monolabel, df_multilabel


#%% subfunctions for the code 

def theme_clustering(txt, stop_words, nlp):
    """
    Main program to cluster the subject

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    stop_words : list
        list of stop to not considered when extract subject.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    lst_merged_cluster_word_all : list
        list of subject identified by the algorithm.

    """
    # sentList = clean_txt(txt)
    sentList = clean_txt(txt, nlp)


    # for line in sentList:
    lst_merged_cluster_word_all = []
    for line in sentList.sentences:
        
        # NLTK POS Tagging
        # taggedList = nltk_pos_tag(line)
        # taggedList = stanza_pos_tag(line)

        # join consecutive noun
        # finaltxt, newwordList, taggedList_new, bool_valid_txt = join_consecutive_noun(taggedList, nlp, noun_tag = "NOUN")
        bool_valid_txt = True
        
        if bool_valid_txt:
            # Create Stanford object model
            # doc = nlp(txt) # Object of Stanford NLP Pipeleine
            taggedList_new, taggedList_id_new = stanza_pos_tag(line)
            
            # Getting the dependency relations betwwen the words
            # Coverting it into appropriate format
            dep_node, dep_node_id = get_dependencies(sentList)
    
            # we will select only those sublists from the <dep_node> that could probably contain the features
            totalfeatureList, featureList, featureList_id, categories = select_potential_features(taggedList_new, taggedList_id_new)
            
            # Now using the <dep_node> list and the <featureList> we will determine 
            # to which of the words these features in the feature list are related to.
            fcluster = create_relation(featureList, dep_node)
            fcluster_id = create_relation(featureList_id, dep_node_id)
            
            # finalcluster_out, finalcluster, dic = select_NN_feature_relation(totalfeatureList, fcluster)
            
            # Merge list commonelements
            lst_merged_cluster_id = merge_list_common_elements(fcluster_id)
            
            # Order the id within the list
            lst_merged_cluster_id_sort1 = sort_list_list(lst_merged_cluster_id)
            
            # Order the element by id of 1st element
            lst_merged_cluster_id_keysort = sort_list_by_key(lst_merged_cluster_id_sort1)
            
            # Bug fix : if a list is only of 1 element and is contains within the range of the previous list, put it into it
            # Do not take into account 1st element 0
            lst_merged_cluster_id_no_orphelin = combine_orphelin_list(lst_merged_cluster_id_keysort)
            
            
            # lst_merged_cluster_id_sorted = sort_list_list(lst_merged_cluster_id)
            lst_merged_cluster_id_sorted = sort_list_list(lst_merged_cluster_id_no_orphelin)
            
            lst_merged_cluster_word = convert_idx_lst_to_word_lst(lst_merged_cluster_id_sorted, taggedList_new)
            
            
        else:
            lst_merged_cluster_word = []   
        
        for element in lst_merged_cluster_word:
            lst_merged_cluster_word_all.append(element)
    
    return lst_merged_cluster_word_all


def combine_orphelin_list(lst_merged_cluster_id_keysort):
    """
    Function to combine orphelin list if there are within previous range
    For example : [[1,3], [2], [4]] will give [[1,2, 3], [4]]

    Parameters
    ----------
    lst_merged_cluster_id_keysort : list
        original list to combine.

    Returns
    -------
    lst_merged_cluster_id_no_orphelin : list
        combined list.

    """
    
    #♦ Assignaton
    lst_merged_cluster_id_no_orphelin = []
    
    # Make a copy
    lst_merged_cluster_id_no_orphelin.append(lst_merged_cluster_id_keysort[0].copy())
    id_to_compare = 0
    
    # Loop
    for i in range(1,len(lst_merged_cluster_id_keysort)):
        # Only if the list is with 1 element
        if len(lst_merged_cluster_id_keysort[i])==1:
            id_element = lst_merged_cluster_id_keysort[i][0]
            # Meta data of previous list
            previous_list = lst_merged_cluster_id_no_orphelin[id_to_compare]
            min_id_previous_list = min(previous_list)
            max_id_previous_list = max(previous_list)
            # If new list within the range of previous list, merge with it
            if id_element>=min_id_previous_list and id_element<=max_id_previous_list:
                lst_merged_cluster_id_no_orphelin[id_to_compare].extend(lst_merged_cluster_id_keysort[i].copy())
            else:
                lst_merged_cluster_id_no_orphelin.append(lst_merged_cluster_id_keysort[i].copy())
                id_to_compare = len(lst_merged_cluster_id_no_orphelin)-1
        else:
            lst_merged_cluster_id_no_orphelin.append(lst_merged_cluster_id_keysort[i].copy())
            id_to_compare = len(lst_merged_cluster_id_no_orphelin)-1   
    return lst_merged_cluster_id_no_orphelin

def sort_list_by_key(lst_merged_cluster_id_sort1):
    """
    Function to sort a list of list into an ordered list
    For example : [[2], [1,3], [4]] will give [[1,3], [2], [4]]

    Parameters
    ----------
    lst_merged_cluster_id_sort1 : list
        original list unsorted.

    Returns
    -------
    final_list : list
        sorted list.

    """
    dct_list = {}
    for i in range(0, len(lst_merged_cluster_id_sort1)):
        first_id = lst_merged_cluster_id_sort1[i][0]
        dct_list[first_id] = lst_merged_cluster_id_sort1[i]
    
    # sort dict key
    od = collections.OrderedDict(sorted(dct_list.items()))
    
    final_list = []
    for element in od:
        final_list.append(od[element])
    return final_list
    

def sort_list_list(lst_merged_cluster_id_no_orphelin):
    """
    Function to sort the list of list to order the id of the words

    Parameters
    ----------
    lst_merged_cluster_id : list
        DESCRIPTION.

    Returns
    -------
    lst_merged_cluster_id_sorted : list
        DESCRIPTION.

    """
    lst_merged_cluster_id_sorted = lst_merged_cluster_id_no_orphelin.copy()
    for i in range(0, len(lst_merged_cluster_id_sorted)):
        lst_merged_cluster_id_sorted[i] = sorted(lst_merged_cluster_id_no_orphelin[i]).copy()
    # Initiate the void dictionnary
    # dct_idx = {}
    # # Loop on all element of the list
    # for element in lst_merged_cluster_id_no_orphelin:
    #     count = element[0]
    #     dct_idx[count] = element
    # # sort the list
    # dct_idx_sorted = dict(sorted(dct_idx.items(), key=lambda item: item[1]))
    # lst_merged_cluster_id_sorted = []
    # for key_element in dct_idx_sorted:
    #     lst_merged_cluster_id_sorted.append(dct_idx_sorted[key_element])

    return lst_merged_cluster_id_sorted

def convert_idx_lst_to_word_lst(lst_merged_cluster_id, taggedList_new):
    """
    Function to convert the idx to the words

    Parameters
    ----------
    lst_merged_cluster_id : list
        list of index.
    taggedList_new : str
        list of words.

    Returns
    -------
    lst_merged_cluster_word : list
        list of words.

    """
    lst_merged_cluster_word = []
    for lst in lst_merged_cluster_id:
        lst_word = []
        for element in lst:
            word_tmp = taggedList_new[element-1][0]
            lst_word.append(word_tmp)
        lst_merged_cluster_word.append(lst_word)
    return lst_merged_cluster_word

def clean_txt(txt, nlp):
    """
    Function to clean the text by lower and separate the sentence using stanza

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    sentList : list
        list of sentences.

    """
    txt = txt.lower() # LowerCasing the given Text
    sentList = nlp(txt)
    
    return sentList

def clean_txt_nltk(txt):
    """
    Function to clean the text by lower and separate the sentence using nltk

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    sentList : list
        list of sentences.

    """
    txt = txt.lower() # LowerCasing the given Text
    sentList = nltk.sent_tokenize(txt) # Splitting the text into sentences

    return sentList


def nltk_pos_tag(line):
    """
    for each sentence in the <sentList> tokenize it and perform POS Tagging and store it into a Tagged List
    """
    txt_list = nltk.word_tokenize(line) # Splitting up into words
    taggedList = nltk.pos_tag(txt_list) # Doing Part-of-Speech Tagging to each word
    return taggedList

def stanza_pos_tag(line):
    taggedList = []
    taggedList_id = []
    for word in line.words:
        taggedList.append((word.text, word.pos))
        taggedList_id.append((word.id, word.pos))
        
    return taggedList, taggedList_id

def join_consecutive_noun(taggedList, nlp, noun_tag = "NOUN"):
    """
    there are many instances where a feature is represented by multiple words 
    so we need to handle that first by joining multiple words features into a one-word feature
    
     noun_tag = "NOUN" (stanza) or NN (nltk)
    """
    newwordList = []
    flag = 0
    for i in range(0,len(taggedList)-1):
        if(taggedList[i][1]==noun_tag and taggedList[i+1][1]==noun_tag): # If two consecutive words are Nouns then they are joined together
            newwordList.append(taggedList[i][0]+taggedList[i+1][0])
            flag=1
        else:
            if(flag==1):
                flag=0
                continue
            newwordList.append(taggedList[i][0])
            if(i==len(taggedList)-2):
                newwordList.append(taggedList[i+1][0])

    finaltxt = ' '.join(word for word in newwordList) 
    
    # Tokenize and POS Tag the new sentence.
    # new_txt_list = nltk.word_tokenize(finaltxt)
    # wordsList = [w for w in new_txt_list if not w in stop_words]
    # taggedList = nltk.pos_tag(wordsList)
    # print(wordsList)
    doc = nlp(finaltxt)

    if len(finaltxt) > 0:
        taggedList, taggedList_id = stanza_pos_tag(doc.sentences[0])
        bool_valid_txt = True
    else:
        taggedList = []
        bool_valid_txt = False

    return finaltxt, newwordList, taggedList, bool_valid_txt

def get_dependencies(doc):
    """
    Function to extract the dependencies from the analysis from Stanza

    Parameters
    ----------
    doc : stanza object
        stanza object applied to a verbatim.

    Returns
    -------
    dep_node : list
        list to tuple of relationship with the words.
    dep_node_id : list
        list to tuple of relationship with the ids.

    """
    # Getting the dependency relations betwwen the words
    dep_node = []
    dep_node_id = []
    for dep_edge in doc.sentences[0].dependencies:
        dep_node.append([dep_edge[2].text, dep_edge[0].id, dep_edge[1]])
        dep_node_id.append([dep_edge[2].id, dep_edge[0].id, dep_edge[1]])
        
    # Coverting it into appropriate format
    for i in range(0, len(dep_node)):
        if (int(dep_node[i][1]) != 0):
            dep_node[i][1] = dep_node[(int(dep_node[i][1]) - 1)][0]
            
    return dep_node, dep_node_id

def select_potential_features(taggedList_new, taggedList_id_new):
    """
    Function to extract potential subject from the dep node

    Parameters
    ----------
    taggedList_new : TYPE
        list of words.
    taggedList_id_new : TYPE
        list of id words

    Returns
    -------
    totalfeatureList : TYPE
        DESCRIPTION.
    featureList : TYPE
        DESCRIPTION.
    featureList_id : TYPE
        DESCRIPTION.
    categories : TYPE
        DESCRIPTION.

    """
    totalfeatureList = []
    featureList_id = []
    featureList = []
    categories = []
    count = 1
    
    for idx in range(0,len(taggedList_new)):
        i = taggedList_new[idx]
        i_idx = taggedList_id_new[idx]
        # if(i[1]=='JJ' or i[1]=='NN' or i[1]=='JJR' or i[1]=='NNS' or i[1]=='RB'):
        if(i[1]!='PUNCT'):
        # if(i[1]=='ADJ' or i[1]=='NOUN' or i[1]=='ADV' or i[1]=="VERB" or i[1]=="PRON" or i[1]=="DET"):
            featureList.append(list(i)) # For features for each sentence
            totalfeatureList.append(i) # Stores the features of all the sentences in the text
            featureList_id.append(list(i_idx))
            categories.append(i[0])
        count = count+1

    return totalfeatureList, featureList, featureList_id, categories

def create_relation(featureList, dep_node):
    """
    Function to create relationship from the dep_node

    Parameters
    ----------
    featureList : list
        list of potential subjects.
    dep_node : list
        tree of dependencies.

    Returns
    -------
    fcluster : list
        list of clusters.

    """
    
    lst_deprel = ["nsubj", 
                  "acl:relcl", 
                  "obj", "dobj", 
                  "agent", 
                  "advmod", 
                  "amod", 
                  "neg", 
                  "prep_of", 
                  "acomp", 
                  "xcomp", 
                  "compound",
                  
                  "cop"
                  ]
    
    fcluster = []
    
    for i in featureList:
        filist = []
        for j in dep_node:
            if((j[0]==i[0] or j[1]==i[0]) and (j[2] in lst_deprel)):
                if(j[0]==i[0]):
                    filist.append(j[1])
                else:
                    filist.append(j[0])
        fcluster.append([i[0], filist])
    
    return fcluster

def select_NN_feature_relation(totalfeatureList, fcluster):
    """
    Focus the subject on the noun

    Parameters
    ----------
    totalfeatureList : list
        list of potential features for subject.
    fcluster : list
        list of extracted potential subjects.

    Returns
    -------
    finalcluster_out : list
        list of extracted potential subjects focus on noun.
    finalcluster : list
        list of extracted potential subjects focus on noun.
    dic : TYPE
        dic of extracted potential subjects focus on noun..

    """
    finalcluster = []
    dic = {}
    
    for i in totalfeatureList:
        dic[i[0]] = i[1]
    
    for i in fcluster:
        if(dic[i[0]]=="NOUN"):
            finalcluster.append(i)
    
    finalcluster_out = flat_list(finalcluster)
    
    return finalcluster_out, finalcluster, dic


#%%

def merge_list_common_elements(fcluster_id):
    """
    Function to merge list that have common 

    Parameters
    ----------
    fcluster_id : TYPE
        DESCRIPTION.

    Returns
    -------
    lst_merged_sorted : TYPE
        DESCRIPTION.

    """
    lst_flat = flat_list(fcluster_id)
    G = to_graph(lst_flat)
    set_merged = list(connected_components(G))
    lst_merged = list(map(list,set_merged))
    
    lst_merged_sorted = []
    for element in lst_merged:
        lst_merged_sorted.append(sorted(element))
    
    return lst_merged_sorted


def flat_list(finalcluster):
    lst_final = []
    for lst_tmp in finalcluster:
        lst_tmp_final = []
        for element in lst_tmp:
            if type(element) == list:
                lst_tmp_final.extend(element)
            else:
                lst_tmp_final.append(element)
        lst_tmp_final = sorted(lst_tmp_final)
        lst_final.append(lst_tmp_final)
    
    # Remove duplicated list
    set_theme = set(map(tuple,lst_final))
    lst_theme = list(map(list,set_theme))
    
    return lst_theme


def to_graph(l):
    G = networkx.Graph()
    for part in l:
        # each sublist is a bunch of nodes
        G.add_nodes_from(part)
        # it also imlies a number of edges:
        G.add_edges_from(to_edges(part))
    return G

def to_edges(l):
    """ 
        treat `l` as a Graph and returns it's edges 
        to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
    """
    it = iter(l)
    last = next(it)

    for current in it:
        yield last, current
        last = current    


In [None]:
import nltk

import networkx 
from networkx.algorithms.components.connected import connected_components
import pandas as pd

from nltk.corpus import stopwords
import stanza
stanza.download('en')
import collections


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2021-10-15 12:38:49 INFO: Downloading default packages for language: en (English)...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2021-10-15 12:39:05 INFO: Finished downloading models and saved to /root/stanza_resources.


In [None]:
import stanza
def initialize_stanza():
    """
    Function to initialize the stanza engine (default parameters for Engie)

    Returns
    -------
    nlp : object
        stanza object.

    """
    nlp = stanza.Pipeline(lang="en", 
                          use_gpu = True, 
                          verbose = False, 
                          processors= {
                                          'tokenize':'default', 
                                          'pos' :'default',
                                          'lemma':'default', 
                                          'depparse':'default',
                                        })
    return nlp

In [None]:
# Define the stanza object
nlp = initialize_stanza()

In [None]:
# Define the text example
txt = df['Review'] [0]

# Define the stanza object
# Use our custom phrase extraction code
result = theme_extraction(txt = txt, stop_words = stopword, nlp = nlp, threshold = 7)

#for tmp_text in result:
  #print(tmp_text)

extraction = [tmp_text for tmp_text in result]
#print(extraction)
print(result)

[['great', 'item', ',', 'worth', 'every', 'penny', '.'], ['i', 'was', 'skeptical', 'about', 'it', 'after'], ['seeing', 'comments', 'on', 'how'], ['loud', 'the', 'fan', 'is'], ['in', 'all', 'actually', 'i', 'barely', 'hear', 'it', 'when', 'the'], ['volume', 'is', 'low', 'but'], ['it', 'does', 'an', 'amazing', 'job', 'in'], ['cooling', 'the', 'projector'], ['every', 'time', 'the', 'capsule', 'is', 'cool', 'after'], ['running', 'hours'], ['over', '3'], ['highly', 'recommended', 'the'], ['picture', 'below', 'is', 'on', 'a', 'stucco', 'exterior', 'wall', 'but', 'the'], ['image', 'still', 'looks', 'amazing']]


In [None]:
def cao_package(text):
  nlp = initialize_stanza()
  result = theme_extraction(txt = text, stop_words = [], nlp = nlp, threshold = 7)
  #extraction = [tmp_text for tmp_text in result]
  return result 

In [None]:

#df['cut'] = df['Review'].apply(cao_package)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=22d1188b-dbee-4618-bda5-79d4ace33c29' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>