# Notebook: Theme extraction

<center><img src="https://media-exp1.licdn.com/dms/image/C5112AQHxkUuGyjk_Zw/article-cover_image-shrink_600_2000/0/1520139427384?e=1635379200&v=beta&t=E1kiu2v6CzpQ0OnIAhniRoEeS0zwlZ-PeA_-Au1_ang" title="Python Logo"/></center>



The objective of this notebook is to give an example of unsupervised thematic extraction on simple and complex domain.

**Topic analysis** (also called topic detection, topic modeling, or topic extraction) is a machine learning technique that organizes and understands large collections of text data, by assigning “tags” or categories according to each individual text’s topic or theme.

Topic analysis uses natural language processing (NLP) to break down human language so that you can find patterns and unlock semantic structures within texts to extract insights and help make data-driven decisions.

The two most common approaches for topic analysis with machine learning are NLP topic modeling and NLP topic classification.

**Topic modeling** is an unsupervised machine learning technique. This means it can infer patterns and cluster similar expressions without needing to define topic tags or train data beforehand. This type of algorithm can be applied quickly and easily, but there’s a downside – they are rather inaccurate.

**Text classification** or **topic extraction from text**, on the other hand, needs to know the topics of a text before starting the analysis, because you need to tag data in order to train a topic classifier. Although there’s an extra step involved, topic classifiers pay off in the long run, and they’re much more precise than clustering techniques.

# 1. Install libraries

To use our coding example, we will use the libraries :
- **nltk**: state-of-the art library for NLP analysis (https://github.com/nltk/nltk)
- **stanza**: famous NLP library made by the university of Stanford for NLP analysis (https://github.com/stanfordnlp/stanza)
- **rake_nltk**: a Python implementation of the RAKE algorithm using the NLTK library (https://github.com/csurfer/rake-nltk)
- **networkx** (https://github.com/networkx)

## 1.1 Install the package from pip

We first install the different libraries

In [None]:
# Install the nltk package
!pip install nltk



In [None]:
# Install the rake_nltk package
!pip install rake_nltk



In [None]:
# Install the networkx package
!pip install networkx



In [None]:
!pip install tqdm



## 1.2 Download the plugins within the libraries

Stanza package requires to download the material of the language that will be used inside our code. In our case, our example are English based document. We will download the English package by usin the command `stanza.download(<language>)` 

More information on how to get started with the package can be found at: https://stanfordnlp.github.io/stanza/#getting-started

Note: Stanza has a strange bug and its installation has been put into the init.ipbyb

In [None]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# 2. RAKE Algorithm

## 2.1 Presentation

Rake also known as **Rapid Automatic Keyword Extraction** is a keyword extraction algorithm that is extremely efficient which operates on individual documents to enable an application to the dynamic collection, it can also be applied on the new domains very easily and also very effective in handling multiple types of documents, especially the type of text which follows specific grammar conventions.
Rake is based on the observations that keywords frequently contain multiple words with standard punctuation or stop words or we can say functioning words like ‘and’, ‘of’, ‘the’, etc with minimum lexical meaning. Stop words are typically dropped within all the informational systems and also not included in various text analyses as they are considered to be meaningless. Words that are considered to carry a meaning related to the text are described as the content bearing and are called as content words.

Source: https://medium.datadriveninvestor.com/rake-rapid-automatic-keyword-extraction-algorithm-f4ec17b2886c

<center><img src="https://miro.medium.com/max/1276/0*HlO7We9cJhaLa5QT" title="Python Logo"/></center>



## 2.2 Let's test the RAKE algorithm on a simple example

Let's now analyze the RAKE algorithm on a simple example. We have a text: `this film is great but the movie was awful. The theatre was amazing`. The objective of our work is to extract the different subjects. In this case, the answers we are waiting are:
- `great film` 
- `movie awful` 
- `theatre amazing`

In [None]:
# Text Example
txt = "this film is great but the movie was awful. The theatre was amazing"

Define the stopwords list:

In [None]:
from nltk.corpus import stopwords

stopword = list(stopwords.words('english'))
stopword.remove('not')

Here is below a code to use the RAKE Algorithm:

In [None]:
# Import the libraries
from rake_nltk import Rake
from nltk.corpus import stopwords 

# Define the object RAKE
r = Rake(stopwords = stopword) # Uses stopwords for english from NLTK, and all puntuation characters.Please note that "hello" is not included in the list of stopwords.

# Make the extraction
a=r.extract_keywords_from_text(txt)
b=r.get_ranked_phrases()
c=r.get_ranked_phrases_with_scores()

print("Original text:", txt)
print("Extracted Keywords from the text:", a)
print("Ranked phrases:", b)
print("Ranked phrases with scores:", c)

Original text: this film is great but the movie was awful. The theatre was amazing
Extracted Keywords from the text: None
Ranked phrases: ['theatre', 'movie', 'great', 'film', 'awful', 'amazing']
Ranked phrases with scores: [(1.0, 'theatre'), (1.0, 'movie'), (1.0, 'great'), (1.0, 'film'), (1.0, 'awful'), (1.0, 'amazing')]


## 2.3 Limits of the RAKE Algorithm

What we can observe is that the algorithm succeeded in extracting the different important word but did not succeeded in associating the words that are of the same concept. To improve the method, we need another method to categorize first the different part of a comment, then use the RAKE algorithm to analyze the subject

# 3. Package to separate the different thematic in a comment

## 3.1 Phrase separation based on the ASBA method

One challenge when you want to use the RAKE algorithm is to separate first the different concept. For this, we can inspirate ourselves from the ASBA method, presented in this article: https://medium.com/analytics-vidhya/aspect-based-sentiment-analysis-a-practical-approach-8f51029bbc4a

We upgrade this code so that it can take into account of our business case

In [None]:
"""
Created on Mon Mar 29 14:13:12 2021

@author: Cao Tri DO
"""

import nltk

import networkx 
from networkx.algorithms.components.connected import connected_components
import pandas as pd

from nltk.corpus import stopwords
import stanza
stanza.download('en')
import collections


#%% subfunctions

def theme_extraction(txt, stop_words, nlp, threshold = 7):
    """
    Function to extract theme from a verbatims

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    stop_words : list
        list of stop to not considered when extract subject.
    nlp : stanza object
        stanza object engine.
    threshold : int, optional
        DESCRIPTION. The default is 7.

    Returns
    -------
    lst_subject : list
        list of subject analyzed.

    """
    try:
        # Lower the text
        doc = nlp(txt.lower())
        # Initiate void list
        lst_subject = []
        
        # Loop for all sentences in the verbatims
        for sent in doc.sentences:
            # If the sentence if too short, keep all in one subject
            if len(sent._tokens) <= threshold:          
                # print("Sentence is too short. Keep all words")
                finalcluster = [word.text for word in sent.words]
                lst_subject.append(finalcluster)  
            # For the sentences that are long enough
            else:
                # Extract subject using the POS and xparel method
                # print("Sentence is long. Apply subject extraction")
                finalcluster = theme_clustering(sent.text, stop_words, nlp)
                
                # Only apply the subject engine if there are extracted subject
                if len(finalcluster)>0: # case when there are issues e.g., 'PRPBLEME D IMPRESSION?????????????????????????????????'
                    ##################################################################
                    # Group first element if there are many of them
                    ##################################################################
                    # list of size of element
                    lst_size = [len(element) for element in finalcluster]
                    # first element of size 1
                    try:
                        first_element = next(x[0] for x in enumerate(lst_size) if x[1] > 1)
                    except: # if all are equal to 1
                        first_element = len(lst_size)-1
                    
                    finalcluster_first_group = []
                    tmp_first_element = []
                    for idx_element in range(0,first_element+1):
                        element = finalcluster[idx_element]
                        tmp_first_element.extend(element)
                    finalcluster_first_group.append(tmp_first_element)
                    for idx_element in range(first_element+1, len(finalcluster)):
                        element = finalcluster[idx_element]
                        finalcluster_first_group.append(element)
                    
                    ##################################################################
                    
                    finalcluster_clean = []
                    count = 0
                    tmp = []
                    for element_idx in range(0,len(finalcluster_first_group)):
                        element = finalcluster_first_group[element_idx]
                        if element_idx == 0 and len(element)== 1:
                            tmp = element
                        elif len(tmp)>0 or len(element)>1:
                            tmp.extend(element)
                            finalcluster_clean.append(tmp)
                            tmp = []
                            count = len(finalcluster_clean)-1
                        else:
                            tmp.extend(element)
                            finalcluster_clean[count].extend(tmp)
                            tmp = []
                    
                    for element in finalcluster_clean:
                        lst_subject.append(element)    
    except:
        print(txt)
        lst_subject = []
    
    return lst_subject


#%% utilities functions

def get_meta_data_nlp(txt, nlp):
    """
    Function to transform stanza object result to dataframe

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    df : dataframe
        dataframe allowing to analyze the results from Stanza.

    """
    doc = nlp(txt)
    
    df = pd.DataFrame()
    count = 1
    for sentence in doc.sentences:
        print(sentence._text)
        for word in sentence.words:
            df.loc[count, "word"]  = word.text
            df.loc[count, "lemma"] = word.lemma
            # df.loc[count, "feats"] = word.feats 
            df.loc[count, "pos"] = word.pos
            df.loc[count, "upos"] = word.upos
            df.loc[count, "xpos"] = word.xpos
            df.loc[count, "id head"] = word.head
            df.loc[count, "word root"] = sentence.words[word.head-1].text if word.head > 0 else "root"
            df.loc[count, "deprel"] = word.deprel
            # print(word.text, " | ", word.lemma, " | ", word.pos)
            count = count +1
    print(df)
    
    return df


def get_monolabel(X_test):
    """
    Extract monolabel from dataframe

    Parameters
    ----------
    X_test : dataframe
        containing the data.

    Returns
    -------
    df_monolabel : dataframe
        Contains only the monolabel.
    df_multilabel : dataframe
        Contains only the multilabel.
    """
    df_theme = X_test.copy()
    mask = df_theme["id_unique"].value_counts()
    lst_unique_idx = list(mask[mask>1].index)
    df_monolabel = X_test[~X_test['id_unique'].isin(lst_unique_idx)].copy()
    df_multilabel = X_test[X_test['id_unique'].isin(lst_unique_idx)].copy()
    return df_monolabel, df_multilabel


#%% subfunctions for the code 

def theme_clustering(txt, stop_words, nlp):
    """
    Main program to cluster the subject

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    stop_words : list
        list of stop to not considered when extract subject.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    lst_merged_cluster_word_all : list
        list of subject identified by the algorithm.

    """
    # sentList = clean_txt(txt)
    sentList = clean_txt(txt, nlp)


    # for line in sentList:
    lst_merged_cluster_word_all = []
    for line in sentList.sentences:
        
        # NLTK POS Tagging
        # taggedList = nltk_pos_tag(line)
        # taggedList = stanza_pos_tag(line)

        # join consecutive noun
        # finaltxt, newwordList, taggedList_new, bool_valid_txt = join_consecutive_noun(taggedList, nlp, noun_tag = "NOUN")
        bool_valid_txt = True
        
        if bool_valid_txt:
            # Create Stanford object model
            # doc = nlp(txt) # Object of Stanford NLP Pipeleine
            taggedList_new, taggedList_id_new = stanza_pos_tag(line)
            
            # Getting the dependency relations betwwen the words
            # Coverting it into appropriate format
            dep_node, dep_node_id = get_dependencies(sentList)
    
            # we will select only those sublists from the <dep_node> that could probably contain the features
            totalfeatureList, featureList, featureList_id, categories = select_potential_features(taggedList_new, taggedList_id_new)
            
            # Now using the <dep_node> list and the <featureList> we will determine 
            # to which of the words these features in the feature list are related to.
            fcluster = create_relation(featureList, dep_node)
            fcluster_id = create_relation(featureList_id, dep_node_id)
            
            # finalcluster_out, finalcluster, dic = select_NN_feature_relation(totalfeatureList, fcluster)
            
            # Merge list commonelements
            lst_merged_cluster_id = merge_list_common_elements(fcluster_id)
            
            # Order the id within the list
            lst_merged_cluster_id_sort1 = sort_list_list(lst_merged_cluster_id)
            
            # Order the element by id of 1st element
            lst_merged_cluster_id_keysort = sort_list_by_key(lst_merged_cluster_id_sort1)
            
            # Bug fix : if a list is only of 1 element and is contains within the range of the previous list, put it into it
            # Do not take into account 1st element 0
            lst_merged_cluster_id_no_orphelin = combine_orphelin_list(lst_merged_cluster_id_keysort)
            
            
            # lst_merged_cluster_id_sorted = sort_list_list(lst_merged_cluster_id)
            lst_merged_cluster_id_sorted = sort_list_list(lst_merged_cluster_id_no_orphelin)
            
            lst_merged_cluster_word = convert_idx_lst_to_word_lst(lst_merged_cluster_id_sorted, taggedList_new)
            
            
        else:
            lst_merged_cluster_word = []   
        
        for element in lst_merged_cluster_word:
            lst_merged_cluster_word_all.append(element)
    
    return lst_merged_cluster_word_all


def combine_orphelin_list(lst_merged_cluster_id_keysort):
    """
    Function to combine orphelin list if there are within previous range
    For example : [[1,3], [2], [4]] will give [[1,2, 3], [4]]

    Parameters
    ----------
    lst_merged_cluster_id_keysort : list
        original list to combine.

    Returns
    -------
    lst_merged_cluster_id_no_orphelin : list
        combined list.

    """
    
    #♦ Assignaton
    lst_merged_cluster_id_no_orphelin = []
    
    # Make a copy
    lst_merged_cluster_id_no_orphelin.append(lst_merged_cluster_id_keysort[0].copy())
    id_to_compare = 0
    
    # Loop
    for i in range(1,len(lst_merged_cluster_id_keysort)):
        # Only if the list is with 1 element
        if len(lst_merged_cluster_id_keysort[i])==1:
            id_element = lst_merged_cluster_id_keysort[i][0]
            # Meta data of previous list
            previous_list = lst_merged_cluster_id_no_orphelin[id_to_compare]
            min_id_previous_list = min(previous_list)
            max_id_previous_list = max(previous_list)
            # If new list within the range of previous list, merge with it
            if id_element>=min_id_previous_list and id_element<=max_id_previous_list:
                lst_merged_cluster_id_no_orphelin[id_to_compare].extend(lst_merged_cluster_id_keysort[i].copy())
            else:
                lst_merged_cluster_id_no_orphelin.append(lst_merged_cluster_id_keysort[i].copy())
                id_to_compare = len(lst_merged_cluster_id_no_orphelin)-1
        else:
            lst_merged_cluster_id_no_orphelin.append(lst_merged_cluster_id_keysort[i].copy())
            id_to_compare = len(lst_merged_cluster_id_no_orphelin)-1   
    return lst_merged_cluster_id_no_orphelin

def sort_list_by_key(lst_merged_cluster_id_sort1):
    """
    Function to sort a list of list into an ordered list
    For example : [[2], [1,3], [4]] will give [[1,3], [2], [4]]

    Parameters
    ----------
    lst_merged_cluster_id_sort1 : list
        original list unsorted.

    Returns
    -------
    final_list : list
        sorted list.

    """
    dct_list = {}
    for i in range(0, len(lst_merged_cluster_id_sort1)):
        first_id = lst_merged_cluster_id_sort1[i][0]
        dct_list[first_id] = lst_merged_cluster_id_sort1[i]
    
    # sort dict key
    od = collections.OrderedDict(sorted(dct_list.items()))
    
    final_list = []
    for element in od:
        final_list.append(od[element])
    return final_list
    

def sort_list_list(lst_merged_cluster_id_no_orphelin):
    """
    Function to sort the list of list to order the id of the words

    Parameters
    ----------
    lst_merged_cluster_id : list
        DESCRIPTION.

    Returns
    -------
    lst_merged_cluster_id_sorted : list
        DESCRIPTION.

    """
    lst_merged_cluster_id_sorted = lst_merged_cluster_id_no_orphelin.copy()
    for i in range(0, len(lst_merged_cluster_id_sorted)):
        lst_merged_cluster_id_sorted[i] = sorted(lst_merged_cluster_id_no_orphelin[i]).copy()
    # Initiate the void dictionnary
    # dct_idx = {}
    # # Loop on all element of the list
    # for element in lst_merged_cluster_id_no_orphelin:
    #     count = element[0]
    #     dct_idx[count] = element
    # # sort the list
    # dct_idx_sorted = dict(sorted(dct_idx.items(), key=lambda item: item[1]))
    # lst_merged_cluster_id_sorted = []
    # for key_element in dct_idx_sorted:
    #     lst_merged_cluster_id_sorted.append(dct_idx_sorted[key_element])

    return lst_merged_cluster_id_sorted

def convert_idx_lst_to_word_lst(lst_merged_cluster_id, taggedList_new):
    """
    Function to convert the idx to the words

    Parameters
    ----------
    lst_merged_cluster_id : list
        list of index.
    taggedList_new : str
        list of words.

    Returns
    -------
    lst_merged_cluster_word : list
        list of words.

    """
    lst_merged_cluster_word = []
    for lst in lst_merged_cluster_id:
        lst_word = []
        for element in lst:
            word_tmp = taggedList_new[element-1][0]
            lst_word.append(word_tmp)
        lst_merged_cluster_word.append(lst_word)
    return lst_merged_cluster_word

def clean_txt(txt, nlp):
    """
    Function to clean the text by lower and separate the sentence using stanza

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    sentList : list
        list of sentences.

    """
    txt = txt.lower() # LowerCasing the given Text
    sentList = nlp(txt)
    
    return sentList

def clean_txt_nltk(txt):
    """
    Function to clean the text by lower and separate the sentence using nltk

    Parameters
    ----------
    txt : str
        verbatim to analyze.
    nlp : stanza object
        stanza object engine.

    Returns
    -------
    sentList : list
        list of sentences.

    """
    txt = txt.lower() # LowerCasing the given Text
    sentList = nltk.sent_tokenize(txt) # Splitting the text into sentences

    return sentList


def nltk_pos_tag(line):
    """
    for each sentence in the <sentList> tokenize it and perform POS Tagging and store it into a Tagged List
    """
    txt_list = nltk.word_tokenize(line) # Splitting up into words
    taggedList = nltk.pos_tag(txt_list) # Doing Part-of-Speech Tagging to each word
    return taggedList

def stanza_pos_tag(line):
    taggedList = []
    taggedList_id = []
    for word in line.words:
        taggedList.append((word.text, word.pos))
        taggedList_id.append((word.id, word.pos))
        
    return taggedList, taggedList_id

def join_consecutive_noun(taggedList, nlp, noun_tag = "NOUN"):
    """
    there are many instances where a feature is represented by multiple words 
    so we need to handle that first by joining multiple words features into a one-word feature
    
     noun_tag = "NOUN" (stanza) or NN (nltk)
    """
    newwordList = []
    flag = 0
    for i in range(0,len(taggedList)-1):
        if(taggedList[i][1]==noun_tag and taggedList[i+1][1]==noun_tag): # If two consecutive words are Nouns then they are joined together
            newwordList.append(taggedList[i][0]+taggedList[i+1][0])
            flag=1
        else:
            if(flag==1):
                flag=0
                continue
            newwordList.append(taggedList[i][0])
            if(i==len(taggedList)-2):
                newwordList.append(taggedList[i+1][0])

    finaltxt = ' '.join(word for word in newwordList) 
    
    # Tokenize and POS Tag the new sentence.
    # new_txt_list = nltk.word_tokenize(finaltxt)
    # wordsList = [w for w in new_txt_list if not w in stop_words]
    # taggedList = nltk.pos_tag(wordsList)
    # print(wordsList)
    doc = nlp(finaltxt)

    if len(finaltxt) > 0:
        taggedList, taggedList_id = stanza_pos_tag(doc.sentences[0])
        bool_valid_txt = True
    else:
        taggedList = []
        bool_valid_txt = False

    return finaltxt, newwordList, taggedList, bool_valid_txt

def get_dependencies(doc):
    """
    Function to extract the dependencies from the analysis from Stanza

    Parameters
    ----------
    doc : stanza object
        stanza object applied to a verbatim.

    Returns
    -------
    dep_node : list
        list to tuple of relationship with the words.
    dep_node_id : list
        list to tuple of relationship with the ids.

    """
    # Getting the dependency relations betwwen the words
    dep_node = []
    dep_node_id = []
    for dep_edge in doc.sentences[0].dependencies:
        dep_node.append([dep_edge[2].text, dep_edge[0].id, dep_edge[1]])
        dep_node_id.append([dep_edge[2].id, dep_edge[0].id, dep_edge[1]])
        
    # Coverting it into appropriate format
    for i in range(0, len(dep_node)):
        if (int(dep_node[i][1]) != 0):
            dep_node[i][1] = dep_node[(int(dep_node[i][1]) - 1)][0]
            
    return dep_node, dep_node_id

def select_potential_features(taggedList_new, taggedList_id_new):
    """
    Function to extract potential subject from the dep node

    Parameters
    ----------
    taggedList_new : TYPE
        list of words.
    taggedList_id_new : TYPE
        list of id words

    Returns
    -------
    totalfeatureList : TYPE
        DESCRIPTION.
    featureList : TYPE
        DESCRIPTION.
    featureList_id : TYPE
        DESCRIPTION.
    categories : TYPE
        DESCRIPTION.

    """
    totalfeatureList = []
    featureList_id = []
    featureList = []
    categories = []
    count = 1
    
    for idx in range(0,len(taggedList_new)):
        i = taggedList_new[idx]
        i_idx = taggedList_id_new[idx]
        # if(i[1]=='JJ' or i[1]=='NN' or i[1]=='JJR' or i[1]=='NNS' or i[1]=='RB'):
        if(i[1]!='PUNCT'):
        # if(i[1]=='ADJ' or i[1]=='NOUN' or i[1]=='ADV' or i[1]=="VERB" or i[1]=="PRON" or i[1]=="DET"):
            featureList.append(list(i)) # For features for each sentence
            totalfeatureList.append(i) # Stores the features of all the sentences in the text
            featureList_id.append(list(i_idx))
            categories.append(i[0])
        count = count+1

    return totalfeatureList, featureList, featureList_id, categories

def create_relation(featureList, dep_node):
    """
    Function to create relationship from the dep_node

    Parameters
    ----------
    featureList : list
        list of potential subjects.
    dep_node : list
        tree of dependencies.

    Returns
    -------
    fcluster : list
        list of clusters.

    """
    
    lst_deprel = ["nsubj", 
                  "acl:relcl", 
                  "obj", "dobj", 
                  "agent", 
                  "advmod", 
                  "amod", 
                  "neg", 
                  "prep_of", 
                  "acomp", 
                  "xcomp", 
                  "compound",
                  
                  "cop"
                  ]
    
    fcluster = []
    
    for i in featureList:
        filist = []
        for j in dep_node:
            if((j[0]==i[0] or j[1]==i[0]) and (j[2] in lst_deprel)):
                if(j[0]==i[0]):
                    filist.append(j[1])
                else:
                    filist.append(j[0])
        fcluster.append([i[0], filist])
    
    return fcluster

def select_NN_feature_relation(totalfeatureList, fcluster):
    """
    Focus the subject on the noun

    Parameters
    ----------
    totalfeatureList : list
        list of potential features for subject.
    fcluster : list
        list of extracted potential subjects.

    Returns
    -------
    finalcluster_out : list
        list of extracted potential subjects focus on noun.
    finalcluster : list
        list of extracted potential subjects focus on noun.
    dic : TYPE
        dic of extracted potential subjects focus on noun..

    """
    finalcluster = []
    dic = {}
    
    for i in totalfeatureList:
        dic[i[0]] = i[1]
    
    for i in fcluster:
        if(dic[i[0]]=="NOUN"):
            finalcluster.append(i)
    
    finalcluster_out = flat_list(finalcluster)
    
    return finalcluster_out, finalcluster, dic


#%%

def merge_list_common_elements(fcluster_id):
    """
    Function to merge list that have common 

    Parameters
    ----------
    fcluster_id : TYPE
        DESCRIPTION.

    Returns
    -------
    lst_merged_sorted : TYPE
        DESCRIPTION.

    """
    lst_flat = flat_list(fcluster_id)
    G = to_graph(lst_flat)
    set_merged = list(connected_components(G))
    lst_merged = list(map(list,set_merged))
    
    lst_merged_sorted = []
    for element in lst_merged:
        lst_merged_sorted.append(sorted(element))
    
    return lst_merged_sorted


def flat_list(finalcluster):
    lst_final = []
    for lst_tmp in finalcluster:
        lst_tmp_final = []
        for element in lst_tmp:
            if type(element) == list:
                lst_tmp_final.extend(element)
            else:
                lst_tmp_final.append(element)
        lst_tmp_final = sorted(lst_tmp_final)
        lst_final.append(lst_tmp_final)
    
    # Remove duplicated list
    set_theme = set(map(tuple,lst_final))
    lst_theme = list(map(list,set_theme))
    
    return lst_theme


def to_graph(l):
    G = networkx.Graph()
    for part in l:
        # each sublist is a bunch of nodes
        G.add_nodes_from(part)
        # it also imlies a number of edges:
        G.add_edges_from(to_edges(part))
    return G

def to_edges(l):
    """ 
        treat `l` as a Graph and returns it's edges 
        to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
    """
    it = iter(l)
    last = next(it)

    for current in it:
        yield last, current
        last = current    



Here is below a code to initiate the `stanza` object that will be used after in our code

In [None]:
import stanza

def initialize_stanza():
    """
    Function to initialize the stanza engine (default parameters for Engie)

    Returns
    -------
    nlp : object
        stanza object.

    """
    nlp = stanza.Pipeline(lang="en", 
                          use_gpu = True, 
                          verbose = False, 
                          processors= {
                                          'tokenize':'default', 
                                          'pos' :'default',
                                          'lemma':'default', 
                                          'depparse':'default',
                                        })
    return nlp

## 3.2 Application on our simple example

Let's define a simple example

In [None]:
# Define the text example
txt = "this film is great but the movie was awful. The theatre was amazing"

We first initiate the `stanza` object

In [None]:
# Define the stanza object
nlp = initialize_stanza()

We then separate the different phrases in our example

In [None]:
# Use our custom phrase extraction code
result = theme_extraction(txt = txt, stop_words = [], nlp = nlp, threshold = 7)

for tmp_text in result:
    print(tmp_text)

['this', 'film', 'is', 'great', 'but', 'the']
['movie', 'was', 'awful']
['the', 'theatre', 'was', 'amazing']


We can see from below result that our new package has succeeded in separating the different phrases.
We define below the stopword list. Be careful, in our case study, the negation `NOT` is very important and should not be in the stopword list like by default

In [None]:
from nltk.corpus import stopwords

nltk.download('stopwords')
stopword = list(stopwords.words('english'))
stopword.remove('not')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We finally apply the RAKE algorithm

In [None]:
from rake_nltk import Rake
from nltk.corpus import stopwords 
r = Rake(stopwords =stopword) # Uses stopwords for english from NLTK, and all puntuation characters.Please note that "hello" is not included in the list of stopwords.

for tmp_text in result:
    text= ' '.join(tmp_text)
    print('-----------------------------')
    print(text)
    a=r.extract_keywords_from_text(text)
    b=r.get_ranked_phrases()
    c=r.get_ranked_phrases_with_scores()
    print(a)
    print(b)
    print(c)

-----------------------------
this film is great but the
None
['great', 'film']
[(1.0, 'great'), (1.0, 'film')]
-----------------------------
movie was awful
None
['movie', 'awful']
[(1.0, 'movie'), (1.0, 'awful')]
-----------------------------
the theatre was amazing
None
['theatre', 'amazing']
[(1.0, 'theatre'), (1.0, 'amazing')]


We can observe that the obtained result is what we were expecting

## 3.3 Conclusion

From this simple example, we show that we achieved our objective of getting the subject by assemble 2 methodologies:
- phrase separation
- subject extraction

<center><img src="https://media.makeameme.org/created/victory-at-last.jpg" title="Python Logo" width = 300/></center>



<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=22d1188b-dbee-4618-bda5-79d4ace33c29' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>