<a href="https://colab.research.google.com/github/fininsight/text-mining-tutorial/blob/master/3_%ED%95%B5%EC%8B%AC_%ED%82%A4%EC%9B%8C%EB%93%9C_%EC%B6%94%EC%B6%9C_Keyword_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 핵심 키워드 추출 (Keyword Extraction)

# 1 TF-IDF

### 1) 샘플 텍스트

In [0]:
d1 = "The cat sat on my face. I hate a cat."
d2 = "The dog sat on my bed. I love a dog." 

### 2) sklearn 활용 TF-IDF

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict

document_ls = [d1, d2]

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(document_ls)

word2id = defaultdict(lambda : 0)
for idx, feature in enumerate(vectorizer.get_feature_names()):
    word2id[feature] = idx

### 3) dataframe으로 변환하여 출력

In [99]:
import pandas as pd
count_vect_df = pd.DataFrame(tfidf.todense(), columns=vectorizer.get_feature_names())
count_vect_df

Unnamed: 0,bed,cat,dog,face,hate,love,my,on,sat,the
0,0.0,0.706006,0.0,0.353003,0.353003,0.0,0.251164,0.251164,0.251164,0.251164
1,0.353003,0.0,0.706006,0.0,0.0,0.353003,0.251164,0.251164,0.251164,0.251164


### 4) TF-IDF score가 높은 순으로 출력

In [111]:
feature_array = np.array(vectorizer.get_feature_names())
tfidf_sorting = np.argsort(tfidf[0].toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting][:n]

top_n

array(['cat', 'hate', 'face'], dtype='<U4')



---



# 2 Textrank

<img src="https://i.stack.imgur.com/ohF5r.png" />

## 2.1 TextRank 직접 구현하기
(Based on: https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)

In [0]:
#Source of text:
#https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents

Text = "The FAANG stocks won’t see much more growth in the near future, according to Bill Studebaker, founder and Chief Investment Officer of Robo Global. \
Studebaker argues we are seeing a 'reallocation' that will continue from large-cap tech stocks into market-weight stocks. \
The FAANG stocks have had a rough few weeks, and have been hit hard since March 12. \
One FAANG to look out for, in the midst of all this, is Amazon, according to Studebaker. \
The stock market is seeing a 'reallocation' out of FAANG stocks, which are not where the smart money is, founder and Chief Investment Officer of Robo Global Bill Studebaker told Business Insider. \
The FAANG stocks (Facebook, Apple, Amazon, Netflix, Google) are all down considerably since March 12, a trend that accelerated when news of a massive Facebook data scandal broke, sending the tech-heavy Nasdaq into a downward frenzy. \
Investors are wondering what’s next. \
And what’s next isn’t good news for FAANG stock optimists, Studebaker thinks. 'This is a dead trade' for the next several months, he said. 'I wouldn’t expect there to be a lot of performance attribution coming from the FAANG stocks,' he added. That is, if the stock market is to see gains in the next several months, they will largely not come from the big tech companies. \
The market is seeing a 'reallocation out of large-cap technology, into other parts of the market,' he said. And this trend could continue for the foreseeable future. 'When you get these reallocation trades, a de-risking, this can go on for months and months.' The FAANG’s are pricey stocks, he said, pointing out that investors will 'factor in the law of big numbers,' he said. 'Just because they’re big cap doesn’t mean they’re safe,' he added. \
Still, he doesn’t necessarily think that investors are going to shift drastically into value stocks. 'With an increasingly favorable macro backdrop, you have strong growth demand.' \
Studebaker, who runs an artificial intelligence and robotics exchange-traded fund with $4 billion in assets under management, thinks that AI and robotics are better areas of growth. His ETF is up 27% in the past year, while the FAANG stocks are also largely up over that same span, even if they are down since March 12. \
While many point to artificial intelligence as an area that will be a boost to Google and Amazon, Studebaker doesn’t see that as a sign of significant growth for the FAANGs. He pointed out that 'eighty to ninety percent of their businesses are still search,' and that 'AI doesn’t really move the needle on the business.' He also said 'the revenue mix [attributable to AI] in those businesses are insignificant.' \
And while he’s not bullish on FAANG’s, he does say that the one FAANG to still watch out for is Amazon, simply because ecommerce still represents a small portion of the global retail market, giving the company room to grow." 

### 1) 토큰화 (Tokenization)

분석 텍스트 정제

In [123]:
import nltk
from nltk import word_tokenize
import string

nltk.download('punkt')

text = word_tokenize(Text)

print ("Tokenized Text: \n")
print (text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Tokenized Text: 

['The', 'FAANG', 'stocks', 'won', '’', 't', 'see', 'much', 'more', 'growth', 'in', 'the', 'near', 'future', ',', 'according', 'to', 'Bill', 'Studebaker', ',', 'founder', 'and', 'Chief', 'Investment', 'Officer', 'of', 'Robo', 'Global', '.', 'Studebaker', 'argues', 'we', 'are', 'seeing', 'a', "'reallocation", "'", 'that', 'will', 'continue', 'from', 'large-cap', 'tech', 'stocks', 'into', 'market-weight', 'stocks', '.', 'The', 'FAANG', 'stocks', 'have', 'had', 'a', 'rough', 'few', 'weeks', ',', 'and', 'have', 'been', 'hit', 'hard', 'since', 'March', '12', '.', 'One', 'FAANG', 'to', 'look', 'out', 'for', ',', 'in', 'the', 'midst', 'of', 'all', 'this', ',', 'is', 'Amazon', ',', 'according', 'to', 'Studebaker', '.', 'The', 'stock', 'market', 'is', 'seeing', 'a', "'reallocation", "'", 'out', 'of', 'FAANG', 'stocks', ',', 'which', 'are', 'not', 'where', 'the', 'smart

### 2) 품사부착 (POS Tagging)

토큰화된 텍스트에 품사 부착

http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [140]:
nltk.download('averaged_perceptron_tagger')
POS_tag = nltk.pos_tag(text)

print(POS_tag)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[('The', 'DT'), ('FAANG', 'NNP'), ('stocks', 'NNS'), ('won', 'VBD'), ('’', 'JJ'), ('t', 'NN'), ('see', 'VBP'), ('much', 'RB'), ('more', 'JJR'), ('growth', 'NN'), ('in', 'IN'), ('the', 'DT'), ('near', 'JJ'), ('future', 'NN'), (',', ','), ('according', 'VBG'), ('to', 'TO'), ('Bill', 'NNP'), ('Studebaker', 'NNP'), (',', ','), ('founder', 'NN'), ('and', 'CC'), ('Chief', 'NNP'), ('Investment', 'NNP'), ('Officer', 'NNP'), ('of', 'IN'), ('Robo', 'NNP'), ('Global', 'NNP'), ('.', '.'), ('Studebaker', 'NNP'), ('argues', 'VBZ'), ('we', 'PRP'), ('are', 'VBP'), ('seeing', 'VBG'), ('a', 'DT'), ("'reallocation", 'NN'), ("'", 'POS'), ('that', 'WDT'), ('will', 'MD'), ('continue', 'VB'), ('from', 'IN'), ('large-cap', 'JJ'), ('tech', 'NN'), ('stocks', 'NNS'), ('into', 'IN'), ('market-weight', 'JJ'), ('stocks', 'NNS'), (

### 3) 표제어 추출 (Lemmatization)
    
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

In [164]:
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

lemmatized_text = []
for word in POS_tag:
  lemmatized_text.append(str(wordnet_lemmatizer.lemmatize(word[0])))
        
print(lemmatized_text)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
['The', 'FAANG', 'stock', 'won', '’', 't', 'see', 'much', 'more', 'growth', 'in', 'the', 'near', 'future', ',', 'according', 'to', 'Bill', 'Studebaker', ',', 'founder', 'and', 'Chief', 'Investment', 'Officer', 'of', 'Robo', 'Global', '.', 'Studebaker', 'argues', 'we', 'are', 'seeing', 'a', "'reallocation", "'", 'that', 'will', 'continue', 'from', 'large-cap', 'tech', 'stock', 'into', 'market-weight', 'stock', '.', 'The', 'FAANG', 'stock', 'have', 'had', 'a', 'rough', 'few', 'week', ',', 'and', 'have', 'been', 'hit', 'hard', 'since', 'March', '12', '.', 'One', 'FAANG', 'to', 'look', 'out', 'for', ',', 'in', 'the', 'midst', 'of', 'all', 'this', ',', 'is', 'Amazon', ',', 'according', 'to', 'Studebaker', '.', 'The', 'stock', 'market', 'is', 'seeing', 'a', "'reallocation", "'", 'out', 'of', 'FAANG', 'stock', ',', 'which', 'are', 'not', 'where', 'the', 'smart', 'money', 'is', ',

### 4) 불용어(Stopwords) 처리 및 불필요한 품사 제거

In [170]:
stopwords = [] #불용어 배열

# 추출 키워드 대상이 되는 품사 지정
wanted_POS = ['NN','NNS','NNP','NNPS']

# 추출 키워드 대상 품사가 아닌 토큰은 불용어로 등록
for word in POS_tag:
    if word[1] not in wanted_POS:
        stopwords.append(word[0])

# punctuation 을 불용어로 추가
punctuations = list(str(string.punctuation))
stopwords = stopwords + punctuations


# 사용자 정의 토큰을 불용어로 추가
stopwords_plus = ['t', 'isn']
stopwords = stopwords + stopwords_plus 
stopwords = set(stopwords)


processed_text = []
for word in lemmatized_text:
    if word not in stopwords:
        processed_text.append(word)
print(processed_text)

['stock', 'growth', 'future', 'Bill', 'Studebaker', 'founder', 'Chief', 'Investment', 'Officer', 'Robo', 'Global', 'Studebaker', "'reallocation", 'tech', 'stock', 'stock', 'stock', 'week', 'March', 'midst', 'Amazon', 'Studebaker', 'stock', 'market', "'reallocation", 'stock', 'money', 'founder', 'Chief', 'Investment', 'Officer', 'Robo', 'Global', 'Bill', 'Studebaker', 'Business', 'Insider', 'stock', 'Facebook', 'Apple', 'Amazon', 'Netflix', 'Google', 'March', 'trend', 'news', 'Facebook', 'data', 'scandal', 'Nasdaq', 'frenzy', 'Investors', 'news', 'stock', 'optimist', 'Studebaker', 'trade', 'month', 'lot', 'performance', 'attribution', 'stock', 'stock', 'market', 'gain', 'month', 'tech', 'company', 'market', "'reallocation", 'technology', 'part', 'market', 'trend', 'future', 'reallocation', 'trade', 'de-risking', 'month', 'month', 'stock', 'investor', 'law', 'number', 'cap', 'mean', 'investor', 'value', 'stock', 'macro', 'backdrop', 'growth', 'demand', 'Studebaker', 'run', 'intelligence'

In [171]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### 5) Unique한 토큰 목록 생성

그래프 생성을 위해서 Unique한 토큰 목록 생성

In [172]:
vocabulary = list(set(processed_text))
print(vocabulary)

['lot', 'portion', 'Insider', 'Nasdaq', 'needle', 'midst', 'investor', 'company', 'Investors', 'macro', 'asset', 'Bill', 'span', 'gain', 'doe', 'fund', 'Robo', 'data', 'year', 'law', 'mean', 'performance', 'backdrop', 'money', 'optimist', 'technology', 'scandal', 'demand', 'run', 'area', 'revenue', 'trade', 'news', 'part', 'business', 'founder', 'Netflix', 'FAANGs', 'stock', 'Apple', 'Facebook', 'mix', 'de-risking', 'Amazon', 'future', 'ecommerce', 'Studebaker', 'room', 'March', 'month', 'number', 'attributable', 'market', 'reallocation', 'cap', 'growth', 'percent', 'management', 'sign', "'AI", 'Chief', 'trend', 'attribution', 'point', 'Investment', 'robotics', 'ETF', 'AI', 'boost', 'Global', 'Officer', 'tech', 'week', "'reallocation", 'Business', 'frenzy', 'intelligence', 'Google', 'value']


### 6) 그래프 생성 

TextRank is a graph based model, and thus it requires us to build a graph. Each words in the vocabulary will serve as a vertex for graph. The words will be represented in the vertices by their index in vocabulary list.  

The weighted_edge matrix contains the information of edge connections among all vertices.
I am building wieghted undirected edges.

weighted_edge[i][j] contains the weight of the connecting edge between the word vertex represented by vocabulary index i and the word vertex represented by vocabulary j.

If weighted_edge[i][j] is zero, it means no edge connection is present between the words represented by index i and j.

There is a connection between the words (and thus between i and j which represents them) if the words co-occur within a window of a specified 'window_size' in the processed_text.

The value of the weighted_edge[i][j] is increased by (1/(distance between positions of words currently represented by i and j)) for every connection discovered between the same words in different locations of the text. 

The covered_coocurrences list (which is contain the list of pairs of absolute positions in processed_text of the words whose coocurrence at that location is already checked) is managed so that the same two words located in the same positions in processed_text are not repetitively counted while sliding the window one text unit at a time.

The score of all vertices are intialized to one. 

Self-connections are not considered, so weighted_edge[i][i] will be zero.

In [0]:
import numpy as np
import math
vocab_len = len(vocabulary)

weighted_edge = np.zeros((vocab_len,vocab_len),dtype=np.float32)

score = np.zeros((vocab_len),dtype=np.float32)
window_size = 3
covered_coocurrences = []

for i in range(0,vocab_len):
    score[i]=1
    for j in range(0,vocab_len):
        if j==i:
            weighted_edge[i][j]=0
        else:
            for window_start in range(0,(len(processed_text)-window_size)):
                
                window_end = window_start+window_size
                
                window = processed_text[window_start:window_end]
                
                if (vocabulary[i] in window) and (vocabulary[j] in window):
                    
                    index_of_i = window_start + window.index(vocabulary[i])
                    index_of_j = window_start + window.index(vocabulary[j])
                      
                    if [index_of_i,index_of_j] not in covered_coocurrences:
                        weighted_edge[i][j]+=1/math.fabs(index_of_i-index_of_j)
                        covered_coocurrences.append([index_of_i,index_of_j])


### Calculating weighted summation of connections of a vertex

inout[i] will contain the sum of all the undirected connections\edges associated withe the vertex represented by i.

In [0]:
inout = np.zeros((vocab_len),dtype=np.float32)

for i in range(0,vocab_len):
    for j in range(0,vocab_len):
        inout[i]+=weighted_edge[i][j]

### Scoring Vertices

The formula used for scoring a vertex represented by i is:

score[i] = (1-d) + d x [ Summation(j) ( (weighted_edge[i][j]/inout[j]) x score[j] ) ] where j belongs to the list of vertieces that has a connection with i. 

d is the damping factor.

The score is iteratively updated until convergence. 

In [154]:
MAX_ITERATIONS = 50
d=0.85
threshold = 0.0001 #convergence threshold

for iter in range(0,MAX_ITERATIONS):
    prev_score = np.copy(score)
    
    for i in range(0,vocab_len):
        
        summation = 0
        for j in range(0,vocab_len):
            if weighted_edge[i][j] != 0:
                summation += (weighted_edge[i][j]/inout[j])*score[j]
                
        score[i] = (1-d) + d*(summation)
    
    if np.sum(np.fabs(prev_score-score)) <= threshold: #convergence condition
        print("Converging at iteration "+str(iter)+"....")
        break


Converging at iteration 31....


In [155]:
for i in range(0,vocab_len):
    print("Score of "+vocabulary[i]+": "+str(score[i]))

Score of lot: 0.6848181
Score of portion: 0.6340738
Score of Insider: 0.6298209
Score of Nasdaq: 0.8359115
Score of needle: 0.7524147
Score of isn: 0.7430971
Score of midst: 0.6222233
Score of investor: 1.2873713
Score of company: 0.8264576
Score of Investors: 0.8021918
Score of macro: 0.6596142
Score of asset: 0.75754374
Score of Bill: 1.0430568
Score of span: 0.6415029
Score of gain: 0.59739095
Score of doe: 0.68117535
Score of fund: 0.73801804
Score of Robo: 1.0141851
Score of data: 0.75828475
Score of year: 0.65219885
Score of law: 0.7377731
Score of mean: 0.76092434
Score of performance: 0.6963699
Score of backdrop: 0.6586655
Score of money: 0.59004873
Score of optimist: 0.60736376
Score of technology: 0.6115385
Score of scandal: 0.8069137
Score of demand: 0.6338073
Score of run: 0.6435892
Score of area: 1.2065072
Score of revenue: 0.77174926
Score of trade: 1.157185
Score of news: 1.2283493
Score of part: 0.61795616
Score of business: 1.9873168
Score of founder: 1.0434319
Score o

### Phrase Partiotioning

Paritioning lemmatized_text into phrases using the stopwords in it as delimeters.
The phrases are also candidates for keyphrases to be extracted. 

In [156]:
phrases = []

phrase = " "
for word in lemmatized_text:
    
    if word in stopwords:
        if phrase!= " ":
            phrases.append(str(phrase).strip().split())
        phrase = " "
    elif word not in stopwords:
        phrase+=str(word)
        phrase+=" "

print(phrases)

Partitioned Phrases (Candidate Keyphrases): 

[['stock'], ['growth'], ['future'], ['Bill', 'Studebaker'], ['founder'], ['Chief', 'Investment', 'Officer'], ['Robo', 'Global'], ['Studebaker'], ["'reallocation"], ['tech', 'stock'], ['stock'], ['stock'], ['week'], ['March'], ['midst'], ['Amazon'], ['Studebaker'], ['stock', 'market'], ["'reallocation"], ['stock'], ['money'], ['founder'], ['Chief', 'Investment', 'Officer'], ['Robo', 'Global', 'Bill', 'Studebaker'], ['Business', 'Insider'], ['stock'], ['Facebook'], ['Apple'], ['Amazon'], ['Netflix'], ['Google'], ['March'], ['trend'], ['news'], ['Facebook', 'data', 'scandal'], ['Nasdaq'], ['frenzy'], ['Investors'], ['isn'], ['news'], ['stock', 'optimist'], ['Studebaker'], ['trade'], ['month'], ['lot'], ['performance', 'attribution'], ['stock'], ['stock', 'market'], ['gain'], ['month'], ['tech', 'company'], ['market'], ["'reallocation"], ['technology'], ['part'], ['market'], ['trend'], ['future'], ['reallocation', 'trade'], ['de-risking'], ['mo

### Create a list of unique phrases.

Repeating phrases\keyphrase candidates has no purpose here, anymore. 

In [157]:
unique_phrases = []

for phrase in phrases:
    if phrase not in unique_phrases:
        unique_phrases.append(phrase)

print(unique_phrases)

Unique Phrases (Candidate Keyphrases): 

[['stock'], ['growth'], ['future'], ['Bill', 'Studebaker'], ['founder'], ['Chief', 'Investment', 'Officer'], ['Robo', 'Global'], ['Studebaker'], ["'reallocation"], ['tech', 'stock'], ['week'], ['March'], ['midst'], ['Amazon'], ['stock', 'market'], ['money'], ['Robo', 'Global', 'Bill', 'Studebaker'], ['Business', 'Insider'], ['Facebook'], ['Apple'], ['Netflix'], ['Google'], ['trend'], ['news'], ['Facebook', 'data', 'scandal'], ['Nasdaq'], ['frenzy'], ['Investors'], ['isn'], ['stock', 'optimist'], ['trade'], ['month'], ['lot'], ['performance', 'attribution'], ['gain'], ['tech', 'company'], ['market'], ['technology'], ['part'], ['reallocation', 'trade'], ['de-risking'], ['investor'], ['law'], ['number'], ['cap'], ['mean'], ['value', 'stock'], ['macro', 'backdrop'], ['growth', 'demand'], ['run'], ['intelligence'], ['robotics'], ['fund'], ['asset'], ['management'], ['AI'], ['area'], ['ETF'], ['year'], ['span'], ['point'], ['boost'], ['sign'], ['FAANG

### Thinning the list of candidate-keyphrases.

Removing single word keyphrases-candidates that are present multi-word alternatives. 

In [158]:
for word in vocabulary:
    #print word
    for phrase in unique_phrases:
        if (word in phrase) and ([word] in unique_phrases) and (len(phrase)>1):
            #if len(phrase)>1 then the current phrase is multi-worded.
            #if the word in vocabulary is present in unique_phrases as a single-word-phrase
            # and at the same time present as a word within a multi-worded phrase,
            # then I will remove the single-word-phrase from the list.
            unique_phrases.remove([word])
            
print("Thinned Unique Phrases (Candidate Keyphrases): \n")
print(unique_phrases)

Thinned Unique Phrases (Candidate Keyphrases): 

[['future'], ['Bill', 'Studebaker'], ['founder'], ['Chief', 'Investment', 'Officer'], ['Robo', 'Global'], ["'reallocation"], ['tech', 'stock'], ['week'], ['March'], ['midst'], ['Amazon'], ['stock', 'market'], ['money'], ['Robo', 'Global', 'Bill', 'Studebaker'], ['Business', 'Insider'], ['Apple'], ['Netflix'], ['Google'], ['trend'], ['news'], ['Facebook', 'data', 'scandal'], ['Nasdaq'], ['frenzy'], ['Investors'], ['isn'], ['stock', 'optimist'], ['month'], ['lot'], ['performance', 'attribution'], ['gain'], ['tech', 'company'], ['technology'], ['part'], ['reallocation', 'trade'], ['de-risking'], ['investor'], ['law'], ['number'], ['cap'], ['mean'], ['value', 'stock'], ['macro', 'backdrop'], ['growth', 'demand'], ['run'], ['intelligence'], ['robotics'], ['fund'], ['asset'], ['management'], ['AI'], ['area'], ['ETF'], ['year'], ['span'], ['point'], ['boost'], ['sign'], ['FAANGs'], ['percent'], ['business'], ["'AI"], ['needle'], ['revenue', 'mi

### Scoring Keyphrases

Scoring the phrases (candidate keyphrases) and building up a list of keyphrases\keywords
by listing untokenized versions of tokenized phrases\candidate-keyphrases.
Phrases are scored by adding the score of their members (words\text-units that were ranked by the graph algorithm)


In [159]:
phrase_scores = []
keywords = []
for phrase in unique_phrases:
    phrase_score=0
    keyword = ''
    for word in phrase:
        keyword += str(word)
        keyword += " "
        phrase_score+=score[vocabulary.index(word)]
    phrase_scores.append(phrase_score)
    keywords.append(keyword.strip())

i=0
for keyword in keywords:
    print("Keyword: '"+str(keyword)+"', Score: "+str(phrase_scores[i]))
    i+=1

Keyword: 'future', Score: 1.0961461067199707
Keyword: 'Bill Studebaker', Score: 4.527086615562439
Keyword: 'founder', Score: 1.0434318780899048
Keyword: 'Chief Investment Officer', Score: 3.07278048992157
Keyword: 'Robo Global', Score: 2.026546359062195
Keyword: ''reallocation', Score: 1.517419695854187
Keyword: 'tech stock', Score: 6.0930193066596985
Keyword: 'week', Score: 0.6178255677223206
Keyword: 'March', Score: 1.6795504093170166
Keyword: 'midst', Score: 0.6222233176231384
Keyword: 'Amazon', Score: 2.1962010860443115
Keyword: 'stock market', Score: 7.556353330612183
Keyword: 'money', Score: 0.5900487303733826
Keyword: 'Robo Global Bill Studebaker', Score: 6.553632974624634
Keyword: 'Business Insider', Score: 1.2468579411506653
Keyword: 'Apple', Score: 0.6438391208648682
Keyword: 'Netflix', Score: 0.6378358602523804
Keyword: 'Google', Score: 1.141761064529419
Keyword: 'trend', Score: 1.1278831958770752
Keyword: 'news', Score: 1.2283493280410767
Keyword: 'Facebook data scandal', S

### Ranking Keyphrases

Ranking keyphrases based on their calculated scores. Displaying top keywords_num no. of keyphrases.

In [160]:
sorted_index = np.flip(np.argsort(phrase_scores),0)

keywords_num = 10

print("Keywords:\n")

for i in range(0,keywords_num):
    print(str(keywords[sorted_index[i]])+", ")

Keywords:

stock market, 
Robo Global Bill Studebaker, 
tech stock, 
value stock, 
stock optimist, 
Bill Studebaker, 
Chief Investment Officer, 
Facebook data scandal, 
growth demand, 
Amazon, 


## 2.2 gensim Textrank

In [174]:
from gensim.summarization import keywords

keywords(Text).split('\n')

['stocks',
 'stock',
 'studebaker',
 'trade',
 'trades',
 'amazon',
 'tech',
 'attribution',
 'attributable',
 'cap',
 'facebook',
 'market',
 'future',
 'growth',
 'thinks',
 'think',
 'frenzy',
 'investment',
 'big',
 'global',
 'favorable macro']

In [175]:
from gensim.summarization.summarizer import summarize

print(summarize(Text))

The FAANG stocks won’t see much more growth in the near future, according to Bill Studebaker, founder and Chief Investment Officer of Robo Global.
Studebaker argues we are seeing a 'reallocation' that will continue from large-cap tech stocks into market-weight stocks.
The stock market is seeing a 'reallocation' out of FAANG stocks, which are not where the smart money is, founder and Chief Investment Officer of Robo Global Bill Studebaker told Business Insider.
