### Dependent Libraries

In [22]:
import pandas as pd
import csv
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

### Preprocess Input Data

In [10]:
df = pd.read_csv('./cve_data_description_only.csv')
df.dropna(inplace=True)

cve_arr = []

for cve in df['Description']:
    if '**' not in cve:
        cve_arr.append(cve)

#remove duplicates
cve_arr = list(set(cve_arr))

### Summarization

1. clean raw text(remove numbers, puncutation, etc)
2. split clean text into individual sentences
3. tokenize each sentence into respective words
4. convert tokenize sentences into sentence vectors(sequence of word vectors or average of word vectors in a sentence)
5. create similarity matrix(# of sentences by # of sentences) -> similarity between 2 sentence vectors(consine similarity)
6. create a graph out of the similar matrix(nodes: sentences, edges: similarity)
7. rank each node(page ranking algorithm) to score the important of a sentence
8. output the top 'k' sentences

### Keyword Extraction

1. The same process as Summarization except we are picking the most important words instead of the most important sentences

In [20]:
keywords_arr = []

for i in range(len(cve_arr)):
    cve = cve_arr[i]
    
    #attempt to summarize the input cve
    try:
        summary = summarize(cve, ratio=0.2)
    except:
        #input was only one sentence
        summary = cve
        
    raw_keywords = keywords(cve, words=5, split=True)
    summary_keywords = keywords(summary, words=5, lemmatize=True, split=True)
    
    if i < 50:
        print('raw:',cve)
        print('raw keywords:',raw_keywords)
        print('summary:', summary)
        print('summary keywords:', summary_keywords)
        print('------------------------------------')
        
    keywords_arr.append(str(raw_keywords))

raw: Multiple unspecified vulnerabilities in the kernel in Sun Solaris 8 through 10 allow local users to cause a denial of service (panic), related to the support for retrieval of kernel statistics, and possibly related to the sfmmu_mlspl_enter or sfmmu_mlist_enter functions.
raw keywords: ['unspecified', 'local', 'panic related', 'functions']
summary: Multiple unspecified vulnerabilities in the kernel in Sun Solaris 8 through 10 allow local users to cause a denial of service (panic), related to the support for retrieval of kernel statistics, and possibly related to the sfmmu_mlspl_enter or sfmmu_mlist_enter functions.
summary keywords: ['local', 'unspecified', 'panic related', 'statistics']
------------------------------------
raw: Multiple directory traversal vulnerabilities in Splunk 4.x before 4.2.5 allow remote authenticated users to read arbitrary files via a .. (dot dot) in a URI to (1) Splunk Web or (2) the Splunkd HTTP Server, aka SPL-45243.
raw keywords: ['arbitrary', 'remote

raw: Liesbeth base CMS stores sensitive information under the web root with insufficient access control, which allows remote attackers to download an include file containing account credentials via a direct request for config.inc.
raw keywords: ['access', 'remote', 'file', 'account', 'sensitive']
summary: Liesbeth base CMS stores sensitive information under the web root with insufficient access control, which allows remote attackers to download an include file containing account credentials via a direct request for config.inc.
summary keywords: ['remote', 'access', 'account', 'file', 'sensitive']
------------------------------------
raw: The spotim-comments plugin before 4.0.4 for WordPress has multiple XSS issues.
raw keywords: ['xss', 'issues', 'multiple', 'spotim']
summary: The spotim-comments plugin before 4.0.4 for WordPress has multiple XSS issues.
summary keywords: ['xss', 'spotim plugin', 'issues']
------------------------------------
raw: Multiple memory leaks in kadmin/server

In [23]:
fields = ['Description', 'Keywords']
output_data = []
for a,b in zip(cve_arr, keywords_arr):
    output_data.append([a,b])

with open('cve_keywords.csv', 'w') as f: 
      
    write = csv.writer(f) 
    write.writerow(fields) 
    write.writerows(output_data)   