Extractive Summarization Section

In [1]:
import json
import pandas as pd
import numpy as np
from helper_functions import *

In [2]:
df = pd.read_csv('Data/processed/LongForm_Clean_Lemma_Telehealth.csv')

In [3]:
df.head(1)

Unnamed: 0,Journal Title,Article Title,Date Published,Authors,Abstract,Keywords,Citation,Content,Content_Length,Abstract_Length,Parsed_Keywords,Parsed_Keywords_Length,Subfield,et_al_Count,LongForm,LongForm_Clean_Content,LongForm_Clean_Content_length,LongForm_Clean_Content_Lemma,Classification
0,Psychological Services,The Effectiveness of Telepsychology With Veter...,2021,Michael J. McClellan; Richard Osbaldiston; Ron...,Veterans face a variety of stressors due to th...,"KEYWORDS:\r\n\r\ntelepsychology, meta-analysis...","McClellan, M. J., Osbaldiston, R., Wu, R., Yea...",Veterans face a variety of stressors related t...,37477,2411,"['telepsychology,', 'meta-analysis,', 'veteran...",5,Clinical & Counseling Psychology,50,Veterans face a variety of stressors related t...,Veterans face a variety of stressors related t...,38022,veteran face a variety of stressor relate to t...,Covid


In [4]:
# Drop everything we don't need to focus on the summaries:

summ_df = df[['Abstract', 'LongForm']].copy()

In [5]:
# Some of the articles do not include an Abstract, so we will drop those and redo the index:

print(len(summ_df))
summ_df.dropna(inplace=True)
summ_df.reset_index(drop=True, inplace=True)
print(len(summ_df))

44
39


That dropped 5 articles that did not have an abstract.

In [6]:
# Create a new column with any references within parenthesis removed. I would imagine these could mess with any
# summarization algorithims, so we'll want to remove those (DBB):

summ_df['LF_no_refs'] = df['LongForm'].apply(remove_text_in_parens)

<h1>GENSIM Summary Section:

In [7]:
from gensim.summarization import summarize

In [8]:
# Dr. Diana said this was a good ratio to use for summaries:

summ_ratio = 0.05

In [9]:
def gnsm_summary(text):
    summary = summarize(text, ratio=summ_ratio)
    return summary

In [10]:
# Create a column with a gensim summary for each paper:

summ_df['gnsm_summ'] = summ_df['LF_no_refs'].apply(gnsm_summary)

In [11]:
summ_df.head()

Unnamed: 0,Abstract,LongForm,LF_no_refs,gnsm_summ
0,Veterans face a variety of stressors due to th...,Veterans face a variety of stressors related t...,Veterans face a variety of stressors related t...,Telepsychology has generally been found to be ...
1,The COVID-19 pandemic and its requirements for...,The impact of the coronavirus disease 2019 pan...,The impact of the coronavirus disease 2019 pan...,"Finally, because the stakes are so significant..."
2,Forensic e-mental health is an area of psychol...,"In November 2019, coronavirus disease 2019—the...","In November 2019, coronavirus disease 2019—the...","Accessibility is increasingly important, consi..."
3,Many university training clinics are facing nu...,Anxiety and depressive disorders are among the...,Anxiety and depressive disorders are among the...,A review of telepsychology’s specific ethical ...
4,The emergence of the Covid-19 pandemic at the ...,"In mid-March, 2020, the authors—as well as the...","In mid-March, 2020, the authors—as well as the...",The uncertain duration of this new professiona...


<h1>BLEU Scoring Section:

In [12]:
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu

In [13]:
bleu_score_dict = {}

for row in range(len(summ_df)):
    # Using (1,0,0,0) weights to only search for unigram similarities since beyond that the scores were infintesimal
    
    bleu_score = sentence_bleu(summ_df.Abstract[row], summ_df.gnsm_summ[row], weights=(1,0,0,0))
    print(bleu_score)
    bleu_score_dict[row] = bleu_score

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


0.018680445151033388
0.005193456245131133
0.00798782807151008
0.01736274049741906
0.010332661290322585
0.014018691588785043
0.012211981566820278
0.018316831683168312
0.01856973528249704
0.04094165813715455
0.020521353300055466
0.007171593493090785
0.011656441717791406
0.00852382797365362
0.014630577907827357
0.009372559229367349
0.009045465365389192
0.015753938484621154
0.014701378254211327
0.011604868383809794
0.013971358714634997
0.013293943870014768
0.025557011795543903
0.017565872020075285
0.009155873157659672
0.023099133782483152
0.06365503080082137
0.008836876044900882
0.03271028037383178
0.05087440381558029
0.02
0.010851217747769468
0.022950819672131147
0.03575547866205306
0.00809810273021749
0.008586679043861681
0.01150180742688137
0.024268823895457377
0.01956702747710241


In [14]:
# Create a column for the bleu score from the blue_score_dict:
summ_df['gnsm_bleu'] = summ_df.from_dict(bleu_score_dict.values())

In [15]:
summ_df.head()

Unnamed: 0,Abstract,LongForm,LF_no_refs,gnsm_summ,gnsm_bleu
0,Veterans face a variety of stressors due to th...,Veterans face a variety of stressors related t...,Veterans face a variety of stressors related t...,Telepsychology has generally been found to be ...,0.01868
1,The COVID-19 pandemic and its requirements for...,The impact of the coronavirus disease 2019 pan...,The impact of the coronavirus disease 2019 pan...,"Finally, because the stakes are so significant...",0.005193
2,Forensic e-mental health is an area of psychol...,"In November 2019, coronavirus disease 2019—the...","In November 2019, coronavirus disease 2019—the...","Accessibility is increasingly important, consi...",0.007988
3,Many university training clinics are facing nu...,Anxiety and depressive disorders are among the...,Anxiety and depressive disorders are among the...,A review of telepsychology’s specific ethical ...,0.017363
4,The emergence of the Covid-19 pandemic at the ...,"In mid-March, 2020, the authors—as well as the...","In mid-March, 2020, the authors—as well as the...",The uncertain duration of this new professiona...,0.010333


<h1>Rouge Scoring Section:

In [16]:
from rouge import rouge_score

rouge_scorer = rouge_score.rouge_n

In [17]:
rouge_scores_dict = {}

for row in range(len(summ_df)):
    
    rouge_scores = rouge_scorer(summ_df.Abstract[row], summ_df.gnsm_summ[row])
    print(rouge_scores)
    rouge_scores_dict[row] = rouge_scores

{'f': 0.7098515469575528, 'p': 0.7185792349726776, 'r': 0.7013333333333334}
{'f': 0.6514745261692387, 'p': 0.8804347826086957, 'r': 0.5170212765957447}
{'f': 0.7154046949462809, 'p': 0.898360655737705, 'r': 0.5943600867678959}
{'f': 0.7471074330742435, 'p': 0.837037037037037, 'r': 0.6746268656716418}
{'f': 0.6833114275227458, 'p': 0.8524590163934426, 'r': 0.5701754385964912}
{'f': 0.7361299003050973, 'p': 0.7953216374269005, 'r': 0.6851385390428212}
{'f': 0.6814268094019539, 'p': 0.8147058823529412, 'r': 0.5856236786469344}
{'f': 0.7483660080721091, 'p': 0.7435064935064936, 'r': 0.7532894736842105}
{'f': 0.7072135735569743, 'p': 0.7911392405063291, 'r': 0.639386189258312}
{'f': 0.7399267349912115, 'p': 0.6644736842105263, 'r': 0.8347107438016529}
{'f': 0.7176079684462645, 'p': 0.7714285714285715, 'r': 0.6708074534161491}
{'f': 0.7258805464849007, 'p': 0.8977272727272727, 'r': 0.609254498714653}
{'f': 0.7170953052149931, 'p': 0.8200692041522492, 'r': 0.6370967741935484}
{'f': 0.72282608

In [18]:
# Isolate the F1 scores:

rouge_f1_dict = {}
row = 0

for score_dict in rouge_scores_dict.values():
    f1_score = score_dict['f']
    print(f1_score)
    rouge_f1_dict[row] = f1_score
    row += 1

0.7098515469575528
0.6514745261692387
0.7154046949462809
0.7471074330742435
0.6833114275227458
0.7361299003050973
0.6814268094019539
0.7483660080721091
0.7072135735569743
0.7399267349912115
0.7176079684462645
0.7258805464849007
0.7170953052149931
0.7228260820641836
0.7152103509875264
0.7453987681014059
0.7357664184842667
0.6844319725832222
0.729577459887919
0.7067448630792219
0.7745664689934931
0.6848874548668336
0.7715735990746706
0.6791044726144465
0.6731571577392492
0.6810810760971879
0.5351473873377862
0.7030129076181791
0.6550632861490547
0.6372360797138238
0.680921047718144
0.6878306829327288
0.6277873020333108
0.6269841221164809
0.6470588188346021
0.6832116739881294
0.6948356757514799
0.6996805061833845
0.6648575255461928


In [19]:
# Create a column for the rouge F1 scores from the rouge_scores_dict:
summ_df['gnsm_rouge_f1'] = summ_df.from_dict(rouge_f1_dict.values())

In [20]:
summ_df.head()

Unnamed: 0,Abstract,LongForm,LF_no_refs,gnsm_summ,gnsm_bleu,gnsm_rouge_f1
0,Veterans face a variety of stressors due to th...,Veterans face a variety of stressors related t...,Veterans face a variety of stressors related t...,Telepsychology has generally been found to be ...,0.01868,0.709852
1,The COVID-19 pandemic and its requirements for...,The impact of the coronavirus disease 2019 pan...,The impact of the coronavirus disease 2019 pan...,"Finally, because the stakes are so significant...",0.005193,0.651475
2,Forensic e-mental health is an area of psychol...,"In November 2019, coronavirus disease 2019—the...","In November 2019, coronavirus disease 2019—the...","Accessibility is increasingly important, consi...",0.007988,0.715405
3,Many university training clinics are facing nu...,Anxiety and depressive disorders are among the...,Anxiety and depressive disorders are among the...,A review of telepsychology’s specific ethical ...,0.017363,0.747107
4,The emergence of the Covid-19 pandemic at the ...,"In mid-March, 2020, the authors—as well as the...","In mid-March, 2020, the authors—as well as the...",The uncertain duration of this new professiona...,0.010333,0.683311


<h1>Keyword Extraction Section:

In [21]:
from rake_nltk import Rake

In [22]:
df_overview = pd.read_csv("./Data/processed/LongForm_Clean_Lemma_Telehealth.csv")

In [23]:
from rake_nltk import Rake

keyword_corpus = " ".join(df_overview["LongForm_Clean_Content_Lemma"].str.replace("et al",""))

r_extraction = Rake()

r_extraction.extract_keywords_from_text(keyword_corpus)


r_extraction.get_ranked_phrases()[:10]

['treat various disorder bee 2008 postel de haan de jong 2008 include posttraumatic stress disorder germain marchand bouchard drouin guay 2009 depression sloan gallagher feinstein lee pruneau 2011 anxiety ruskin 2004 substance use frueh henderson myrick 2005 chronic pain macea gajos calil fregni 2010',
 'largely positive adler pritchett kauth nadorff 2014 baird whitney caedo 2018 brooks manson bair dailey shore 2012 cunningham connors lever stephan 2013 levy strachan 2013 mitchell maclaren morton carachi 2009 moreau 2018 whitten kuwahara 2004 wynn bergvik pettersen fossum 2012',
 'neuropsychological testing bouchard 2004 cullum hynan grosch parikh weiner 2014 cullum weiner gehrmann hynan 2006 gehrman shah miles kuna godleski 2016 gros yoder tuerk lozano acierno 2011 hilty 2013 morland 2014 morland hynes mackintosh resick chard 2011the veteran ’',
 'encopresis eg davis sampilo gallagher landrum malone 2013 palermo wilson peters lewandowski somhegyi 2009 richardson frueh grubaugh egede e