Extractive Summarization Section

In [1]:
import json
import pandas as pd
import numpy as np
from helper_functions import *

In [2]:
clusters = pd.read_csv('../Data/summarization_cluster_analysis.csv.csv')

In [4]:
clusters.head(2)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,year,class,cleaned_pca_component_1,cleaned_pca_component_2,cleaned_pca_dbscan_class,cleaned_tsne_dim_1,cleaned_tsne_dim_2,cleaned_tsne_dbscan_class,...,lemma_lg_tsne_dim_1,lemma_lg_tsne_dim_2,lemma_lg_tsne_dbscan_class,Polarity,Subjectivity,Journal Title,Article Title,Abstract,Keywords,Content_Length
0,0,0,2021,post-pandemic,-1.293127,-0.298761,1,-105.3824,-37.88513,0,...,-122.16082,64.9606,0,0.052444,0.391682,Psychological Services,The Effectiveness of Telepsychology With Veter...,Veterans face a variety of stressors due to th...,"KEYWORDS:\n\ntelepsychology, meta-analysis, ve...",37477
1,1,1,2021,post-pandemic,0.015101,-0.601342,0,-14.273545,12.252538,0,...,-105.67202,-34.4959,0,0.08133,0.432419,"Psychology, Public Policy, and Law",Making the Case for Videoconferencing and Remo...,The COVID-19 pandemic and its requirements for...,"KEYWORDS:\n\nremote child custody evaluations,...",74025


In [6]:
# There are 3 clusers total, those assigned to -1 are unaffiliated with any cluster:
clusters.cleaned_pca_dbscan_class.unique()

array([ 1,  0, -1,  2], dtype=int64)

In [7]:
df = pd.read_csv('../Data/processed/Telehealth.csv')

In [22]:
df.head(1)

Unnamed: 0,Journal Title,Article Title,Date Published,Authors,Abstract,Keywords,Citation,Content,Content_Length,Abstract_Length,Parsed_Keywords,Parsed_Keywords_Length,Subfield,et_al_Count,Stopwords_Lemma_Longform_Clean_Content,Clean_Content,Classification
0,Psychological Services,The Effectiveness of Telepsychology With Veter...,2021,Michael J. McClellan; Richard Osbaldiston; Ron...,Veterans face a variety of stressors due to th...,"KEYWORDS:\n\ntelepsychology, meta-analysis, ve...","McClellan, M. J., Osbaldiston, R., Wu, R., Yea...",Veterans face a variety of stressors related t...,37477,2411,"['telepsychology,', 'meta-analysis,', 'veteran...",5,Clinical & Counseling Psychology,50,veteran face variety stressor relate military ...,veteran face variety stressor relate military ...,Covid


In [8]:
# Combine the two dataframes on the Article title:
merged = df.merge(clusters, on= 'Article Title')

In [9]:
merged.head(1)

Unnamed: 0,Journal Title_x,Article Title,Date Published,Authors,Abstract_x,Keywords_x,Citation,Content,Content_Length_x,Abstract_Length,...,lemma_lg_pca_dbscan_class,lemma_lg_tsne_dim_1,lemma_lg_tsne_dim_2,lemma_lg_tsne_dbscan_class,Polarity,Subjectivity,Journal Title_y,Abstract_y,Keywords_y,Content_Length_y
0,Psychological Services,The Effectiveness of Telepsychology With Veter...,2021,Michael J. McClellan; Richard Osbaldiston; Ron...,Veterans face a variety of stressors due to th...,"KEYWORDS:\n\ntelepsychology, meta-analysis, ve...","McClellan, M. J., Osbaldiston, R., Wu, R., Yea...",Veterans face a variety of stressors related t...,37477,2411,...,-1,-122.16082,64.9606,0,0.052444,0.391682,Psychological Services,Veterans face a variety of stressors due to th...,"KEYWORDS:\n\ntelepsychology, meta-analysis, ve...",37477


In [10]:
# Drop everything we don't need to focus on the text and clusters:

summ_df = merged[['Abstract_x', 'Content', 'lemma_tsne_dbscan_class']].copy()

Create the three cluster corpora, exclude the three papers in the unassigned -1 category

In [54]:
summ_df_0 = summ_df[summ_df['lemma_tsne_dbscan_class']==0]
summ_df_1 = summ_df[summ_df['lemma_tsne_dbscan_class']==1]
summ_df_2 = summ_df[summ_df['lemma_tsne_dbscan_class']==2]

KeyError: 'lemma_tsne_dbscan_class'

In [59]:
print('Articles in 0 Cluster: ', len(summ_df_0))
print('Articles in 1 Cluster: ', len(summ_df_1))
print('Articles in 2 Cluster: ', len(summ_df_2))

Articles in 0 Cluster:  18
Articles in 1 Cluster:  11
Articles in 2 Cluster:  12


In [14]:
summ_df_0.head(1)

Unnamed: 0,Abstract_x,Content,lemma_tsne_dbscan_class
0,Veterans face a variety of stressors due to th...,Veterans face a variety of stressors related t...,0


In [15]:
summ_df_1.head(1)

Unnamed: 0,Abstract_x,Content,lemma_tsne_dbscan_class
3,Many university training clinics are facing nu...,Anxiety and depressive disorders are among the...,1


In [16]:
summ_df_2.head(1)

Unnamed: 0,Abstract_x,Content,lemma_tsne_dbscan_class
7,Although the medical impacts of COVID-19 are n...,The primary focus of COVID-19 has been on its ...,2


In [21]:
def create_text(df):
  text = ''
  abstracts = ''
  for i in df['Content']:
    text = text + i + ' '
  for j in df['Abstract_x']:
    try:
      abstracts = abstracts + j + ' '
    except:
      continue
  return text, abstracts

corpus_0, abstracts_0 = create_text(summ_df_0)
corpus_1, abstracts_1 = create_text(summ_df_1)
corpus_2, abstracts_2 = create_text(summ_df_2)

<h1>GENSIM Summary Section:

In [22]:
from gensim.summarization import summarize

In [38]:
corpora = [corpus_0, corpus_1, corpus_2]

abstracts = [abstracts_0, abstracts_1, abstracts_2]

clean_corpora = []

clean_abstracts = []

# Remove all parenthetical citations, which add no real value to summarization:
for corpus in corpora:
    print('Length Before: ',len(corpus))
    corpus = remove_text_in_parens(corpus)
    print('Length After:  ',len(corpus))
    clean_corpora.append(corpus)
    
for abstract in abstracts:
    print('Length Before: ',len(abstract))
    abstract = remove_text_in_parens(abstract)
    print('Length After:  ',len(abstract))
    clean_abstracts.append(abstract)

summ_corpora = []

for i, text in enumerate(clean_corpora):
    # 0.05 ratio recommended by Dr. Diana. Prefer this to 150 word_count since we are combining all of the texts of the cluster,
    # so a ratio helps offset the increased length of the corpus:
    summ_corpora.append(summarize(text, ratio=0.05))
    
summ_df = pd.DataFrame(summ_corpora, columns=['Summaries'])
summ_df['Abstracts'] = clean_abstracts

Length Before:  730421
Length After:   651225
Length Before:  341586
Length After:   316731
Length Before:  344330
Length After:   311581
Length Before:  34575
Length After:   33539
Length Before:  12367
Length After:   11770
Length Before:  19163
Length After:   18694


In [39]:
summ_df.head()

Unnamed: 0,Summaries,Abstracts
0,"Given the consequences of these barriers, more...",Veterans face a variety of stressors due to th...
1,"Notably, social distancing guidelines, includi...",Many university training clinics are facing nu...
2,"Not surprisingly then, one of the largest and ...",Although the medical impacts of COVID-19 are n...


<h1>BLEU Scoring Section:

In [40]:
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu

In [42]:
bleu_score_dict = {}

for row in range(len(summ_df)):
    # Using weights=(1,0,0,0) to only search for unigram similarities since beyond that the scores produced were infintesimal
    print(row)
    bleu_score = sentence_bleu(summ_df.Abstracts[row], summ_df.Summaries[row], weights=(1,0,0,0))
    print(bleu_score)
    bleu_score_dict[row] = bleu_score
    
# Create a column for the bleu score from the blue_score_dict:
summ_df['Bleu Score'] = summ_df.from_dict(bleu_score_dict.values())

0
0.001100556070435588
1
0.001683501683501683
2
0.0019931578164509886


In [43]:
summ_df.head()

Unnamed: 0,Summaries,Abstracts,Bleu Score
0,"Given the consequences of these barriers, more...",Veterans face a variety of stressors due to th...,0.001101
1,"Notably, social distancing guidelines, includi...",Many university training clinics are facing nu...,0.001684
2,"Not surprisingly then, one of the largest and ...",Although the medical impacts of COVID-19 are n...,0.001993


BLEU Scores are very low, but this isn't too surprising since BLEU isn't really decided to evaluate automatic summarization techniques. Rouge will be a better metric to go by:

<h1>Rouge Scoring Section:

In [45]:
from rouge import Rouge 

rouge_scorer = Rouge().get_scores

In [46]:
rouge_scores_dict = {}

for row in range(len(summ_df)):
    
    rouge_scores = rouge_scorer(summ_df.Summaries[row],summ_df.Abstracts[row])
    print(rouge_scores)
    rouge_scores_dict[row] = rouge_scores

[{'rouge-1': {'f': 0.4862243420210064, 'p': 0.3565244279529994, 'r': 0.7642509942554132}, 'rouge-2': {'f': 0.16968929694124865, 'p': 0.12442016286980724, 'r': 0.2667403314917127}, 'rouge-l': {'f': 0.37779894733657743, 'p': 0.29026354319180087, 'r': 0.5409276944065484}}]
[{'rouge-1': {'f': 0.33592261110789656, 'p': 0.21907857006308545, 'r': 0.7198492462311558}, 'rouge-2': {'f': 0.10907491212461867, 'p': 0.07112810707456979, 'r': 0.2338152105593966}, 'rouge-l': {'f': 0.2694560630009139, 'p': 0.18463302752293578, 'r': 0.4984520123839009}}]
[{'rouge-1': {'f': 0.5134625694453173, 'p': 0.40012589173310953, 'r': 0.7163786626596544}, 'rouge-2': {'f': 0.19849178101947484, 'p': 0.1546694648478489, 'r': 0.2769635475385194}, 'rouge-l': {'f': 0.3668088359371192, 'p': 0.2945205479452055, 'r': 0.48612538540596095}}]


In [47]:
cluster = 0
for i in rouge_scores_dict.values():
  print(i)
  x = 'rougescores' + str(cluster)
  x = pd.DataFrame.from_dict(i)
  cluster +=1

[{'rouge-1': {'f': 0.4862243420210064, 'p': 0.3565244279529994, 'r': 0.7642509942554132}, 'rouge-2': {'f': 0.16968929694124865, 'p': 0.12442016286980724, 'r': 0.2667403314917127}, 'rouge-l': {'f': 0.37779894733657743, 'p': 0.29026354319180087, 'r': 0.5409276944065484}}]
[{'rouge-1': {'f': 0.33592261110789656, 'p': 0.21907857006308545, 'r': 0.7198492462311558}, 'rouge-2': {'f': 0.10907491212461867, 'p': 0.07112810707456979, 'r': 0.2338152105593966}, 'rouge-l': {'f': 0.2694560630009139, 'p': 0.18463302752293578, 'r': 0.4984520123839009}}]
[{'rouge-1': {'f': 0.5134625694453173, 'p': 0.40012589173310953, 'r': 0.7163786626596544}, 'rouge-2': {'f': 0.19849178101947484, 'p': 0.1546694648478489, 'r': 0.2769635475385194}, 'rouge-l': {'f': 0.3668088359371192, 'p': 0.2945205479452055, 'r': 0.48612538540596095}}]


In [48]:
scores = []
frames = []
for i, k in rouge_scores_dict.items():
  print(k[0])
  for score, d in k[0].items():
    scores.append(score)
    frames.append(pd.DataFrame.from_dict(d, orient='index'))

rouge = pd.concat(frames, keys=['Rouge 1 Cluster 0', 'Rouge 2 Cluster 0', 'Rouge l Cluster 0', 
                        'Rouge 1 Cluster 1', 'Rouge 2 Cluster 1', 'Rouge l Cluster 1',
                        'Rouge 1 Cluster 2', 'Rouge 2 Cluster 2', 'Rouge l Cluster 2'
                        ])

{'rouge-1': {'f': 0.4862243420210064, 'p': 0.3565244279529994, 'r': 0.7642509942554132}, 'rouge-2': {'f': 0.16968929694124865, 'p': 0.12442016286980724, 'r': 0.2667403314917127}, 'rouge-l': {'f': 0.37779894733657743, 'p': 0.29026354319180087, 'r': 0.5409276944065484}}
{'rouge-1': {'f': 0.33592261110789656, 'p': 0.21907857006308545, 'r': 0.7198492462311558}, 'rouge-2': {'f': 0.10907491212461867, 'p': 0.07112810707456979, 'r': 0.2338152105593966}, 'rouge-l': {'f': 0.2694560630009139, 'p': 0.18463302752293578, 'r': 0.4984520123839009}}
{'rouge-1': {'f': 0.5134625694453173, 'p': 0.40012589173310953, 'r': 0.7163786626596544}, 'rouge-2': {'f': 0.19849178101947484, 'p': 0.1546694648478489, 'r': 0.2769635475385194}, 'rouge-l': {'f': 0.3668088359371192, 'p': 0.2945205479452055, 'r': 0.48612538540596095}}


In [50]:
rouge

Unnamed: 0,Unnamed: 1,0
Rouge 1 Cluster 0,f,0.486224
Rouge 1 Cluster 0,p,0.356524
Rouge 1 Cluster 0,r,0.764251
Rouge 2 Cluster 0,f,0.169689
Rouge 2 Cluster 0,p,0.12442
Rouge 2 Cluster 0,r,0.26674
Rouge l Cluster 0,f,0.377799
Rouge l Cluster 0,p,0.290264
Rouge l Cluster 0,r,0.540928
Rouge 1 Cluster 1,f,0.335923


In [60]:
rouge.to_csv('../Data/rouge_scores.csv')

In [61]:
summ_df.head()

Unnamed: 0,Summaries,Abstracts,Bleu Score
0,"Given the consequences of these barriers, more...",Veterans face a variety of stressors due to th...,0.001101
1,"Notably, social distancing guidelines, includi...",Many university training clinics are facing nu...,0.001684
2,"Not surprisingly then, one of the largest and ...",Although the medical impacts of COVID-19 are n...,0.001993


In [62]:
summ_df = summ_df[['Summaries', 'Bleu Score']]

In [63]:
summ_df.to_csv('../Data/Extractive Summaries.csv')

In [70]:
' '.join(summ_df['Summaries'][0].split()[0:150])

'Given the consequences of these barriers, more work needs to be done to reduce their impact on veterans.Telepsychology, or the use of technology to provide mental health services, broadly encompasses a variety of direct formats including videoconferencing, phone, and instant messaging as well as a variety of indirect formats such as email, self-help apps, or websites. In this article, telepsychology is more narrowly defined as the use of videoconferencing and telephone technologies to provide mental health services in order to reflect that the bulk of the available research is conducted using one of these two mediums. Furthermore, this increased privacy associated with telepsychology visits can help prevent some of the negative stigma or embarrassment that veterans might experience during face-to-face visits by allowing them to avoid those in-office contacts.Effectiveness of TelepsychologyOne question that must be presented to mental health providers and researchers is whether veteran

In [71]:
' '.join(summ_df['Summaries'][1].split()[0:150])

'Notably, social distancing guidelines, including guidance to wear face coverings, disinfectant procedures, and general fear among both clients and providers, present significant barriers to care from practical, economic, health, and personal comfort perspectives. Furthermore, while public Internet access is nearly ubiquitous today, the Pew Research Center reported 75% of U.S. adults also have broadband Internet service at home, suggesting fast, reliable, and more secure connectivity that provides a unique opportunity to leverage telepsychology to address mental health needs.Telepsychology is defined as “the provision of psychological services using telecommunication technologies… [including, but not limited] to telephone, mobile devices, interactive videoconferencing, email, chat, text, and Internet”. Additionally, telepsychology via videoconferencing and via telephone is associated with strong therapeutic alliances and increased cost-effectiveness, lowering costs by 10% per patient a

In [72]:
' '.join(summ_df['Summaries'][2].split()[0:150])

'Not surprisingly then, one of the largest and most sustained effects of the COVID-19 pandemic is its impact on mental health and, by extension, the prosperity of nations worldwide.Due to the financial, social, and psychological stress of COVID-19, and the reduction in supports attributed to physical distancing requirements, it is expected that anxiety, depression, and traumatic stress will increase dramatically as a function of COVID-19. Telemental health may also be effective in reducing common barriers to accessing treatment, such as transportation to treatment sessions, increasing access to evidence-based services in rural areas or in communities without specialized mental health services in Canada, and in low- and middle-income countries with low funding for in person services.Importantly, although there are myriad benefits to telemental health, there are also notable limitations, and these should be weighed prior to the widespread adoption of telemental health practices with all 