# NLP With Deep Learning (W266)

Submission by *Carolina Arriaga, Ayman, Abhi Sharma*

Winter 2021 | UC Berkeley

## Notebook Overview

This notebook contains the data prep needed by the team to be able to draw conclusions on the relationship between `coherence, fluency, consistency, relevance` and the metrics that were authored in the `SummaryScorer` notebook.

References list:

https://arxiv.org/pdf/2007.12626.pdf

https://github.com/Yale-LILY/SummEval

# Data Prep

In [None]:
%cd /content
!pwd

/content
/content


## Utilities

In [None]:
import numpy as np
import pandas as pd

def get_link_for_model(num):
  return 'https://storage.googleapis.com/sfr-summarization-repo-research/M{}.tar.gz'.format(num)

def get_link_for_human_annotation():
  return 'https://storage.googleapis.com/sfr-summarization-repo-research/model_annotations.aligned.jsonl'
  
# there are 24 models - 0 to 23 as per the link here
# https://github.com/Yale-LILY/SummEval#model-outputs
total_models = 24

def get_links_for_all_models():
  links = ''
  for i in range(total_models):
    links += get_link_for_model(i) + ' '
  return links


In [None]:
!wget {get_link_for_human_annotation()}

--2021-11-27 21:25:30--  https://storage.googleapis.com/sfr-summarization-repo-research/model_annotations.aligned.jsonl
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.5.128, 64.233.184.128, 64.233.167.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.5.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5839062 (5.6M) [application/octet-stream]
Saving to: ‘model_annotations.aligned.jsonl’


2021-11-27 21:25:31 (43.5 MB/s) - ‘model_annotations.aligned.jsonl’ saved [5839062/5839062]



In [None]:
!wget {get_links_for_all_models()}

--2021-11-27 21:25:31--  https://storage.googleapis.com/sfr-summarization-repo-research/M0.tar.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.5.128, 64.233.167.128, 74.125.133.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.5.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3821775 (3.6M) [application/x-gzip]
Saving to: ‘M0.tar.gz’


2021-11-27 21:25:31 (150 MB/s) - ‘M0.tar.gz’ saved [3821775/3821775]

--2021-11-27 21:25:31--  https://storage.googleapis.com/sfr-summarization-repo-research/M1.tar.gz
Reusing existing connection to storage.googleapis.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 3703162 (3.5M) [application/x-gzip]
Saving to: ‘M1.tar.gz’


2021-11-27 21:25:32 (91.7 MB/s) - ‘M1.tar.gz’ saved [3703162/3703162]

--2021-11-27 21:25:32--  https://storage.googleapis.com/sfr-summarization-repo-research/M2.tar.gz
Reusing existing connection to storage.googleapis.com:443.
HTTP req

In [None]:
!for f in *.tar.gz; do tar -xf "$f"; done

In [None]:
from os import listdir
from os.path import isfile, join

def get_jsonl_files_for_model(model_num):
  assert type(model_num) == int
  path = "M{}/aligned".format(model_num)
  files = [f for f in listdir(path) if isfile(join(path, f))]
  return [path + '/' + f for f in files if f.endswith('jsonl')]

In [None]:
get_jsonl_files_for_model(23)

['M23/aligned/outputs_dynamicmix_cnn_dailymail.aligned.jsonl',
 'M23/aligned/outputs_c4_cnn_dailymail.aligned.jsonl',
 'M23/aligned/outputs_hugenews_cnn_dailymail.aligned.jsonl']

In [None]:
import json

def get_model_result_list(model_num, all_model_variants=True):
  file_list = get_jsonl_files_for_model(model_num)
  assert len(file_list) > 0
  if not all_model_variants:
    file_list = [file_list[0]]
  
  corrupted_val = "cnndm/dailymail/stories/9f270039c861e75ee2f01e4e2898a9ea04a96b26.story"
  data = []  
  for f in file_list:
    with open(f, 'r') as jsonl_file:
        json_list = list(jsonl_file)

    for json_str in json_list:
        model_result = json.loads(json_str)
        model_result['model_id'] = 'M' + str(model_num)
        model_result['model_variant'] = f
        # there is a single corrupted record in model #2, so we remove that
        if model_num == 2 and model_result['filepath'] == corrupted_val:
          continue
        data.append(model_result)
  
  return data

In [None]:
def get_all_model_data(all_model_variants=True):
  data = []
  for i in range(total_models):
    data_for_model = get_model_result_list(i, all_model_variants)
    # verify keys in result for every model
    first = data_for_model[0]
    assert 'reference' in first.keys()
    assert 'decoded' in first.keys()
    assert 'id' in first.keys()
    assert 'filepath' in first.keys()
    assert 'model_id' in first.keys()
    assert 'model_variant' in first.keys()
    data.extend(data_for_model)
  return data

## Model Summaries

In [None]:
import pandas as pd

data = get_all_model_data()
model_summ = pd.DataFrame(data)
model_summ.head()

Unnamed: 0,reference,decoded,id,filepath,model_id,model_variant
0,marseille prosecutor says `` so far no videos ...,"marseille , france -lrb- cnn -rrb- the french ...",cnn-test-469c6ac05092ca5997728c9dfc19f9ab6b936e40,cnndm/cnn/stories/469c6ac05092ca5997728c9dfc19...,M0,M0/aligned/outputs.aligned.jsonl
1,membership gives the icc jurisdiction over all...,-lrb- cnn -rrb- the palestinian authority offi...,cnn-test-f001ec5c4704938247d27a44948eebb37ae98d01,cnndm/cnn/stories/f001ec5c4704938247d27a44948e...,M0,M0/aligned/outputs.aligned.jsonl
2,amnesty 's annual death penalty report catalog...,-lrb- cnn -rrb- governments around the world a...,cnn-test-e2706dce6cf26bc61b082438188fdb6e130d9e40,cnndm/cnn/stories/e2706dce6cf26bc61b082438188f...,M0,M0/aligned/outputs.aligned.jsonl
3,amnesty international releases its annual revi...,"-lrb- cnn -rrb- on may 28 , 2014 , some 7,000 ...",cnn-test-c222979bd1cfbc7d3ff821e9c738e3dbd29b14f4,cnndm/cnn/stories/c222979bd1cfbc7d3ff821e9c738...,M0,M0/aligned/outputs.aligned.jsonl
4,museum : anne frank died earlier than previous...,"-lrb- cnn -rrb- seventy years ago , anne frank...",cnn-test-203886369feea77bbc35715e6d7e518b751f57de,cnndm/cnn/stories/203886369feea77bbc35715e6d7e...,M0,M0/aligned/outputs.aligned.jsonl


In [None]:
model_summ.shape

(517027, 6)

In [None]:
# check for null and empty vals in df
np.where(pd.isnull(model_summ))

(array([], dtype=int64), array([], dtype=int64))

In [None]:
# check for null and empty vals in df
np.where(model_summ.applymap(lambda x: x == ''))

(array([ 23072,  23267,  30707,  32247,  32247,  32522,  33110,  61748,
         73238, 425385, 425734, 428213, 428955, 435262, 436467, 441326,
        448859, 449390, 449574, 449826, 450278, 450618, 451061, 451369,
        451583, 452632, 452644, 452972, 453027, 453445, 453815, 454051,
        454795, 455649, 456532, 456925, 457381, 457743, 458701, 458832,
        458885, 467637, 468970]),
 array([1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))

In [None]:
# there are some rows with empty summary outputs from the model
# we will keep these as is as this is what the model's natural output is
model_summ.iloc[23072]

reference        richard dysart best known for leland mckenzie ...
decoded                                                           
id               cnn-test-c34d84d38ccfd021ba0b3712dc23feadd455af5b
filepath         cnndm/cnn/stories/c34d84d38ccfd021ba0b3712dc23...
model_id                                                        M2
model_variant                     M2/aligned/outputs.aligned.jsonl
Name: 23072, dtype: object

## Annotator Data

In [None]:
def get_annotation_data(with_mturk=False):
  with open('/content/model_annotations.aligned.jsonl', 'r') as json_file:
    json_list = list(json_file)

  data = []
  for json_str in json_list:
      annotation = json.loads(json_str)
      result = {}
      result['id'] = annotation['id']
      result['model_id'] = annotation['model_id']
      result['decoded'] = annotation['decoded']
      result['filepath'] = annotation['filepath']

      # there are 3 expert and 5 mturk outputs
      expert = 'expert_annotations'
      turk = 'turker_annotations'
      assert len(annotation[expert]) == 3
      assert len(annotation[turk]) == 5

      dims =  ["coherence", "consistency", "fluency", "relevance"]
      
      ### add expert individual and avg scores ###
      # go through each dim
      for d in dims:
        summ = 0
        # go through each expert
        for e in range(len(annotation[expert])):
          assert d in annotation[expert][e].keys()

          result['expert_{}_{}'.format(e, d)] = annotation[expert][e][d]
          summ += annotation[expert][e][d]
        
        result['all_expert_avg_{}'.format(d)] = 1.0 * summ/len(annotation[expert])

      if with_mturk:
        ### add turk individual and avg scores ###
        # go through each dim
        for d in dims:
          summ = 0
          # go through each turk
          for t in range(len(annotation[turk])):
            assert d in annotation[turk][t].keys()

            result['turk_{}_{}'.format(t, d)] = annotation[turk][t][d]
            summ += annotation[turk][t][d]
          
          result['all_turk_avg_{}'.format(d)] = 1.0 * summ/len(annotation[turk])

      data.append(result)
  return data

In [None]:
import pandas as pd

data = get_annotation_data(with_mturk=True)
annotations = pd.DataFrame(data)
annotations.head()

Unnamed: 0,id,model_id,decoded,filepath,expert_0_coherence,expert_1_coherence,expert_2_coherence,all_expert_avg_coherence,expert_0_consistency,expert_1_consistency,expert_2_consistency,all_expert_avg_consistency,expert_0_fluency,expert_1_fluency,expert_2_fluency,all_expert_avg_fluency,expert_0_relevance,expert_1_relevance,expert_2_relevance,all_expert_avg_relevance,turk_0_coherence,turk_1_coherence,turk_2_coherence,turk_3_coherence,turk_4_coherence,all_turk_avg_coherence,turk_0_consistency,turk_1_consistency,turk_2_consistency,turk_3_consistency,turk_4_consistency,all_turk_avg_consistency,turk_0_fluency,turk_1_fluency,turk_2_fluency,turk_3_fluency,turk_4_fluency,all_turk_avg_fluency,turk_0_relevance,turk_1_relevance,turk_2_relevance,turk_3_relevance,turk_4_relevance,all_turk_avg_relevance
0,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M11,paul merson was brought on with only seven min...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,2,1,1,1.333333,1,1,1,1.0,4,2,3,3.0,2,1,2,1.666667,3,3,3,3,3,3.0,3,3,3,3,3,3.0,4,4,4,4,4,4.0,3,3,3,3,3,3.0
1,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M13,paul merson has restarted his row with andros ...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,3,2,2,2.333333,5,5,5,5.0,5,5,5,5.0,2,3,3,2.666667,2,2,2,2,2,2.0,3,3,3,3,3,3.0,2,2,2,2,2,2.0,3,3,3,3,3,3.0
2,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M1,paul merson has restarted his row with andros ...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,2,2,3,2.333333,5,5,5,5.0,5,5,5,5.0,2,4,2,2.666667,4,5,4,4,2,3.8,5,4,5,5,2,4.2,4,4,4,4,3,3.8,5,4,5,5,4,4.6
3,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M14,paul merson has restarted his row with andros ...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,2,1,2,1.666667,5,5,5,5.0,5,5,5,5.0,3,2,3,2.666667,5,5,5,5,5,5.0,5,5,5,5,5,5.0,5,5,5,5,5,5.0,4,4,4,4,4,4.0
4,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M15,paul merson has restarted his row with andros ...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,4,3,3,3.333333,5,5,5,5.0,3,3,4,3.333333,4,4,4,4.0,2,2,2,2,2,2.0,4,4,4,4,4,4.0,2,2,2,2,2,2.0,4,4,4,4,4,4.0


In [None]:
annotations.shape

(1600, 44)

In [None]:
# check for null and empty vals in df
np.where(pd.isnull(annotations))

(array([], dtype=int64), array([], dtype=int64))

In [None]:
# check for null and empty vals in df
np.where(annotations.applymap(lambda x: x == ''))

(array([], dtype=int64), array([], dtype=int64))

In [None]:
# note that annotations don't have all 24 models present in them
annotations[['model_id', 'id']].groupby(['model_id']).count()

Unnamed: 0_level_0,id
model_id,Unnamed: 1_level_1
M0,100
M1,100
M10,100
M11,100
M12,100
M13,100
M14,100
M15,100
M17,100
M2,100


## Join Annotator Data & Model Summaries

Key for join: combination of `id` and `model_id`

In [None]:
joined = pd.merge(annotations, model_summ, on = ['id', 'model_id'])

In [None]:
# the reason we have more than 1600 rows here is because of the model variants
# we will filter these later where the variant's decoded should equal the annotation's decoded
joined.shape

(3232, 48)

In [None]:
np.where(pd.isnull(joined))

(array([], dtype=int64), array([], dtype=int64))

In [None]:
np.where(joined.applymap(lambda x: x == ''))

(array([], dtype=int64), array([], dtype=int64))

In [None]:
joined.head()

Unnamed: 0,id,model_id,decoded_x,filepath_x,expert_0_coherence,expert_1_coherence,expert_2_coherence,all_expert_avg_coherence,expert_0_consistency,expert_1_consistency,expert_2_consistency,all_expert_avg_consistency,expert_0_fluency,expert_1_fluency,expert_2_fluency,all_expert_avg_fluency,expert_0_relevance,expert_1_relevance,expert_2_relevance,all_expert_avg_relevance,turk_0_coherence,turk_1_coherence,turk_2_coherence,turk_3_coherence,turk_4_coherence,all_turk_avg_coherence,turk_0_consistency,turk_1_consistency,turk_2_consistency,turk_3_consistency,turk_4_consistency,all_turk_avg_consistency,turk_0_fluency,turk_1_fluency,turk_2_fluency,turk_3_fluency,turk_4_fluency,all_turk_avg_fluency,turk_0_relevance,turk_1_relevance,turk_2_relevance,turk_3_relevance,turk_4_relevance,all_turk_avg_relevance,reference,decoded_y,filepath_y,model_variant
0,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M11,paul merson was brought on with only seven min...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,2,1,1,1.333333,1,1,1,1.0,4,2,3,3.0,2,1,2,1.666667,3,3,3,3,3,3.0,3,3,3,3,3,3.0,4,4,4,4,4,4.0,3,3,3,3,3,3.0,andros townsend an 83rd minute sub in tottenha...,paul merson was brought on with only seven min...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,M11/aligned/outputs_novelty.aligned.jsonl
1,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M11,paul merson was brought on with only seven min...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,2,1,1,1.333333,1,1,1,1.0,4,2,3,3.0,2,1,2,1.666667,3,3,3,3,3,3.0,3,3,3,3,3,3.0,4,4,4,4,4,4.0,3,3,3,3,3,3.0,andros townsend an 83rd minute sub in tottenha...,paul merson was brought on with only seven min...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,M11/aligned/outputs_baseline.aligned.jsonl
2,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M11,paul merson was brought on with only seven min...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,2,1,1,1.333333,1,1,1,1.0,4,2,3,3.0,2,1,2,1.666667,3,3,3,3,3,3.0,3,3,3,3,3,3.0,4,4,4,4,4,4.0,3,3,3,3,3,3.0,andros townsend an 83rd minute sub in tottenha...,paul merson was brought on with only seven min...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,M11/aligned/outputs_novelty+lm.aligned.jsonl
3,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M13,paul merson has restarted his row with andros ...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,3,2,2,2.333333,5,5,5,5.0,5,5,5,5.0,2,3,3,2.666667,2,2,2,2,2,2.0,3,3,3,3,3,3.0,2,2,2,2,2,2.0,3,3,3,3,3,3.0,andros townsend an 83rd minute sub in tottenha...,paul merson has restarted his row with andros ...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,M13/aligned/outputs.aligned.jsonl
4,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M1,paul merson has restarted his row with andros ...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,2,2,3,2.333333,5,5,5,5.0,5,5,5,5.0,2,4,2,2.666667,4,5,4,4,2,3.8,5,4,5,5,2,4.2,4,4,4,4,3,3.8,5,4,5,5,4,4.6,andros townsend an 83rd minute sub in tottenha...,paul merson has restarted his row with andros ...,cnndm/dailymail/stories/8764fb95bfad8ee8492748...,M1/aligned/outputs.aligned.jsonl


In [None]:
# models that were in the model summaries but were not in the annotations 
# this shows that the annotations didn't include some models
assert len(list(model_summ.model_id.unique())) == total_models
missing_models = [x for x in list(model_summ.model_id.unique()) if x not in list(joined.model_id.unique())] 
print(missing_models)
# there were 16 models remaining in the annotations, for which we have 16x100 = 1600 data points 
print(total_models - len(missing_models))

['M3', 'M4', 'M6', 'M7', 'M16', 'M18', 'M19', 'M21']
16


In [None]:
# the annotation's summary output for the model should match the model variant's summary output
# only keep those rows where it matches as the annotation was done only for that variant
joined = joined.loc[joined['decoded_x'] == joined['decoded_y']]

In [None]:
# we see the rows decrease because the scoring (annotation) was done on a subset of the variants of a single model
joined.shape

(1653, 48)

In [None]:
# shows that some variants of a single model produced the same summary as another variant of the same model
# for example, 38 examples of M5's variant produced the same summary output as the baseline M5 model 
joined[['model_id', 'model_variant']].value_counts()

model_id  model_variant                                     
M0        M0/aligned/outputs.aligned.jsonl                      100
M2        M2/aligned/outputs.aligned.jsonl                      100
M14       M14/aligned/outputs.aligned.jsonl                     100
M15       M15/aligned/outputs_coverage.aligned.jsonl            100
M11       M11/aligned/outputs_novelty.aligned.jsonl             100
M17       M17/aligned/outputs_11B.aligned.jsonl                 100
M9        M9/aligned/outputs_extabs+rl+rerank.aligned.jsonl     100
M10       M10/aligned/outputs_encdec.aligned.jsonl              100
M20       M20/aligned/outputs_zeroshot.aligned.jsonl            100
M12       M12/aligned/outputs.aligned.jsonl                     100
M22       M22/aligned/outputs_cnndm.aligned.jsonl               100
M5        M5/aligned/outputs_rouge.aligned.jsonl                100
M8        M8/aligned/outputs_ptrgen+cov.aligned.jsonl           100
M1        M1/aligned/outputs.aligned.jsonl             

In [None]:
# clean up dataframe and store for future use
joined = joined.rename(columns={"id": "story_id", "decoded_x": "decoded"})
joined = joined.drop(columns=['filepath_x', 'filepath_y', 'decoded_y'])

In [None]:
joined.head()

Unnamed: 0,story_id,model_id,decoded,expert_0_coherence,expert_1_coherence,expert_2_coherence,all_expert_avg_coherence,expert_0_consistency,expert_1_consistency,expert_2_consistency,all_expert_avg_consistency,expert_0_fluency,expert_1_fluency,expert_2_fluency,all_expert_avg_fluency,expert_0_relevance,expert_1_relevance,expert_2_relevance,all_expert_avg_relevance,turk_0_coherence,turk_1_coherence,turk_2_coherence,turk_3_coherence,turk_4_coherence,all_turk_avg_coherence,turk_0_consistency,turk_1_consistency,turk_2_consistency,turk_3_consistency,turk_4_consistency,all_turk_avg_consistency,turk_0_fluency,turk_1_fluency,turk_2_fluency,turk_3_fluency,turk_4_fluency,all_turk_avg_fluency,turk_0_relevance,turk_1_relevance,turk_2_relevance,turk_3_relevance,turk_4_relevance,all_turk_avg_relevance,reference,model_variant
0,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M11,paul merson was brought on with only seven min...,2,1,1,1.333333,1,1,1,1.0,4,2,3,3.0,2,1,2,1.666667,3,3,3,3,3,3.0,3,3,3,3,3,3.0,4,4,4,4,4,4.0,3,3,3,3,3,3.0,andros townsend an 83rd minute sub in tottenha...,M11/aligned/outputs_novelty.aligned.jsonl
3,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M13,paul merson has restarted his row with andros ...,3,2,2,2.333333,5,5,5,5.0,5,5,5,5.0,2,3,3,2.666667,2,2,2,2,2,2.0,3,3,3,3,3,3.0,2,2,2,2,2,2.0,3,3,3,3,3,3.0,andros townsend an 83rd minute sub in tottenha...,M13/aligned/outputs.aligned.jsonl
4,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M1,paul merson has restarted his row with andros ...,2,2,3,2.333333,5,5,5,5.0,5,5,5,5.0,2,4,2,2.666667,4,5,4,4,2,3.8,5,4,5,5,2,4.2,4,4,4,4,3,3.8,5,4,5,5,4,4.6,andros townsend an 83rd minute sub in tottenha...,M1/aligned/outputs.aligned.jsonl
5,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M14,paul merson has restarted his row with andros ...,2,1,2,1.666667,5,5,5,5.0,5,5,5,5.0,3,2,3,2.666667,5,5,5,5,5,5.0,5,5,5,5,5,5.0,5,5,5,5,5,5.0,4,4,4,4,4,4.0,andros townsend an 83rd minute sub in tottenha...,M14/aligned/outputs.aligned.jsonl
6,dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2,M15,paul merson has restarted his row with andros ...,4,3,3,3.333333,5,5,5,5.0,3,3,4,3.333333,4,4,4,4.0,2,2,2,2,2,2.0,4,4,4,4,4,4.0,2,2,2,2,2,2.0,4,4,4,4,4,4.0,andros townsend an 83rd minute sub in tottenha...,M15/aligned/outputs_coverage.aligned.jsonl


# Store Data as CSV




In [None]:
import datetime
from google.colab import files

now = datetime.datetime.now()
filename = now.strftime("%Y-%m-%d-%H-%M-%S")

compression_opts = dict(method='zip', archive_name='data.csv')

joined.to_csv('{}.zip'.format(filename), index=False, compression = compression_opts)
files.download('{}.zip'.format(filename))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Misc Utils

In [1]:
!pip install -q datasets 

[K     |████████████████████████████████| 298 kB 5.2 MB/s 
[K     |████████████████████████████████| 243 kB 35.1 MB/s 
[K     |████████████████████████████████| 1.1 MB 21.7 MB/s 
[K     |████████████████████████████████| 132 kB 37.9 MB/s 
[K     |████████████████████████████████| 59 kB 4.3 MB/s 
[K     |████████████████████████████████| 192 kB 31.1 MB/s 
[K     |████████████████████████████████| 160 kB 45.0 MB/s 
[K     |████████████████████████████████| 271 kB 48.7 MB/s 
[?25h

In [None]:
all_articles = datasets.load_dataset("cnn_dailymail", "3.0.0")

In [7]:
import datasets

def get_cnndm_by_id(dataset, id, return_article_only=True):
  id = id.replace('dm-test-', '')
  id = id.replace('dm-train-', '')
  id = id.replace('dm-dev-', '')
  id = id.replace('dm-val-', '')

  id = id.replace('cnn-test-', '')
  id = id.replace('cnn-train-', '')
  id = id.replace('cnn-dev-', '')
  id = id.replace('cnn-val-', '')
  try:
    highlight = dataset.filter(lambda x: x['id'] == id)['highlights'][0]
    article = dataset.filter(lambda x: x['id'] == id)['article'][0]
  except:
    return None
  if return_article_only:
    return article
    
  return article, highlight

In [8]:
id = 'fbbafa743a8c2ecd2cedf65c6c61956b2db8ec5c'
print(get_cnndm_by_id(all_articles['test'], id))

Loading cached processed dataset at /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234/cache-f34474f8f74c4d4a.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234/cache-f34474f8f74c4d4a.arrow


(CNN)One of the biggest TV events of all time is being reimagined for new audiences. "Roots," the epic miniseries about an African-American slave and his descendants, had a staggering audience of over 100 million viewers back in 1977. Now A&E networks are remaking the miniseries, to air in 2016. A&E, Lifetime and History (formerly the History Channel) announced Thursday that the three networks would simulcast a remake of the saga of Kunta Kinte, an African who was captured, shipped to America and sold into slavery to work on a Virginia plantation. LeVar Burton, who portrayed Kinte in the original, will co-executive produce the new miniseries. A press release describes the new version as "original" and "contemporary" and will draw more from Alex Haley's classic novel, "Roots: The Saga of an American Family." Producers will consult scholars in African and African-American history for added authenticity. "We are proud to bring this saga to fans of the original, as well as to a new generat