<a href="https://colab.research.google.com/github/caitao1234/Huggingface_Toturials/blob/main/notebooks/0a.%20Parse%20data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Crawl dataset with all submissions info
OpenReview Venue Crawling

In [33]:
%load_ext autoreload
%autoreload 2

import time
import pandas as pd
import multiprocessing as mp
from multiprocessing import Pool
from tqdm import tqdm
from tqdm.notebook import tqdm
import requests

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Crawl list of all submissions
Here we scrape the _notes_ , (list of all submissions) using OpenReview's API, way faster than Selenium-based scraping.


In [34]:
DATA_PATH = '../data/'
venue = 'ICLR.cc/2023/Conference'
venue_short = 'iclr2023'

In [35]:
def get_conference_notes(venue, blind_submission=False):
    """
    Get all notes of a conference (data) from OpenReview API.
    If results are not final, you should set blind_submission=True.
    """

    blind_param = '-/Blind_Submission' if blind_submission else ''
    offset = 0
    notes = []
    while True:
        print('Offset:', offset, 'Data:', len(notes))
        url = f'https://api.openreview.net/notes?invitation={venue}/{blind_param}&offset={offset}'
        response = requests.get(url)
        data = response.json()
        if len(data['notes']) == 0:
            break
        offset += 1000
        notes.extend(data['notes'])
    return notes

In [36]:
raw_notes = get_conference_notes(venue, blind_submission=True)
print("Number of submissions:", len(raw_notes))

Offset: 0 Data: 0
Offset: 1000 Data: 1000
Offset: 2000 Data: 2000
Offset: 3000 Data: 3000
Offset: 4000 Data: 3811
Number of submissions: 3811


In [37]:
df_raw = pd.json_normalize(raw_notes)
# set index as first column
# df_raw.set_index(df_raw.columns[0], inplace=True)
df_raw.head()

Unnamed: 0,id,original,number,cdate,mdate,ddate,tcdate,tmdate,tddate,forum,...,content.student_author,content.Please_choose_the_closest_area_that_your_submission_falls_into,content.paperhash,content.pdf,content.supplementary_material,content._bibtex,content.venue,content.venueid,content.TL;DR,content.community_implementations
0,RUzSobdYy0V,pmo4AKuE4-p,6620,1663850590815,,,1663850590815,1677758485903,,RUzSobdYy0V,...,,"Social Aspects of Machine Learning (eg, AI saf...",adebayo|quantifying_and_mitigating_the_impact_...,/pdf/8fa4751c3b6bc13a0eefd3b9a9dd75dc9359f20f.pdf,/attachment/151652f4d981a49f9dfa81be992839a243...,"@inproceedings{\nadebayo2023quantifying,\ntitl...",ICLR 2023 poster,ICLR.cc/2023/Conference,,
1,N3kGYG3ZcTi,kVYulJycT2K,6611,1663850589829,,,1663850589829,1676330777348,,N3kGYG3ZcTi,...,,Deep Learning and representational learning,zhuang|suppression_helps_lateral_inhibitionins...,/pdf/bc66a3bbb804a7158ba77a4de9f91a196e8eaf9a.pdf,,"@misc{\nzhuang2023suppression,\ntitle={Suppres...",Submitted to ICLR 2023,ICLR.cc/2023/Conference,Improving feature learning with lateral inhibi...,
2,tmIiMPl4IPa,RAIF4RUF0T,6610,1663850589709,,,1663850589709,1690120015409,,tmIiMPl4IPa,...,,"Machine Learning for Sciences (eg biology, phy...",tran|factorized_fourier_neural_operators,/pdf/c381fdf1b7600bdbaba7b4a98c1679006ec61c83.pdf,,"@inproceedings{\ntran2023factorized,\ntitle={F...",ICLR 2023 poster,ICLR.cc/2023/Conference,An efficient and scalable neural PDE solver us...,[![CatalyzeX](/images/catalyzex_icon.svg) 1 co...
3,mhnHqRqcjYU,ix_LR-W0OM2,6603,1663850588877,,,1663850588877,1677757114293,,mhnHqRqcjYU,...,,Deep Learning and representational learning,narshana|dfpc_data_flow_driven_pruning_of_coup...,/pdf/a04d739740d3a54486c4a47bf7d26dd24b41732d.pdf,,"@inproceedings{\nnarshana2023dfpc,\ntitle={{DF...",ICLR 2023 poster,ICLR.cc/2023/Conference,We propose a novel data-free algorithm to acce...,
4,sZI1Oj9KBKy,vRziu1jJDu,6601,1663850588630,,,1663850588630,1677757168918,,sZI1Oj9KBKy,...,,Deep Learning and representational learning,murti|tvsprune_pruning_nondiscriminative_filte...,/pdf/54b7911797398691422146138209e69d0674e5de.pdf,,"@inproceedings{\nmurti2023tvsprune,\ntitle={{T...",ICLR 2023 poster,ICLR.cc/2023/Conference,We use the total variation distance between th...,


## (optional) older crawled data

In [87]:
# Read data from old version
df_old = pd.read_csv('iclr2023_20221120.csv')
df_old

Unnamed: 0,id,title,keywords,ratings,confidences,withdraw,review_lengths
0,kRvZ2PcsxjJj,Quantum reinforcement learning,"['quantum reinforcement learning', 'multi-agen...","[1, 1, 1, 1]","[5, 5, 5, 5]",1,"[45, 49, 25, 283]"
1,RUzSobdYy0V,Quantifying and Mitigating the Impact of Label...,[],"[5, 6, 8]","[4, 3, 3]",0,"[443, 274, 401]"
2,N3kGYG3ZcTi,Suppression helps: Lateral Inhibition-inspired...,"['Lateral Inhibition', 'Convolutional Neural N...","[3, 5, 3, 1]","[5, 5, 5, 5]",0,"[333, 360, 362, 304]"
3,tmIiMPl4IPa,Factorized Fourier Neural Operators,"['fourier transform', 'fourier operators', 'pd...","[8, 6, 3, 8, 3]","[5, 4, 4, 2, 2]",0,"[203, 142, 323, 520, 635]"
4,mhnHqRqcjYU,DFPC: Data flow driven pruning of coupled chan...,"['Pruning', 'Data Free', 'Model Compression']","[8, 6, 6]","[3, 2, 3]",0,"[302, 90, 257]"
...,...,...,...,...,...,...,...
4869,IJwhRE510b,ELODI: Ensemble Logit Difference Inhibition fo...,"['positive-congruent training', 'negative flip...","[5, 5, 6, 8]","[4, 2, 3, 4]",0,"[711, 446, 464, 1337]"
4870,4XMAzZasId,Model-agnostic Measure of Generalization Diffi...,"['generalization', 'inductive bias', 'informat...","[8, 3, 3, 3]","[3, 3, 3, 4]",0,"[595, 802, 1393, 698]"
4871,KjKZaJ5Gbv,Efficient Multi-Task Reinforcement Learning vi...,"['Reinforcement Learning', 'Multitask Reinforc...","[3, 5, 3]","[4, 5, 4]",1,"[187, 590, 812]"
4872,ED2Jjms9A4H,Efficient Exploration via Fragmentation and Re...,"['fragmentation', 'recall', 'exploration', 'co...","[3, 5, 5, 5]","[4, 4, 3, 3]",0,"[1326, 523, 1099, 580]"


In [53]:
papers_ids = df_old['id'].values
print("Number of papers (including old):", len(papers_ids))

Number of papers (including old): 4874


## Crawl forums of each submission
Here we scrape the forums of each submissions, it can be pretty fast thanks to:
- OpenReview's API (we use requests)
- Multiprocessing to parallelize the scraping of each paper

In [39]:
# Create multiprocessing pool of requests over index of dataframe

extra = "trash=true&details=replyCount%2Cwritable%2Crevisions%2Coriginal%2Coverwriting%2Cinvitation%2Ctags"

def get_paper_data(paper_id, extra='', timeout=5):
    try:
        url = f"https://api.openreview.net/notes?forum={paper_id}&{extra}"
        response = requests.get(url, timeout=timeout)
        data = response.json()
        return data
    except requests.exceptions.Timeout:
        print(f"Error for paper {paper_id}: Request timed out")
        return None
    except:
        print(f"Error for paper {paper_id}: General error")
        return None

def retry_get_paper_data(paper_id, extra='', timeout=5, retries=10):
    for i in range(retries):
        data = get_paper_data(paper_id, extra, timeout)
        if data is not None:
            return data
    print(f"Error for paper {paper_id}: All {retries} attempts failed")
    return None

def get_paper_data_multi(paper_ids, ratio=0.8):
    num_processes = int(ratio*mp.cpu_count())
    with Pool(num_processes) as p:
        data = list(tqdm(p.imap(retry_get_paper_data, paper_ids), total=len(paper_ids)))
    return data

In [86]:
# filter df with only id, title, url and keywords
df_raw_filtered = df_raw[['id', 'content.title', 'content.keywords']]
df_raw_filtered

Unnamed: 0,id,content.title,content.keywords
0,RUzSobdYy0V,Quantifying and Mitigating the Impact of Label...,[]
1,N3kGYG3ZcTi,Suppression helps: Lateral Inhibition-inspired...,"[Lateral Inhibition, Convolutional Neural Netw..."
2,tmIiMPl4IPa,Factorized Fourier Neural Operators,"[fourier transform, fourier operators, pde, na..."
3,mhnHqRqcjYU,DFPC: Data flow driven pruning of coupled chan...,"[Pruning, Data Free, Model Compression]"
4,sZI1Oj9KBKy,TVSPrune - Pruning Non-discriminative filters ...,"[Structured pruning, model compression]"
...,...,...,...
3806,P5Z-Zl9XJ7,Continuous-Discrete Convolution for Geometry-S...,"[Protein representation learning, 3D geometry ..."
3807,IJwhRE510b,ELODI: Ensemble Logit Difference Inhibition fo...,"[positive-congruent training, negative flip, e..."
3808,4XMAzZasId,Model-agnostic Measure of Generalization Diffi...,"[generalization, inductive bias, information t..."
3809,ED2Jjms9A4H,Efficient Exploration via Fragmentation and Re...,"[fragmentation, recall, exploration, cognitive..."


In [41]:
# ids = list(df_raw_filtered['id'])
ids = df_old['id'].values # use old ids to get data from old papers
data = get_paper_data_multi(ids, ratio=1)

  0%|          | 0/4874 [00:00<?, ?it/s]

Error for paper PLUXnnxUdr4: Request timed out


In [56]:
type(data)

list

In [57]:
print(data[0])

{'notes': [{'id': 'tlqdB1VCIb', 'original': None, 'number': 1, 'cdate': 1674241738301, 'pdate': None, 'mdate': None, 'ddate': None, 'tcdate': 1674241738301, 'tmdate': 1674241738301, 'tddate': None, 'forum': 'RUzSobdYy0V', 'replyto': 'RUzSobdYy0V', 'invitation': 'ICLR.cc/2023/Conference/Paper6620/-/Decision', 'content': {'title': 'Paper Decision', 'decision': 'Accept: poster', 'metareview:_summary,_strengths_and_weaknesses': 'This paper investigates the effect of label error on the model’s disparity metrics (e.g., calibration, FPR, FNR) on both the training and test set. The authors found that empirically, label errors have a larger influence on minority groups than on majority groups. The authors proposed a method to estimate the influence of changing a single training input’s label on a model’s group disparity metric. Reviewers agree that the studied problem is important and may have many practical implications and that the proposed method is well-motivated. At the same time, reviewer

In [None]:
# get only notes
notes = [d['notes'] for d in data]

In [None]:
for d in data:
  notes = d['notes']

In [44]:
def save_notes_to_file(notes, output_file):
    with open(output_file, 'w', encoding='utf-8') as file:
        for note in notes:
            file.write(str(note) + '\n')


In [47]:
  # Replace [...] with your actual notes list
output_file = 'output1.txt'
save_notes_to_file(notes, output_file)


In [48]:
def filter_data(item,
                review_keys=['summary_of_the_paper', 'strength_and_weaknesses', 'clarity,_quality,_novelty_and_reproducibility', 'summary_of_the_review'],
                decision=True):
    """Filter only ratings, confidence, withdraw status and decisions"""
    # parse each note
    withdraw = 0
    # filter meta note
    meta_note = [d for d in item if 'Paper' not in d['invitation']]
    # check withdrawn
    withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0
    # decision
    if decision:
        try:
            if withdraw == 0:
                decision_note = [d for d in item if 'Decision' in d['invitation']]
                decision = decision_note[0]['content']['decision']
            else:
                decision = ''
        except:
            decision = ''
    # filter reviewer comments
    comment_notes = [d for d in item \
                     if 'Official_Review' in d['invitation'] and 'recommendation' in d['content'].keys()]
    comment_notes = sorted(comment_notes, key=lambda d: d['number'])[::-1]
    ratings = [int(note['content']['recommendation'].split(':')[0]) for note in comment_notes]
    confidences = [int(note['content']['confidence'].split(':')[0]) for note in comment_notes]
    review_lengths = [sum(len(note['content'][key].split()) for key in review_keys) for note in comment_notes] # review lengths

    data = {'ratings': ratings, 'confidences': confidences, 'withdraw': withdraw, 'review_lengths': review_lengths}
    if decision: data['decision'] = decision
    return data

In [68]:
def filter_data(item,
                review_keys=['summary_of_the_paper', 'strength_and_weaknesses', 'clarity,_quality,_novelty_and_reproducibility', 'summary_of_the_review'],
                decision=True):
    """Filter only ratings, confidence, withdraw status and decisions"""
    # parse each note
    withdraw = 0
    # filter meta note
    try:
        meta_note = [d for d in item if 'Paper' not in d['invitation']]
        withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0
    except IndexError:
        # Skip this data item if list index is out of range
        return None

    # decision
    if decision:
        try:
            if withdraw == 0:
                decision_note = [d for d in item if 'Decision' in d['invitation']]
                decision = decision_note[0]['content']['decision']
            else:
                decision = ''
        except:
            decision = ''

    # filter reviewer comments
    comment_notes = [d for d in item \
                     if 'Official_Review' in d['invitation'] and 'recommendation' in d['content'].keys()]
    comment_notes = sorted(comment_notes, key=lambda d: d['number'])[::-1]

    # Check if comment_notes is not None before proceeding
    if comment_notes is not None:
        ratings = [int(note['content']['recommendation'].split(':')[0]) for note in comment_notes]
        confidences = [int(note['content']['confidence'].split(':')[0]) for note in comment_notes]
        summary_of_the_paper = [note['content']['summary_of_the_paper'] for note in comment_notes if 'summary_of_the_paper' in note['content']]
        strength_and_weaknesses = [note['content']['strength_and_weaknesses'] for note in comment_notes if 'strength_and_weaknesses' in note['content']]
        clarity_quality_novelty = [note['content']['clarity,_quality,_novelty_and_reproducibility'] for note in comment_notes if 'clarity,_quality,_novelty_and_reproducibility' in note['content']]
        summary_of_the_review = [note['content']['summary_of_the_review'] for note in comment_notes if 'summary_of_the_review' in note['content']]
    else:
        # Skip this data item if comment_notes is None
        return None

    data = {'ratings': ratings, 'confidences': confidences, 'withdraw': withdraw,
            'summary_of_the_paper': summary_of_the_paper, 'strength_and_weaknesses': strength_and_weaknesses,
            'clarity_quality_novelty': clarity_quality_novelty, 'summary_of_the_review': summary_of_the_review}
    if decision: data['decision'] = decision
    return data


In [79]:
# filter data in a pool of processes
with Pool(8) as p:
    filtered_notes = list(tqdm(p.imap(filter_data, notes), total=len(notes)))

  0%|          | 0/4874 [00:00<?, ?it/s]

In [81]:
import pandas as pd

# 假设 filtered_notes 是包含过滤后数据项的列表
# 将 filtered_notes 转换为 DataFrame，并使用默认值填充缺失值为特定值（例如 -1）
df = pd.DataFrame(filtered_notes).fillna(-1)
df

AttributeError: ignored

In [75]:
# create dataframe
# ratings = pd.DataFrame(filtered_notes)
# ratings.head()
type(filtered_notes)
import pandas as pd

# 假设 filtered_notes 是包含过滤后数据项的列表
# 对 filtered_notes 进行筛选，移除其中的 None 值或非字典元素
filtered_notes = [item for item in filtered_notes if item is not None and isinstance(item, dict)]

# 将 filtered_notes 转换为 DataFrame
df = pd.DataFrame(filtered_notes)


In [76]:
df

Unnamed: 0,ratings,confidences,withdraw,summary_of_the_paper,strength_and_weaknesses,clarity_quality_novelty,summary_of_the_review,decision
0,"[5, 6, 8]","[4, 3, 3]",0,[This paper studies the effect of label error ...,[Strength:\n+ The research problems are import...,[This paper generally is well-written and easy...,"[For me, the motivation and research problems ...",Accept: poster
1,"[3, 6, 3, 1]","[5, 5, 5, 5]",0,[The authors propose to add a biologically ins...,[Strengths:\n1) There are considerable gains o...,[The proposed work is not novel and evaluation...,[The proposed work is preliminary and lacks th...,Reject
2,"[8, 6, 5, 8, 6]","[5, 4, 4, 2, 3]",0,"[In this work, the authors proposed a novel ne...",[Strength:\n- The work shows significant impro...,[The paper is clearly written. It has good qua...,[Overall I find this paper interesting. It has...,Accept: poster
3,"[8, 6, 6]","[3, 2, 3]",0,[This paper tackles an important problem of ne...,[Strength: \n- The problem considered is an im...,"[In general, the paper is clearly written and ...","[All in all, I have some minor concerns regard...",Accept: poster
4,"[8, 6, 8, 3]","[3, 3, 4, 4]",0,[In this paper authors propose a mechanism to ...,[Strengths\n------------\nThe paper clearly ca...,[The paper is very clear. It explicitly calls ...,[This paper guides the reader through the desi...,Accept: poster
...,...,...,...,...,...,...,...,...
3845,"[6, 6, 6, 6]","[2, 3, 4, 4]",0,[This work proposed a new genre of convolution...,[**Strengths:**\n\n1. The motivation is clearl...,[**Minor Questions/Problems:**\n\nFigure 3: ac...,[The paper has proposed an interesting convolu...,Accept: poster
3846,"[5, 5, 6, 6]","[4, 2, 3, 4]",0,[The paper addresses the task of updating a mo...,[Strengths:\n* Theoretical analysis of the dif...,[The paper is sufficiently novel and reproduci...,[The paper addresses an important topic and pr...,Reject
3847,"[8, 3, 3, 3]","[3, 3, 3, 4]",0,"[This work proposes a novel notion called ""ind...",[# Strength\n1. I find the paper written in a ...,[Clarity is very good\n\nQuality is good\n\nNo...,[At this point I am very positive to this work...,Reject
3848,"[5, 5, 6, 5]","[4, 4, 3, 3]",0,[This paper proposes a function approximation ...,[The idea forgoes the current trend of having ...,"[In general the paper clarity is ok, but shoul...",[The topic of the paper is worth further study...,Reject


In [90]:
import pandas as pd

# 假设 filtered_notes 是包含过滤后数据项的列表
# 假设 df_raw_filtered 是原始的 DataFrame

# 找到 filtered_notes 中需要过滤掉的索引
filter_indices = [idx for idx, item in enumerate(filtered_notes) if item is None or not isinstance(item, dict)]

filter_indices
# 从 df_raw_filtered 中删除对应的行
df_raw_filtered_after = df_old.drop(df_old.index[filter_indices])


NameError: ignored

In [91]:
df_raw_filtered_after

Unnamed: 0,id,title,keywords,ratings,confidences,withdraw,review_lengths
0,kRvZ2PcsxjJj,Quantum reinforcement learning,"['quantum reinforcement learning', 'multi-agen...","[1, 1, 1, 1]","[5, 5, 5, 5]",1,"[45, 49, 25, 283]"
1,RUzSobdYy0V,Quantifying and Mitigating the Impact of Label...,[],"[5, 6, 8]","[4, 3, 3]",0,"[443, 274, 401]"
2,N3kGYG3ZcTi,Suppression helps: Lateral Inhibition-inspired...,"['Lateral Inhibition', 'Convolutional Neural N...","[3, 5, 3, 1]","[5, 5, 5, 5]",0,"[333, 360, 362, 304]"
3,tmIiMPl4IPa,Factorized Fourier Neural Operators,"['fourier transform', 'fourier operators', 'pd...","[8, 6, 3, 8, 3]","[5, 4, 4, 2, 2]",0,"[203, 142, 323, 520, 635]"
4,mhnHqRqcjYU,DFPC: Data flow driven pruning of coupled chan...,"['Pruning', 'Data Free', 'Model Compression']","[8, 6, 6]","[3, 2, 3]",0,"[302, 90, 257]"
...,...,...,...,...,...,...,...
3846,NUBuJsAq1U,Determinant regularization for Deep Metric Lea...,"['Deep Metric Learning', 'Generalization', 'Ja...","[5, 3, 1, 3]","[5, 3, 4, 2]",0,"[265, 397, 484, 363]"
3847,ybFkELZjuc,Data-Efficient and Interpretable Tabular Anoma...,"['Anomaly Detection', 'Interpretability']","[5, 5, 6, 5]","[3, 5, 4, 3]",0,"[526, 585, 325, 386]"
3848,e25n9Z29PeC,Extracting Expert's Goals by What-if Interpret...,"['counterfactuals', 'explaining decision-makin...","[5, 5, 5, 3]","[5, 3, 2, 4]",1,"[394, 232, 652, 524]"
3849,9aokcgBVIj1,FiT: Parameter Efficient Few-shot Transfer Lea...,"['few-shot learning', 'transfer learning', 'fe...","[8, 5, 8, 6, 6]","[2, 4, 3, 2, 4]",0,"[320, 467, 240, 208, 187]"


In [92]:
# Merge with df_raw_filtered
df_final = pd.concat([df_raw_filtered_after, df], axis=1)
df_final.head()

Unnamed: 0,id,title,keywords,ratings,confidences,withdraw,review_lengths,ratings.1,confidences.1,withdraw.1,summary_of_the_paper,strength_and_weaknesses,clarity_quality_novelty,summary_of_the_review,decision
0,kRvZ2PcsxjJj,Quantum reinforcement learning,"['quantum reinforcement learning', 'multi-agen...","[1, 1, 1, 1]","[5, 5, 5, 5]",1.0,"[45, 49, 25, 283]","[5, 6, 8]","[4, 3, 3]",0.0,[This paper studies the effect of label error ...,[Strength:\n+ The research problems are import...,[This paper generally is well-written and easy...,"[For me, the motivation and research problems ...",Accept: poster
1,RUzSobdYy0V,Quantifying and Mitigating the Impact of Label...,[],"[5, 6, 8]","[4, 3, 3]",0.0,"[443, 274, 401]","[3, 6, 3, 1]","[5, 5, 5, 5]",0.0,[The authors propose to add a biologically ins...,[Strengths:\n1) There are considerable gains o...,[The proposed work is not novel and evaluation...,[The proposed work is preliminary and lacks th...,Reject
2,N3kGYG3ZcTi,Suppression helps: Lateral Inhibition-inspired...,"['Lateral Inhibition', 'Convolutional Neural N...","[3, 5, 3, 1]","[5, 5, 5, 5]",0.0,"[333, 360, 362, 304]","[8, 6, 5, 8, 6]","[5, 4, 4, 2, 3]",0.0,"[In this work, the authors proposed a novel ne...",[Strength:\n- The work shows significant impro...,[The paper is clearly written. It has good qua...,[Overall I find this paper interesting. It has...,Accept: poster
3,tmIiMPl4IPa,Factorized Fourier Neural Operators,"['fourier transform', 'fourier operators', 'pd...","[8, 6, 3, 8, 3]","[5, 4, 4, 2, 2]",0.0,"[203, 142, 323, 520, 635]","[8, 6, 6]","[3, 2, 3]",0.0,[This paper tackles an important problem of ne...,[Strength: \n- The problem considered is an im...,"[In general, the paper is clearly written and ...","[All in all, I have some minor concerns regard...",Accept: poster
4,mhnHqRqcjYU,DFPC: Data flow driven pruning of coupled chan...,"['Pruning', 'Data Free', 'Model Compression']","[8, 6, 6]","[3, 2, 3]",0.0,"[302, 90, 257]","[8, 6, 8, 3]","[3, 3, 4, 4]",0.0,[In this paper authors propose a mechanism to ...,[Strengths\n------------\nThe paper clearly ca...,[The paper is very clear. It explicitly calls ...,[This paper guides the reader through the desi...,Accept: poster


In [93]:
df_final.to_csv('thps.csv')

## Save filtered dataset
We will be saving a smaller version of the dataset in csv format with the data we need for our analysis - this can also be saved directly in Github

In [None]:
# Save dataframe as csv
# rename title
df_final.rename(columns={'content.title': 'title'}, inplace=True)
#rename keywords
df_final.rename(columns={'content.keywords': 'keywords'}, inplace=True)
df_final.to_csv(f'{DATA_PATH}{venue_short}_{time.strftime("%Y%m%d")}.csv', index=False)

## Saving full crawled dataset

Note that this dataset is raw and contains everyting; so it will be pretty large (>100 MBs)!

In [None]:
# Save dataframe as hdf5
notes_df = pd.DataFrame([n['notes'] for n in data])
count_df = pd.DataFrame({'notes_count': [n['count'] for n in data]})
df = pd.concat([df_raw, notes_df, count_df], axis=1)
df.to_hdf(f'{DATA_PATH}{venue_short}_data_full_{time.strftime("%Y%m%d")}.h5', key='df', mode='w')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis0] [items->None]

  df.to_hdf(f'{DATA_PATH}{venue_short}_data_full_{time.strftime("%Y%m%d")}.h5', key='df', mode='w')
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->Index([                                                                    'id',
                                                                     'original',
                                                                        'mdate',
                                                                        'ddate',
                                                                       'tddate',
                                                                        'forum',
                                                                      'replyto',
                  