# Template Notebook for Milestones

In this notebook you will write your code, producing the required output for each Milestone.

Your notebook must contain 3 types of cells:

- (1) Code cells: Cells that contain code snippets, capturing one cohesive fragment of your code.
- (2) Corresponding explanation cells: Each code cell must be followed by a text cell containing the **english** explanation of what the corresponding code cell does and what it's purpose is
- (3) One reflection cell: One cell at the bottom of the notebook that contains your individual reflection on your process working on this milestones in **english**. It could contain technical problems and how you overcame them, it could contain social problems and how you deal with them (group work is hard!), it could contain explanations of prior skills or knowledge that made certain parts of the task easier for you, etc... (those are just suggestions. Your individual reflections will of course contain different/additional aspects)

In [4]:
import json

In [22]:
with open('ir-anthology-07-11-2021-ss23.jsonl', 'r') as json_file:
    json_list = list(json_file)

datasets = []

for json_str in json_list:
    result = json.loads(json_str)
    authors = ""
    for author in result['authors']:
        authors += author + " "
    
    editors = ""
    for editor in result['editors']:
        editors += editor + " "
    
    abstract = ""
    if "abstract" in result:
        abstract = result["abstract"]
    
    
    bookTitle = ""
    if "bookTitle" in result:
        bookTitle = result["bookTitle"]
    
        
    doc_id = result['id']
    text = (result['title'] + " " + authors + " " + editors + " " + result['year'] + " " + abstract + " " + bookTitle).replace("   ", " ").replace("  ", " ")
    datasets.append({'doc_id': doc_id, 'text': text})

print("Example converted dataset:")
print(datasets[0])

with open("ir_datasets.jsonl", 'w') as f:
    for item in datasets:
        f.write(json.dumps(item) + "\n")
    

Example converted dataset:
{'doc_id': '2019.sigirconf_workshop-2019birndl.0', 'text': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 25, 2019 Muthu Kumar Chandrasekaran Philipp Mayr 2019 '}


### Documentation

This script processes and converts a JSONL file containing information retrieval research paper information into a new JSONL file with only two attributes: id and text, which is useful for search purposes.

Open the input file ir-anthology-07-11-2021-ss23.jsonl for reading.
Read the file line by line and store the lines in the json_list.
Create an empty list datasets to store the processed data.
Iterate through each JSON string in json_list. For every element do:

    - Load the JSON string into a dictionary named result

    - Concatenate the author names in result['authors'] and store them in the authors variable
    
    - Concatenate the editor names in result['editors'] and store them in the editors variable
    
    - Check if an abstract is present; if so, store it in the abstract variable
    
    - Check if a book title is present; if so, store it in the bookTitle variable
    
    - Extract the document ID from result['id']
    
    - Create a single text string containing the title, authors, editors, year, abstract, and book title, with extra spaces removed
    
    - Append a dictionary containing the doc_id and text to the datasets list

Open the output file ir_datasets.jsonl for writing.

Write each item in the datasets list as a JSON string followed by a newline character to the output file.

### Reflection

The task was tough because we faced a few technical issues, mainly with installing and handling Docker. Also, our programming skills varied, so we needed good communication within our group, making sure everyone could follow the results.