# Homework 5 - The eternal significance of publications and citations!

#### Group 15 <br>

<div style="float: left;">
    <table>
        <tr>
            <th>Student</th>
            <th>GitHub</th>
            <th>Matricola</th>
            <th>E-Mail</th>
        </tr>
        <tr>
            <td>André Leibrant</td>
            <td>JesterProphet</td>
            <td>2085698</td>
            <td>andre.leibrant@gmx.de</td>
        </tr>
    </table>
</div>

#### Import Libraries and Modules

In [1]:
import csv
import json
import pandas as pd

## 1. Data
In this homework, you will work on a dataset that contains information about a group of papers and their citation relationships. You can find and download the dataset [here](https://www.kaggle.com/datasets/mathurinache/citation-network-dataset).

### Graphs setup
Based on the available data, you will create two graphs to model our relationships as follows:

1. **Citation graph:** This graph should represent the paper's citation relationships. We want this graph to be unweighted and directed. The citation should represent the citation given from one paper to another. For example, if paper A has cited paper B, we should expect an edge from node A to B.

2. **Collaboration graph:** This graph should represent the collaborations of the paper's authors. This graph should be weighted and undirected. Consider an appropriate weighting scheme for your edges to make your graph weighted.

### Data pre-processing
The dataset is quite large and may not fit in your memory when you try constructing your graph. So, what is the solution? You should focus your investigation on a subgraph. You can work on the most connected component in the graph. However, you must first construct and analyze the connections to identify the connected components.

As a result, you will attempt to approximate that most connected component by performing the following steps:

1. Identify the top **10,000 papers** with the <u>highest number of citations</u>.


2. Then the **nodes** of your graphs would be as follows:

    **Citation graph:** you can consider each of the papers as your nodes

    **Collaboration graph:** the authors of these papers would be your nodes


3. For the **edges** of the two graphs, you would have the following cases:

    **Citation graph:** only consider the citation relationship between these 10,000 papers and ignore the rest.

    **Collaboration graph:** only consider the collaborations between the authors of these 10,000 papers and ignore the rest.
    
---

We decided to read the large JSON file line by line and save the paper `id` with corresponding `n_citation` in a list.

---

In [None]:
# File path for papers json file
file_path = "data.json"

# List to store paper information
papers = []

# Open the papers json file
with open(file_path, "r") as file:
    
    # Iterate through every line of the file
    for line_number, line in enumerate(file):
        
        # Remove the leading comma
        if line.startswith(","):
            line = line[1:]
        
        # Skip the opening and ending braket
        if len(line) > 2:
            
            # Skip lines that have non-readable characters
            try:
            
                # Parse the json line
                data = json.loads(line)

                # Extract the id and number of citations (set to 0 if doesn't exist)
                paper_id = data["id"]
                paper_citations = data.get("n_citation", 0)

                # Append the information to the list
                papers.append({"id": paper_id, "n_citations": paper_citations})
                
            except:
                pass

---

In the next step we sort the list by `n_citations` and extract the top 10,000 entries.

---

In [None]:
# Sort the list based on the number of citations
papers.sort(key=lambda paper: paper["n_citations"], reverse=True)

# Extract the top 10000 entries
top_10000_papers = papers[:10000]

---

For safety reasons and so that we don't need to rerun the previous step we save the results in a CSV file.

---

In [None]:
# File path for the output csv file
csv_file_path = "top_10000_papers_ids.csv"

# Write the data to the csv file
with open(csv_file_path, "w", newline="") as csv_file:
    
    # Define the csv header
    fieldnames = ["id", "n_citations"]

    # Create a csv writer
    csv_writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    # Write the header to the csv file
    csv_writer.writeheader()

    # Write each entry to the csv file
    for paper in top_10000_papers:
        csv_writer.writerow({"id": paper["id"], "n_citations": paper["n_citations"]})

---

In the next step we are going to read the CSV file into a Pandas DataFrame and extract the paper `ids` into a list

---

In [2]:
# Save results in CSV file
df_top_10000_papers = pd.read_csv("top_10000_papers_ids.csv")
df_top_10000_papers

Unnamed: 0,id,n_citations
0,2912565176,42437
1,2151103935,35541
2,2911964244,34741
3,1973948212,32053
4,2153635508,31047
...,...,...
9995,2034987376,559
9996,2035754981,559
9997,2062270497,559
9998,2070127246,559


In [3]:
# Convert id column to list
paper_ids = df_top_10000_papers["id"].tolist()

---

Finally, we are going to iterate through the JSON file again and only extract the papers which are inside the top 10,000 list.

---

In [7]:
# Initialize an empty Pandas DataFrame
df_top_10000_papers = pd.DataFrame()

# Open the papers json file
with open(file_path, "r") as file:
    
    # Iterate through every line of the file
    for line_number, line in enumerate(file):
        
        # Remove the leading comma
        if line.startswith(","):
            line = line[1:]
        
        # Skip the opening and ending braket
        if len(line) > 2:
            
            # Skip lines that have non-readable characters
            try:
            
                # Parse the json line
                data = json.loads(line)

                # Only append if the id is from the top 10000 papers list
                if data["id"] in paper_ids:

                    # Load the json string into a temporary Pandas DataFrame
                    tmp_df = pd.read_json(line, lines=True)

                    # Append the temporary DataFrame to the main DataFrame
                    df_top_10000_papers = pd.concat([df_top_10000_papers, tmp_df], ignore_index=True)
            
            except:
                pass

In [11]:
df_top_10000_papers

Unnamed: 0,id,authors,title,year,n_citation,page_start,page_end,doc_type,publisher,volume,issue,doi,fos,venue,references,indexed_abstract,alias_ids
0,852874,"[{'name': 'David A. Randell', 'id': 2776268300...",A Spatial Logic based on Regions and Connection.,1992,1709,165,176,Conference,,,,,"[{'name': 'Mereotopology', 'w': 0.52232}, {'na...",{'raw': 'Principles of Knowledge Representatio...,,,
1,1699105,"[{'name': 'Nicolas T. Courtois', 'org': 'Crypt...",Algebraic attacks on stream ciphers with linea...,2003,758,345,359,Conference,Springer Verlag,,,10.1007/978-3-540-45146-4_11,"[{'name': 'Computer science', 'w': 0.42724}, {...",{'raw': 'Theory and Application of Cryptograph...,"[17553438, 29630963, 149636774, 1488905225, 15...","{'IndexLength': 205, 'InvertedIndex': {'criter...",
2,2169610,"[{'name': 'Jason Weston', 'id': 2058584252}, {...",Support vector machines for multi-class patter...,1999,643,219,224,Conference,,,,,"[{'name': 'Structured support vector machine',...",{'raw': 'The European Symposium on Artificial ...,,,
3,4214443,"[{'name': 'Kai Petersen', 'org': 'School of En...",Systematic mapping studies in software enginee...,2008,1032,68,77,Conference,BCS Learning & Development Ltd.,,,10.14236/ewic/ease2008.8,"[{'name': 'Systematic review', 'w': 0.57769}, ...",{'raw': 'Evaluation and Assessment in Software...,"[1965321492, 2082372845, 2103944702, 210478924...","{'IndexLength': 187, 'InvertedIndex': {'them.'...",
4,4508078,"[{'name': 'Hamish Cunningham', 'id': 207329895...",A framework and graphical development environm...,2002,1470,168,175,Conference,,,,,"[{'name': 'Development environment', 'w': 0.0}...",{'raw': 'Meeting of the Association for Comput...,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,2913772345,"[{'name': 'Hugues Hoppe', 'id': 2206701583}, {...",Mesh optimization,1993,1157,19,26,Conference,,,,10.1145/166117.166119,"[{'name': 'Mesh optimization', 'w': 0.0}, {'na...",{'raw': 'International Conference on Computer ...,"[1753524162, 1985020152, 2043896016, 204693154...",,
9996,2913932916,"[{'name': 'Ruslan Salakhutdinov', 'org': 'Depa...",Semantic hashing,2009,944,969,978,Journal,,50,7,10.1016/j.ijar.2008.11.006,"[{'name': 'Locality-sensitive hashing', 'w': 0...",{'raw': 'International Journal of Approximate ...,"[205159212, 1880262756, 1974393942, 1978394996...","{'IndexLength': 165, 'InvertedIndex': {'u0027u...",
9997,2913999547,"[{'name': 'Ying-Cheng Lai', 'id': 2914799073}]",Controlling chaos,1994,3290,62,67,,,8,1,10.1063/1.4823262,"[{'name': 'Control theory', 'w': 0.36276}, {'n...",{'raw': 'Computers in Physics archive'},,,
9998,2914019270,"[{'name': 'L. A. Zadeh', 'id': 2252586558}]",Fuzzy algorithms,1996,780,60,,,,,,,"[{'name': 'Computer science', 'w': 0.459070000...","{'raw': 'Fuzzy sets, fuzzy logic, and fuzzy sy...","[1972693464, 2292996153, 2912565176]",,


---

We save the final result inside a CSV file.

---

In [9]:
# Save results in CSV file
df_top_10000_papers.to_csv("top_10000_papers.csv", index=False)