## Who do authors in the top 5 economic journals collaborate with?
The collaboration network in the top 5 econ journals is a subset of the collaboration network of published economists. The collaboration network of published economists is in turn a subset of the acquaintance network of economists. Most consider the econlit database as a sufficient representation of the network of publishing economists which is also assumed to mirror the acquaintance network which is not easily observed.

A publication in a top 5 journal implies that the article is especially novel and of a high standard: quality. Since our data is strictly of metadata in the top 5 econ journals, which have been the most prestigious journals since the 1970s (See Tinbergen ranks), any model results are interpretted within this context. Specifically, these are observations are of the top performing economists and their peers and are not representative of all economists. 

It may take an academic several years to reach the level at which they are producing work of sufficient quality and novelty to publish in the top 5. That is not to say they have never published before as economists/academics. 

Questions when analyzing the top 5 network: 

Taking into consideration first-time authors and repeating authors. Do first-time authors collaborate thereafter and how frequently? Does a halo effect exist for first time authors? How do authors behave after publishing in the top 5? Are they more likely to publish again? Are collaborations likely to occur between authors who are already top 5 authors? Or do they tend to collaborate with new entrants more so than not? How often do new entrants continue to publish at the top 5 quality? If authors publish again, what's the split between their publications being single or coauthored and do they always collaborate with new top 5 authors or always with authors who are already in the network?

Dataset construction:
We look at the author pair categories: 
- 0 both authors are already in the network when they collaborate, 1 if one author is new to the network, 2 both authors are new to the network.
- 

- 


In [None]:
import pandas as pd
from itertools import combinations


In [3]:
upd10 = pd.read_pickle("flattened_co-author_10.pkl")
upd20 = pd.read_pickle("flattened_co-author_20.pkl")
upd5 = pd.read_pickle("flattened_co-author_5.pkl")

In [None]:
base_path="/Users/sijiawu/Work/Thesis/Data/Affiliations/"
data_base_path="/Users/sijiawu/Work/Thesis/Data/"
nets_path="/Users/sijiawu/Work/80YearsEconomicResearch/032_auth_graph_gen/networks/"
pdf_base_path="/Users/sijiawu/Dropbox/80YearsEconomicResearch/Data/0_PDF/"

proc_auths_all = pd.read_pickle(base_path+"proc_auth_aff_flat.pkl")
aff_sub=pd.read_pickle(base_path+"affiliations_combined_sub.pkl")
j_data=pd.read_pickle(data_base_path+"Combined/011_merged_proc_scopus_inception_2020_w_counts.pkl")
all_refs=pd.read_excel('../031_proc_refs_full_set/refs_1940_2020.xlsx')
relevant=pd.read_excel('../031_proc_refs_full_set/refs_1940_2020_top5.xlsx')

In [None]:
j_data["id"]=j_data["URL"].str.split("/").str[-1]
relevant["id_o"]=relevant["id_o"].astype(str)
relevant["year_o"]=relevant["year_o"].astype(int)
proc_auths_all["id_o"]=proc_auths_all["url"].str.split("/").str[-1]
proc_auths_all["a1_order_str"]=proc_auths_all["a1_order"].astype(str)
relevant_sub=relevant[["ref_ord", "id_o", "year_o","match_id"]]

In [None]:
ex_content=['MISC', 'Errata','Discussion', 'Review', 'Review2']
content=['Article', 'Comment', 'Reply', 'Rejoinder']

In [None]:
# Load dataset (Assuming it has 'Paper_ID', 'Author', and 'Year')
df = proc_auths_all[["id_o","a1","last", "a1_order_str", "year"]].rename(columns={'id_o': 'Paper_ID', 'a1_order_str': 'Author', "year":"Year"})
# Step 1: Expand multi-author papers into pairwise relationships
df_expanded = df.groupby(["Paper_ID", "Year"])['Author'].apply(
    lambda x: list(combinations(x, 2)) if len(x) > 1 else [(x.iloc[0], None)]
).explode()

# Convert tuples to separate columns
df_expanded = pd.DataFrame(df_expanded.tolist(), index=df_expanded.index, columns=["Author", "Coauthor"]).reset_index()

# Keep necessary columns
df_expanded = df_expanded[["Author", "Coauthor", "Year"]]
df_expanded_alt=df_expanded[["Author", "Coauthor", "Year"]].fillna(-1)

# Step 2: Count occurrences of coauthorship per year
df_expanded["Coauthorship_Count"] = df_expanded.groupby(["Author", "Coauthor", "Year"])['Year'].transform('count')
df_expanded_alt["Coauthorship_Count"] = df_expanded_alt.groupby(["Author", "Coauthor", "Year"])['Year'].transform('count')


# Determine first and last publication year for each author
author_first_pub = df.groupby("Author")["Year"].min()
author_last_pub = df.groupby("Author")["Year"].max()