# Individual and gender inequality in computer science: A career study of cohorts from 1970 to 2000

## Part 1: Preprocessing

In this notebook, we produce three dataframes. First, a `features` dataframe (saved as 'features.csv.gz') that contains all the variables used in the prediction models (plus two more). Second, a `counts` dataframe (saved as 'counts.csv.gz') which contains the information how many publications and citations an author from which cohort has produced and received, respectively, in and until which career age. Second, such a `counts_first` dataframe (saved as 'counts_first.csv.gz') but just for publications produced as a first author. These dataframes will be used in subsequent notebooks.

---

### 1. Imports

Many of the custom functions we need are stored in a utilities file.

In [None]:
import numpy as np
import pandas as pd

from utils import *

### 2. Parameters

- `COHORT_FIRST` and `COHORT_LAST` sets the interval in which cohort members have published their first paper,
- `CAREER_AGES` sets for how many years we study careers,
- `WINDOW_SIZE` sets the length of the early career,
- `DROPOUT_SIZE` sets the number of consecutive years which define that an author has left academia.

In [None]:
COHORT_FIRST = 1970
COHORT_LAST = 2000
CAREER_AGES = 15
WINDOW_SIZE = 3
DROPOUT_SIZE = 10

### 3. Load data

Download all files from [here](https://doi.org/10.7802/2642) into the 'data' directory and load them:

In [None]:
# 1,704,919 authors (comma-separated): name (integer), gender (string)
authors = pd.read_csv('../data/authors.csv.gz')

# 9,471,668 relationships (comma-separated): author (string), year (integer), pub_id (string)
authors_publications = pd.read_csv('../data/authors_publications.csv.gz')

# 8,938,798 relationships (comma-separated): id1 (string), id2 (string), year (integer)
citations = pd.read_csv('../data/citations.csv.gz')

# 2,285,112 publications (comma-separated): pub_id (string), year (integer), venue (string), 
# h5_index (integer), ranking (float), deciles (integer), quantiles (integer)
publications = pd.read_csv('../data/publications.csv.gz')

# 48,555 publications: pub_id (string)
publications_arxiv = pd.read_csv('../data/publications_arxiv.csv.gz')

# 3,078,230 relationships (comma-separated): pub_id (string), year (integer), authors (string), 
# num_authors (integer), is_alpha (boolean), first_author (string)
publications_authors = pd.read_csv('../data/publications_authors.csv.gz')

Remove duplicates and preprints from arXiv:

In [None]:
authors_publications.drop_duplicates(subset=['author', 'pub_id'], inplace=True)
authors_publications = authors_publications.loc[~authors_publications.pub_id.isin(publications_arxiv['pub_id'])]

citations.drop_duplicates(inplace=True)
citations = citations.loc[(~citations.id1.isin(publications_arxiv['pub_id']))]
citations = citations.loc[(~citations.id2.isin(publications_arxiv['pub_id']))]

### 4. Construct dataframes

#### 4.1. Feature: Baseline

Here, we begin constructing the `features` dataframe. First, we extract the "Cohort", the years in which authors had their first publication, plus a 'career_length' variable:

In [None]:
groupByAuthor = authors_publications.groupby(['author'])
groupByAuthorMinYearData = groupByAuthor['year'].min()
groupByAuthorMaxYearData = groupByAuthor['year'].max()

features = groupByAuthorMinYearData.to_frame(name='cohort')
features['end_year'] = groupByAuthorMaxYearData
features = features.reset_index()
features = features.drop_duplicates()
features = features.dropna(how='any')
features['career_length'] = features['end_year'] - features['cohort'] + 1
del features['end_year']

#### 4.2. Features: Gender

Merge in the gender variable (to be transformed into "Male", "Female", and "Undetected" dummies in part 4):

In [None]:
features = features.merge(authors, left_on='author', right_on='name', how='left')
features.drop('name', axis=1, inplace=True)

#### 4.3. Temporary dataframes

To engineer the remaining features and construct the `counts` dataframes, some temporary dataframes are needed:

In [None]:
# citations for every author and paper
publications_citations_no_uncited = authors_publications.merge(citations, left_on='pub_id', right_on='id2', how='inner', suffixes=('_pub', '_cit'))
publications_citations_no_uncited = publications_citations_no_uncited.merge(features[['author', 'cohort']], on='author', how='inner')
publications_citations_no_uncited = publications_citations_no_uncited[publications_citations_no_uncited.year_pub <= publications_citations_no_uncited.year_cit]

In [None]:
# citations per paper
paper_paper_citations = publications_citations_no_uncited[['id1', 'id2', 'year_pub', 'year_cit']]
paper_paper_citations = paper_paper_citations.drop_duplicates(subset=['id1', 'id2'])
paper_total_citations = paper_paper_citations.groupby('id2')['id1'].count()

In [None]:
# papers per author and cohort
publications_start_year = authors_publications.merge(features[['author', 'cohort']], on='author', how='inner')

In [None]:
# publication first-author relationships
publications_first_author = publications_authors.merge(features[['author', 'cohort']], left_on='first_author', right_on='author', how='left')
publications_first_author = publications_first_author.drop('first_author', axis='columns')

In [None]:
# author citations per year
authors_yearly_citations = publications_citations_no_uncited.groupby(['author', 'year_cit'])['id1'].count()
authors_yearly_citations = authors_yearly_citations.reset_index()
authors_yearly_citations = authors_yearly_citations.rename(columns={'id1': 'num_cit', 'year_cit': 'year'})

In [None]:
# author publications per year
authors_yearly_publications = authors_publications.groupby(['author', 'year'])['pub_id'].count().reset_index()
authors_yearly_publications = authors_yearly_publications.rename(columns={'pub_id': 'num_pub'})

#### 4.4. Features: Early achievement

This set contains four variables. First, "Productivity", the cumulative number of publications authored in the early career (set to be the first three career ages):

In [None]:
early_career_publications_reduced = publications_start_year[publications_start_year.year < publications_start_year['cohort'] + WINDOW_SIZE]
early_career_publications_ = early_career_publications_reduced.groupby('author').agg({'pub_id': 'nunique'}).reset_index()
early_career_publications_ = early_career_publications_.rename({'pub_id': 'productivity'}, axis='columns')

features = features.merge(early_career_publications_, on='author', how='left')

Second, "Productivity (1st author)", the cumulative number of publications authored in the early career as a first author:

In [None]:
publications_first_author_early = publications_first_author[(publications_first_author.year < publications_first_author['cohort'] + WINDOW_SIZE)]
publications_first_author_early = publications_first_author_early.groupby('author').agg({'pub_id': 'count'}).reset_index()
publications_first_author_early.rename({'pub_id': 'productivity_first'}, axis='columns', inplace=True)

features = features.merge(publications_first_author_early, on='author', how='left')
features['productivity_first'] = features['productivity_first'].fillna(0)
features['productivity_first'] = features['productivity_first'].astype(int)

Third, "Impact", the cumulative number of citations received in the early career:

In [None]:
col_name_early = 'impact'
early_career_impact = publications_citations_no_uncited[(publications_citations_no_uncited.year_pub < publications_citations_no_uncited['cohort'] + WINDOW_SIZE) & (publications_citations_no_uncited.year_cit < publications_citations_no_uncited['cohort'] + WINDOW_SIZE)]
early_career_impact = early_career_impact.groupby('author')['id1'].count()
early_career_impact = early_career_impact.rename(col_name_early)
early_career_impact = early_career_impact.reset_index()

features = features.merge(early_career_impact, on='author', how='left')
features[col_name_early] = features[col_name_early].fillna(0)
features[col_name_early] = features[col_name_early].astype(int)

Fourth, "Top source", the smallest h5-index-based quartile rank of all journals and conference proceedings an author has published in in the early career:

In [None]:
early_career_venues = publications_start_year.merge(publications[['pub_id', 'quantiles']], on='pub_id', how='inner')
early_career_venues_ec = early_career_venues[early_career_venues.year < early_career_venues['cohort'] + WINDOW_SIZE]
early_career_venues_gr = early_career_venues_ec.groupby('author').agg({'quantiles': 'min'})
early_career_venues_gr = early_career_venues_gr.reset_index()

features = features.merge(early_career_venues_gr, on='author', how='left')
features['quantiles'] = features['quantiles'].fillna(4)
features['top_source'] = features['quantiles'].apply(quantile_binary)
del features['quantiles']

#### 4.5. Features: Social support

This set contains three variables. First, "Collaboration network", the number of distinct co-authors in the early career:

In [None]:
combined_early_degree = publications_start_year[(publications_start_year.year < publications_start_year['cohort'] + WINDOW_SIZE)]
combined_early_degree = combined_early_degree.drop_duplicates(subset=['author', 'pub_id'])
combined_early_degree = combined_early_degree[['author', 'pub_id']]
combined_early_degree = combined_early_degree.merge(publications_start_year, on='pub_id')
combined_early_degree = combined_early_degree[combined_early_degree.author_x != combined_early_degree.author_y]
combined_early_degree = combined_early_degree.drop_duplicates(subset=['author_x', 'author_y'])
combined_early_degree = combined_early_degree.groupby('author_x')['author_y'].count().reset_index()
combined_early_degree.rename({'author_x': 'author', 'author_y': 'collaboration_network'}, axis='columns', inplace=True)

features = features.merge(combined_early_degree, on='author', how='left')
features['collaboration_network'] = features['collaboration_network'].fillna(0)
features['collaboration_network'] = features['collaboration_network'].astype(int)

Second, "Senior support", the largest h-index of all co-authors in the early career:

In [None]:
# h-index of all authors
papers_authors = publications_citations_no_uncited[['author', 'year_pub']].drop_duplicates(subset=['author', 'year_pub'])
all_authors_hind = pd.DataFrame(columns=['author', 'h-index', 'year_pub'])
all_authors_hind['year_pub'] = all_authors_hind['year_pub'].astype('int64')
for year_x in papers_authors.year_pub.unique():
    authors = papers_authors[papers_authors.year_pub == year_x].author.values
    author_hind_at_year = author_h_index_in_year_X(publications_citations_no_uncited, authors, year_x)
    all_authors_hind = all_authors_hind.append(author_hind_at_year)
papers_authors = papers_authors.merge(all_authors_hind, how='left')
papers_authors['h-index'] = papers_authors['h-index'].fillna(0)

In [None]:
# largest h-index of early-career co-authors
combined_early_coauthor = publications_start_year[(publications_start_year.year < publications_start_year['cohort'] + WINDOW_SIZE)]
combined_early_coauthor = combined_early_coauthor.drop_duplicates(subset=['author', 'pub_id'])
combined_early_coauthor = combined_early_coauthor[['author', 'pub_id']]
combined_early_coauthor = combined_early_coauthor.merge(publications_start_year, on='pub_id')
combined_early_coauthor = combined_early_coauthor[combined_early_coauthor.author_x != combined_early_coauthor.author_y]
combined_early_coauthor = combined_early_coauthor.drop_duplicates(subset=['author_x', 'author_y'])
combined_early_coauthor = combined_early_coauthor.merge(papers_authors, left_on=['author_y', 'year'], right_on=['author', 'year_pub'])
combined_early_coauthor = combined_early_coauthor.groupby('author_x')['h-index'].max().reset_index()
combined_early_coauthor.rename({'author_x': 'author', 'h-index': 'senior_support'}, axis='columns', inplace=True)
combined_early_coauthor = combined_early_coauthor[['author', 'senior_support']]

features = features.merge(combined_early_coauthor, on='author', how='left')
features['senior_support'] = features['senior_support'].fillna(0)
features['senior_support'] = features['senior_support'].astype(int)

Third, "Team size", the median number of authors of all publications produced in the early career:

In [None]:
publications_early = publications_start_year[(publications_start_year.year < publications_start_year['cohort'] + WINDOW_SIZE)]
paper_team_size = publications_early.groupby('pub_id').agg({'author': 'nunique'}).reset_index()
paper_team_size = paper_team_size.rename({'author': 'team_size'}, axis='columns')
publications_early = publications_early.merge(paper_team_size, on='pub_id', how='left')
team_size_median = publications_early.groupby('author').agg({'team_size': 'median'}).reset_index()

features = features.merge(team_size_median, on='author', how='left')

#### 4.6. Dependent variable: Dropout

"Dropout" is a boolean dependent variable if an author has not published for ten consecutive years in the first 15 career ages:

In [None]:
pubs_grouped = publications_start_year[(publications_start_year.year >= publications_start_year.cohort) & (publications_start_year.year < publications_start_year.cohort + CAREER_AGES)]
pubs_grouped = pubs_grouped.groupby('author').agg({'year': lambda x: sorted(list(x))})
pubs_grouped['year'] = pubs_grouped['year'].apply(lambda x: sorted(list_append(x, x[0] + CAREER_AGES)))
pubs_grouped['absence_list'] = pubs_grouped['year'].apply(np.diff)
pubs_grouped['last_consec_ca'] = pubs_grouped['absence_list'].apply(get_last_consec)
pubs_grouped['absence_list'] = pubs_grouped['absence_list'].apply(lambda x: [e for e in x if e != 0 or e != 1])
pubs_grouped['max_absence'] = pubs_grouped['absence_list'].apply(max)
pubs_grouped['max_absence'] = pubs_grouped['max_absence'] - 1
pubs_grouped.reset_index(inplace=True)

features = features.merge(pubs_grouped[['author', 'max_absence', 'last_consec_ca']], on='author', how='left')
features['dropout'] = features['max_absence'].apply(lambda x: True if x >= DROPOUT_SIZE else False)

#### 4.7. Dependent variable: Success

"Success" is a numerical dependent variable that measures the increase in the cumulative number of citations received by all publications published until and in career age 15 after the early career period:

In [None]:
col_name_end = 'end_career_impact'
end_career_impact = publications_citations_no_uncited[(publications_citations_no_uncited.year_pub < publications_citations_no_uncited['cohort'] + CAREER_AGES) & (publications_citations_no_uncited.year_cit < publications_citations_no_uncited['cohort'] + CAREER_AGES)]
end_career_impact = end_career_impact.groupby('author')['id1'].count()
end_career_impact = end_career_impact.rename(col_name_end)
end_career_impact = end_career_impact.reset_index()

features = features.merge(end_career_impact, on='author', how='left')
features[col_name_end] = features[col_name_end].fillna(0)
features['success'] = features[col_name_end] - features[col_name_early]
features['success'] = features['success'].astype(int)
del features[col_name_end]

#### 4.8. Counts

Here, we construct the `counts` and `counts_first` dataframes:

In [None]:
start_years = get_start_years(COHORT_FIRST, COHORT_LAST, features)

In [None]:
# create publication and citation dataframes for first authors
author_year_numPub_first = publications_first_author.groupby(['author', 'year'])['pub_id'].count().reset_index()
author_year_numPub_first = author_year_numPub_first.rename(columns={'pub_id': 'num_pub'})
publications_citations_no_uncited_first = publications_citations_no_uncited.merge(publications_first_author[['author', 'pub_id']], how='inner')
citations_year_auth_first = publications_citations_no_uncited_first.groupby(['author', 'year_cit'])['id1'].count()
citations_year_auth_first = citations_year_auth_first.reset_index()
citations_year_auth_first = citations_year_auth_first.rename(columns={'id1': 'num_cit', 'year_cit': 'year'})

In [None]:
# construct temporary dataframes
temp_df = create_counts(features, authors_yearly_citations, authors_yearly_publications, start_years, CAREER_AGES)
temp_df_first = create_counts(features, citations_year_auth_first, author_year_numPub_first, start_years, CAREER_AGES)

In [None]:
# add window-based counts
counts = create_counts_win(temp_df, publications_citations_no_uncited, WINDOW_SIZE, start_years)
counts_first = create_counts_win(temp_df_first, publications_citations_no_uncited_first, WINDOW_SIZE, start_years, file_ext='_first')

### 5. Save dataframes

Dataframes are saved into the 'results' directory:

In [None]:
features[features['cohort'] >= COHORT_FIRST].to_csv('../data/features.csv.gz', index=False, encoding='utf-8', compression='gzip')
counts.to_csv(f'../data/counts.csv.gz', index=False, encoding='utf-8', compression='gzip')
counts_first.to_csv(f'../data/counts_first.csv.gz', index=False, encoding='utf-8', compression='gzip')