## Dataset Questions
Hey Russ,

I've been playing with the articles dataset and am confused by a couple things. Primarily, there are articles in the dataset that
1. Are duplicated.
2. Have the `article_id` set to something that doesn't make sense.

I know this dataset was cleaned by you somewhere else to make the concept-article network CSV, so this might not be particularly important, but I figured I'd send it to you and see if you had any idea what was going on.

#### Preliminaries

In [1]:
# load some packages
import pandas as pd

# configuration
DATA_PATH = 'datasets/concept_network/' # your path file
ARTICLE_FILE = 'dimensions_2021_09_01_articles_category_for_2l_code_102.csv.gz' # Applied Mathematics articles

# load the dataset
articles_df = pd.read_csv(DATA_PATH+ARTICLE_FILE)
# throws an error when low_memory=False, I think bc of the year column in the three degenerate entries at the end

  articles_df = pd.read_csv(DATA_PATH+ARTICLE_FILE)


In [2]:
## Solution - Use a parquet file
ARTICLE_FILE = 'dimensions_2021_09_01_articles_category_for_2l_code_102.parquet.gz' # Applied Mathematics articles

# load the dataset
articles_df = pd.read_parquet(DATA_PATH+ARTICLE_FILE, engine='fastparquet')

#### Duplicated Entries
It looks like there are three entries that are duplicated a combined 120k times.

In [3]:
## Check rows are duplicated
num_rows = len(articles_df)
num_unique_rows = len(articles_df.drop_duplicates())
num_nonunique_rows = num_rows - num_unique_rows

f'There are {num_nonunique_rows} duplicated rows'

'There are 0 duplicated rows'

In [4]:
## Find duplicated rows
assert articles_df['article_id'].nunique() == num_unique_rows # unique article id iff unique row

unique_article_counts = articles_df.groupby(
        'article_id' # group by unique rows
    ).size() # count the number of unique elements
unique_article_counts = unique_article_counts[unique_article_counts != 1] # drop unique ones, only nonunniques are left
unique_article_counts

Series([], dtype: int64)

In [5]:
## Count number of dumplicated rows
(unique_article_counts-1).sum() # subtract 1 bc row should (prolly) be counted once

0

Essentially, there are 3 articles that make up and extra 120k rows in the dataframe.

I don't 100% know how this affects our analysis (if at all after you provide the cleaned network data file) but, at the very least, seems very memory inefficient.

#### Weird Articles
There are 3 (unique) articles where the `article_id` column seems like the first part of the abstract, and is inconsistent with the other articles.

In [6]:
## find these weird entries
unique_articles_df = articles_df.drop_duplicates() # we dont want duplicates anymore

unique_articles_df[unique_articles_df['article_id'].str[:4] != 'pub.'] # all real articles start with 'pub.' then have a number. These 3 don't

Unnamed: 0,article_id,year,date,doi,volume,issue,pages,abstract_preferred,journal_title,metrics_times_cited


Again, I don't know what impact this would have (if any), but at the very least I assume it means these three articles aren't included in our network, since the abstract, which we parse for topics, is in the `article_id` column and not the `abstract_preferred` column.