Data cleaning is definitely a "practice makes perfect" skill. Using this dataset of article open-access prices paid by the WELLCOME Trust between 2012 and 2013, 

1. Determine the five most common journals and the total articles for each. 
2. Next, calculate the mean, median, and standard deviation of the open-access cost per article for each journal.
3. For a real bonus round, identify the open access prices paid by subject area.

You will need to do considerable data cleaning in order to extract accurate estimates, and may want to look into data encoding methods if you get stuck.

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
columns = ['PMID/PMCID', 'publisher', 'journal', 'article', 'cost']
df = pd.read_csv('WELLCOME/WELLCOME_APCspend2013_forThinkful.csv', encoding = 'unicode_escape', 
                 header=0, names=columns)


In [3]:
print(df.head())
print(df.info())
print(df.describe())

              PMID/PMCID publisher                 journal  \
0                    NaN       CUP  Psychological Medicine   
1             PMC3679557       ACS       Biomacromolecules   
2  23043264  PMC3506128        ACS              J Med Chem   
3    23438330 PMC3646402       ACS              J Med Chem   
4   23438216 PMC3601604        ACS              J Org Chem   

                                             article      cost  
0  Reduced parahippocampal cortical thickness in ...     £0.00  
1  Structural characterization of a Model Gram-ne...  £2381.04  
2  Fumaroylamino-4,5-epoxymorphinans and related ...   £642.56  
3  Orvinols with mixed kappa/mu opioid receptor a...   £669.64  
4  Regioselective opening of myo-inositol orthoes...   £685.88  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2127 entries, 0 to 2126
Data columns (total 5 columns):
PMID/PMCID    1928 non-null object
publisher     2127 non-null object
journal       2126 non-null object
article       2127 non-nul

In [4]:
# Look at the top 15 most frequent journals to determine which 
print(df.journal.value_counts().head(15))

PLoS One                                           92
PLoS ONE                                           62
Journal of Biological Chemistry                    48
Nucleic Acids Research                             21
Proceedings of the National Academy of Sciences    19
PLoS Neglected Tropical Diseases                   18
Human Molecular Genetics                           18
Nature Communications                              17
PLoS Genetics                                      15
PLoS Pathogens                                     15
Neuroimage                                         15
Brain                                              14
BMC Public Health                                  14
NeuroImage                                         14
PLOS ONE                                           14
Name: journal, dtype: int64


In [5]:
# Check to see what units the price are in, all pounds and dollars
print(df[df.cost.str.find('$') != -1].cost)
print(df[df.cost.str.find('£') == -1].cost)

178        1674$
179      1375.8$
180        2010$
181     1204.38$
182      1254.6$
183        1476$
337      671.04$
1599       1440$
1600     1460.3$
1601    1476.47$
1602    1570.87$
1603    1600.25$
1604    1600.25$
Name: cost, dtype: object
178        1674$
179      1375.8$
180        2010$
181     1204.38$
182      1254.6$
183        1476$
337      671.04$
1599       1440$
1600     1460.3$
1601    1476.47$
1602    1570.87$
1603    1600.25$
1604    1600.25$
Name: cost, dtype: object


In [6]:
# Assume a conversion rate of 0.76 dollars to pound
# Remove the pound character at the start of the cost
# Remove the dollar character at the end of the cost

df.cost = df.cost.apply(lambda x: float(x[1:]) if x[0].isdigit() == False else float(x[:-1])*.76)
print(df.cost.iloc[[1603, 1604]])

1603    1216.19
1604    1216.19
Name: cost, dtype: float64


In [7]:
# Convert everything to lower case and remove extra spaces before and after the journal names
df.journal = df.journal.str.lower()
df.journal = df.journal.str.strip()
df.journal = df.journal.str.rstrip('.')

# Convert articles to lower case as well
df.article = df.article.str.lower()
df.article = df.article.str.strip()
df.article = df.article.str.rstrip('.')

df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2127 entries, 0 to 2126
Data columns (total 5 columns):
PMID/PMCID    1928 non-null object
publisher     2127 non-null object
journal       2126 non-null object
article       2127 non-null object
cost          2127 non-null float64
dtypes: float64(1), object(4)
memory usage: 83.2+ KB


In [8]:
# Get rid of "the" at the beginning of the phrase
df.journal = df.journal.str.replace('the ', '')
# Get rid of double spaces
df.journal = df.journal.str.replace('  ', ' ')

In [9]:
# Lots of different abbreviations for american journal of
df.journal = df.journal.str.replace('americal', 'american')
df.journal = df.journal.str.replace('american', 'am')

# standardize journal abbreviaion
df.journal = df.journal.str.replace('joural', 'journal')
df.journal = df.journal.str.replace('jounal', 'journal')
df.journal = df.journal.str.replace('journal of', 'j')
df.journal = df.journal.str.replace('journal for', 'j')
df.journal = df.journal.str.replace('jnl of', 'j')
df.journal = df.journal.str.replace('journal', 'j')
df.journal = df.journal.str.replace('jnl', 'j')
df.journal = df.journal.str.replace('j.', 'j', regex=False)
df.journal = df.journal.str.replace('the j', 'j')
df.journal = df.journal.str.replace('jounral', 'j')
df.journal = df.journal.str.replace('js', 'j')

# replace abbreviation for and 
df.journal = df.journal.str.replace('&', 'and')

In [11]:
# Misspellings
df.journal = df.journal.str.replace('bioohysica', 'biophysica')
df.journal = df.journal.str.replace('clinicla', 'clinical')
df.journal = df.journal.str.replace('heath', 'health')
df.journal = df.journal.str.replace('behaviour', 'behavior')
df.journal = df.journal.str.replace('neuropathol$ ', 'neuropathologica ')
df.journal = df.journal.str.replace('brt', 'british')
df.journal = df.journal.str.replace('angewande', 'angewandte')
df.journal = df.journal.str.replace('ophthalmology', 'opthalmology')
df.journal = df.journal.str.replace('infect dis', 'infectious diseases')
df.journal = df.journal.str.replace('sci$', 'science')
df.journal = df.journal.str.replace('agfents', 'agents')
df.journal = df.journal.str.replace('biinformatics', 'bioinformatics')
df.journal = df.journal.str.replace('britsh$', 'british')
df.journal = df.journal.str.replace('child: care, health development', 'child: care, health and development')
df.journal = df.journal.str.replace('epigentics', 'epigenetics')
df.journal = df.journal.str.replace('psychiatty', 'psychiatry')
df.journal = df.journal.str.replace('epidemiol$', 'epidemiology')
df.journal = df.journal.str.replace('epidemology', 'epidemiology')
df.journal = df.journal.str.replace('immunol$', 'immunology')
df.journal = df.journal.str.replace('heptology', 'hepatology')
df.journal = df.journal.str.replace('biophysical', 'biophysical')
df.journal = df.journal.str.replace('experiements', 'experiments')
df.journal = df.journal.str.replace('proceddings', 'proceedings')

# public library of science
df.journal = df.journal.str.replace('public library of science', 'plos')
df.journal = df.journal.str.replace('plos 1', 'plos one')
df.journal = df.journal.str.replace('plosone', 'plos one')

In [12]:
# Different abbreviations for biology
df.journal = df.journal.str.replace('biology', 'biol')



# International journal
df.journal = df.journal.str.replace('inyernational', 'international') #miss-spelling
df.journal = df.journal.str.replace('international j for', 'int j')
df.journal = df.journal.str.replace('international', 'int')
df.journal = df.journal.str.replace('the int j', 'int j')



In [13]:
# angewandte chemie international edition magazines
df.journal = df.journal.str.replace('angewandte chemie int edition', 'angewandte chemie')
df.journal = df.journal.str.replace('angew chems int ed', 'angewandte chemie')

# abbreviations for tropical medicine and hygine
df.journal = df.journal.str.replace('trop med hyg', 'tropical medicine and hygiene')

In [15]:
# clean up the acta crystallographica
df.journal = df.journal.str.replace('acta crystallographica section d.*', 'acta crystallographica section d')
df.journal = df.journal.str.replace('acta crystallographica, section d.*', 'acta crystallographica section d')
#df.journal = df.journal.str.replace('acta d', 'acta crystallographica section d')
df.journal = df.journal.str.replace('acta crystallography d', 'acta crystallographica section d')
df.journal = df.journal.str.replace('acta crystallographica section f:.*', 'acta crystallographica section f')

# clean up embo
df.journal = df.journal.str.replace('embo.*', 'embo')


In [17]:
df.journal = df.journal.str.replace('curr biol', 'current biol')
df.journal = df.journal.str.replace('current opinions in', 'current opinion in')

# Clean up 'dev world bioeth'
df.journal = df.journal.str.replace('dev\.', 'dev')
df.journal = df.journal.str.replace('bioeth$', 'bioethics')
df.journal = df.journal.str.replace('developing', 'dev')

df.journal = df.journal.str.replace('development science', 'developmental science')

df.journal = df.journal.str.replace('blood j 2012', 'blood')

df.journal = df.journal.str.replace('scientific reports.*', 'scientific reports')
df.journal = df.journal.str.replace('studies in history and philosophy of science part c.*', 'studies in history and philosophy of science part c')
df.journal = df.journal.str.replace('virology j', 'virology')
df.journal = df.journal.str.replace('trop med int health', 'tropical medicine and int health')
df.journal = df.journal.str.replace('chem$', 'chemistry')
df.journal = df.journal.str.replace('j biol chemistry', 'j biological chemistry')
df.journal = df.journal.str.replace('acta neuropathol$', 'acta neuropathologica')
df.journal = df.journal.str.replace('human mol genetics', 'human molecular genetics')
df.journal = df.journal.str.replace('visulaized', 'visualized')
df.journal = df.journal.str.replace('sex transm infect', 'sexually transmitted infections')
df.journal = df.journal.str.replace('trends in neuroscience$', 'trends in neurosciences')

In [19]:
grouped = df.groupby('journal').count()

### 5 Most used journals and number of articles for each

In [20]:
# 5 most used journals and number of articles for each
grouped.cost.nlargest(5)

journal
plos one                  208
j biological chemistry     71
neuroimage                 29
nucleic acids research     26
plos genetics              24
Name: cost, dtype: int64

In [21]:
grouped = df.groupby('journal').cost.agg(['mean', 'median', 'std'])

In [22]:
print((grouped.sort_values(by='mean', ascending=False)).head(15))

                                            mean      median            std
journal                                                                    
pone-d12-17947                        999999.000  999999.000            NaN
molecluar and cellular endocrinology  999999.000  999999.000            NaN
experimental cell research            999999.000  999999.000            NaN
expert reviews in molecular medicine  999999.000  999999.000            NaN
j paediatric urology                  999999.000  999999.000            NaN
frontiers in cognition                999999.000  999999.000            NaN
oxford university press               999999.000  999999.000            NaN
qualitative research                  999999.000  999999.000            NaN
hbm j human brain mapping             999999.000  999999.000            NaN
pmedicine-d-12-03130                  999999.000  999999.000            NaN
genetics in medicine                  501499.500  501499.500  704984.753736
veterinary p

In [23]:
# Looks like 999999 is used as a NaN value
df[df['cost']==999999].head(5)

Unnamed: 0,PMID/PMCID,publisher,journal,article,cost
149,PMC3234811,ASBMB,j biological chemistry,picomolar nitric oxide signals from central ne...,999999.0
227,3708772,BioMed Central,bmc genomics,"phenotypic, genomic, and transcriptional chara...",999999.0
277,PMC3668259,BMC,trials,community resource centres to improve the heal...,999999.0
358,PMC3219211,Cambridge University Press,expert reviews in molecular medicine,pharmacological targets in the ubiquitin syste...,999999.0
404,PMC3533396,Company of Biologists,j cell science,pka isoforms coordinate mrna fate during nutri...,999999.0


In [24]:
# Change 999999 values for NaN
df['cost'] = [float('NaN') if item > 999000 else item  for item in df['cost']]

In [25]:
#df["cost"] = df.groupby("journal").transform(lambda x: x.fillna(x.mean()))['cost'] 
grouped = df.groupby('journal').cost.agg(['max', 'min', 'count'])

In [26]:
print(grouped.sort_values('max', ascending=False).head())

                            max      min  count
journal                                        
movement disorders     201024.0  1005.00     15
plos one               192645.0   122.31    200
public service review    6000.0  6000.00      1
lancet neurology         5760.0  4320.00      2
lancet                   4800.0   838.35      5


In [27]:
# Look into why movement disorders is so expensive - appears one price is off
print(df[df['journal']=='movement disorders'])

                      PMID/PMCID publisher             journal  \
1975                 PMC3739940      Wiley  movement disorders   
1976                  PMC3660780     Wiley  movement disorders   
1977              PMC3633239\n\n     Wiley  movement disorders   
1978              PMC3664413\n\n     Wiley  movement disorders   
1979              PMC3664414\n\n     Wiley  movement disorders   
1980              PMC3664415\n\n     Wiley  movement disorders   
1981              PMC3664426\n\n     Wiley  movement disorders   
1982              PMC3664430\n\n     Wiley  movement disorders   
1983              PMC3672686\n\n     Wiley  movement disorders   
1984                  PMC3748791     Wiley  movement disorders   
1985                         NaN     Wiley  movement disorders   
1986                         NaN     Wiley  movement disorders   
1987              PMC3664409\n\n     Wiley  movement disorders   
1988                 PMC3739929      Wiley  movement disorders   
1989  Pub 

In [28]:
# We find that:
#    1987  Limb amputations in fixed dystonia: a form of ...  201024.00  
# This price is likely a typo considering how many of them are listed at 2010.24
# Fix the typo

df.iloc[1987, df.columns.get_loc('cost')] = df.iloc[1987, df.columns.get_loc('cost')]/100
print(df.iloc[1987])

PMID/PMCID                                       PMC3664409\n\n
publisher                                                 Wiley
journal                                      movement disorders
article       limb amputations in fixed dystonia: a form of ...
cost                                                    2010.24
Name: 1987, dtype: object


In [29]:
# Look into PLOS expensive article
print(df[df['journal']=='plos one'].sort_values('cost', ascending=False).head())

             PMID/PMCID                  publisher   journal  \
1470            3547931  Public Library of Science  plos one   
1468  PMCID: PMC3617094  Public Library of Science  plos one   
1467  PMCID: PMC3649981  Public Library of Science  plos one   
1466           23853603  Public Library of Science  plos one   
1465           23991236  Public Library of Science  plos one   

                                                article       cost  
1470  reducing stock-outs of life saving malaria com...  192645.00  
1468  functional il6r 368ala allele impairs classica...    1785.36  
1467  duplication and retention biases of essential ...    1775.50  
1466  mosaic vsg's and the sclae of trypanosoma bruc...    1745.00  
1465  in vivo imaging of trypanosome brain interacti...    1692.00  


In [30]:
# It is less clear if this value is a result of a decimal place typo, so convert it to NaN
df.iloc[1470, df.columns.get_loc('cost')] = float('NaN')
print(df.iloc[1470])

PMID/PMCID                                              3547931
publisher                             Public Library of Science
journal                                                plos one
article       reducing stock-outs of life saving malaria com...
cost                                                        NaN
Name: 1470, dtype: object


### Mean, median, and std of cost of different journal articles

In [31]:
# Sorted by most number of articles
grouped = df.groupby('journal').cost.agg(['mean', 'median', 'std', 'count'])
print(grouped.sort_values('count', ascending=False).head())

                               mean   median         std  count
journal                                                        
plos one                 935.577236   896.99  194.654385    199
j biological chemistry  1385.794638  1314.53  390.485278     69
neuroimage              2215.168276  2326.43  266.653947     29
nucleic acids research  1149.000000   852.00  442.940447     26
plos genetics           1643.110909  1712.73  153.366825     22


In [32]:
# Sorted by most expensive articles
grouped_expensive = df.groupby('journal').cost.agg(['mean', 'median', 'std', 'count'])
print(grouped.sort_values('mean', ascending=False).head())

                          mean   median          std  count
journal                                                    
public service review  6000.00  6000.00          NaN      1
lancet neurology       5040.00  5040.00  1018.233765      2
cell j                 4041.05  4041.05          NaN      1
cell host and microbe  4032.46  4032.46   273.763461      2
immunity               3934.75  3934.75   190.791552      2


### Price grouped by subject

In [33]:
# Group the subjects
categories = {'cell': 'cell',
              'health': 'health',
              'clinical': 'health',
              'nutrition': 'health',
              'arthritis': 'health',
              'depression': 'health',
              'diabe': 'health',
              'derma': 'health',
              'epilep': 'health',
              'obesity': 'health',
              'preventive medicine': 'health',
              'gene': 'genetic',
              'genom': 'genetic',
              'bio': 'bio',
              'neuro': 'brain',
              'brain': 'brain',
              'cereb': 'brain',
              'cortex': 'brain',
              'hippocampus': 'brain',
              'parasit': 'parasite',
              'vir': 'viruses, diseases, and pathogens',
              'pathogen': 'viruses, diseases, and pathogens',
              'movement': 'health',
              'nature': 'nature',
              'paediatric': 'obstetrics and pediatric',
              'child': 'obstetrics and pediatric',
              'hepa': 'hepa',
              'chem': 'chem',
              'psychia': 'psychiatry and psychology',
              'psycho': 'psychiatry and psychology',
              'aids': 'viruses, diseases, and pathogens',
              'hiv': 'viruses, diseases, and pathogens',
              'malaria': 'viruses, diseases, and pathogens',
              'disease': 'viruses, diseases, and pathogens',
              'infection': 'viruses, diseases, and pathogens',
              'pathol': 'viruses, diseases, and pathogens',
              'pharm': 'medicine',
              'medicine': 'medicine',
              'blood': 'blood',
              'hema': 'blood',
              'nucleic acids': 'cell',
              'animal': 'nature',
              'immunolo': 'immunology',
              'addiction': 'health',
              'drug and alcohol': 'health',
              'alcohol': 'health',
              'history': 'history',
              'rna': 'cell',
              'crystal': 'crystal',
              'haematologica': 'blood',
              'epidemiology': 'viruses, diseases, and pathogens',
              'fertility': 'obstetrics and pediatric',
              'fetus': 'obstetrics and pediatric',
              'fetal': 'obstetrics and pediatric',
              'obstet': 'obstetrics and pediatric',
              'endocrin': 'endocrinology',
              'cardio': 'health',
              'behavior research': 'psychiatry and psychology',
              'chronic illness': 'health',
              'cognition': 'brain', 
              'allergy': 'health',
              'neural': 'brain'}

In [34]:
df['subject'] = 'other'
# If the article contains a key word, categorize it as that
for key, value in categories.items():
    df['subject'] = np.where(df['article'].str.contains(key)==True, value, df['subject'])

# If the journal contains a key word, categorize as that, do this second so that journal
# categories take precedence
for key, value in categories.items():
    df['subject'] = np.where(df['journal'].str.contains(key)==True, value, df['subject'])

In [35]:
df['subject'].value_counts()

viruses, diseases, and pathogens    331
other                               258
brain                               255
cell                                194
chem                                181
bio                                 181
health                              162
genetic                             149
medicine                            111
psychiatry and psychology            72
immunology                           33
obstetrics and pediatric             31
history                              30
endocrinology                        30
blood                                28
nature                               28
parasite                             24
hepa                                 16
crystal                              13
Name: subject, dtype: int64

In [36]:
others = df[df['subject']=='other']
group_others = others.groupby('journal').cost.agg(['count']).sort_values('count', ascending=False)
#print(group_others)


In [37]:
subject_price = df.groupby('subject').cost.agg(['mean', 'count']).sort_values('mean', ascending=False)

In [38]:
print(subject_price)

                                         mean  count
subject                                             
nature                            2658.186667     27
psychiatry and psychology         2240.165833     72
immunology                        2093.797879     33
endocrinology                     2021.697931     29
medicine                          1943.641038    106
brain                             1943.585685    248
cell                              1901.615319    188
bio                               1887.538947    180
hepa                              1877.372500     16
parasite                          1826.238636     22
health                            1794.164074    162
viruses, diseases, and pathogens  1762.223189    325
genetic                           1750.571261    145
obstetrics and pediatric          1741.887931     29
history                           1709.245517     29
other                             1649.332440    250
chem                              1640.613933 