To complete this challenge, determine the five most common journals and the total articles for each. Next, calculate the mean, median, and standard deviation of the open-access cost per article for each journal.

You will need to do considerable data cleaning in order to extract accurate estimates. You may may want to look into data encoding methods if you get stuck. For a real bonus round, identify the open access prices paid by subject area.

Remember not to modify the data directly. Instead, write a cleaning script that will load the raw data and whip it into shape. Jupyter notebooks are a great format for this. Keep a record of your decisions: well-commented code is a must for recording your data cleaning decision-making progress. Submit a link to your script and results below and discuss it with your mentor at your next session.

Data file at https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/WELLCOME/WELLCOME_APCspend2013_forThinkful.csv


In [171]:
#importing modules 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

In [101]:
#read data into notebook
url = r"https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/WELLCOME/WELLCOME_APCspend2013_forThinkful.csv"
local = r"C:\Users\Chris\Documents\thinkful\data_sets\WELLCOME_APCspend2013_forThinkful.csv"
data_df_raw = pd.read_csv(url, encoding='latin_1')

data_df_raw.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [102]:
#save copy of data
data_df = data_df_raw
data_df.shape

(2127, 5)

In [87]:
#check data type
data_df.dtypes

PMid     object
pub      object
journ    object
art      object
cost     object
dtype: object

In [103]:
#rename columns
data_df.columns = ['PMid', 'pub', 'journ', 'art', 'cost']
data_df.head()

Unnamed: 0,PMid,pub,journ,art,cost
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [194]:
#sort by article names to check for near duplicates
data_df.sort_values(by='art').head()

Unnamed: 0,PMid,pub,journ,art,cost,PMid2,PMCid,cost_n
1231,Pmid:24048963 (Epub Sept 2013),Oxford University Press,Journal Of Infectious Diseases,Persistent Endothelial Activation After Plas...,£2841.60,24048963.0,,2841.6
1729,Pmc3536945,Springer,Brain Topography,"""A Novel Method For Reducing The Effect Of Ton...",£1889.90,,3536945.0,1889.9
1026,,Nature,Biosocieties,"""Creating The 'Ethics Industry': Mary Warnock,...",£1800.00,,,1800.0
1287,Pmc3547059,Plos,Plos One,"""Involvement Of Ephb1 Receptors Signalling In ...",£1023.41,,3547059.0,1023.41
1485,3569446,Public Library Of Science,Plos One,"""The Words Will Pass With The Blowing Wind""; S...",£825.68,,3569446.0,825.68


In [117]:
data_df['journ'].head()

0    Psychological Medicine
1         Biomacromolecules
2                J Med Chem
3                J Med Chem
4                J Org Chem
Name: journ, dtype: object

In [111]:
#why doesn't this work?
data_df = data_df.applymap(str.replace(r'\n', ''))

TypeError: replace() takes at least 2 arguments (1 given)

In [113]:
#remove newline chars
data_df.journ = data_df.journ.str.replace(r'\n', '')
data_df.PMid = data_df.PMid.str.replace(r'\n', '')
data_df.pub = data_df.pub.str.replace(r'\n', '')
data_df.art = data_df.art.str.replace(r'\n', '')
data_df.cost = data_df.cost.str.replace(r'\n', '')

In [114]:
#fix journal names
data_df.journ = data_df.journ.str.title()
data_df.PMid = data_df.PMid.str.title()
data_df.pub = data_df.pub.str.title()
data_df.art = data_df.art.str.title()
data_df.cost = data_df.cost.str.title()

In [193]:
data_df.head(20)

Unnamed: 0,PMid,pub,journ,art,cost,PMid2,PMCid,cost_n
0,,Cup,Psychological Medicine,Reduced Parahippocampal Cortical Thickness In ...,£0.00,,,0.0
1,Pmc3679557,Acs,Biomacromolecules,Structural Characterization Of A Model Gram-Ne...,£2381.04,,3679557.0,2381.04
2,23043264 Pmc3506128,Acs,J Med Chem,"Fumaroylamino-4,5-Epoxymorphinans And Related ...",£642.56,23043264.0,3506128.0,642.56
3,23438330 Pmc3646402,Acs,J Med Chem,Orvinols With Mixed Kappa/Mu Opioid Receptor A...,£669.64,23438330.0,3646402.0,669.64
4,23438216 Pmc3601604,Acs,J Org Chem,Regioselective Opening Of Myo-Inositol Orthoes...,£685.88,23438216.0,3601604.0,685.88
5,Pmc3579457,Acs,Journal Of Medicinal Chemistry,Comparative Structural And Functional Studies ...,£2392.20,,3579457.0,2392.2
6,Pmc3709265,Acs,Journal Of Proteome Research,Mapping Proteolytic Processing In The Secretom...,£2367.95,,3709265.0,2367.95
7,23057412 Pmc3495574,Acs,Mol Pharm,Quantitative Silencing Of Egfp Reporter Gene B...,£649.33,23057412.0,3495574.0,649.33
8,Pmcid: Pmc3780468,Acs (Amercian Chemical Society) Publications,Acs Chemical Biology,A Novel Allosteric Inhibitor Of The Uridine Di...,£1294.59,,3780468.0,1294.59
9,Pmcid: Pmc3621575,Acs (Amercian Chemical Society) Publications,Acs Chemical Biology,Chemical Proteomic Analysis Reveals The Drugab...,£1294.78,,3621575.0,1294.78


In [190]:
#check for near duplicate journal names
data_df['journ'].unique()

array(['Psychological Medicine', 'Biomacromolecules', 'J Med Chem',
       'J Org Chem', 'Journal Of Medicinal Chemistry',
       'Journal Of Proteome Research', 'Mol Pharm',
       'Acs Chemical Biology',
       'Journal Of Chemical Information And Modeling', 'Biochemistry',
       'Gastroenterology', 'Journal Of Biological Chemistry',
       'Journal Of Immunology', 'Acs Chemical Neuroscience', 'Acs Nano',
       'American Chemical Society', 'Analytical Chemistry',
       'Bioconjugate Chemistry', 'Journal Of Medicinal Chemistry ',
       'Journal Of The American Chemical Society', 'Chest',
       'Journal Of Neurophysiology', 'Journal Of Physiology',
       'The Journal Of Neurophysiology', 'American Journal Of Psychiatry',
       'Americal Journal Of Psychiatry', 'Behavioral Neuroscience',
       'Emotion', 'Health Psychology', 'Journal Of Abnormal Psychology',
       'Journal Of Consulting And Clinical Psychology',
       'Journal Of Experimental Psychology:  Animal Behaviour Proc

In [158]:
#Extract pmids and pmcids
data_df['PMid2'] = data_df.PMid.str.extract(r"((?<!\d)\d{8}(?!\d))")
data_df['PMCid'] = data_df.PMid.str.extract(r"(?<!\d)(\d{7})(?!\d)")


In [159]:
data_df.head()


Unnamed: 0,PMid,pub,journ,art,cost,PMid2,PMCid
0,,Cup,Psychological Medicine,Reduced Parahippocampal Cortical Thickness In ...,£0.00,,
1,Pmc3679557,Acs,Biomacromolecules,Structural Characterization Of A Model Gram-Ne...,£2381.04,,3679557.0
2,23043264 Pmc3506128,Acs,J Med Chem,"Fumaroylamino-4,5-Epoxymorphinans And Related ...",£642.56,23043264.0,3506128.0
3,23438330 Pmc3646402,Acs,J Med Chem,Orvinols With Mixed Kappa/Mu Opioid Receptor A...,£669.64,23438330.0,3646402.0
4,23438216 Pmc3601604,Acs,J Org Chem,Regioselective Opening Of Myo-Inositol Orthoes...,£685.88,23438216.0,3601604.0


In [191]:
#show duplicates on id - not wrokig?
data_df[data_df['PMCid'].duplicated(keep=False)].sort_values(by=['PMCid'], ascending=True).head()


Unnamed: 0,PMid,pub,journ,art,cost,PMid2,PMCid,cost_n
128,Pmc3173209,Asbmb,Journal Of Biological Chemistry,The T Cell Receptor Triggering Apparatus Is Co...,£1286.86,,3173209,1286.86
155,Pmc3173209,Asbmb,The Journal Of Biological Chemistry,The T-Cel Receptor Triggering Apparatus Is Com...,£1281.15,,3173209,1281.15
176,22738332 Pmc3381227,Biomed Central,Biomed Central,Long-Term Impact Of Systemic Bacterial Infecti...,£1350.00,22738332.0,3381227,1350.0
546,22155499 Pmc3381227,Elsevier,Elsevier,Age Related Changes In Microglial Phenotype Va...,£2152.76,22155499.0,3381227,2152.76
2063,Pmid: 23344974 Pmc3401426,Wiley-Blackwell,Chembiochem,Using A Fragment-Based Approach To Target Prot...,£2048.93,23344974.0,3401426,2048.93


In [163]:
#change cost column to numeric
data_df["cost_n"] = data_df.cost.str.extract(r"£(.*)")
data_df.cost_n = pd.to_numeric(data_df["cost_n"])

In [170]:
#confirm cost is numeric now
data_df.cost_n.sum()

51172418.16

In [192]:
data_df[pd.isnull(data_df['PMid2']) & pd.isnull(data_df['PMCid'])].head()

Unnamed: 0,PMid,pub,journ,art,cost,PMid2,PMCid,cost_n
0,,Cup,Psychological Medicine,Reduced Parahippocampal Cortical Thickness In ...,£0.00,,,0.0
21,,American Chemical Society,Acs Chemical Biology,Discovery Of ?2 Adrenergic Receptor Ligands Us...,£947.07,,,947.07
43,,American Psychiatric Association,American Journal Of Psychiatry,Methamphetamine-Induced Disruption Of Frontost...,£2351.73,,,2351.73
90,Pmc In Progress,American Society For Microbiology,Infection And Immunity,Analysis Of Antibodies To Newly Described Plas...,£2034.00,,,2034.0
93,,American Society For Microbiology,Journal Of Virology,The Human Adenovirus Type 5 L4 Promoter Is Act...,£1312.59,,,1312.59
