### 1.  Using this dataset of article open-access prices paid by the WELLCOME Trust between 2012 and 2013, determine the five most common journals and the total articles for each. 

#### Answer:  
~~~~
plos one                           190
journal of biological chemistry     53
neuroimage                          29
plos genetics                       24
plos pathogens                      24
~~~~

In [328]:
import pandas as pd
import numpy as np

In [329]:
# define the data frame df_apc from csv file; use encoding argument to specify format of this file.
df_apc = pd.read_csv('WELLCOME_APCspend2013_forThinkful.csv', encoding = "ISO-8859-1")

In [330]:
df_apc.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [331]:
# just to see a view of the various journal titles.
df_apc['Journal title'].unique()

array(['Psychological Medicine', 'Biomacromolecules', 'J Med Chem',
       'J Org Chem', 'Journal of Medicinal Chemistry',
       'Journal of Proteome Research', 'Mol Pharm',
       'ACS Chemical Biology',
       'Journal of Chemical Information and Modeling', 'Biochemistry',
       'Gastroenterology', 'Journal of Biological Chemistry',
       'Journal of Immunology', 'ACS Chemical Neuroscience', 'ACS NANO',
       'American Chemical Society', 'Analytical Chemistry',
       'Bioconjugate Chemistry', 'Journal of Medicinal Chemistry ',
       'Journal of the American Chemical Society', 'ACS Nano', 'CHEST',
       'Journal of Neurophysiology', 'Journal of Physiology',
       'The Journal of Neurophysiology', 'American Journal of Psychiatry',
       'Americal Journal of Psychiatry', 'Behavioral Neuroscience',
       'Emotion', 'Health Psychology', 'Journal of Abnormal Psychology',
       'Journal of Consulting and Clinical Psychology',
       'Journal of Experimental Psychology:  Animal Be

In [332]:
df_apc['Journal title'].count()

2126

In [333]:
df_apc['Journal title'].nunique()

984

In [334]:
# Create the 'journals' series and convert it to all lower case for count.
journals = df_apc['Journal title']
journals_lower = journals.str.lower()
journals_lower.value_counts().head(6)

plos one                           190
journal of biological chemistry     53
neuroimage                          29
plos genetics                       24
plos pathogens                      24
nucleic acids research              23
Name: Journal title, dtype: int64

### 2. Calculate the mean, median, and standard deviation of the open-access cost per article for each journal.

#### Answer:  
~~~~
count     2078.000000
mean      1820.986583
std        814.167299
min          0.000000
25%       1260.000000
50%       1851.220000
75%       2302.730000
max      13200.000000
median   1851.22
~~~~

In [335]:
df_apc.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [336]:
# just capturing the data following the [pound] character.
df_apc["COST (£) charged to Wellcome (inc VAT when charged)"] = df_apc["COST (£) charged to Wellcome (inc VAT when charged)"].apply(lambda x : x[1:])

In [337]:
df_apc.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,685.88


In [338]:
df_apc.dtypes

PMID/PMCID                                             object
Publisher                                              object
Journal title                                          object
Article title                                          object
COST (£) charged to Wellcome (inc VAT when charged)    object
dtype: object

In [339]:
# removing '$' in cost column in order to perform statistical analysis later.
df_apc["COST (£) charged to Wellcome (inc VAT when charged)"] = [(x.replace('$', '')) for x in df_apc["COST (£) charged to Wellcome (inc VAT when charged)"]]

In [340]:
df_apc.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,685.88


In [341]:
# convert cost dat to float data type.
df_apc["COST (£) charged to Wellcome (inc VAT when charged)"] = [np.float(x) for x in df_apc["COST (£) charged to Wellcome (inc VAT when charged)"]]

In [342]:
df_apc.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,685.88


In [343]:
# making sure cost is of type float.
df_apc.dtypes

PMID/PMCID                                              object
Publisher                                               object
Journal title                                           object
Article title                                           object
COST (£) charged to Wellcome (inc VAT when charged)    float64
dtype: object

In [344]:
df_apc["COST (£) charged to Wellcome (inc VAT when charged)"].mean()

24060.945989656793

In [345]:
# The above mean looks way to high so checking max value.
df_apc["COST (£) charged to Wellcome (inc VAT when charged)"].max()

999999.0

In [346]:
df_apc.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,685.88


In [347]:
# sort cost by descending order to look at data.
df_apc.sort_values(["COST (£) charged to Wellcome (inc VAT when charged)"], ascending=0).head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
1065,PMCID: PMC3615658,Nature Publishing Group,EMBO Reports,Physiological release of endogenous tau is sti...,999999.0
1675,Not yet available,Sage Publishing,Qualitative Research,Picturing commuting: photovoice and seeking we...,999999.0
1939,,Wiley,HBM JNL Human Brain Mapping,Phase informed model for motion and susceptibi...,999999.0
825,,Elsevier,Veterinary Parasitology,Persistence of the efficacy of copper oxide wi...,999999.0
491,PMCID: PMC3464430,Elsevier,Cell,piRNAs can trigger a multigenerational epigene...,999999.0


In [348]:
df_apc.dtypes

PMID/PMCID                                              object
Publisher                                               object
Journal title                                           object
Article title                                           object
COST (£) charged to Wellcome (inc VAT when charged)    float64
dtype: object

In [349]:
# rename the cumbersome column name to 'cost'.
df_rename = df_apc.rename(columns={"COST (£) charged to Wellcome (inc VAT when charged)": "cost"})

In [350]:
df_rename.dtypes

PMID/PMCID        object
Publisher         object
Journal title     object
Article title     object
cost             float64
dtype: object

In [351]:
df_rename.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,cost
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,685.88


In [352]:
df_rename.tail()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,cost
2122,2901593,Wolters Kluwer Health,Circulation Research,Mechanistic Links Between Na+ Channel (SCN5A) ...,1334.15
2123,3748854,Wolters Kluwer Health,AIDS,Evaluation of an empiric risk screening score ...,1834.77
2124,3785148,Wolters Kluwer Health,Pediatr Infect Dis J,Topical umbilical cord care for prevention of ...,1834.77
2125,PMCID:\n PMC3647051\n,Wolters Kluwer N.V./Lippinott,AIDS,Grassroots Community Organisations' Contributi...,2374.52
2126,PMID: 23846567 (Epub July 2013),Wolters Kluwers,Journal of Acquired Immune Deficiency Syndromes,A novel community health worker tool outperfor...,2034.75


In [353]:
df_rename.dtypes

PMID/PMCID        object
Publisher         object
Journal title     object
Article title     object
cost             float64
dtype: object

In [354]:
# remove strange high number of 999999.00.
df_no_nines = df_rename[df_rename.cost != 999999.00]
df_no_nines.sort_values(["cost"], ascending=0).head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,cost
1987,PMC3664409\n\n,Wiley,Movement Disorders,Limb amputations in fixed dystonia: a form of ...,201024.0
1470,3547931,Public Library of Science,PLoS One,Reducing stock-outs of life saving Malaria Com...,192645.0
986,,MacMillan,,Fungal Disease in Britain and the United State...,13200.0
1619,543219,public.service.co.uk,Public Service Review,Laboratory Science in Tropical Medicine,6000.0
800,PMID: 23041239 /PMCID: PMC3490334,Elsevier,The Lancet Neurology,Genetic risk factors for ischaemic stroke and ...,5760.0


In [355]:
# just checking to make sure cost is still a float data type.
df_no_nines.dtypes

PMID/PMCID        object
Publisher         object
Journal title     object
Article title     object
cost             float64
dtype: object

In [356]:
# remove cost figures that are obviously too high.
df_no_large = df_no_nines[df_no_nines.cost != 201024.00]
df_no_large2 = df_no_large[df_no_large.cost != 192645.00]
df_no_large2.sort_values(["cost"], ascending=0).head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,cost
986,,MacMillan,,Fungal Disease in Britain and the United State...,13200.0
1619,543219,public.service.co.uk,Public Service Review,Laboratory Science in Tropical Medicine,6000.0
800,PMID: 23041239 /PMCID: PMC3490334,Elsevier,The Lancet Neurology,Genetic risk factors for ischaemic stroke and ...,5760.0
552,23541370 PMC3744751,Elsevier,Elsevier,Effects of relative weight gain and linear gro...,4800.0
798,PMCID:\n PMC3627205\n,Elsevier,The Lancet,Effects of unconditional and conditional cash ...,4800.0


In [357]:
# rename data frame and obtain statistical values.
df_clean = df_no_large2
df_clean['cost'].describe()

count     2078.000000
mean      1820.986583
std        814.167299
min          0.000000
25%       1260.000000
50%       1851.220000
75%       2302.730000
max      13200.000000
Name: cost, dtype: float64

In [358]:
print(df_clean['cost'].median())

1851.22
