# Challenge: Data Cleaning and Validation

Unit 1 / Lesson 3 / Project 6

Data cleaning is definitely a "practice makes perfect" skill.

Use [this dataset](https://www.dropbox.com/s/19cjdi7wqhlfcpt/WELLCOME.zip?dl=0) of article open-access prices paid by the WELLCOME Trust between 2012 and 2013.

- Determine the five most common journals and the total articles for each. 
- Calculate the mean, median, and standard deviation of the open-access cost per article for each journal .

You will need to do considerable data cleaning in order to extract accurate estimates, and may want to look into [data encoding methods](https://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16) if you get stuck.

For a real bonus round, identify the open access prices paid by subject area.

Don't modify the data directly. Instead, write a cleaning script that will load the raw data and whip it into shape. Jupyter notebooks are a great format for this. Keep a record of your decisions: well-commented code is a must for recording your data cleaning decision-making progress.


### Determine the five most common journals and the total articles for each:

Based on our analysis of the clean data, we found that the top five journals by total number of articles are the following:

Public Library of Science: `307`

ASBMB Journal of Biological Chemistry: `71`

Elsevier Journal of Neuroimage: `34`

Oxford University Press Nucleic Acids Research journal: `26`

Proceedings of the National Academy of Science: `28`
    

### Calculate the mean, median, and standard deviation of the open-access cost per article for each journal 

The __Public Libray of Science database__ has the following publishing costs:

    A mean cost of 1790.34 GBP
    A median cost of 1020.42 GBP
    A standard deviation in cost of 11,214.64 GBP


The __ASBMB Journal of Biological Chemistry__ has the following publishing costs:

    A mean cost of 1396.80 GBP
    A median cost of 1314.53 GBP
    A standard deviation in cost of 364.66 GBP


The __Elsevier Journal of Neuroimage__ has the following publishing costs:

    A mean cost of 2050.75 GBP
    A median cost of 2289.24 GBP
    A standard deviation in cost of 472.21 GBP


The __Oxford University Press Nucleic Acids Research journal__ has the following publishing costs:

    A mean cost of 1149.00 GBP
    A median cost of 852.00 GBP
    A standard deviation in cost of 442.94 GBP


The __Proceedings of the National Academy of Science__ has the following publishing costs:

    A mean cost of 785.71 GBP
    A median cost of 755.57 GBP
    A standard deviation in cost of 413.97 GBP

In [1]:
import numpy as np
import pandas as pd

loc = 'WELLCOME APCspend2013 forThinkful.csv'

df = pd.read_csv(loc,
                 encoding='ISO-8859-1',
                 #encoding = "utf-8"
                )

df.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [2]:
# let's rename the columns to make them easier to read
df = df.rename(columns={'PMID/PMCID':'pmid_pmcid',
                        'Publisher':'publisher',
                        'Journal title':'journal',
                        'Article title':'article',
                        'COST (£) charged to Wellcome (inc VAT when charged)':'cost_gbp'})


df.head()

Unnamed: 0,pmid_pmcid,publisher,journal,article,cost_gbp
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [3]:
# sort Journal title column and reset index
df_stripped = df['journal'].str.strip()
df_sorted = df.sort_values('journal')
df_sorted = df_sorted.reset_index(drop=True)
df_sorted.head()

Unnamed: 0,pmid_pmcid,publisher,journal,article,cost_gbp
0,,American Chemical Society,ACS Chemical Biology,Discovery of ?2 Adrenergic Receptor Ligands Us...,£947.07
1,: PMC3805332,American Chemical Society,ACS Chemical Biology,Synthesis of alpha-glucan in mycobacteria invo...,£2286.73
2,PMCID: PMC3780468,ACS (Amercian Chemical Society) Publications,ACS Chemical Biology,A Novel Allosteric Inhibitor of the Uridine Di...,£1294.59
3,PMCID: PMC3621575,ACS (Amercian Chemical Society) Publications,ACS Chemical Biology,Chemical proteomic analysis reveals the drugab...,£1294.78
4,PMID: 24015914 PMC3833349,American Chemical Society,ACS Chemical Biology,Discovery of an allosteric inhibitor binding s...,£1267.76


In [4]:
# let'se see how many journals we have before we start cleaning the data
pd.Series(df_sorted['journal'].unique()).count()

984

In [5]:
# set uniform letter case
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.lower()
df_sorted['journal'] = pd.Series(df_sorted['journal']).str.lower()
df_sorted['article'] = pd.Series(df_sorted['article']).str.lower()

# drop the £ symbol from cost_gbp
df_sorted['cost_gbp'] = pd.Series(df_sorted['cost_gbp'].str.replace('£',''))

# it looks like some journals have 'journal of ...' or 'j ...' before the name of the journal
# this could cause duplicate entries when we count articles, so let's drop them
# drop 'journal of' and 'j ...'
df_sorted['journal'] = pd.Series(df_sorted['journal']).str.replace('journal of ', '')
df_sorted['journal'] = pd.Series(df_sorted['journal']).str.replace('j ', '')

# there are multiple journals with 'plos' as a prefix
# this is a peer-reviewed journal that covers multiple topics so they should be grouped as one journal
# let's drop 'plos' so we can group them together
df_sorted['journal'] = pd.Series(df_sorted['journal']).str.replace('plos', '')
# plos stands for 'public library of science', so let's drop that too
df_sorted['journal'] = pd.Series(df_sorted['journal']).str.replace('public library of science', 'plos')

# strip any whitespace
df_sorted['journal'] = pd.Series(df_sorted['journal']).str.strip()

#view our cleaned data
pd.Series(df_sorted['journal'].unique()).count()

873

In [6]:
# view journals by article count
df_sorted['article'].groupby(df_sorted['journal']).agg('count').sort_values(ascending=False)

journal
one                                                           200
biological chemistry                                           54
neuroimage                                                     29
nucleic acids research                                         26
genetics                                                       26
pathogens                                                      24
proceedings of the national academy of sciences                22
neglected tropical diseases                                    20
human molecular genetics                                       19
nature communications                                          19
neuroscience                                                   16
movement disorders                                             15
bmc public health                                              14
brain                                                          14
biochemical journal                                            12
de

In [7]:
# let's view the top journals by artical count
df_sorted['article'].groupby(df_sorted['journal']).agg('count').sort_values(ascending=False).head(10)

journal
one                                                200
biological chemistry                                54
neuroimage                                          29
nucleic acids research                              26
genetics                                            26
pathogens                                           24
proceedings of the national academy of sciences     22
neglected tropical diseases                         20
human molecular genetics                            19
nature communications                               19
Name: article, dtype: int64

In [11]:
# let's look at our top five journals

# let's start with the public library of science
df_sorted.loc[df_sorted['publisher'].str.contains('public library', na=False)]

# biological chemistry
df_sorted.loc[df_sorted['journal'].str.contains('biological chemistry', na=False)]

# neuroimage
df_sorted.loc[df_sorted['journal'].str.contains('neuroimage', na=False)]

# nucleic acids research
df_sorted.loc[df_sorted['journal'].str.contains('nucleic acids research', na=False)]

# genetics
df_sorted.loc[df_sorted['journal'].str.contains('genetics', na=False)]

# pathogens
df_sorted.loc[df_sorted['journal'].str.contains('pathogens', na=False)]

#proceedings of the national academy of science
df_sorted.loc[df_sorted['journal'].str.contains('proceedings of the national academy of science', na=False)]

Unnamed: 0,pmid_pmcid,publisher,journal,article,cost_gbp
1800,PMCID: PMC3780889,national academy of sciences,pnas (proceedings of the national academy of s...,activation of the canonical ikk complex by k63...,853.64
1864,PMC3557024,national academy of sciences,proceedings of the national academy of sciences,sexual reproduction and mating-type-mediated s...,664.89
1865,3666720,national academy of sciences,proceedings of the national academy of sciences,presynaptic maturation in auditory hair cells ...,765.36
1866,3752214,national academy of sciences,proceedings of the national academy of sciences,progressive hearing loss and gradual deteriora...,793.02
1867,3396515,pnas,proceedings of the national academy of sciences,multistep molecular mechanism for bone morphog...,206.32
1868,2766312,dartmouth journal services,proceedings of the national academy of sciences,analysis of synthetic lethality reveals geneti...,1241.1
1869,PMC3511132,dartmouth journal services,proceedings of the national academy of sciences,sgta antagonizes bag6-mediated protein triage,603.42
1870,23341602 PMC3568321,pnas,proceedings of the national academy of sciences,specular reflections and the estimation of shape,663.3
1871,PMCID:\n PMC3529010,national academy of sciences,proceedings of the national academy of sciences,interactions between the nucleosome histone co...,614.95
1872,PMCID:\n PMC3479458\n,national academy of sciences,proceedings of the national academy of sciences,structural basis for the recognition and cleav...,605.17


Public Library of Science (PLOS) is a large open source database of scientific articles. It appears multiple journals in our database are published by the PLOS.


The Biological Chemistry journal is published by the American Society for Biochemistry and Molecular Biology (ASBMB).


The Neuroimage journal is published by an organization called Elsevier.


The Nucleic Acids Research journal is published by the Oxford University Press.


There are multiple journals titled 'Genetics', most of which are published by the PLOS, so we'll omit them from out count of the top five journals. The same is true with the next most populous journal, 'Pathogens', we'll omit that too.


The next qualifying journal is the Proceeding of the National Academy of Sciences.


So according to our cleaning and sorting, the top journals with the most articles in our database are the Public Library of Science, the ASBMB Biological Chemistry Journal, Elsevier's Neuroimage journal, the Oxford University Press' Nucleic Acids Research journal, and the Proceedings of the National Academy of Sciences.

In [18]:
# let's clean up some of the data in the PLOS journal
# there are multiple names for the same publisher: plos, public library of science, plos(public library of science)
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'plos \(public library of science\)', 'public library of science')
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'plos', 'public library of science')

# let's clean up the data for the american society for biochemistry and molecular biology
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'asbmb', 'american society for biochemistry and molecular biology')
df_sorted.loc[df_sorted['publisher'].str.contains('molecular biology')]

# let's clean up the data for the neuroimage journal
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'elseveier science', 'elseveier')
df_sorted.loc[df_sorted['publisher'].str.contains('elseveier')]

# let's clean up the data for the nucleic research journal
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'oup', 'oxford university press')
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'oxford university press\n', 'oxford university press')
df_sorted.loc[df_sorted['publisher'].str.contains('oxford university press')]

# let's clean up the data for the proceedings of the national academy of sciences
df_sorted['journal'] = pd.Series(df_sorted['journal']).str.replace(
    'pnas (proceedings of the national academy of sciences)',
    'proceedings of the national academy of sciences')
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'pnas', 'national academy of sciences')
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'proceedings of the national academy of sciences', 'national academy of sciences')
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'national academy of sciences, usa', 'national academy of sciences')
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'national academy of sciences usa', 'national academy of sciences')
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'national academy of sciences of the united states of america', 'national academy of sciences')
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'pnas author publication', 'national academy of sciences')
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'national academy of sciences author publication', 'national academy of sciences')
df_sorted['publisher'] = pd.Series(df_sorted['publisher']).str.replace(
    'national academy of sciences \(national academy of sciences\)', 'national academy of sciences')

# let's view our data and see if we've cleaned it up
df_sorted.loc[df_sorted['publisher'].str.contains('public library', na=False)]
df_sorted.loc[df_sorted['journal'].str.contains('biological chemistry', na=False)]
df_sorted.loc[df_sorted['journal'].str.contains('neuroimage', na=False)]
df_sorted.loc[df_sorted['journal'].str.contains('nucleic acids research', na=False)]
df_sorted.loc[df_sorted['journal'].str.contains('proceedings of the national academy of science', na=False)]

Unnamed: 0,pmid_pmcid,publisher,journal,article,cost_gbp
1800,PMCID: PMC3780889,national academy of sciences,pnas (proceedings of the national academy of s...,activation of the canonical ikk complex by k63...,853.64
1864,PMC3557024,national academy of sciences,proceedings of the national academy of sciences,sexual reproduction and mating-type-mediated s...,664.89
1865,3666720,national academy of sciences,proceedings of the national academy of sciences,presynaptic maturation in auditory hair cells ...,765.36
1866,3752214,national academy of sciences,proceedings of the national academy of sciences,progressive hearing loss and gradual deteriora...,793.02
1867,3396515,national academy of sciences,proceedings of the national academy of sciences,multistep molecular mechanism for bone morphog...,206.32
1868,2766312,dartmouth journal services,proceedings of the national academy of sciences,analysis of synthetic lethality reveals geneti...,1241.1
1869,PMC3511132,dartmouth journal services,proceedings of the national academy of sciences,sgta antagonizes bag6-mediated protein triage,603.42
1870,23341602 PMC3568321,national academy of sciences,proceedings of the national academy of sciences,specular reflections and the estimation of shape,663.3
1871,PMCID:\n PMC3529010,national academy of sciences,proceedings of the national academy of sciences,interactions between the nucleosome histone co...,614.95
1872,PMCID:\n PMC3479458\n,national academy of sciences,proceedings of the national academy of sciences,structural basis for the recognition and cleav...,605.17


In [20]:
print('Top 5 Research Journals by Article Count:')
print('Public Library of Science:',
      df_sorted['publisher'].str.contains('public library', na=False).sum())
print('ASBMB Journal of Biological Chemistry:',
      df_sorted['publisher'].str.contains('american society for biochemistry', na=False).sum())
print('Elsevier Journal of Neuroimage:',
      df_sorted['journal'].str.contains('neuroimage', na=False).sum())
print('Oxford University Press Nucleic Acids Research journal:',
      df_sorted['journal'].str.contains('nucleic acids', na=False).sum())
print('Proceedings of the National Academy of Science:',
      df_sorted['journal'].str.contains('proceedings of the national academy of science', na=False).sum())

Top 5 Research Journals by Article Count:
Public Library of Science: 307
ASBMB Journal of Biological Chemistry: 71
Elsevier Journal of Neuroimage: 34
Oxford University Press Nucleic Acids Research journal: 26
Proceedings of the National Academy of Science: 28


### Determine the five most common journals and the total articles for each:

Based on our analysis of the clean data, we found that the top five journals by total number of articles are the following:

Public Library of Science: 307

ASBMB Journal of Biological Chemistry: 71

Elsevier Journal of Neuroimage: 34

Oxford University Press Nucleic Acids Research journal: 26

Proceedings of the National Academy of Science: 28

In [201]:
# now let's find the mean, median, and standard deviation
# of the cost of publishing an article in the top five journals

# let's start with the public library of science
plos = df_sorted.loc[df_sorted['publisher'].str.contains('public library', na=False)]

plos['cost_gbp'] = plos['cost_gbp'].str.replace('\$', '')
plos = plos.drop(plos[plos['cost_gbp'] == '999999.00'].index)
plos['cost_gbp'] = plos['cost_gbp'].apply(lambda x: pd.to_numeric(x, errors='coerce'))

plos['cost_gbp'].describe()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


count       292.000000
mean       1790.337123
std       11214.640461
min         122.310000
25%         871.442500
50%        1020.425000
75%        1430.287500
max      192645.000000
Name: cost_gbp, dtype: float64

In [204]:
# ASBMB Journal of Biological Chemistry
asbmb = df_sorted.loc[df_sorted['publisher'].str.contains('american society for biochemistry', na=False)]

asbmb['cost_gbp'] = asbmb['cost_gbp'].str.replace('\$', '')
asbmb = asbmb.drop(asbmb[asbmb['cost_gbp'] == '999999.00'].index)
asbmb['cost_gbp'] = asbmb['cost_gbp'].apply(lambda x: pd.to_numeric(x, errors='coerce'))

asbmb['cost_gbp'].describe()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


count      69.000000
mean     1396.807246
std       364.664885
min       381.040000
25%      1166.600000
50%      1314.530000
75%      1566.890000
max      2501.070000
Name: cost_gbp, dtype: float64

In [205]:
# Elsevier Journal of Neuroimage
neuro = df_sorted.loc[df_sorted['journal'].str.contains('neuroimage', na=False)]

neuro['cost_gbp'] = neuro['cost_gbp'].str.replace('\$', '')
neuro = neuro.drop(neuro[neuro['cost_gbp'] == '999999.00'].index)
neuro['cost_gbp'] = neuro['cost_gbp'].apply(lambda x: pd.to_numeric(x, errors='coerce'))

neuro['cost_gbp'].describe()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


count      34.000000
mean     2050.756176
std       472.211498
min       987.750000
25%      1762.690000
50%      2289.245000
75%      2395.760000
max      2503.340000
Name: cost_gbp, dtype: float64

In [206]:
# Oxford University Press Nucleic Acids Research journal
oup = df_sorted.loc[df_sorted['journal'].str.contains('nucleic acids', na=False)]

oup['cost_gbp'] = oup['cost_gbp'].str.replace('\$', '')
oup = oup.drop(oup[oup['cost_gbp'] == '999999.00'].index)
oup['cost_gbp'] = oup['cost_gbp'].apply(lambda x: pd.to_numeric(x, errors='coerce'))

oup['cost_gbp'].describe()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


count      26.000000
mean     1149.000000
std       442.940447
min       710.000000
25%       852.000000
50%       852.000000
75%      1704.000000
max      2184.000000
Name: cost_gbp, dtype: float64

In [207]:
# Proceedings of the National Academy of Science
pnas = df_sorted.loc[df_sorted['journal'].str.contains('proceedings of the national academy of science', na=False)]

pnas['cost_gbp'] = pnas['cost_gbp'].str.replace('\$', '')
pnas = pnas.drop(pnas[pnas['cost_gbp'] == '999999.00'].index)
pnas['cost_gbp'] = pnas['cost_gbp'].apply(lambda x: pd.to_numeric(x, errors='coerce'))

pnas['cost_gbp'].describe()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


count      28.000000
mean      785.713214
std       413.977213
min       206.320000
25%       617.080000
50%       755.570000
75%       797.170000
max      2691.680000
Name: cost_gbp, dtype: float64

### Calculate the mean, median, and standard deviation of the open-access cost per article for each journal 

The __Public Libray of Science database__ has the following publishing costs:

    A mean cost of 1790.34 GBP
    A median cost of 1020.42 GBP
    A standard deviation in cost of 11,214.64 GBP


The __ASBMB Journal of Biological Chemistry__ has the following publishing costs:

    A mean cost of 1396.80 GBP
    A median cost of 1314.53 GBP
    A standard deviation in cost of 364.66 GBP


The __Elsevier Journal of Neuroimage__ has the following publishing costs:

    A mean cost of 2050.75 GBP
    A median cost of 2289.24 GBP
    A standard deviation in cost of 472.21 GBP


The __Oxford University Press Nucleic Acids Research journal__ has the following publishing costs:

    A mean cost of 1149.00 GBP
    A median cost of 852.00 GBP
    A standard deviation in cost of 442.94 GBP


The __Proceedings of the National Academy of Science__ has the following publishing costs:

    A mean cost of 785.71 GBP
    A median cost of 755.57 GBP
    A standard deviation in cost of 413.97 GBP`