To complete this challenge, determine the five most common journals and the total articles for each. Next, calculate the mean, median, and standard deviation of the open-access cost per article for each journal.

You will need to do considerable data cleaning in order to extract accurate estimates. You may may want to look into data encoding methods if you get stuck. For a real bonus round, identify the open access prices paid by subject area.

Remember not to modify the data directly. Instead, write a cleaning script that will load the raw data and whip it into shape. Jupyter notebooks are a great format for this. Keep a record of your decisions: well-commented code is a must for recording your data cleaning decision-making progress. Submit a link to your script and results below and discuss it with your mentor at your next session.

Data file at https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/WELLCOME/WELLCOME_APCspend2013_forThinkful.csv


In [420]:
#importing modules 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import scipy


In [268]:
#read data into notebook
url = r"https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/WELLCOME/WELLCOME_APCspend2013_forThinkful.csv"
local = r"C:\Users\Chris\Documents\thinkful\data_sets\WELLCOME_APCspend2013_forThinkful.csv"
data_df_raw = pd.read_csv(url, encoding='latin_1')

data_df_raw.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [450]:
#save copy of data to avoid multiple imports
data_df = data_df_raw
data_df.shape

(2127, 5)

In [451]:
#check data type
data_df.dtypes

PMid     object
pub      object
journ    object
art      object
cost     object
dtype: object

In [452]:
#rename columns
data_df.columns = ['PMid', 'pub', 'journ', 'art', 'cost']
data_df.head()

Unnamed: 0,PMid,pub,journ,art,cost
0,,Cup,Psychological Medicine,Reduced Parahippocampal Cortical Thickness In ...,£0.00
1,Pmc3679557,Acs,Biomacromolecules,Structural Characterization Of A Model Gram-Ne...,£2381.04
2,23043264 Pmc3506128,Acs,J Med Chem,"Fumaroylamino-4,5-Epoxymorphinans And Related ...",£642.56
3,23438330 Pmc3646402,Acs,J Med Chem,Orvinols With Mixed Kappa/Mu Opioid Receptor A...,£669.64
4,23438216 Pmc3601604,Acs,J Org Chem,Regioselective Opening Of Myo-Inositol Orthoes...,£685.88


In [453]:
#sort by article names to check for near duplicates
data_df.sort_values(by='art').head()

Unnamed: 0,PMid,pub,journ,art,cost
1231,Pmid:24048963 (Epub Sept 2013),Oxford University Press,Journal Of Infectious Diseases,Persistent Endothelial Activation After Plas...,£2841.60
1729,Pmc3536945,Springer,Brain Topography,"""A Novel Method For Reducing The Effect Of Ton...",£1889.90
1026,,Nature,Biosocieties,"""Creating The 'Ethics Industry': Mary Warnock,...",£1800.00
1287,Pmc3547059,Plos,Plos One,"""Involvement Of Ephb1 Receptors Signalling In ...",£1023.41
1485,3569446,Public Library Of Science,Plos One,"""The Words Will Pass With The Blowing Wind""; S...",£825.68


In [454]:
data_df['journ'].head()

0    Psychological Medicine
1         Biomacromolecules
2                J Med Chem
3                J Med Chem
4                J Org Chem
Name: journ, dtype: object

In [500]:
#why doesn't this work?
#data2_df = data_df[['PMid', 'pub', 'journ', 'art']].applymap(lambda x: x.str.replace(r"\n", ""))
#Answ: applying string method to integers. 


AttributeError: ("'float' object has no attribute 'str'", 'occurred at index PMid')

In [456]:
#remove newline chars
data_df.journ = data_df.journ.str.replace(r'\n', '')
data_df.PMid = data_df.PMid.str.replace(r'\n', '')
data_df.pub = data_df.pub.str.replace(r'\n', '')
data_df.art = data_df.art.str.replace(r'\n', '')
data_df.cost = data_df.cost.str.replace(r'\n', '')

In [457]:
#fix journal name case
data_df.journ = data_df.journ.str.title()
data_df.PMid = data_df.PMid.str.title()
data_df.pub = data_df.pub.str.title()
data_df.art = data_df.art.str.title()
data_df.cost = data_df.cost.str.title()

In [458]:
data_df.head()

Unnamed: 0,PMid,pub,journ,art,cost
0,,Cup,Psychological Medicine,Reduced Parahippocampal Cortical Thickness In ...,£0.00
1,Pmc3679557,Acs,Biomacromolecules,Structural Characterization Of A Model Gram-Ne...,£2381.04
2,23043264 Pmc3506128,Acs,J Med Chem,"Fumaroylamino-4,5-Epoxymorphinans And Related ...",£642.56
3,23438330 Pmc3646402,Acs,J Med Chem,Orvinols With Mixed Kappa/Mu Opioid Receptor A...,£669.64
4,23438216 Pmc3601604,Acs,J Org Chem,Regioselective Opening Of Myo-Inositol Orthoes...,£685.88


In [486]:
#check for near duplicate journal names
data_df['journ'].unique()[:40]

array(['Psychological Medicine', 'Biomacromolecules', 'J Med Chem',
       'J Org Chem', 'Journal Of Medicinal Chemistry',
       'Journal Of Proteome Research', 'Mol Pharm',
       'Acs Chemical Biology',
       'Journal Of Chemical Information And Modeling', 'Biochemistry',
       'Gastroenterology', 'Journal Of Biological Chemistry',
       'Journal Of Immunology', 'Acs Chemical Neuroscience', 'Acs Nano',
       'American Chemical Society', 'Analytical Chemistry',
       'Bioconjugate Chemistry',
       'Journal Of The American Chemical Society', 'Chest',
       'Journal Of Neurophysiology', 'Journal Of Physiology',
       'The Journal Of Neurophysiology', 'American Journal Of Psychiatry',
       'Americal Journal Of Psychiatry', 'Behavioral Neuroscience',
       'Emotion', 'Health Psychology', 'Journal Of Abnormal Psychology',
       'Journal Of Consulting And Clinical Psychology',
       'Journal Of Experimental Psychology:  Animal Behaviour Process',
       'Journal Of Experiment

In [460]:
#remove record with null jornal
data_df = data_df[data_df['journ'].notnull()]


In [506]:
#clean journal names
data_df.loc[:,'journ'] = data_df['journ'].str.replace(r"\s+?", r" ")
data_df.loc[:,'journ'] = data_df.loc[:,'journ'].str.replace(r"\s$", r"")
data_df.loc[:,'journ'] = data_df.loc[:,'journ'].str.replace("Public Library Of Science One", "Plos One")
data_df.loc[:,['journ']] = data_df.loc[:,'journ'].str.replace(r"&", r"and") #double bracket
data_df.iloc[:, 2] = data_df['journ'].map(lambda x: x.title())  #second bracket fixed setcopywarning, or use iloc
##unless data set is huge, create new columns. 


In [507]:
%who  #way of listing all outstanding variables in environment



data_df	 data_df_PMCnotnulls	 data_df_PMnotnulls	 data_df_nulls	 data_df_raw	 data_grp_df	 data_nonquid	 group	 high	 
json	 local	 low	 minmax	 name	 np	 pd	 plt	 quantiles	 
re	 scipy	 sqlite3	 testd	 url	 w_test	 windsorise	 winsorize	 


In [532]:
#Extract pmids and pmcids
data_df['PMid2'] = data_df.PMid.str.extract(r"((?<!\d)\d{8}(?!\d))")
data_df['PMCid'] = data_df.PMid.str.extract(r"(?<!\d)(\d{7})(?!\d)")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [463]:
data_df.head()


Unnamed: 0,PMid,pub,journ,art,cost,PMid2,PMCid
0,,Cup,Psychological Medicine,Reduced Parahippocampal Cortical Thickness In ...,£0.00,,
1,Pmc3679557,Acs,Biomacromolecules,Structural Characterization Of A Model Gram-Ne...,£2381.04,,3679557.0
2,23043264 Pmc3506128,Acs,J Med Chem,"Fumaroylamino-4,5-Epoxymorphinans And Related ...",£642.56,23043264.0,3506128.0
3,23438330 Pmc3646402,Acs,J Med Chem,Orvinols With Mixed Kappa/Mu Opioid Receptor A...,£669.64,23438330.0,3646402.0
4,23438216 Pmc3601604,Acs,J Org Chem,Regioselective Opening Of Myo-Inositol Orthoes...,£685.88,23438216.0,3601604.0


In [464]:
#show duplicates on id?
data_df[data_df['PMCid'].duplicated(keep=False)].sort_values(by=['PMCid'], ascending=True).head()


Unnamed: 0,PMid,pub,journ,art,cost,PMid2,PMCid
128,Pmc3173209,Asbmb,Journal Of Biological Chemistry,The T Cell Receptor Triggering Apparatus Is Co...,£1286.86,,3173209
155,Pmc3173209,Asbmb,The Journal Of Biological Chemistry,The T-Cel Receptor Triggering Apparatus Is Com...,£1281.15,,3173209
176,22738332 Pmc3381227,Biomed Central,Biomed Central,Long-Term Impact Of Systemic Bacterial Infecti...,£1350.00,22738332.0,3381227
546,22155499 Pmc3381227,Elsevier,Elsevier,Age Related Changes In Microglial Phenotype Va...,£2152.76,22155499.0,3381227
2063,Pmid: 23344974 Pmc3401426,Wiley-Blackwell,Chembiochem,Using A Fragment-Based Approach To Target Prot...,£2048.93,23344974.0,3401426


In [465]:
#change cost column to numeric
data_df["cost_n"] = data_df.cost.str.extract(r"£(.*)")
data_df.cost_n = pd.to_numeric(data_df["cost_n"], errors='coerce')
data_df.cost_n.sum()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


51159218.16

In [466]:
#remove null costs
data_df[data_df['cost_n'].isnull()]
#cost nulls are actually dollars. 


Unnamed: 0,PMid,pub,journ,art,cost,PMid2,PMCid,cost_n
178,Pmc3398262,Biomed Central,Bmc Biology,Detailed Interrogation Of Trypanosome Cell Bio...,1674$,,3398262.0,
179,Pmid: 24020915 Pmc3846689,Biomed Central,Bmc Genetics,The Physical Capability Of Community-Based Men...,1375.8$,24020915.0,3846689.0,
180,Pmid:23442822,Biomed Central,Bmc Genome Biology,Patterns Of Prokaryotic Lateral Gene Transfers...,2010$,23442822.0,,
181,Pmc2843621,Biomed Central,Bmc Genomics,Trichomonas Vaginalis Vast Bspa-Like Gene Fami...,1204.38$,,2843621.0,
182,Pmcid: Pmc3636053,Biomed Central,Bmc Genomics,Enhancing The Utility Of Proteomics Signature ...,1254.6$,,3636053.0,
183,3526451,Biomed Central,Bmc Genomics,Advances In Genome-Wide Rnai Cellular Screens:...,1476$,,3526451.0,
337,23931322 Pmcid: Pmc3736666,Byophysical Society,Biophysical Journal,Aggregation Modulators Interfer With Membrane ...,671.04$,23931322.0,3736666.0,
1599,23308065,Public Library Of Science,Plos Pathogens,Transmission Of Equine Influenza Virus During ...,1440$,23308065.0,,
1600,Pmid:23633946,Public Library Of Science,Plos Pathogens,The Mnn2 Mannosyltransferase Family Modulates ...,1460.3$,23633946.0,,
1601,Pmc3610638,Public Library Of Science,Plos Pathogens,Dna Break Site At Fragile Subtelomeres Determi...,1476.47$,,3610638.0,


In [467]:
#convert dollar costs to pounds
data_nonquid = data_df[data_df['cost_n'].isnull()]
data_nonquid.loc[:,'cost_n'] = pd.to_numeric(data_nonquid.loc[:,'cost'].str.replace(r"\$", "", regex=True))
data_df.update(data_nonquid, join='left', overwrite=True, filter_func=None, errors='ignore')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[col] = expressions.where(mask, this, that)


In [468]:
data_df[data_df['cost_n'].isnull()]
#nulls gone now

Unnamed: 0,PMid,pub,journ,art,cost,PMid2,PMCid,cost_n


In [469]:
#confirm cost is numeric now
data_df.dtypes

PMid       object
pub        object
journ      object
art        object
cost       object
PMid2      object
PMCid      object
cost_n    float64
dtype: object

In [470]:
#separate nulls and nonnulls
data_df_PMCnotnulls = data_df[pd.isnull(data_df['PMCid'])]
data_df_PMnotnulls = data_df[pd.isnull(data_df['PMid2'])]                           
data_df_nulls = data_df[pd.isnull(data_df['PMid2']) & pd.isnull(data_df['PMCid'])]
data_df[pd.isnull(data_df['PMid2']) & pd.isnull(data_df['PMCid'])].head()

Unnamed: 0,PMid,pub,journ,art,cost,PMid2,PMCid,cost_n
0,,Cup,Psychological Medicine,Reduced Parahippocampal Cortical Thickness In ...,£0.00,,,0.0
21,,American Chemical Society,Acs Chemical Biology,Discovery Of ?2 Adrenergic Receptor Ligands Us...,£947.07,,,947.07
43,,American Psychiatric Association,American Journal Of Psychiatry,Methamphetamine-Induced Disruption Of Frontost...,£2351.73,,,2351.73
90,Pmc In Progress,American Society For Microbiology,Infection And Immunity,Analysis Of Antibodies To Newly Described Plas...,£2034.00,,,2034.0
93,,American Society For Microbiology,Journal Of Virology,The Human Adenovirus Type 5 L4 Promoter Is Act...,£1312.59,,,1312.59


In [471]:
data_df.shape

(2126, 8)

In [472]:
data_df_PMnotnulls.shape

(1672, 8)

In [473]:
#more PMid2 field is filled out, will be used to aggregate.
data_df_PMCnotnulls.shape

(479, 8)

In [474]:
#fill in missing PMcid from other entries
data_df_nulls.loc[:,'PMCid_assumed'] = data_df_nulls.merge(data_df_PMCnotnulls, on='art').loc[:,'PMCid_y']
data_df_nulls.loc[:,'PMid2_assumed'] = data_df_nulls.merge(data_df_PMnotnulls, on='art').loc[:,'PMid2_y']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)


In [475]:
data_df_nulls.head()
#this wasn't able to fill in the nulls.

Unnamed: 0,PMid,pub,journ,art,cost,PMid2,PMCid,cost_n,PMCid_assumed,PMid2_assumed
0,,Cup,Psychological Medicine,Reduced Parahippocampal Cortical Thickness In ...,£0.00,,,0.0,,
21,,American Chemical Society,Acs Chemical Biology,Discovery Of ?2 Adrenergic Receptor Ligands Us...,£947.07,,,947.07,,
43,,American Psychiatric Association,American Journal Of Psychiatry,Methamphetamine-Induced Disruption Of Frontost...,£2351.73,,,2351.73,,
90,Pmc In Progress,American Society For Microbiology,Infection And Immunity,Analysis Of Antibodies To Newly Described Plas...,£2034.00,,,2034.0,,
93,,American Society For Microbiology,Journal Of Virology,The Human Adenovirus Type 5 L4 Promoter Is Act...,£1312.59,,,1312.59,,


In [509]:
#check for outliers
data_removed = data_df.loc[data_df['cost_n'] < 4101]

In [513]:
#windsorise with lambda
low = .03
high = .97
quantiles = data_df['cost_n'].quantile([low, high])
#data_df['cost_n'] = data_df['cost_n'].apply(lambda x: (max(quantiles[low], x) & (min(quantiles[high], x))))
data_df['cost_n'] = data_df['cost_n'].apply(lambda x: (max(quantiles[low], x)))
data_df['cost_n'] = data_df['cost_n'].apply(lambda x: (min(quantiles[high], x)))


4100.061875

In [491]:
#windsorise with func
def minmax(series, low=.03, high=.97):
    quantiles = series.quantile([low, high])
    return (max(quantiles[low], x) & (min(quantiles[high], x)))

#need to add for loop...
#data_df['cost_n'] = data_df['cost_n'].apply(minmax))
testd = pd.Series([0,5,5,5,5,5,5,5,6,6,6,6,6,6,6,0])
minmax(testd, .1, .9)
            

NameError: name 'x' is not defined

In [530]:
#summarize dataframe
data_grp_df = data_df_PMnotnulls.groupby('journ').agg({'cost_n': ['sum', 'mean', 'std', 'count']})
data_grp_df.columns = ['_'.join(col).strip() for col in data_grp_df.columns.values]
data_grp_df.columns = ['sum_of_cost', 'mean_cost', 'std_cost', 'article_count']
data_grp_df.sort_values(by='article_count', inplace=True, ascending=False)
data_grp_df.iloc[:8,:]

Unnamed: 0_level_0,sum_of_cost,mean_cost,std_cost,article_count
journ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Plos One,6338831.1,38185.729518,187409.570898,166
Journal Of Biological Chemistry,1062178.73,23603.971778,148860.768915,45
Neuroimage,55470.6,2218.824,268.668926,25
Plos Genetics,2030867.15,96707.959524,300303.644375,21
Nucleic Acids Research,20022.0,1053.789474,370.640823,19
Plos Neglected Tropical Diseases,27741.48,1541.193333,146.178543,18
Proceedings Of The National Academy Of Sciences,14484.85,804.713889,514.742636,18
Nature Communications,1045775.4,65360.9625,249238.443178,16


In [531]:
data_grp_df = data_removed.groupby('journ').agg({'cost_n': ['sum', 'mean', 'median', 'std', 'count']})
data_grp_df.columns = ['_'.join(col).strip() for col in data_grp_df.columns.values]
data_grp_df.columns = ['sum_of_cost', 'mean_cost', 'median_cost', 'std_cost', 'article_count']
data_grp_df.sort_values(by='article_count', inplace=True, ascending=False)
data_grp_df.iloc[:8,:]#.style.applymap(lambda x: pd.format('${0:,.2f}'),subset=pd.IndexSlice[:, ['sum_of_cost']])


Unnamed: 0_level_0,sum_of_cost,mean_cost,median_cost,std_cost,article_count
journ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Plos One,173036.85,945.556557,896.99,174.789313,183
Journal Of Biological Chemistry,74380.25,1430.389423,1301.14,395.035207,52
Neuroimage,64239.88,2215.168276,2326.43,266.653947,29
Nucleic Acids Research,29874.0,1149.0,852.0,442.940447,26
Proceedings Of The National Academy Of Sciences,18118.94,823.588182,742.045,439.216887,22
Plos Pathogens,34603.07,1572.866818,1600.25,161.780891,22
Plos Genetics,36148.44,1643.110909,1712.73,153.366825,22
Plos Neglected Tropical Diseases,31270.38,1563.519,1516.115,156.521088,20


In [537]:
for number in np.arange(0,11,2):
    print(number)

0
2
4
6
8
10
