## Handling publishers
Publishers are not uniquely defined in Unpaywall's data. This is a problem because it means that, if we want to look at what individual publishers are doing with regard to OA, we might only get part of the picture when a publisher has more than 1 name in the data. 

There are a few possible ways to solve this. E.g. one could cross-reference with CrossRef and use the CrossRef member-ids to disambiguate publishers, but that might take a lot of time and there's the further problem that some publishers also have more than 1 CrossRef member id.

So here we will try an ad-hoc solution of simply searching the complete publishers list in the Unpaywall data to see if we can find name variants and imprint names for major publishers. This is certainly going to yield imperfect results, but checking against the major publishers' websites shows that the totals are close enough. 

In [44]:
# get all paper published by SAGE
client = pymongo.MongoClient()
db = client['unpaywall']
coll = db['snapshot']

In [46]:
# Get a list of all the publishers' names
publishers = list(coll.distinct( "publisher" ))
# check how many publishers and look at a few examples
len(publishers), publishers[:10]

(67104,
 [None,
  '',
  '"Advocate Adviser",',
  '"Bazaar Exchange and Mart"',
  '"Buzzacott",',
  '"Cage Birds, ",',
  '"Calpe"',
  '"Canary and Cage-bird Life"',
  '"Canary and cage-bird life"',
  '"Ceylon Observer" Press,'])

In [47]:
# now perform a search for a publisher's name / imprent name and see what comes back
# build on this so that you can consolidate publishers' names
# doing this several times with several different publisher names allows creation of the tools.PublisherNameConsolidator class below
# if you edit this class, you need to copy it to tools.PublisherNameConsolidator.py
[x for x in publishers if 'MDPI' in str(x)]

['MDPI', 'MDPI AG']

In [None]:
"""
This simple class attempts to consolidate all the different names that appear for publishers in Unpaywall data
Consider that using member_ids in CrossRef data might be a smarter way to do this. 
"""

class PublisherNameConsolidator:
    
    def __init__(self):
        self.publisher_dct = {
                    'CUP':[str(x) for x in publishers if x=='CUP' or 'Cambridge University Press' in str(x)],
                    'OUP': [str(x) for x in publishers if x=='OUP' or 'Oxford University Press' in str(x)],
                    'SAGE': [str(x) for x in publishers if 'SAGE' in str(x)],
                    'Elsevier': [str(x) for x in publishers if any(
                                                                            [y in str(x) ## Close matches
                                                                            for y in ['Harcourt','Morgan Kaufmann','Cell Press',
                                                                                      'Churchill Livingstone','Pergamon','B Saunders','B. Saunders']
                                                                           ] +
                                                                           [z == str(x) ## Exact matches
                                                                             for z in ['Butterworth-Heinemann Ltd.','Academic Press','Medicine Publishing']] +
                                                                            ['Elsevier' in str(x)] 

                                                                          ) and all(
                                                                            [a not in str(x) for a in ['V. Masson','G. Masson', 'V Masson', 'G Masson']]
                                                                            )],

                    'Springer' : [str(x) for x in publishers if any(
                                                                            [y in str(x) ## Close matches
                                                                            for y in ['Palgrave','Biomed Central','BioMed Central','MacMillan','Macmillan']
                                                                           ] +
                                                                           [z == str(x) ## Exact matches
                                                                             for z in ['Vogel ','Vogel,']] +
                                                                            ['Nature Publishing' in str(x) or 'Springer' in str(x)]
                                                                          )],
                    'Wiley': [str(x) for x in publishers if any([y in str(x) ## Close matches
                                                                            for y in ['Jossey-Bass','Wrox Press','Wrightbooks','Howell Book House','Audel','Capstone']
                                                                           ]
                                                                          + [z == str(x) ## Exact matches
                                                                             for z in ['Pfeiffer','Pfeiffer: A Wiley Imprint']]
                                                                           + ['Wiley' in str(x) or 'Blackwell' in str(x)]
                                                                          )],
                    'T+F' : [str(x) for x in publishers if any([y in str(x) ## Close matches
                                                                            for y in ['Informa ','F1000', 'Faculty of 1000','Cogent','Routledge','Garland Science','Spon Press']
                                                                           ]
                                                                          + [z == str(x) ## Exact matches
                                                                             for z in ['Psychology Press']]
                                                                           + ['Taylor' in str(x) and 'Francis' in str(x)]
                                                                          )],
                    'MDPI': ['MDPI', 'MDPI AG'],
                    }
        publisher_name_consolidator = dict()
        for key, value in self.publisher_dct.items():
            for string in value:
                publisher_name_consolidator.setdefault(string, key)
        self.publisher_name_consolidator = publisher_name_consolidator

In [48]:
# test that you can load the object

# from tools import PublisherNameConsolidator

# publisher_name_consolidator = PublisherNameConsolidator().publisher_name_consolidator
# publisher_dct = PublisherNameConsolidator().publisher_dct