# PCAD notebook 3

This notebook processes electronic inventory data from Alma, merging electronic collection IDs and portfolio data into the main dataframe. It also loads, cleans, and merges data from BTAA SPR and Portico (the trusted repositories) into the main dataframe. Finally, it produces a list of MMS IDs for print records in Alma that are candidates for withdrawal based on preliminary matching to PCA and the trusted repositories.

This notebook requires an Alma Analytics report of PCA-eligible portfolios. To produce this report in Alma Analytics:
1. Edit the Analytics report `/shared/University of Minnesota/Twin Cities/PCAD/PCAD-collection-IDs-to-title-info`
2. In the report, change the electronic collection IDs filter to match the e-collection IDs from the current version of the PCA spreadsheet provided by ERM or CMCC.
3. Run the report, save it, and export it as CSV.
4. If the first report is 400000 lines, run the report twice -- once with RCOUNT(1) is less than or equal to 400000 and once where greater than 400000 to get around export limits. 
5. There will likely be character encoding issues. The easiest workaround for this is to open the exported csv files in Notebook++ and change the encoding to UTF-8 under the 'Encoding' menu tab, then save the files.

Required files/inputs:
- `mmsids_with_ecollids_all_{date}.pkl` file produced by `get_ecoll_ids.py` script
- Alma Analytics report of PCA portfolios created based on the spreadsheet of PCA-eligible collections
- `p_and_e_{date}.pkl` produced by PCAD notebook 2
- Excel files showing coverage overlap for each trusted repository (BTAA SPR and Portico)

Outputs:
- `all_groups_{date}.pkl` file of main dataframe
- Tab-delimited file `all_p_mms_ids-pcad-preserved_{date}.txt` to create itemized set in Alma of candidate print records
- Various other `.pkl` files to save progress and review (if needed)

In [2]:
import ast
import math
import re
import pandas as pd
import numpy as np
from os.path import splitext
from io import BytesIO
from datetime import date
today = str(date.today()).replace('-','')

In [3]:
#change filename
ecolls = pd.read_pickle('mmsids_with_ecollids_all_20201109.pkl')
ecolls

Unnamed: 0,MMS ID,E Coll ID
0,9976741105001701,61781897120001701
0,9974069275901701,61809184890001701
0,9974069275901701,61781433670001701
0,9974069275901701,61726636620001701
0,9974069275901701,61692584080001701
...,...,...
0,9968274140001701,61535210660001701
0,9969007820001701,61620731210001701
0,9968320030001701,61789849490001701
0,9968320030001701,61645309830001701


In [4]:
#change filename
p_and_e_data = pd.read_pickle('p_and_e_20201109.pkl')
p_and_e_data

Unnamed: 0,record_index,MMS_ID,Title,OCN,ISSN,Related_OCNs,Related_ISSNs,Vol_nos,Gov_doc_nos,OCN_cluster,ISSN_cluster,p_or_e,ISSN_group_id,OCN_group_id,both_groups,matches_group_id,ISSN_to_match
0,87349,9935224330001701,Publishers' world.,[988619],[0555-6384],"[2489456, 567791231, 1695359]",[0000-0019],[],[],"[567791231, 1695359, 988619, 2489456]","[0000-0019, 0555-6384]",p,[0i],[35959o],"[0i, 35959o]",0,0555-6384
1,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],"[2150-4008, 0000-0019]",e,[0i],[35959o],"[0i, 35959o]",0,2150-4008
2,58136,9913446020001701,Publishers weekly yearbook,[9604938],[0000-0469],[],[0000-0019],[],[],[9604938],"[0000-0469, 0000-0019]",p,[0i],[35959o],"[0i, 35959o]",0,0000-0469
3,60626,9934112930001701,Publishers weekly,[2489456],[0000-0019],"[37309426, 9604938]","[0000-0019, 000--0019, 0000-0469, 2150-4008]",[],[],"[37309426, 9604938, 2489456]","[2150-4008, 0000-0469, 0000-0019]",p,[0i],[35959o],"[0i, 35959o]",0,0000-0019
4,52478,9937257820001701,The Book publishing annual,[1114932096],[0000-0787],[],[0000-0019],[],[],[1114932096],"[0000-0787, 0000-0019]",p,[0i],[9122o],"[0i, 9122o]",0,0000-0787
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129003,2519,9939323230001701,UNLV gaming law journal.,[666937502],[],[],[],[],[],[666937502],[],p,[48442i],[99928o],"[48442i, 99928o]",101650,
129020,21367,9935136780001701,Northern history,[1760664],[0078-172X],[],[],[],[],[1760664],[0078-172X],p,[9994i],[40621o],"[9994i, 40621o]",101667,0078-172X
129021,105065,9969162920001701,Northern history.,[679734406],[1745-8706],[],[0078-172X],[],[],[679734406],"[0078-172X, 1745-8706]",e,[9994i],[100478o],"[9994i, 100478o]",101667,1745-8706
129074,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],"[0030-5316, 2157-8745]",e,[5104i],[99982o],"[5104i, 99982o]",101697,2157-8745


In [5]:
edf2 = pd.merge(p_and_e_data,ecolls,how='left',left_on='MMS_ID',right_on='MMS ID')
edf2

Unnamed: 0,record_index,MMS_ID,Title,OCN,ISSN,Related_OCNs,Related_ISSNs,Vol_nos,Gov_doc_nos,OCN_cluster,ISSN_cluster,p_or_e,ISSN_group_id,OCN_group_id,both_groups,matches_group_id,ISSN_to_match,MMS ID,E Coll ID
0,87349,9935224330001701,Publishers' world.,[988619],[0555-6384],"[2489456, 567791231, 1695359]",[0000-0019],[],[],"[567791231, 1695359, 988619, 2489456]","[0000-0019, 0555-6384]",p,[0i],[35959o],"[0i, 35959o]",0,0555-6384,,
1,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],"[2150-4008, 0000-0019]",e,[0i],[35959o],"[0i, 35959o]",0,2150-4008,9967008940001701,61821822470001701
2,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],"[2150-4008, 0000-0019]",e,[0i],[35959o],"[0i, 35959o]",0,2150-4008,9967008940001701,61816638400001701
3,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],"[2150-4008, 0000-0019]",e,[0i],[35959o],"[0i, 35959o]",0,2150-4008,9967008940001701,61816608130001701
4,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],"[2150-4008, 0000-0019]",e,[0i],[35959o],"[0i, 35959o]",0,2150-4008,9967008940001701,61815504800001701
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102032,105065,9969162920001701,Northern history.,[679734406],[1745-8706],[],[0078-172X],[],[],[679734406],"[0078-172X, 1745-8706]",e,[9994i],[100478o],"[9994i, 100478o]",101667,1745-8706,9969162920001701,61549770130001701
102033,105065,9969162920001701,Northern history.,[679734406],[1745-8706],[],[0078-172X],[],[],[679734406],"[0078-172X, 1745-8706]",e,[9994i],[100478o],"[9994i, 100478o]",101667,1745-8706,9969162920001701,61535213430001701
102034,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],"[0030-5316, 2157-8745]",e,[5104i],[99982o],"[5104i, 99982o]",101697,2157-8745,9977101571201701,61808527020001701
102035,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],"[0030-5316, 2157-8745]",e,[5104i],[99982o],"[5104i, 99982o]",101697,2157-8745,9977101571201701,61808506700001701


### Merge pcad data from Analytics into e df

In [6]:
#change filename
pcad0 = pd.read_csv('PCAD-collection-IDs-to-title-info-1.csv',sep=',',dtype=str)
pcad0

Unnamed: 0,Electronic Collection Id,Electronic Collection Public Name,Electronic Collection Public Name (override),MMS Id,Title,Begin Publication Date,End Publication Date,ISSN,Coverage Information Combined,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active)
0,61535209430001701,American Medical Association Backfiles,,9966629010001701,Archives of ophthalmology.,1879,1950,2375-057X; 0093-0326,Available from 1929 volume: 1 issue: 1 until ...,Only global,53536123280001701,In Repository,,,,
1,61535209430001701,American Medical Association Backfiles,,9966757510001701,Archives of surgery.,1920,1950,2376-3590; 0272-5533,Available from 1920 volume: 1 issue: 1 until ...,Only global,53536420410001701,In Repository,,,,
2,61535209430001701,American Medical Association Backfiles,,9966759170001701,The archives of internal medicine.,1908,1950,0730-188X,Available from 1908 volume: 1 issue: 1 until ...,Only global,53536407060001701,In Repository,,,,
3,61535209430001701,American Medical Association Backfiles,,9966774590001701,Archives of otolaryngology.,1960,1985,2376-3817; 0003-9977,Available from 1960 volume: 72 issue: 1 until...,Only global,53536464550001701,In Repository,,,,
4,61535209430001701,American Medical Association Backfiles,,9966774910001701,American journal of diseases of children.,1960,1993,2374-3018; 0002-922X,Available from 1960 volume: 100 issue: 1 unti...,Only global,53536464180001701,In Repository,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42466,61816118310001701,Mary Ann Liebert Publishers Journals,Mary Ann Liebert Backfile,9977118809701701,Central nervous system trauma.,1984,1987,0737-5999,Available from 1984 volume: 1 issue: 1 until ...,Only global,53816118110001701,In Repository,,,,
42467,61816118310001701,Mary Ann Liebert Publishers Journals,Mary Ann Liebert Backfile,9977118809801701,CyberPsychology and behavior.,1998,2009,1557-8364; 1094-9313,Available from 1998 volume: 1 issue: 1 until ...,Only global,53816118140001701,In Repository,,,,
42468,61816118310001701,Mary Ann Liebert Publishers Journals,Mary Ann Liebert Backfile,9977118809901701,AIDS patient care.,1987,1995,0893-5068; 0893-5068,Available from 1987 volume: 1 issue: 1 until ...,Only global,53816118210001701,In Repository,,,,
42469,61816598280001701,Wiley Online Library Surgery Backfiles,,9966956150001701,British journal of surgery.,1913,9999,1365-2168; 0007-1323,Available from 1913 volume: 1 issue: 1 until ...,Only local,53816598250001701,In Repository,,,,


In [None]:
'''Run this cell only if Analytics report exceeded 40000 lines
pcad1 = pd.read_csv('PCAD-collection-IDs-to-title-info-1-20180809.csv',sep='\t',dtype=str)
pcad = pd.concat([pcad0,pcad1],ignore_index=True)
pcad.drop_duplicates(inplace=True)
pcad'''

#### Remove rows where electronic lifecycle is "Deleted"

In [7]:
pcad = pcad0
pcad = pcad[pcad['Lifecycle'] != 'Deleted']
pcad

Unnamed: 0,Electronic Collection Id,Electronic Collection Public Name,Electronic Collection Public Name (override),MMS Id,Title,Begin Publication Date,End Publication Date,ISSN,Coverage Information Combined,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active)
0,61535209430001701,American Medical Association Backfiles,,9966629010001701,Archives of ophthalmology.,1879,1950,2375-057X; 0093-0326,Available from 1929 volume: 1 issue: 1 until ...,Only global,53536123280001701,In Repository,,,,
1,61535209430001701,American Medical Association Backfiles,,9966757510001701,Archives of surgery.,1920,1950,2376-3590; 0272-5533,Available from 1920 volume: 1 issue: 1 until ...,Only global,53536420410001701,In Repository,,,,
2,61535209430001701,American Medical Association Backfiles,,9966759170001701,The archives of internal medicine.,1908,1950,0730-188X,Available from 1908 volume: 1 issue: 1 until ...,Only global,53536407060001701,In Repository,,,,
3,61535209430001701,American Medical Association Backfiles,,9966774590001701,Archives of otolaryngology.,1960,1985,2376-3817; 0003-9977,Available from 1960 volume: 72 issue: 1 until...,Only global,53536464550001701,In Repository,,,,
4,61535209430001701,American Medical Association Backfiles,,9966774910001701,American journal of diseases of children.,1960,1993,2374-3018; 0002-922X,Available from 1960 volume: 100 issue: 1 unti...,Only global,53536464180001701,In Repository,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42466,61816118310001701,Mary Ann Liebert Publishers Journals,Mary Ann Liebert Backfile,9977118809701701,Central nervous system trauma.,1984,1987,0737-5999,Available from 1984 volume: 1 issue: 1 until ...,Only global,53816118110001701,In Repository,,,,
42467,61816118310001701,Mary Ann Liebert Publishers Journals,Mary Ann Liebert Backfile,9977118809801701,CyberPsychology and behavior.,1998,2009,1557-8364; 1094-9313,Available from 1998 volume: 1 issue: 1 until ...,Only global,53816118140001701,In Repository,,,,
42468,61816118310001701,Mary Ann Liebert Publishers Journals,Mary Ann Liebert Backfile,9977118809901701,AIDS patient care.,1987,1995,0893-5068; 0893-5068,Available from 1987 volume: 1 issue: 1 until ...,Only global,53816118210001701,In Repository,,,,
42469,61816598280001701,Wiley Online Library Surgery Backfiles,,9966956150001701,British journal of surgery.,1913,9999,1365-2168; 0007-1323,Available from 1913 volume: 1 issue: 1 until ...,Only local,53816598250001701,In Repository,,,,


#### Read in PCAD collections data from spreadsheet

In [8]:
#change filename; may need to add or remove renaming pairs depending on headers in PCAD collections spreadsheet
pcad_ynm = pd.read_excel('PCAD II Final_summary.xlsx',dtype=str)
pcad_ynm.rename(columns={'Collection ID':'Electronic Collection Id', 'PCA&ILL': 'PCAD?'},inplace=True)
pcad_ynm

Unnamed: 0,Electronic Collection Id,Collection Public Name,PCAD?
0,61757894800001701,ACSESS Digital Library,Yes
1,61647584340001701,AIAA Aerospace Research Central Journals,Yes
2,61549768600001701,AIP Digital Archive,Yes
3,61535215970001701,American Chemical Society Legacy Archive,Yes
4,61535209430001701,American Medical Association Backfiles,Yes
...,...,...,...
195,61816598280001701,Wiley Online Library Surgery Backfiles,Yes
196,61535213310001701,Wiley Online Library Veterinary Backfiles,Yes
197,61644185620001701,Women''s Magazine Archive I,Yes
198,61618810070001701,Women''s Wear Daily Archive,Yes


In [9]:
pcad_ynm = pcad_ynm[['Electronic Collection Id','Collection Public Name','PCAD?']]
pcad_ynm

Unnamed: 0,Electronic Collection Id,Collection Public Name,PCAD?
0,61757894800001701,ACSESS Digital Library,Yes
1,61647584340001701,AIAA Aerospace Research Central Journals,Yes
2,61549768600001701,AIP Digital Archive,Yes
3,61535215970001701,American Chemical Society Legacy Archive,Yes
4,61535209430001701,American Medical Association Backfiles,Yes
...,...,...,...
195,61816598280001701,Wiley Online Library Surgery Backfiles,Yes
196,61535213310001701,Wiley Online Library Veterinary Backfiles,Yes
197,61644185620001701,Women''s Magazine Archive I,Yes
198,61618810070001701,Women''s Wear Daily Archive,Yes


In [10]:
def code_vendor (collection_public_name):
    x = collection_public_name.lower()
    if 'wiley' in x:
        return 'Wiley'
    elif 'elsevier' in x:
        return 'Elsevier'
    elif 'jstor' in x:
        return 'JSTOR'
    elif 'sage' in x:
        return 'SAGE'
    elif 'springer' in x:
        return 'Springer'
    elif 'taylor & francis' in x:
        return 'Taylor & Francis'
    else:
        return 'other'

In [11]:
pcad_ynm['Vendor_key'] = pcad_ynm['Collection Public Name'].apply(lambda x: code_vendor(x))
pcad_ynm

Unnamed: 0,Electronic Collection Id,Collection Public Name,PCAD?,Vendor_key
0,61757894800001701,ACSESS Digital Library,Yes,other
1,61647584340001701,AIAA Aerospace Research Central Journals,Yes,other
2,61549768600001701,AIP Digital Archive,Yes,other
3,61535215970001701,American Chemical Society Legacy Archive,Yes,other
4,61535209430001701,American Medical Association Backfiles,Yes,other
...,...,...,...,...
195,61816598280001701,Wiley Online Library Surgery Backfiles,Yes,Wiley
196,61535213310001701,Wiley Online Library Veterinary Backfiles,Yes,Wiley
197,61644185620001701,Women''s Magazine Archive I,Yes,other
198,61618810070001701,Women''s Wear Daily Archive,Yes,other


In [12]:
pcad_ynm['Vendor_key'].value_counts()

other               75
Wiley               52
JSTOR               23
Taylor & Francis    18
SAGE                14
Springer            13
Elsevier             5
Name: Vendor_key, dtype: int64

In [13]:
pcad = pd.merge(pcad,pcad_ynm,how='inner',left_on='Electronic Collection Id',right_on='Electronic Collection Id')
pcad

Unnamed: 0,Electronic Collection Id,Electronic Collection Public Name,Electronic Collection Public Name (override),MMS Id,Title,Begin Publication Date,End Publication Date,ISSN,Coverage Information Combined,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active),Collection Public Name,PCAD?,Vendor_key
0,61535209430001701,American Medical Association Backfiles,,9966629010001701,Archives of ophthalmology.,1879,1950,2375-057X; 0093-0326,Available from 1929 volume: 1 issue: 1 until ...,Only global,53536123280001701,In Repository,,,,,American Medical Association Backfiles,Yes,other
1,61535209430001701,American Medical Association Backfiles,,9966757510001701,Archives of surgery.,1920,1950,2376-3590; 0272-5533,Available from 1920 volume: 1 issue: 1 until ...,Only global,53536420410001701,In Repository,,,,,American Medical Association Backfiles,Yes,other
2,61535209430001701,American Medical Association Backfiles,,9966759170001701,The archives of internal medicine.,1908,1950,0730-188X,Available from 1908 volume: 1 issue: 1 until ...,Only global,53536407060001701,In Repository,,,,,American Medical Association Backfiles,Yes,other
3,61535209430001701,American Medical Association Backfiles,,9966774590001701,Archives of otolaryngology.,1960,1985,2376-3817; 0003-9977,Available from 1960 volume: 72 issue: 1 until...,Only global,53536464550001701,In Repository,,,,,American Medical Association Backfiles,Yes,other
4,61535209430001701,American Medical Association Backfiles,,9966774910001701,American journal of diseases of children.,1960,1993,2374-3018; 0002-922X,Available from 1960 volume: 100 issue: 1 unti...,Only global,53536464180001701,In Repository,,,,,American Medical Association Backfiles,Yes,other
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32664,61816118310001701,Mary Ann Liebert Publishers Journals,Mary Ann Liebert Backfile,9977118809701701,Central nervous system trauma.,1984,1987,0737-5999,Available from 1984 volume: 1 issue: 1 until ...,Only global,53816118110001701,In Repository,,,,,Mary Ann Liebert Publishers Journals,Yes,other
32665,61816118310001701,Mary Ann Liebert Publishers Journals,Mary Ann Liebert Backfile,9977118809801701,CyberPsychology and behavior.,1998,2009,1557-8364; 1094-9313,Available from 1998 volume: 1 issue: 1 until ...,Only global,53816118140001701,In Repository,,,,,Mary Ann Liebert Publishers Journals,Yes,other
32666,61816118310001701,Mary Ann Liebert Publishers Journals,Mary Ann Liebert Backfile,9977118809901701,AIDS patient care.,1987,1995,0893-5068; 0893-5068,Available from 1987 volume: 1 issue: 1 until ...,Only global,53816118210001701,In Repository,,,,,Mary Ann Liebert Publishers Journals,Yes,other
32667,61816598280001701,Wiley Online Library Surgery Backfiles,,9966956150001701,British journal of surgery.,1913,9999,1365-2168; 0007-1323,Available from 1913 volume: 1 issue: 1 until ...,Only local,53816598250001701,In Repository,,,,,Wiley Online Library Surgery Backfiles,Yes,Wiley


In [14]:
df_pcad = pd.merge(edf2, pcad, how='left', left_on=['MMS_ID','E Coll ID'],
                   right_on=['MMS Id','Electronic Collection Id'],suffixes=['_bib','_portfolio'])
df_pcad

Unnamed: 0,record_index,MMS_ID,Title_bib,OCN,ISSN_bib,Related_OCNs,Related_ISSNs,Vol_nos,Gov_doc_nos,OCN_cluster,...,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active),Collection Public Name,PCAD?,Vendor_key
0,87349,9935224330001701,Publishers' world.,[988619],[0555-6384],"[2489456, 567791231, 1695359]",[0000-0019],[],[],"[567791231, 1695359, 988619, 2489456]",...,,,,,,,,,,
1,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],...,,,,,,,,,,
2,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],...,,,,,,,,,,
3,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],...,,,,,,,,,,
4,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102154,105065,9969162920001701,Northern history.,[679734406],[1745-8706],[],[0078-172X],[],[],[679734406],...,,,,,,,,,,
102155,105065,9969162920001701,Northern history.,[679734406],[1745-8706],[],[0078-172X],[],[],[679734406],...,,,,,,,,,,
102156,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],...,Only global,53815435190001701,In Repository,,,,,"Taylor & Francis Biological, Earth & Environme...",Yes,Taylor & Francis
102157,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],...,Only local,53815442430001701,In Repository,,,,,"Taylor & Francis Biological, Earth, Environmen...",Yes,Taylor & Francis


In [15]:
df_pcad['Vendor_key'].value_counts()

Wiley               5404
other               2075
JSTOR               1966
Elsevier            1372
SAGE                1020
Taylor & Francis     921
Springer             581
Name: Vendor_key, dtype: int64

#### Filter for rows with some kind of PCAD indicator

In [16]:
edf_pcad = df_pcad[(df_pcad['p_or_e'] == 'e') & (df_pcad['PCAD?'].isin(['Yes','Maybe']))]
edf_pcad

Unnamed: 0,record_index,MMS_ID,Title_bib,OCN,ISSN_bib,Related_OCNs,Related_ISSNs,Vol_nos,Gov_doc_nos,OCN_cluster,...,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active),Collection Public Name,PCAD?,Vendor_key
24,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,Only global,53588425000001701,In Repository,>=,,4,Waiting for Renewal,JSTOR Life Sciences Collection,Yes,JSTOR
27,128733,9968441380001701,IEEE transactions on ultrasonics engineering,[66903667],[2162-1373],[],[0893-6706],[],[],[66903667],...,Only global,53620359120001701,In Repository,,,,,IEEE/IET Electronic Library (IEL) Journals,Yes,other
34,112352,9967759560001701,Journal of the University Film Association,[560253122],[2328-1944],[],[0041-9311],[],[],[560253122],...,Only global,53538633520001701,In Repository,,,,Waiting for Renewal,JSTOR Arts and Sciences V,Yes,JSTOR
37,107439,9967145520001701,Journal of the University Film and Video Assoc...,[669053443],[2328-1936],[],[0734-919X],[],[],[669053443],...,Only global,53537295580001701,In Repository,,,,Waiting for Renewal,JSTOR Arts and Sciences V,Yes,JSTOR
51,111951,9968429800001701,Journal of the Institute of Actuaries,[669212538],[2058-1009],[],[0020-2681],[],[],[669212538],...,Only global,53747073660001701,In Repository,,,,,JSTOR Arts and Sciences X,Yes,JSTOR
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102066,123828,9968665290001701,Proceedings and addresses of the American Phil...,[58839985],[2325-9248],[],[0065-972X],[],[],[58839985],...,Only global,53540700110001701,In Repository,>=,,4,Waiting for Renewal,JSTOR Arts and Sciences VII,Yes,JSTOR
102111,117510,9967987860001701,The Americas - Academy of American Franciscan ...,[45202766],[1533-6247],[],[0003-1615],[],[],[45202766],...,Only global,53539160390001701,In Repository,>=,,6,Waiting for Renewal,JSTOR Arts and Sciences VII,Yes,JSTOR
102114,125537,9968947900001701,International journal of adhesion and adhesives,[38995556],[1879-0127],[],[0143-7496],[],[],[38995556],...,Only local,53624617390001701,In Repository,,,,,Elsevier ScienceDirect Journals Complete,Yes,Elsevier
102156,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],...,Only global,53815435190001701,In Repository,,,,,"Taylor & Francis Biological, Earth & Environme...",Yes,Taylor & Francis


In [17]:
edf_pcad.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'OCN', 'ISSN_bib',
       'Related_OCNs', 'Related_ISSNs', 'Vol_nos', 'Gov_doc_nos',
       'OCN_cluster', 'ISSN_cluster', 'p_or_e', 'ISSN_group_id',
       'OCN_group_id', 'both_groups', 'matches_group_id', 'ISSN_to_match',
       'MMS ID', 'E Coll ID', 'Electronic Collection Id',
       'Electronic Collection Public Name',
       'Electronic Collection Public Name (override)', 'MMS Id',
       'Title_portfolio', 'Begin Publication Date', 'End Publication Date',
       'ISSN_portfolio', 'Coverage Information Combined', 'Coverage Statement',
       'Portfolio Id', 'Lifecycle', 'Embargo Operator', 'Embargo Months',
       'Embargo Years', 'Status (Active)', 'Collection Public Name', 'PCAD?',
       'Vendor_key'],
      dtype='object')

In [18]:
epcad_groups = list(edf_pcad['matches_group_id'])
print(len(epcad_groups))
epcad_groups

13339


[2,
 5,
 7,
 7,
 12,
 12,
 20,
 77,
 83,
 92,
 92,
 202,
 213,
 213,
 213,
 213,
 213,
 252,
 254,
 259,
 267,
 267,
 273,
 273,
 277,
 284,
 292,
 315,
 340,
 340,
 477,
 496,
 511,
 513,
 513,
 537,
 660,
 686,
 686,
 707,
 739,
 739,
 744,
 747,
 764,
 803,
 803,
 803,
 819,
 877,
 877,
 895,
 895,
 895,
 895,
 895,
 895,
 914,
 1001,
 1005,
 1005,
 1005,
 1005,
 1005,
 1005,
 1005,
 1038,
 1066,
 1086,
 1086,
 1087,
 1204,
 1208,
 1245,
 1245,
 1345,
 1475,
 1526,
 1526,
 1527,
 1534,
 1547,
 1606,
 1725,
 1745,
 1745,
 1772,
 1809,
 1812,
 1846,
 1857,
 1857,
 1869,
 1920,
 1960,
 2059,
 2059,
 2059,
 2059,
 2065,
 2067,
 2067,
 2078,
 2308,
 2308,
 2308,
 2337,
 2356,
 2364,
 2365,
 2394,
 2435,
 2471,
 2492,
 2492,
 2492,
 2492,
 2492,
 2492,
 2492,
 2492,
 2505,
 2529,
 2537,
 2557,
 2561,
 2561,
 2571,
 2592,
 2592,
 2592,
 2603,
 2603,
 2604,
 2619,
 2701,
 2701,
 2725,
 2725,
 2734,
 2734,
 2734,
 2734,
 2734,
 2734,
 2775,
 2808,
 2908,
 2951,
 2961,
 2990,
 2990,
 2993,
 3

In [19]:
epcad_data = df_pcad[df_pcad['matches_group_id'].isin(epcad_groups)]
epcad_data

Unnamed: 0,record_index,MMS_ID,Title_bib,OCN,ISSN_bib,Related_OCNs,Related_ISSNs,Vol_nos,Gov_doc_nos,OCN_cluster,...,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active),Collection Public Name,PCAD?,Vendor_key
14,69875,9939153590001701,The Wilson journal of ornithology.,[62791211],[1559-4491],[],[],[],[],[62791211],...,,,,,,,,,,
15,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
16,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
17,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
18,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102118,125537,9968947900001701,International journal of adhesion and adhesives,[38995556],[1879-0127],[],[0143-7496],[],[],[38995556],...,,,,,,,,,,
102119,125537,9968947900001701,International journal of adhesion and adhesives,[38995556],[1879-0127],[],[0143-7496],[],[],[38995556],...,,,,,,,,,,
102156,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],...,Only global,53815435190001701,In Repository,,,,,"Taylor & Francis Biological, Earth & Environme...",Yes,Taylor & Francis
102157,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],...,Only local,53815442430001701,In Repository,,,,,"Taylor & Francis Biological, Earth, Environmen...",Yes,Taylor & Francis


In [20]:
no_pcad_data = df_pcad[df_pcad['matches_group_id'].isin(epcad_groups) == False]
no_pcad_data

Unnamed: 0,record_index,MMS_ID,Title_bib,OCN,ISSN_bib,Related_OCNs,Related_ISSNs,Vol_nos,Gov_doc_nos,OCN_cluster,...,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active),Collection Public Name,PCAD?,Vendor_key
0,87349,9935224330001701,Publishers' world.,[988619],[0555-6384],"[2489456, 567791231, 1695359]",[0000-0019],[],[],"[567791231, 1695359, 988619, 2489456]",...,,,,,,,,,,
1,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],...,,,,,,,,,,
2,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],...,,,,,,,,,,
3,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],...,,,,,,,,,,
4,105258,9967008940001701,Publishers weekly,[37309426],[2150-4008],[],[0000-0019],[],[],[37309426],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102151,21367,9935136780001701,Northern history,[1760664],[0078-172X],[],[],[],[],[1760664],...,,,,,,,,,,
102152,105065,9969162920001701,Northern history.,[679734406],[1745-8706],[],[0078-172X],[],[],[679734406],...,,,,,,,,,,
102153,105065,9969162920001701,Northern history.,[679734406],[1745-8706],[],[0078-172X],[],[],[679734406],...,,,,,,,,,,
102154,105065,9969162920001701,Northern history.,[679734406],[1745-8706],[],[0078-172X],[],[],[679734406],...,,,,,,,,,,


In [21]:
no_pcad_data['matches_group_id'].nunique()

10280

In [22]:
df_epcad = epcad_data

In [23]:
df_epcad[df_epcad['PCAD?'] == 'nan']

Unnamed: 0,record_index,MMS_ID,Title_bib,OCN,ISSN_bib,Related_OCNs,Related_ISSNs,Vol_nos,Gov_doc_nos,OCN_cluster,...,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active),Collection Public Name,PCAD?,Vendor_key


In [24]:
df_epcad = df_epcad[df_epcad['PCAD?'] != 'nan']
df_epcad

Unnamed: 0,record_index,MMS_ID,Title_bib,OCN,ISSN_bib,Related_OCNs,Related_ISSNs,Vol_nos,Gov_doc_nos,OCN_cluster,...,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active),Collection Public Name,PCAD?,Vendor_key
14,69875,9939153590001701,The Wilson journal of ornithology.,[62791211],[1559-4491],[],[],[],[],[62791211],...,,,,,,,,,,
15,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
16,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
17,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
18,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102118,125537,9968947900001701,International journal of adhesion and adhesives,[38995556],[1879-0127],[],[0143-7496],[],[],[38995556],...,,,,,,,,,,
102119,125537,9968947900001701,International journal of adhesion and adhesives,[38995556],[1879-0127],[],[0143-7496],[],[],[38995556],...,,,,,,,,,,
102156,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],...,Only global,53815435190001701,In Repository,,,,,"Taylor & Francis Biological, Earth & Environme...",Yes,Taylor & Francis
102157,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],...,Only local,53815442430001701,In Repository,,,,,"Taylor & Francis Biological, Earth, Environmen...",Yes,Taylor & Francis


#### Find groups with p and e, p only, and e only

In [28]:
find_only_p_or_e = df_epcad[['record_index','matches_group_id','p_or_e','MMS_ID','Title_bib']]
find_only_p_or_e

Unnamed: 0,record_index,matches_group_id,p_or_e,MMS_ID,Title_bib
14,69875,2,p,9939153590001701,The Wilson journal of ornithology.
15,113344,2,e,9968879310001701,The Wilson journal of ornithology
16,113344,2,e,9968879310001701,The Wilson journal of ornithology
17,113344,2,e,9968879310001701,The Wilson journal of ornithology
18,113344,2,e,9968879310001701,The Wilson journal of ornithology
...,...,...,...,...,...
102118,125537,101582,e,9968947900001701,International journal of adhesion and adhesives
102119,125537,101582,e,9968947900001701,International journal of adhesion and adhesives
102156,114862,101697,e,9977101571201701,Oriental insects
102157,114862,101697,e,9977101571201701,Oriental insects


In [29]:
find_only_p_or_e.drop_duplicates(inplace=True)
find_only_p_or_e

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,record_index,matches_group_id,p_or_e,MMS_ID,Title_bib
14,69875,2,p,9939153590001701,The Wilson journal of ornithology.
15,113344,2,e,9968879310001701,The Wilson journal of ornithology
25,57684,5,p,9963550760001701,IEEE transactions on ultrasonics engineering
26,128733,5,e,9968441380001701,IEEE transactions on ultrasonics engineering
32,112352,7,e,9967759560001701,Journal of the University Film Association
...,...,...,...,...,...
102112,124654,101581,e,9924328340001701,Americas (Online)
102113,31683,101582,p,9957960200001701,International journal of adhesion and adhesives
102114,125537,101582,e,9968947900001701,International journal of adhesion and adhesives
102156,114862,101697,e,9977101571201701,Oriental insects


In [30]:
find_only_p_or_e = find_only_p_or_e.groupby(['matches_group_id']).agg(lambda x: list(set(x))).reset_index()
find_only_p_or_e

Unnamed: 0,matches_group_id,record_index,p_or_e,MMS_ID,Title_bib
0,2,"[113344, 69875]","[e, p]","[9939153590001701, 9968879310001701]","[The Wilson journal of ornithology, The Wilson..."
1,5,"[57684, 128733]","[e, p]","[9963550760001701, 9968441380001701]",[IEEE transactions on ultrasonics engineering]
2,7,"[112352, 78492, 107439]","[e, p]","[9967759560001701, 9933370000001701, 996714552...","[Journal of the University Film Association, J..."
3,12,"[88618, 111951]","[e, p]","[9939481760001701, 9968429800001701]",[Journal of the Institute of Actuaries]
4,20,"[122066, 89596]","[e, p]","[9963135560001701, 9975966292301701]","[Journal of tissue culture methods., Journal o..."
...,...,...,...,...,...
6652,101504,"[113874, 3295]","[e, p]","[9974799725001701, 9953296770001701]",[Nomos]
6653,101522,"[41280, 26122, 118219, 61904, 18706, 123828, 8...","[e, p]","[9959156260001701, 9967988520001701, 991429894...","[Year book - American Philosophical Society, P..."
6654,101581,"[17960, 124654, 117510]","[e, p]","[9924328340001701, 9946768760001701, 996798786...",[The Americas - Academy of American Franciscan...
6655,101582,"[125537, 31683]","[e, p]","[9957960200001701, 9968947900001701]",[International journal of adhesion and adhesives]


In [31]:
find_only_p_or_e['p_or_e'] = find_only_p_or_e['p_or_e'].apply(lambda x: sorted(x))
find_only_p_or_e

Unnamed: 0,matches_group_id,record_index,p_or_e,MMS_ID,Title_bib
0,2,"[113344, 69875]","[e, p]","[9939153590001701, 9968879310001701]","[The Wilson journal of ornithology, The Wilson..."
1,5,"[57684, 128733]","[e, p]","[9963550760001701, 9968441380001701]",[IEEE transactions on ultrasonics engineering]
2,7,"[112352, 78492, 107439]","[e, p]","[9967759560001701, 9933370000001701, 996714552...","[Journal of the University Film Association, J..."
3,12,"[88618, 111951]","[e, p]","[9939481760001701, 9968429800001701]",[Journal of the Institute of Actuaries]
4,20,"[122066, 89596]","[e, p]","[9963135560001701, 9975966292301701]","[Journal of tissue culture methods., Journal o..."
...,...,...,...,...,...
6652,101504,"[113874, 3295]","[e, p]","[9974799725001701, 9953296770001701]",[Nomos]
6653,101522,"[41280, 26122, 118219, 61904, 18706, 123828, 8...","[e, p]","[9959156260001701, 9967988520001701, 991429894...","[Year book - American Philosophical Society, P..."
6654,101581,"[17960, 124654, 117510]","[e, p]","[9924328340001701, 9946768760001701, 996798786...",[The Americas - Academy of American Franciscan...
6655,101582,"[125537, 31683]","[e, p]","[9957960200001701, 9968947900001701]",[International journal of adhesion and adhesives]


In [32]:
find_only_p_or_e['pore'] = find_only_p_or_e['p_or_e'].apply(lambda x: ' '.join(x))
find_only_p_or_e

Unnamed: 0,matches_group_id,record_index,p_or_e,MMS_ID,Title_bib,pore
0,2,"[113344, 69875]","[e, p]","[9939153590001701, 9968879310001701]","[The Wilson journal of ornithology, The Wilson...",e p
1,5,"[57684, 128733]","[e, p]","[9963550760001701, 9968441380001701]",[IEEE transactions on ultrasonics engineering],e p
2,7,"[112352, 78492, 107439]","[e, p]","[9967759560001701, 9933370000001701, 996714552...","[Journal of the University Film Association, J...",e p
3,12,"[88618, 111951]","[e, p]","[9939481760001701, 9968429800001701]",[Journal of the Institute of Actuaries],e p
4,20,"[122066, 89596]","[e, p]","[9963135560001701, 9975966292301701]","[Journal of tissue culture methods., Journal o...",e p
...,...,...,...,...,...,...
6652,101504,"[113874, 3295]","[e, p]","[9974799725001701, 9953296770001701]",[Nomos],e p
6653,101522,"[41280, 26122, 118219, 61904, 18706, 123828, 8...","[e, p]","[9959156260001701, 9967988520001701, 991429894...","[Year book - American Philosophical Society, P...",e p
6654,101581,"[17960, 124654, 117510]","[e, p]","[9924328340001701, 9946768760001701, 996798786...",[The Americas - Academy of American Franciscan...,e p
6655,101582,"[125537, 31683]","[e, p]","[9957960200001701, 9968947900001701]",[International journal of adhesion and adhesives],e p


In [33]:
find_only_p_or_e['pore'].value_counts()

e p    6657
Name: pore, dtype: int64

In [34]:
only_p = find_only_p_or_e[find_only_p_or_e['pore'] == 'p']
only_p

Unnamed: 0,matches_group_id,record_index,p_or_e,MMS_ID,Title_bib,pore


In [35]:
only_p_groups = list(only_p['matches_group_id'])
print(len(only_p_groups))
only_p_groups

0


[]

In [36]:
p_only_data = df_epcad[df_epcad['matches_group_id'].isin(only_p_groups)]
p_only_data

Unnamed: 0,record_index,MMS_ID,Title_bib,OCN,ISSN_bib,Related_OCNs,Related_ISSNs,Vol_nos,Gov_doc_nos,OCN_cluster,...,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active),Collection Public Name,PCAD?,Vendor_key


In [None]:
#if p_only_data is not empty
p_only_data.to_pickle('p_e_no_pcad.pkl')

In [37]:
only_e = find_only_p_or_e[find_only_p_or_e['pore'] == 'e']
only_e

Unnamed: 0,matches_group_id,record_index,p_or_e,MMS_ID,Title_bib,pore


In [38]:
only_e_groups = list(only_e['matches_group_id'])
only_e_groups

[]

In [39]:
e_only_data = df_epcad[df_epcad['matches_group_id'].isin(only_e_groups)]
e_only_data

Unnamed: 0,record_index,MMS_ID,Title_bib,OCN,ISSN_bib,Related_OCNs,Related_ISSNs,Vol_nos,Gov_doc_nos,OCN_cluster,...,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active),Collection Public Name,PCAD?,Vendor_key


In [None]:
#if e_only_data is not empty
e_only_data.to_pickle('pe_e_no_pcad.pkl')

In [40]:
p_and_e = find_only_p_or_e[find_only_p_or_e['pore'] == 'e p']
p_and_e

Unnamed: 0,matches_group_id,record_index,p_or_e,MMS_ID,Title_bib,pore
0,2,"[113344, 69875]","[e, p]","[9939153590001701, 9968879310001701]","[The Wilson journal of ornithology, The Wilson...",e p
1,5,"[57684, 128733]","[e, p]","[9963550760001701, 9968441380001701]",[IEEE transactions on ultrasonics engineering],e p
2,7,"[112352, 78492, 107439]","[e, p]","[9967759560001701, 9933370000001701, 996714552...","[Journal of the University Film Association, J...",e p
3,12,"[88618, 111951]","[e, p]","[9939481760001701, 9968429800001701]",[Journal of the Institute of Actuaries],e p
4,20,"[122066, 89596]","[e, p]","[9963135560001701, 9975966292301701]","[Journal of tissue culture methods., Journal o...",e p
...,...,...,...,...,...,...
6652,101504,"[113874, 3295]","[e, p]","[9974799725001701, 9953296770001701]",[Nomos],e p
6653,101522,"[41280, 26122, 118219, 61904, 18706, 123828, 8...","[e, p]","[9959156260001701, 9967988520001701, 991429894...","[Year book - American Philosophical Society, P...",e p
6654,101581,"[17960, 124654, 117510]","[e, p]","[9924328340001701, 9946768760001701, 996798786...",[The Americas - Academy of American Franciscan...,e p
6655,101582,"[125537, 31683]","[e, p]","[9957960200001701, 9968947900001701]",[International journal of adhesion and adhesives],e p


In [41]:
pe_groups = list(p_and_e['matches_group_id'])
print(len(pe_groups))
pe_groups

6657


[2,
 5,
 7,
 12,
 20,
 77,
 83,
 92,
 202,
 213,
 252,
 254,
 259,
 267,
 273,
 277,
 284,
 292,
 315,
 340,
 477,
 496,
 511,
 513,
 537,
 660,
 686,
 707,
 739,
 744,
 747,
 764,
 803,
 819,
 877,
 895,
 914,
 1001,
 1005,
 1038,
 1066,
 1086,
 1087,
 1204,
 1208,
 1245,
 1345,
 1475,
 1526,
 1527,
 1534,
 1547,
 1606,
 1725,
 1745,
 1772,
 1809,
 1812,
 1846,
 1857,
 1869,
 1920,
 1960,
 2059,
 2065,
 2067,
 2078,
 2308,
 2337,
 2356,
 2364,
 2365,
 2394,
 2435,
 2471,
 2492,
 2505,
 2529,
 2537,
 2557,
 2561,
 2571,
 2592,
 2603,
 2604,
 2619,
 2701,
 2725,
 2734,
 2775,
 2808,
 2908,
 2951,
 2961,
 2990,
 2993,
 3023,
 3095,
 3156,
 3399,
 3401,
 3434,
 3443,
 3447,
 3448,
 3475,
 3478,
 3483,
 3489,
 3513,
 3518,
 3544,
 3561,
 3563,
 3564,
 3571,
 3573,
 3592,
 3674,
 3682,
 3686,
 3698,
 3702,
 3716,
 3724,
 3740,
 3794,
 3806,
 3818,
 3828,
 3840,
 3861,
 3875,
 3896,
 3899,
 3919,
 3928,
 3960,
 4013,
 4029,
 4035,
 4058,
 4105,
 4117,
 4136,
 4188,
 4201,
 4219,
 4237,
 4258

In [42]:
p_and_e_data = df_epcad[df_epcad['matches_group_id'].isin(pe_groups)]
p_and_e_data

Unnamed: 0,record_index,MMS_ID,Title_bib,OCN,ISSN_bib,Related_OCNs,Related_ISSNs,Vol_nos,Gov_doc_nos,OCN_cluster,...,Coverage Statement,Portfolio Id,Lifecycle,Embargo Operator,Embargo Months,Embargo Years,Status (Active),Collection Public Name,PCAD?,Vendor_key
14,69875,9939153590001701,The Wilson journal of ornithology.,[62791211],[1559-4491],[],[],[],[],[62791211],...,,,,,,,,,,
15,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
16,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
17,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
18,113344,9968879310001701,The Wilson journal of ornithology,[66901429],[1938-5447],[],[1559-4491],[],[],[66901429],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102118,125537,9968947900001701,International journal of adhesion and adhesives,[38995556],[1879-0127],[],[0143-7496],[],[],[38995556],...,,,,,,,,,,
102119,125537,9968947900001701,International journal of adhesion and adhesives,[38995556],[1879-0127],[],[0143-7496],[],[],[38995556],...,,,,,,,,,,
102156,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],...,Only global,53815435190001701,In Repository,,,,,"Taylor & Francis Biological, Earth & Environme...",Yes,Taylor & Francis
102157,114862,9977101571201701,Oriental insects,[668436876],[2157-8745],[],[0030-5316],[],[],[668436876],...,Only local,53815442430001701,In Repository,,,,,"Taylor & Francis Biological, Earth, Environmen...",Yes,Taylor & Francis


In [43]:
p_and_e_data.to_pickle(f'p_and_e_pcad_{today}.pkl')

In [44]:
p_and_e_data['PCAD?'].value_counts()

Yes    13339
Name: PCAD?, dtype: int64

In [45]:
p = p_and_e_data[p_and_e_data['p_or_e']=='e']
p['PCAD?'].value_counts()

Yes    13339
Name: PCAD?, dtype: int64

In [46]:
df = p_and_e_data
df.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'OCN', 'ISSN_bib',
       'Related_OCNs', 'Related_ISSNs', 'Vol_nos', 'Gov_doc_nos',
       'OCN_cluster', 'ISSN_cluster', 'p_or_e', 'ISSN_group_id',
       'OCN_group_id', 'both_groups', 'matches_group_id', 'ISSN_to_match',
       'MMS ID', 'E Coll ID', 'Electronic Collection Id',
       'Electronic Collection Public Name',
       'Electronic Collection Public Name (override)', 'MMS Id',
       'Title_portfolio', 'Begin Publication Date', 'End Publication Date',
       'ISSN_portfolio', 'Coverage Information Combined', 'Coverage Statement',
       'Portfolio Id', 'Lifecycle', 'Embargo Operator', 'Embargo Months',
       'Embargo Years', 'Status (Active)', 'Collection Public Name', 'PCAD?',
       'Vendor_key'],
      dtype='object')

In [48]:
df = df[['record_index', 'MMS_ID', 'Title_bib', 'p_or_e', 'ISSN_bib','ISSN_cluster',
         'matches_group_id', 'Electronic Collection Id', 'Electronic Collection Public Name',
         'Electronic Collection Public Name (override)', 'Title_portfolio', 
         'ISSN_portfolio', 'Coverage Information Combined', 'Coverage Statement',
         'Portfolio Id', 'Collection Public Name', 'PCAD?','Vendor_key']]
df

Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,ISSN_bib,ISSN_cluster,matches_group_id,Electronic Collection Id,Electronic Collection Public Name,Electronic Collection Public Name (override),Title_portfolio,ISSN_portfolio,Coverage Information Combined,Coverage Statement,Portfolio Id,Collection Public Name,PCAD?,Vendor_key
14,69875,9939153590001701,The Wilson journal of ornithology.,p,[1559-4491],[1559-4491],2,,,,,,,,,,,
15,113344,9968879310001701,The Wilson journal of ornithology,e,[1938-5447],"[1559-4491, 1938-5447]",2,,,,,,,,,,,
16,113344,9968879310001701,The Wilson journal of ornithology,e,[1938-5447],"[1559-4491, 1938-5447]",2,,,,,,,,,,,
17,113344,9968879310001701,The Wilson journal of ornithology,e,[1938-5447],"[1559-4491, 1938-5447]",2,,,,,,,,,,,
18,113344,9968879310001701,The Wilson journal of ornithology,e,[1938-5447],"[1559-4491, 1938-5447]",2,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102118,125537,9968947900001701,International journal of adhesion and adhesives,e,[1879-0127],"[0143-7496, 1879-0127]",101582,,,,,,,,,,,
102119,125537,9968947900001701,International journal of adhesion and adhesives,e,[1879-0127],"[0143-7496, 1879-0127]",101582,,,,,,,,,,,
102156,114862,9977101571201701,Oriental insects,e,[2157-8745],"[0030-5316, 2157-8745]",101697,61808527020001701,"Taylor & Francis Biological, Earth & Environme...","Taylor & Francis Biological, Earth, Environmen...",Oriental insects,2157-8745; 0030-5316,Available from 1967 volume: 1 until 1996 volu...,Only global,53815435190001701,"Taylor & Francis Biological, Earth & Environme...",Yes,Taylor & Francis
102157,114862,9977101571201701,Oriental insects,e,[2157-8745],"[0030-5316, 2157-8745]",101697,61808506700001701,"Taylor & Francis Biological, Earth, Environmen...","Taylor & Francis Biological, Earth, Environmen...",Oriental insects,2157-8745; 0030-5316,Available from 1997 volume: 31 issue: 1 until...,Only local,53815442430001701,"Taylor & Francis Biological, Earth, Environmen...",Yes,Taylor & Francis


In [49]:
df_e_trunc = df[df['PCAD?'].notnull()]
df_e_trunc

Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,ISSN_bib,ISSN_cluster,matches_group_id,Electronic Collection Id,Electronic Collection Public Name,Electronic Collection Public Name (override),Title_portfolio,ISSN_portfolio,Coverage Information Combined,Coverage Statement,Portfolio Id,Collection Public Name,PCAD?,Vendor_key
24,113344,9968879310001701,The Wilson journal of ornithology,e,[1938-5447],"[1559-4491, 1938-5447]",2,61588428410001701,JSTOR Life Sciences Collection,,The Wilson journal of ornithology.,1938-5447; 1559-4491,Available from 2006 volume: 118 issue: 1;,Only global,53588425000001701,JSTOR Life Sciences Collection,Yes,JSTOR
27,128733,9968441380001701,IEEE transactions on ultrasonics engineering,e,[2162-1373],"[0893-6706, 2162-1373]",5,61619505660001701,IEEE/IET Electronic Library (IEL) Journals,IEEE Journal Archive 1884-1999,IEEE transactions on ultrasonics engineering.,2162-1373; 0893-6706,Available from 1963 volume: 10 issue: 1 until...,Only global,53620359120001701,IEEE/IET Electronic Library (IEL) Journals,Yes,other
34,112352,9967759560001701,Journal of the University Film Association,e,[2328-1944],"[0041-9311, 2328-1944]",7,61535210810001701,JSTOR Arts and Sciences V,,Journal of the University Film Association.,2328-1944; 0041-9311,Available from 1968 volume: 20 issue: 1 until...,Only global,53538633520001701,JSTOR Arts and Sciences V,Yes,JSTOR
37,107439,9967145520001701,Journal of the University Film and Video Assoc...,e,[2328-1936],"[2328-1936, 0734-919X]",7,61535210810001701,JSTOR Arts and Sciences V,,Journal of the University Film and Video Assoc...,2328-1936; 0734-919X,Available from 1982 volume: 34 issue: 1 until...,Only global,53537295580001701,JSTOR Arts and Sciences V,Yes,JSTOR
51,111951,9968429800001701,Journal of the Institute of Actuaries,e,[2058-1009],"[0020-2681, 2058-1009]",12,61746266290001701,JSTOR Arts and Sciences X,,Journal of the Institute of Actuaries.,2058-1009; 0020-2681,Available from 1886 volume: 25 issue: 5 until...,Only global,53747073660001701,JSTOR Arts and Sciences X,Yes,JSTOR
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102066,123828,9968665290001701,Proceedings and addresses of the American Phil...,e,[2325-9248],"[2325-9248, 0065-972X]",101522,61535211010001701,JSTOR Arts and Sciences VII,,Proceedings and addresses of the American Phil...,2325-9248; 0065-972X,Available from 1927 volume: 1;,Only global,53540700110001701,JSTOR Arts and Sciences VII,Yes,JSTOR
102111,117510,9967987860001701,The Americas - Academy of American Franciscan ...,e,[1533-6247],"[1533-6247, 0003-1615]",101581,61535211010001701,JSTOR Arts and Sciences VII,,The Americas.,1533-6247; 0003-1615,Available from 1944 volume: 1 issue: 1;,Only global,53539160390001701,JSTOR Arts and Sciences VII,Yes,JSTOR
102114,125537,9968947900001701,International journal of adhesion and adhesives,e,[1879-0127],"[0143-7496, 1879-0127]",101582,61624504590001701,Elsevier ScienceDirect Journals Complete,,International journal of adhesion & adhesives,1879-0127; 0143-7496,Available from 1980-07- volume: 1 issue: 1;,Only local,53624617390001701,Elsevier ScienceDirect Journals Complete,Yes,Elsevier
102156,114862,9977101571201701,Oriental insects,e,[2157-8745],"[0030-5316, 2157-8745]",101697,61808527020001701,"Taylor & Francis Biological, Earth & Environme...","Taylor & Francis Biological, Earth, Environmen...",Oriental insects,2157-8745; 0030-5316,Available from 1967 volume: 1 until 1996 volu...,Only global,53815435190001701,"Taylor & Francis Biological, Earth & Environme...",Yes,Taylor & Francis


In [51]:
df_e_trunc['Electronic Collection Id'].fillna('',inplace=True)
df_e_trunc['Electronic Collection Public Name (override)'].fillna('',inplace=True)
df_e_trunc

Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,ISSN_bib,ISSN_cluster,matches_group_id,Electronic Collection Id,Electronic Collection Public Name,Electronic Collection Public Name (override),Title_portfolio,ISSN_portfolio,Coverage Information Combined,Coverage Statement,Portfolio Id,Collection Public Name,PCAD?,Vendor_key
24,113344,9968879310001701,The Wilson journal of ornithology,e,[1938-5447],"[1559-4491, 1938-5447]",2,61588428410001701,JSTOR Life Sciences Collection,,The Wilson journal of ornithology.,1938-5447; 1559-4491,Available from 2006 volume: 118 issue: 1;,Only global,53588425000001701,JSTOR Life Sciences Collection,Yes,JSTOR
27,128733,9968441380001701,IEEE transactions on ultrasonics engineering,e,[2162-1373],"[0893-6706, 2162-1373]",5,61619505660001701,IEEE/IET Electronic Library (IEL) Journals,IEEE Journal Archive 1884-1999,IEEE transactions on ultrasonics engineering.,2162-1373; 0893-6706,Available from 1963 volume: 10 issue: 1 until...,Only global,53620359120001701,IEEE/IET Electronic Library (IEL) Journals,Yes,other
34,112352,9967759560001701,Journal of the University Film Association,e,[2328-1944],"[0041-9311, 2328-1944]",7,61535210810001701,JSTOR Arts and Sciences V,,Journal of the University Film Association.,2328-1944; 0041-9311,Available from 1968 volume: 20 issue: 1 until...,Only global,53538633520001701,JSTOR Arts and Sciences V,Yes,JSTOR
37,107439,9967145520001701,Journal of the University Film and Video Assoc...,e,[2328-1936],"[2328-1936, 0734-919X]",7,61535210810001701,JSTOR Arts and Sciences V,,Journal of the University Film and Video Assoc...,2328-1936; 0734-919X,Available from 1982 volume: 34 issue: 1 until...,Only global,53537295580001701,JSTOR Arts and Sciences V,Yes,JSTOR
51,111951,9968429800001701,Journal of the Institute of Actuaries,e,[2058-1009],"[0020-2681, 2058-1009]",12,61746266290001701,JSTOR Arts and Sciences X,,Journal of the Institute of Actuaries.,2058-1009; 0020-2681,Available from 1886 volume: 25 issue: 5 until...,Only global,53747073660001701,JSTOR Arts and Sciences X,Yes,JSTOR
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102066,123828,9968665290001701,Proceedings and addresses of the American Phil...,e,[2325-9248],"[2325-9248, 0065-972X]",101522,61535211010001701,JSTOR Arts and Sciences VII,,Proceedings and addresses of the American Phil...,2325-9248; 0065-972X,Available from 1927 volume: 1;,Only global,53540700110001701,JSTOR Arts and Sciences VII,Yes,JSTOR
102111,117510,9967987860001701,The Americas - Academy of American Franciscan ...,e,[1533-6247],"[1533-6247, 0003-1615]",101581,61535211010001701,JSTOR Arts and Sciences VII,,The Americas.,1533-6247; 0003-1615,Available from 1944 volume: 1 issue: 1;,Only global,53539160390001701,JSTOR Arts and Sciences VII,Yes,JSTOR
102114,125537,9968947900001701,International journal of adhesion and adhesives,e,[1879-0127],"[0143-7496, 1879-0127]",101582,61624504590001701,Elsevier ScienceDirect Journals Complete,,International journal of adhesion & adhesives,1879-0127; 0143-7496,Available from 1980-07- volume: 1 issue: 1;,Only local,53624617390001701,Elsevier ScienceDirect Journals Complete,Yes,Elsevier
102156,114862,9977101571201701,Oriental insects,e,[2157-8745],"[0030-5316, 2157-8745]",101697,61808527020001701,"Taylor & Francis Biological, Earth & Environme...","Taylor & Francis Biological, Earth, Environmen...",Oriental insects,2157-8745; 0030-5316,Available from 1967 volume: 1 until 1996 volu...,Only global,53815435190001701,"Taylor & Francis Biological, Earth & Environme...",Yes,Taylor & Francis


In [52]:
df_e_trunc['e_coll_info'] = df_e_trunc.apply(lambda x: str([x['Electronic Collection Id'],x['Electronic Collection Public Name'],x['Electronic Collection Public Name (override)'],x['Coverage Information Combined']]), 
                                                   axis = 1)
df_e_trunc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,ISSN_bib,ISSN_cluster,matches_group_id,Electronic Collection Id,Electronic Collection Public Name,Electronic Collection Public Name (override),Title_portfolio,ISSN_portfolio,Coverage Information Combined,Coverage Statement,Portfolio Id,Collection Public Name,PCAD?,Vendor_key,e_coll_info
24,113344,9968879310001701,The Wilson journal of ornithology,e,[1938-5447],"[1559-4491, 1938-5447]",2,61588428410001701,JSTOR Life Sciences Collection,,The Wilson journal of ornithology.,1938-5447; 1559-4491,Available from 2006 volume: 118 issue: 1;,Only global,53588425000001701,JSTOR Life Sciences Collection,Yes,JSTOR,"['61588428410001701', 'JSTOR Life Sciences Col..."
27,128733,9968441380001701,IEEE transactions on ultrasonics engineering,e,[2162-1373],"[0893-6706, 2162-1373]",5,61619505660001701,IEEE/IET Electronic Library (IEL) Journals,IEEE Journal Archive 1884-1999,IEEE transactions on ultrasonics engineering.,2162-1373; 0893-6706,Available from 1963 volume: 10 issue: 1 until...,Only global,53620359120001701,IEEE/IET Electronic Library (IEL) Journals,Yes,other,"['61619505660001701', 'IEEE/IET Electronic Lib..."
34,112352,9967759560001701,Journal of the University Film Association,e,[2328-1944],"[0041-9311, 2328-1944]",7,61535210810001701,JSTOR Arts and Sciences V,,Journal of the University Film Association.,2328-1944; 0041-9311,Available from 1968 volume: 20 issue: 1 until...,Only global,53538633520001701,JSTOR Arts and Sciences V,Yes,JSTOR,"['61535210810001701', 'JSTOR Arts and Sciences..."
37,107439,9967145520001701,Journal of the University Film and Video Assoc...,e,[2328-1936],"[2328-1936, 0734-919X]",7,61535210810001701,JSTOR Arts and Sciences V,,Journal of the University Film and Video Assoc...,2328-1936; 0734-919X,Available from 1982 volume: 34 issue: 1 until...,Only global,53537295580001701,JSTOR Arts and Sciences V,Yes,JSTOR,"['61535210810001701', 'JSTOR Arts and Sciences..."
51,111951,9968429800001701,Journal of the Institute of Actuaries,e,[2058-1009],"[0020-2681, 2058-1009]",12,61746266290001701,JSTOR Arts and Sciences X,,Journal of the Institute of Actuaries.,2058-1009; 0020-2681,Available from 1886 volume: 25 issue: 5 until...,Only global,53747073660001701,JSTOR Arts and Sciences X,Yes,JSTOR,"['61746266290001701', 'JSTOR Arts and Sciences..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102066,123828,9968665290001701,Proceedings and addresses of the American Phil...,e,[2325-9248],"[2325-9248, 0065-972X]",101522,61535211010001701,JSTOR Arts and Sciences VII,,Proceedings and addresses of the American Phil...,2325-9248; 0065-972X,Available from 1927 volume: 1;,Only global,53540700110001701,JSTOR Arts and Sciences VII,Yes,JSTOR,"['61535211010001701', 'JSTOR Arts and Sciences..."
102111,117510,9967987860001701,The Americas - Academy of American Franciscan ...,e,[1533-6247],"[1533-6247, 0003-1615]",101581,61535211010001701,JSTOR Arts and Sciences VII,,The Americas.,1533-6247; 0003-1615,Available from 1944 volume: 1 issue: 1;,Only global,53539160390001701,JSTOR Arts and Sciences VII,Yes,JSTOR,"['61535211010001701', 'JSTOR Arts and Sciences..."
102114,125537,9968947900001701,International journal of adhesion and adhesives,e,[1879-0127],"[0143-7496, 1879-0127]",101582,61624504590001701,Elsevier ScienceDirect Journals Complete,,International journal of adhesion & adhesives,1879-0127; 0143-7496,Available from 1980-07- volume: 1 issue: 1;,Only local,53624617390001701,Elsevier ScienceDirect Journals Complete,Yes,Elsevier,"['61624504590001701', 'Elsevier ScienceDirect ..."
102156,114862,9977101571201701,Oriental insects,e,[2157-8745],"[0030-5316, 2157-8745]",101697,61808527020001701,"Taylor & Francis Biological, Earth & Environme...","Taylor & Francis Biological, Earth, Environmen...",Oriental insects,2157-8745; 0030-5316,Available from 1967 volume: 1 until 1996 volu...,Only global,53815435190001701,"Taylor & Francis Biological, Earth & Environme...",Yes,Taylor & Francis,"['61808527020001701', 'Taylor & Francis Biolog..."


In [53]:
df_e_trunc.drop(['Electronic Collection Id','Electronic Collection Public Name','Electronic Collection Public Name (override)',
                 'Coverage Statement','Collection Public Name'],inplace=True, axis=1)
df_e_trunc

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,ISSN_bib,ISSN_cluster,matches_group_id,Title_portfolio,ISSN_portfolio,Coverage Information Combined,Portfolio Id,PCAD?,Vendor_key,e_coll_info
24,113344,9968879310001701,The Wilson journal of ornithology,e,[1938-5447],"[1559-4491, 1938-5447]",2,The Wilson journal of ornithology.,1938-5447; 1559-4491,Available from 2006 volume: 118 issue: 1;,53588425000001701,Yes,JSTOR,"['61588428410001701', 'JSTOR Life Sciences Col..."
27,128733,9968441380001701,IEEE transactions on ultrasonics engineering,e,[2162-1373],"[0893-6706, 2162-1373]",5,IEEE transactions on ultrasonics engineering.,2162-1373; 0893-6706,Available from 1963 volume: 10 issue: 1 until...,53620359120001701,Yes,other,"['61619505660001701', 'IEEE/IET Electronic Lib..."
34,112352,9967759560001701,Journal of the University Film Association,e,[2328-1944],"[0041-9311, 2328-1944]",7,Journal of the University Film Association.,2328-1944; 0041-9311,Available from 1968 volume: 20 issue: 1 until...,53538633520001701,Yes,JSTOR,"['61535210810001701', 'JSTOR Arts and Sciences..."
37,107439,9967145520001701,Journal of the University Film and Video Assoc...,e,[2328-1936],"[2328-1936, 0734-919X]",7,Journal of the University Film and Video Assoc...,2328-1936; 0734-919X,Available from 1982 volume: 34 issue: 1 until...,53537295580001701,Yes,JSTOR,"['61535210810001701', 'JSTOR Arts and Sciences..."
51,111951,9968429800001701,Journal of the Institute of Actuaries,e,[2058-1009],"[0020-2681, 2058-1009]",12,Journal of the Institute of Actuaries.,2058-1009; 0020-2681,Available from 1886 volume: 25 issue: 5 until...,53747073660001701,Yes,JSTOR,"['61746266290001701', 'JSTOR Arts and Sciences..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102066,123828,9968665290001701,Proceedings and addresses of the American Phil...,e,[2325-9248],"[2325-9248, 0065-972X]",101522,Proceedings and addresses of the American Phil...,2325-9248; 0065-972X,Available from 1927 volume: 1;,53540700110001701,Yes,JSTOR,"['61535211010001701', 'JSTOR Arts and Sciences..."
102111,117510,9967987860001701,The Americas - Academy of American Franciscan ...,e,[1533-6247],"[1533-6247, 0003-1615]",101581,The Americas.,1533-6247; 0003-1615,Available from 1944 volume: 1 issue: 1;,53539160390001701,Yes,JSTOR,"['61535211010001701', 'JSTOR Arts and Sciences..."
102114,125537,9968947900001701,International journal of adhesion and adhesives,e,[1879-0127],"[0143-7496, 1879-0127]",101582,International journal of adhesion & adhesives,1879-0127; 0143-7496,Available from 1980-07- volume: 1 issue: 1;,53624617390001701,Yes,Elsevier,"['61624504590001701', 'Elsevier ScienceDirect ..."
102156,114862,9977101571201701,Oriental insects,e,[2157-8745],"[0030-5316, 2157-8745]",101697,Oriental insects,2157-8745; 0030-5316,Available from 1967 volume: 1 until 1996 volu...,53815435190001701,Yes,Taylor & Francis,"['61808527020001701', 'Taylor & Francis Biolog..."


In [54]:
df_e_trunc['portfolio_info'] = df_e_trunc.apply(lambda x: str([x['Portfolio Id'],x['Title_portfolio']]), 
                                                   axis = 1)
df_e_trunc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,ISSN_bib,ISSN_cluster,matches_group_id,Title_portfolio,ISSN_portfolio,Coverage Information Combined,Portfolio Id,PCAD?,Vendor_key,e_coll_info,portfolio_info
24,113344,9968879310001701,The Wilson journal of ornithology,e,[1938-5447],"[1559-4491, 1938-5447]",2,The Wilson journal of ornithology.,1938-5447; 1559-4491,Available from 2006 volume: 118 issue: 1;,53588425000001701,Yes,JSTOR,"['61588428410001701', 'JSTOR Life Sciences Col...","['53588425000001701', 'The Wilson journal of o..."
27,128733,9968441380001701,IEEE transactions on ultrasonics engineering,e,[2162-1373],"[0893-6706, 2162-1373]",5,IEEE transactions on ultrasonics engineering.,2162-1373; 0893-6706,Available from 1963 volume: 10 issue: 1 until...,53620359120001701,Yes,other,"['61619505660001701', 'IEEE/IET Electronic Lib...","['53620359120001701', 'IEEE transactions on ul..."
34,112352,9967759560001701,Journal of the University Film Association,e,[2328-1944],"[0041-9311, 2328-1944]",7,Journal of the University Film Association.,2328-1944; 0041-9311,Available from 1968 volume: 20 issue: 1 until...,53538633520001701,Yes,JSTOR,"['61535210810001701', 'JSTOR Arts and Sciences...","['53538633520001701', 'Journal of the Universi..."
37,107439,9967145520001701,Journal of the University Film and Video Assoc...,e,[2328-1936],"[2328-1936, 0734-919X]",7,Journal of the University Film and Video Assoc...,2328-1936; 0734-919X,Available from 1982 volume: 34 issue: 1 until...,53537295580001701,Yes,JSTOR,"['61535210810001701', 'JSTOR Arts and Sciences...","['53537295580001701', 'Journal of the Universi..."
51,111951,9968429800001701,Journal of the Institute of Actuaries,e,[2058-1009],"[0020-2681, 2058-1009]",12,Journal of the Institute of Actuaries.,2058-1009; 0020-2681,Available from 1886 volume: 25 issue: 5 until...,53747073660001701,Yes,JSTOR,"['61746266290001701', 'JSTOR Arts and Sciences...","['53747073660001701', 'Journal of the Institut..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102066,123828,9968665290001701,Proceedings and addresses of the American Phil...,e,[2325-9248],"[2325-9248, 0065-972X]",101522,Proceedings and addresses of the American Phil...,2325-9248; 0065-972X,Available from 1927 volume: 1;,53540700110001701,Yes,JSTOR,"['61535211010001701', 'JSTOR Arts and Sciences...","['53540700110001701', 'Proceedings and address..."
102111,117510,9967987860001701,The Americas - Academy of American Franciscan ...,e,[1533-6247],"[1533-6247, 0003-1615]",101581,The Americas.,1533-6247; 0003-1615,Available from 1944 volume: 1 issue: 1;,53539160390001701,Yes,JSTOR,"['61535211010001701', 'JSTOR Arts and Sciences...","['53539160390001701', 'The Americas.']"
102114,125537,9968947900001701,International journal of adhesion and adhesives,e,[1879-0127],"[0143-7496, 1879-0127]",101582,International journal of adhesion & adhesives,1879-0127; 0143-7496,Available from 1980-07- volume: 1 issue: 1;,53624617390001701,Yes,Elsevier,"['61624504590001701', 'Elsevier ScienceDirect ...","['53624617390001701', 'International journal o..."
102156,114862,9977101571201701,Oriental insects,e,[2157-8745],"[0030-5316, 2157-8745]",101697,Oriental insects,2157-8745; 0030-5316,Available from 1967 volume: 1 until 1996 volu...,53815435190001701,Yes,Taylor & Francis,"['61808527020001701', 'Taylor & Francis Biolog...","['53815435190001701', 'Oriental insects']"


In [55]:
df_e_trunc.drop(['Portfolio Id','Title_portfolio','ISSN_portfolio'],inplace=True, axis=1)
df_e_trunc

Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,ISSN_bib,ISSN_cluster,matches_group_id,Coverage Information Combined,PCAD?,Vendor_key,e_coll_info,portfolio_info
24,113344,9968879310001701,The Wilson journal of ornithology,e,[1938-5447],"[1559-4491, 1938-5447]",2,Available from 2006 volume: 118 issue: 1;,Yes,JSTOR,"['61588428410001701', 'JSTOR Life Sciences Col...","['53588425000001701', 'The Wilson journal of o..."
27,128733,9968441380001701,IEEE transactions on ultrasonics engineering,e,[2162-1373],"[0893-6706, 2162-1373]",5,Available from 1963 volume: 10 issue: 1 until...,Yes,other,"['61619505660001701', 'IEEE/IET Electronic Lib...","['53620359120001701', 'IEEE transactions on ul..."
34,112352,9967759560001701,Journal of the University Film Association,e,[2328-1944],"[0041-9311, 2328-1944]",7,Available from 1968 volume: 20 issue: 1 until...,Yes,JSTOR,"['61535210810001701', 'JSTOR Arts and Sciences...","['53538633520001701', 'Journal of the Universi..."
37,107439,9967145520001701,Journal of the University Film and Video Assoc...,e,[2328-1936],"[2328-1936, 0734-919X]",7,Available from 1982 volume: 34 issue: 1 until...,Yes,JSTOR,"['61535210810001701', 'JSTOR Arts and Sciences...","['53537295580001701', 'Journal of the Universi..."
51,111951,9968429800001701,Journal of the Institute of Actuaries,e,[2058-1009],"[0020-2681, 2058-1009]",12,Available from 1886 volume: 25 issue: 5 until...,Yes,JSTOR,"['61746266290001701', 'JSTOR Arts and Sciences...","['53747073660001701', 'Journal of the Institut..."
...,...,...,...,...,...,...,...,...,...,...,...,...
102066,123828,9968665290001701,Proceedings and addresses of the American Phil...,e,[2325-9248],"[2325-9248, 0065-972X]",101522,Available from 1927 volume: 1;,Yes,JSTOR,"['61535211010001701', 'JSTOR Arts and Sciences...","['53540700110001701', 'Proceedings and address..."
102111,117510,9967987860001701,The Americas - Academy of American Franciscan ...,e,[1533-6247],"[1533-6247, 0003-1615]",101581,Available from 1944 volume: 1 issue: 1;,Yes,JSTOR,"['61535211010001701', 'JSTOR Arts and Sciences...","['53539160390001701', 'The Americas.']"
102114,125537,9968947900001701,International journal of adhesion and adhesives,e,[1879-0127],"[0143-7496, 1879-0127]",101582,Available from 1980-07- volume: 1 issue: 1;,Yes,Elsevier,"['61624504590001701', 'Elsevier ScienceDirect ...","['53624617390001701', 'International journal o..."
102156,114862,9977101571201701,Oriental insects,e,[2157-8745],"[0030-5316, 2157-8745]",101697,Available from 1967 volume: 1 until 1996 volu...,Yes,Taylor & Francis,"['61808527020001701', 'Taylor & Francis Biolog...","['53815435190001701', 'Oriental insects']"


In [56]:
dfeg = df_e_trunc[['MMS_ID','e_coll_info','portfolio_info']].groupby(['MMS_ID']).agg(lambda x: list(set([item for item in x]))).reset_index()
dfeg

Unnamed: 0,MMS_ID,e_coll_info,portfolio_info
0,9966211480001701,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624689300001701', 'Structural safety']]"
1,9966211500001701,"[['61535213730001701', 'SpringerLink Historica...","[['53535250720001701', 'Plant molecular biolog..."
2,9966211580001701,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624691020001701', 'Solid state ionics']]"
3,9966211610001701,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624691280001701', 'Soil & tillage researc..."
4,9966211630001701,"[['61535212360001701', 'Elsevier SD Backfile B...","[['53535250570001701', 'Regulatory peptides.']..."
...,...,...,...
7156,9977118809801701,"[['61816118310001701', 'Mary Ann Liebert Publi...","[['53816118140001701', 'CyberPsychology and be..."
7157,9977118809901701,"[['61816118310001701', 'Mary Ann Liebert Publi...","[['53816118210001701', 'AIDS patient care.']]"
7158,9977242251801701,"[['61765175560001701', 'Wiley Online Library D...","[['53831074140001701', 'Crop science, soil sci..."
7159,9977255009801701,"[['61789512600001701', 'JSTOR Arts & Sciences ...","[['53831868390001701', 'Wiener Zeitschrift fü..."


In [57]:
dfes = df_e_trunc[['MMS_ID','Coverage Information Combined','PCAD?','Vendor_key']].groupby(['MMS_ID']).agg(lambda x: list(set(x))).reset_index()
dfes

Unnamed: 0,MMS_ID,Coverage Information Combined,PCAD?,Vendor_key
0,9966211480001701,[ Available from 1995-03- volume: 17 issue: 1;],[Yes],[Elsevier]
1,9966211500001701,[ Available from 1982 volume: 1 issue: 1 until...,[Yes],[Springer]
2,9966211580001701,[ Available from 1980-04- volume: 1 issue: 1;],[Yes],[Elsevier]
3,9966211610001701,[ Available from 1980 volume: 1;],[Yes],[Elsevier]
4,9966211630001701,[ Available from 1980 volume: 1 until 2004-12-...,[Yes],[Elsevier]
...,...,...,...,...
7156,9977118809801701,[ Available from 1998 volume: 1 issue: 1 until...,[Yes],[other]
7157,9977118809901701,[ Available from 1987 volume: 1 issue: 1 until...,[Yes],[other]
7158,9977242251801701,[ Available from 1999 volume: 44 issue: 1 unti...,[Yes],[Wiley]
7159,9977255009801701,[ Available from 2000 volume: 44;],[Yes],[JSTOR]


In [58]:
df_e_combo = pd.merge(dfeg, dfes, how='left',on='MMS_ID')
df_e_combo

Unnamed: 0,MMS_ID,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key
0,9966211480001701,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624689300001701', 'Structural safety']]",[ Available from 1995-03- volume: 17 issue: 1;],[Yes],[Elsevier]
1,9966211500001701,"[['61535213730001701', 'SpringerLink Historica...","[['53535250720001701', 'Plant molecular biolog...",[ Available from 1982 volume: 1 issue: 1 until...,[Yes],[Springer]
2,9966211580001701,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624691020001701', 'Solid state ionics']]",[ Available from 1980-04- volume: 1 issue: 1;],[Yes],[Elsevier]
3,9966211610001701,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624691280001701', 'Soil & tillage researc...",[ Available from 1980 volume: 1;],[Yes],[Elsevier]
4,9966211630001701,"[['61535212360001701', 'Elsevier SD Backfile B...","[['53535250570001701', 'Regulatory peptides.']...",[ Available from 1980 volume: 1 until 2004-12-...,[Yes],[Elsevier]
...,...,...,...,...,...,...
7156,9977118809801701,"[['61816118310001701', 'Mary Ann Liebert Publi...","[['53816118140001701', 'CyberPsychology and be...",[ Available from 1998 volume: 1 issue: 1 until...,[Yes],[other]
7157,9977118809901701,"[['61816118310001701', 'Mary Ann Liebert Publi...","[['53816118210001701', 'AIDS patient care.']]",[ Available from 1987 volume: 1 issue: 1 until...,[Yes],[other]
7158,9977242251801701,"[['61765175560001701', 'Wiley Online Library D...","[['53831074140001701', 'Crop science, soil sci...",[ Available from 1999 volume: 44 issue: 1 unti...,[Yes],[Wiley]
7159,9977255009801701,"[['61789512600001701', 'JSTOR Arts & Sciences ...","[['53831868390001701', 'Wiener Zeitschrift fü...",[ Available from 2000 volume: 44;],[Yes],[JSTOR]


In [59]:
p_and_e_data.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'OCN', 'ISSN_bib',
       'Related_OCNs', 'Related_ISSNs', 'Vol_nos', 'Gov_doc_nos',
       'OCN_cluster', 'ISSN_cluster', 'p_or_e', 'ISSN_group_id',
       'OCN_group_id', 'both_groups', 'matches_group_id', 'ISSN_to_match',
       'MMS ID', 'E Coll ID', 'Electronic Collection Id',
       'Electronic Collection Public Name',
       'Electronic Collection Public Name (override)', 'MMS Id',
       'Title_portfolio', 'Begin Publication Date', 'End Publication Date',
       'ISSN_portfolio', 'Coverage Information Combined', 'Coverage Statement',
       'Portfolio Id', 'Lifecycle', 'Embargo Operator', 'Embargo Months',
       'Embargo Years', 'Status (Active)', 'Collection Public Name', 'PCAD?',
       'Vendor_key'],
      dtype='object')

In [60]:
p_and_e_data = p_and_e_data[['record_index', 'MMS_ID', 'Title_bib', 'ISSN_bib','ISSN_cluster',
                             'p_or_e', 'matches_group_id', 'ISSN_to_match']]
p_and_e_data

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match
14,69875,9939153590001701,The Wilson journal of ornithology.,[1559-4491],[1559-4491],p,2,1559-4491
15,113344,9968879310001701,The Wilson journal of ornithology,[1938-5447],"[1559-4491, 1938-5447]",e,2,1938-5447
16,113344,9968879310001701,The Wilson journal of ornithology,[1938-5447],"[1559-4491, 1938-5447]",e,2,1938-5447
17,113344,9968879310001701,The Wilson journal of ornithology,[1938-5447],"[1559-4491, 1938-5447]",e,2,1938-5447
18,113344,9968879310001701,The Wilson journal of ornithology,[1938-5447],"[1559-4491, 1938-5447]",e,2,1938-5447
...,...,...,...,...,...,...,...,...
102118,125537,9968947900001701,International journal of adhesion and adhesives,[1879-0127],"[0143-7496, 1879-0127]",e,101582,1879-0127
102119,125537,9968947900001701,International journal of adhesion and adhesives,[1879-0127],"[0143-7496, 1879-0127]",e,101582,1879-0127
102156,114862,9977101571201701,Oriental insects,[2157-8745],"[0030-5316, 2157-8745]",e,101697,2157-8745
102157,114862,9977101571201701,Oriental insects,[2157-8745],"[0030-5316, 2157-8745]",e,101697,2157-8745


In [61]:
p_and_e_data['ISSN_bib'] = p_and_e_data['ISSN_bib'].apply(lambda x: str(x))
p_and_e_data['ISSN_cluster'] = p_and_e_data['ISSN_cluster'].apply(lambda x: str(x))
p_and_e_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match
14,69875,9939153590001701,The Wilson journal of ornithology.,['1559-4491'],['1559-4491'],p,2,1559-4491
15,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447
16,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447
17,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447
18,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447
...,...,...,...,...,...,...,...,...
102118,125537,9968947900001701,International journal of adhesion and adhesives,['1879-0127'],"['0143-7496', '1879-0127']",e,101582,1879-0127
102119,125537,9968947900001701,International journal of adhesion and adhesives,['1879-0127'],"['0143-7496', '1879-0127']",e,101582,1879-0127
102156,114862,9977101571201701,Oriental insects,['2157-8745'],"['0030-5316', '2157-8745']",e,101697,2157-8745
102157,114862,9977101571201701,Oriental insects,['2157-8745'],"['0030-5316', '2157-8745']",e,101697,2157-8745


In [62]:
p_and_e_data.drop_duplicates(inplace=True)
p_and_e_data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match
14,69875,9939153590001701,The Wilson journal of ornithology.,['1559-4491'],['1559-4491'],p,2,1559-4491
15,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447
25,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706
26,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373
32,112352,9967759560001701,Journal of the University Film Association,['2328-1944'],"['0041-9311', '2328-1944']",e,7,2328-1944
...,...,...,...,...,...,...,...,...
102112,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,
102113,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496
102114,125537,9968947900001701,International journal of adhesion and adhesives,['1879-0127'],"['0143-7496', '1879-0127']",e,101582,1879-0127
102156,114862,9977101571201701,Oriental insects,['2157-8745'],"['0030-5316', '2157-8745']",e,101697,2157-8745


In [63]:
p_and_e_data['MMS_ID'].nunique()

15651

In [64]:
df_combo = pd.merge(p_and_e_data,df_e_combo,how='left',on='MMS_ID')
df_combo

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key
0,69875,9939153590001701,The Wilson journal of ornithology.,['1559-4491'],['1559-4491'],p,2,1559-4491,,,,,
1,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447,"[['61588428410001701', 'JSTOR Life Sciences Co...","[['53588425000001701', 'The Wilson journal of ...",[ Available from 2006 volume: 118 issue: 1;],[Yes],[JSTOR]
2,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,,,
3,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other]
4,112352,9967759560001701,Journal of the University Film Association,['2328-1944'],"['0041-9311', '2328-1944']",e,7,2328-1944,"[['61535210810001701', 'JSTOR Arts and Science...","[['53538633520001701', 'Journal of the Univers...",[ Available from 1968 volume: 20 issue: 1 unti...,[Yes],[JSTOR]
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15654,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,,,,,,
15655,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496,,,,,
15656,125537,9968947900001701,International journal of adhesion and adhesives,['1879-0127'],"['0143-7496', '1879-0127']",e,101582,1879-0127,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier]
15657,114862,9977101571201701,Oriental insects,['2157-8745'],"['0030-5316', '2157-8745']",e,101697,2157-8745,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442430001701', 'Oriental insects'], ['...",[ Available from 1997 volume: 31 issue: 1 unti...,[Yes],[Taylor & Francis]


In [65]:
df_combo.to_pickle(f'p_and_e_with_combined_coverage_{today}.pkl')

## Create dataframes of BTAA SPR and PORTICO data

#### BTAA SPR

In [66]:
#change filename if necessary
btaa = pd.read_excel('BTAA-SPR.xlsx')
btaa

Unnamed: 0,ISSN Number,Title 1 (Print),Publisher (Print),Title 2 (Print),Publisher (Print).1,Title 3 (Print),Publisher (Print).2,(more bib records?),Match?,SPR Holdings,SPR Missing
0,0001-0782,Communications of the ACM.,Association for Computing Machinery,,,,,,YES,"2 (1959)-43 (2000), 45 (2002)-46 (2003), 50 (2...","v.44 (2001), v.47 (2004)-v.49 (2006), v.51 (20..."
1,0001-1541,AIChE journal.,Wiley Subscription Services Inc,,,,,,YES,"1 (1955)-50 (2004), 52 (2006)","v.51 (2005), v.53 (2007)-"
2,0001-2092,AORN journal.,Association of Operating Room Nurses,,,,,,YES,"4 (1966)-48 (1988), 51 (1990)-85 (2007)","v.1 (1963)-v.3 (1965),v.49 (1989)-50 (1989),v...."
3,0001-2815,Tissue antigens.,Munksgaard,,,,,,YES,1 (1971)-52 (1998),v.53 (1999)-
4,0001-2998,Seminars in nuclear medicine.,Grune & Stratton etc,,,,,,YES,1 (1971)-35 (2005),v.36 (2006)-
...,...,...,...,...,...,...,...,...,...,...,...
9649,8756-7547,"Genetic, social, and general psychology monogr...",Heldref Publications,,,,,,,,
9650,8756-7555,College teaching.,Heldref Publications a division of Helen Dwigh...,,,,,,,,
9651,8756-8225,Journal of college student psychotherapy.,Haworth Press,,,,,,,,
9652,8756-9264,Clinical progress in electrophysiology and pac...,Futura Pub Co,,,,,,,,


In [67]:
btaa_cols = list(btaa.columns)
btaa_cols

['ISSN Number',
 'Title 1 (Print)',
 'Publisher (Print)',
 'Title 2 (Print)',
 'Publisher (Print).1',
 'Title 3 (Print)',
 'Publisher (Print).2',
 '(more bib records?)',
 'Match?',
 'SPR Holdings',
 'SPR Missing']

In [68]:
btaa_cols_edit = []
for x in btaa_cols:
    btaa_cols_edit.append(str(x) + '_BTAA-SPR')
btaa_cols_edit

['ISSN Number_BTAA-SPR',
 'Title 1 (Print)_BTAA-SPR',
 'Publisher (Print)_BTAA-SPR',
 'Title 2 (Print)_BTAA-SPR',
 'Publisher (Print).1_BTAA-SPR',
 'Title 3 (Print)_BTAA-SPR',
 'Publisher (Print).2_BTAA-SPR',
 '(more bib records?)_BTAA-SPR',
 'Match?_BTAA-SPR',
 'SPR Holdings_BTAA-SPR',
 'SPR Missing_BTAA-SPR']

In [69]:
btaa_dict = dict(zip(btaa_cols, btaa_cols_edit))
btaa_dict

{'ISSN Number': 'ISSN Number_BTAA-SPR',
 'Title 1 (Print)': 'Title 1 (Print)_BTAA-SPR',
 'Publisher (Print)': 'Publisher (Print)_BTAA-SPR',
 'Title 2 (Print)': 'Title 2 (Print)_BTAA-SPR',
 'Publisher (Print).1': 'Publisher (Print).1_BTAA-SPR',
 'Title 3 (Print)': 'Title 3 (Print)_BTAA-SPR',
 'Publisher (Print).2': 'Publisher (Print).2_BTAA-SPR',
 '(more bib records?)': '(more bib records?)_BTAA-SPR',
 'Match?': 'Match?_BTAA-SPR',
 'SPR Holdings': 'SPR Holdings_BTAA-SPR',
 'SPR Missing': 'SPR Missing_BTAA-SPR'}

In [70]:
btaa.rename(index=str, columns = btaa_dict, inplace=True)
btaa

Unnamed: 0,ISSN Number_BTAA-SPR,Title 1 (Print)_BTAA-SPR,Publisher (Print)_BTAA-SPR,Title 2 (Print)_BTAA-SPR,Publisher (Print).1_BTAA-SPR,Title 3 (Print)_BTAA-SPR,Publisher (Print).2_BTAA-SPR,(more bib records?)_BTAA-SPR,Match?_BTAA-SPR,SPR Holdings_BTAA-SPR,SPR Missing_BTAA-SPR
0,0001-0782,Communications of the ACM.,Association for Computing Machinery,,,,,,YES,"2 (1959)-43 (2000), 45 (2002)-46 (2003), 50 (2...","v.44 (2001), v.47 (2004)-v.49 (2006), v.51 (20..."
1,0001-1541,AIChE journal.,Wiley Subscription Services Inc,,,,,,YES,"1 (1955)-50 (2004), 52 (2006)","v.51 (2005), v.53 (2007)-"
2,0001-2092,AORN journal.,Association of Operating Room Nurses,,,,,,YES,"4 (1966)-48 (1988), 51 (1990)-85 (2007)","v.1 (1963)-v.3 (1965),v.49 (1989)-50 (1989),v...."
3,0001-2815,Tissue antigens.,Munksgaard,,,,,,YES,1 (1971)-52 (1998),v.53 (1999)-
4,0001-2998,Seminars in nuclear medicine.,Grune & Stratton etc,,,,,,YES,1 (1971)-35 (2005),v.36 (2006)-
...,...,...,...,...,...,...,...,...,...,...,...
9649,8756-7547,"Genetic, social, and general psychology monogr...",Heldref Publications,,,,,,,,
9650,8756-7555,College teaching.,Heldref Publications a division of Helen Dwigh...,,,,,,,,
9651,8756-8225,Journal of college student psychotherapy.,Haworth Press,,,,,,,,
9652,8756-9264,Clinical progress in electrophysiology and pac...,Futura Pub Co,,,,,,,,


#### Merge SPR data into df_combo

In [71]:
df_pcad_btaa = pd.merge(df_combo,btaa,how='left',left_on='ISSN_to_match',right_on='ISSN Number_BTAA-SPR')
df_pcad_btaa

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,Title 1 (Print)_BTAA-SPR,Publisher (Print)_BTAA-SPR,Title 2 (Print)_BTAA-SPR,Publisher (Print).1_BTAA-SPR,Title 3 (Print)_BTAA-SPR,Publisher (Print).2_BTAA-SPR,(more bib records?)_BTAA-SPR,Match?_BTAA-SPR,SPR Holdings_BTAA-SPR,SPR Missing_BTAA-SPR
0,69875,9939153590001701,The Wilson journal of ornithology.,['1559-4491'],['1559-4491'],p,2,1559-4491,,,...,,,,,,,,,,
1,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447,"[['61588428410001701', 'JSTOR Life Sciences Co...","[['53588425000001701', 'The Wilson journal of ...",...,,,,,,,,,,
2,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,...,IEEE transactions on ultrasonics engineering.,Institute of Electrical and Electronics Engineers,,,,,,YES,10 (1963),
3,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",...,,,,,,,,,,
4,112352,9967759560001701,Journal of the University Film Association,['2328-1944'],"['0041-9311', '2328-1944']",e,7,2328-1944,"[['61535210810001701', 'JSTOR Arts and Science...","[['53538633520001701', 'Journal of the Univers...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15654,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,,,,...,,,,,,,,,,
15655,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496,,,...,International journal of adhesion and adhesives.,IPC Science and Technology Press,,,,,,YES,10 (1990)-26 (2006),"v.1 (1980)-v.9 (1989), v.27 (2007)-"
15656,125537,9968947900001701,International journal of adhesion and adhesives,['1879-0127'],"['0143-7496', '1879-0127']",e,101582,1879-0127,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",...,,,,,,,,,,
15657,114862,9977101571201701,Oriental insects,['2157-8745'],"['0030-5316', '2157-8745']",e,101697,2157-8745,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442430001701', 'Oriental insects'], ['...",...,,,,,,,,,,


#### Portico

In [73]:
portico = pd.read_excel('portico_2020-Oct.xlsx')
portico

Unnamed: 0,Library Input Row,Electronic Collection Id,Electronic Collection Public Name,Portfolio Id,Available For Group,Availability,Lifecycle,ISSN,MMS Id,Bibliographic Lifecycle,...,Earliest Volume Preserved,Latest Year Preserved,Latest Volume Preserved,Num Preserved Articles,Num Preserved Issues,Num Preserved Volumes,URL to Journal in Audit Interface,Portico Content Provider,Portico Title ID,Notes
0,1,61535209430001701,American Medical Association Backfiles,53536123280001701,TwinCities,Available,In Repository,2375-057X; 0093-0326,9966629010001701,In Repository,...,1,1997.0,115,29375.0,450.0,52.0,http://audit.portico.org/stable?cs=ISSN_000399...,American Medical Association (through 2017),ISSN_00039950 | ISSN_00930326 | ISSN_00966339,
1,2,61535209430001701,American Medical Association Backfiles,53536407060001701,TwinCities,Available,In Repository,0730-188X,9966759170001701,In Repository,...,,,,,,,,,,
2,3,61535209430001701,American Medical Association Backfiles,53536420410001701,TwinCities,Available,In Repository,2376-3590; 0272-5533,9966757510001701,In Repository,...,,,,,,,,,,
3,4,61535209430001701,American Medical Association Backfiles,53536464180001701,TwinCities,Available,In Repository,2374-3018; 0002-922X,9966774910001701,In Repository,...,91,1997.0,151,16545.0,505.0,85.0,http://audit.portico.org/stable?cs=ISSN_000292...,American Medical Association (through 2017),ISSN_0002922X | ISSN_00966916 | ISSN_00968994 ...,
4,5,61535209430001701,American Medical Association Backfiles,53536464550001701,TwinCities,Available,In Repository,2376-3817; 0003-9977,9966774590001701,In Repository,...,1,1997.0,123,11918.0,311.0,52.0,http://audit.portico.org/stable?cs=ISSN_000399...,American Medical Association (through 2017),ISSN_00039977 | ISSN_00966894 | ISSN_02760673 ...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30871,30872,61816118310001701,Mary Ann Liebert Publishers Journals,53816118240001701,TwinCities,Available,In Repository,2151-1136,9975418997301701,In Repository,...,24,2018.0,32,450.0,52.0,9.0,http://audit.portico.org/stable?cs=ISSN_21511136,"Mary Ann Liebert, Inc.",ISSN_21511136,
30872,30873,61816118310001701,Mary Ann Liebert Publishers Journals,53816118260001701,TwinCities,Available,In Repository,1937-3392; 1937-3384,9966973510001701,In Repository,...,14,2019.0,25,1175.0,120.0,12.0,http://audit.portico.org/stable?cs=ISSN_19373384,"Mary Ann Liebert, Inc.",ISSN_19373384,
30873,30874,61816118310001701,Mary Ann Liebert Publishers Journals,53816118280001701,TwinCities,Available,In Repository,1937-3376; 1937-3368,9966973540001701,In Repository,...,14,2019.0,25,495.0,67.0,12.0,http://audit.portico.org/stable?cs=ISSN_19373368,"Mary Ann Liebert, Inc.",ISSN_19373368,
30874,30875,61816598280001701,Wiley Online Library Surgery Backfiles,53816598250001701,TwinCities,Available,In Repository,1365-2168; 0007-1323,9966956150001701,In Repository,...,1,2020.0,107,35244.0,912.0,107.0,http://audit.portico.org/stable?cs=ISSN_00071323,"John Wiley & Sons, Inc.",ISSN_00071323,


In [74]:
portico['Linking ISSN list'] = portico['Linking ISSN'].apply(lambda x: str(x).split('|'))
portico

Unnamed: 0,Library Input Row,Electronic Collection Id,Electronic Collection Public Name,Portfolio Id,Available For Group,Availability,Lifecycle,ISSN,MMS Id,Bibliographic Lifecycle,...,Latest Year Preserved,Latest Volume Preserved,Num Preserved Articles,Num Preserved Issues,Num Preserved Volumes,URL to Journal in Audit Interface,Portico Content Provider,Portico Title ID,Notes,Linking ISSN list
0,1,61535209430001701,American Medical Association Backfiles,53536123280001701,TwinCities,Available,In Repository,2375-057X; 0093-0326,9966629010001701,In Repository,...,1997.0,115,29375.0,450.0,52.0,http://audit.portico.org/stable?cs=ISSN_000399...,American Medical Association (through 2017),ISSN_00039950 | ISSN_00930326 | ISSN_00966339,,"[0003-9950 , 0093-0326 , 2374-765X]"
1,2,61535209430001701,American Medical Association Backfiles,53536407060001701,TwinCities,Available,In Repository,0730-188X,9966759170001701,In Repository,...,,,,,,,,,,[0730-188X]
2,3,61535209430001701,American Medical Association Backfiles,53536420410001701,TwinCities,Available,In Repository,2376-3590; 0272-5533,9966757510001701,In Repository,...,,,,,,,,,,[0272-5533]
3,4,61535209430001701,American Medical Association Backfiles,53536464180001701,TwinCities,Available,In Repository,2374-3018; 0002-922X,9966774910001701,In Repository,...,1997.0,151,16545.0,505.0,85.0,http://audit.portico.org/stable?cs=ISSN_000292...,American Medical Association (through 2017),ISSN_0002922X | ISSN_00966916 | ISSN_00968994 ...,,"[0002-922X , 0096-8994 , 0096-6916 , 1072-4..."
4,5,61535209430001701,American Medical Association Backfiles,53536464550001701,TwinCities,Available,In Repository,2376-3817; 0003-9977,9966774590001701,In Repository,...,1997.0,123,11918.0,311.0,52.0,http://audit.portico.org/stable?cs=ISSN_000399...,American Medical Association (through 2017),ISSN_00039977 | ISSN_00966894 | ISSN_02760673 ...,,"[2376-3817 , 0096-6894 , 0276-0673 , 1538-3..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30871,30872,61816118310001701,Mary Ann Liebert Publishers Journals,53816118240001701,TwinCities,Available,In Repository,2151-1136,9975418997301701,In Repository,...,2018.0,32,450.0,52.0,9.0,http://audit.portico.org/stable?cs=ISSN_21511136,"Mary Ann Liebert, Inc.",ISSN_21511136,,[2151-1136]
30872,30873,61816118310001701,Mary Ann Liebert Publishers Journals,53816118260001701,TwinCities,Available,In Repository,1937-3392; 1937-3384,9966973510001701,In Repository,...,2019.0,25,1175.0,120.0,12.0,http://audit.portico.org/stable?cs=ISSN_19373384,"Mary Ann Liebert, Inc.",ISSN_19373384,,[1937-3384]
30873,30874,61816118310001701,Mary Ann Liebert Publishers Journals,53816118280001701,TwinCities,Available,In Repository,1937-3376; 1937-3368,9966973540001701,In Repository,...,2019.0,25,495.0,67.0,12.0,http://audit.portico.org/stable?cs=ISSN_19373368,"Mary Ann Liebert, Inc.",ISSN_19373368,,[1937-3368]
30874,30875,61816598280001701,Wiley Online Library Surgery Backfiles,53816598250001701,TwinCities,Available,In Repository,1365-2168; 0007-1323,9966956150001701,In Repository,...,2020.0,107,35244.0,912.0,107.0,http://audit.portico.org/stable?cs=ISSN_00071323,"John Wiley & Sons, Inc.",ISSN_00071323,,[1365-2168]


In [75]:
melted_ids = pd.concat([pd.DataFrame(v, index=np.repeat(k,len(v))) for k,v in portico['Linking ISSN list'].to_dict().items()])
melted_ids = melted_ids.rename(columns={0:'Linking ISSN split'})
melted_ids = melted_ids[melted_ids['Linking ISSN split'] != '']
melted_ids

Unnamed: 0,Linking ISSN split
0,0003-9950
0,0093-0326
0,2374-765X
1,0730-188X
2,0272-5533
...,...
30871,2151-1136
30872,1937-3384
30873,1937-3368
30874,1365-2168


In [76]:
portico_gran = pd.DataFrame()
portico_gran = pd.merge(portico,melted_ids,how="left",left_index=True,right_index=True)
portico_gran

Unnamed: 0,Library Input Row,Electronic Collection Id,Electronic Collection Public Name,Portfolio Id,Available For Group,Availability,Lifecycle,ISSN,MMS Id,Bibliographic Lifecycle,...,Latest Volume Preserved,Num Preserved Articles,Num Preserved Issues,Num Preserved Volumes,URL to Journal in Audit Interface,Portico Content Provider,Portico Title ID,Notes,Linking ISSN list,Linking ISSN split
0,1,61535209430001701,American Medical Association Backfiles,53536123280001701,TwinCities,Available,In Repository,2375-057X; 0093-0326,9966629010001701,In Repository,...,115,29375.0,450.0,52.0,http://audit.portico.org/stable?cs=ISSN_000399...,American Medical Association (through 2017),ISSN_00039950 | ISSN_00930326 | ISSN_00966339,,"[0003-9950 , 0093-0326 , 2374-765X]",0003-9950
0,1,61535209430001701,American Medical Association Backfiles,53536123280001701,TwinCities,Available,In Repository,2375-057X; 0093-0326,9966629010001701,In Repository,...,115,29375.0,450.0,52.0,http://audit.portico.org/stable?cs=ISSN_000399...,American Medical Association (through 2017),ISSN_00039950 | ISSN_00930326 | ISSN_00966339,,"[0003-9950 , 0093-0326 , 2374-765X]",0093-0326
0,1,61535209430001701,American Medical Association Backfiles,53536123280001701,TwinCities,Available,In Repository,2375-057X; 0093-0326,9966629010001701,In Repository,...,115,29375.0,450.0,52.0,http://audit.portico.org/stable?cs=ISSN_000399...,American Medical Association (through 2017),ISSN_00039950 | ISSN_00930326 | ISSN_00966339,,"[0003-9950 , 0093-0326 , 2374-765X]",2374-765X
1,2,61535209430001701,American Medical Association Backfiles,53536407060001701,TwinCities,Available,In Repository,0730-188X,9966759170001701,In Repository,...,,,,,,,,,[0730-188X],0730-188X
2,3,61535209430001701,American Medical Association Backfiles,53536420410001701,TwinCities,Available,In Repository,2376-3590; 0272-5533,9966757510001701,In Repository,...,,,,,,,,,[0272-5533],0272-5533
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30871,30872,61816118310001701,Mary Ann Liebert Publishers Journals,53816118240001701,TwinCities,Available,In Repository,2151-1136,9975418997301701,In Repository,...,32,450.0,52.0,9.0,http://audit.portico.org/stable?cs=ISSN_21511136,"Mary Ann Liebert, Inc.",ISSN_21511136,,[2151-1136],2151-1136
30872,30873,61816118310001701,Mary Ann Liebert Publishers Journals,53816118260001701,TwinCities,Available,In Repository,1937-3392; 1937-3384,9966973510001701,In Repository,...,25,1175.0,120.0,12.0,http://audit.portico.org/stable?cs=ISSN_19373384,"Mary Ann Liebert, Inc.",ISSN_19373384,,[1937-3384],1937-3384
30873,30874,61816118310001701,Mary Ann Liebert Publishers Journals,53816118280001701,TwinCities,Available,In Repository,1937-3376; 1937-3368,9966973540001701,In Repository,...,25,495.0,67.0,12.0,http://audit.portico.org/stable?cs=ISSN_19373368,"Mary Ann Liebert, Inc.",ISSN_19373368,,[1937-3368],1937-3368
30874,30875,61816598280001701,Wiley Online Library Surgery Backfiles,53816598250001701,TwinCities,Available,In Repository,1365-2168; 0007-1323,9966956150001701,In Repository,...,107,35244.0,912.0,107.0,http://audit.portico.org/stable?cs=ISSN_00071323,"John Wiley & Sons, Inc.",ISSN_00071323,,[1365-2168],1365-2168


#### Merge Portico data into df with BTAA data

In [77]:
portico_gran.columns

Index(['Library Input Row', 'Electronic Collection Id',
       'Electronic Collection Public Name', 'Portfolio Id',
       'Available For Group', 'Availability', 'Lifecycle', 'ISSN', 'MMS Id',
       'Bibliographic Lifecycle', 'Title (Complete)', 'Linking ISSN',
       'Portico Match', 'Preservation Service', 'Portico Title',
       'Portico ISSN', 'PCA', 'Status', 'Earliest Year Preserved',
       'Earliest Volume Preserved', 'Latest Year Preserved',
       'Latest Volume Preserved', 'Num Preserved Articles',
       'Num Preserved Issues', 'Num Preserved Volumes',
       'URL to Journal in Audit Interface', 'Portico Content Provider',
       'Portico Title ID', 'Notes', 'Linking ISSN list', 'Linking ISSN split'],
      dtype='object')

In [78]:
portico_gran = portico_gran[['ISSN','Title (Complete)','Portico Match','Portico Title','PCA',
       'Status', 'Earliest Year Preserved', 'Latest Year Preserved','Linking ISSN list', 'Linking ISSN split']]
portico_gran

Unnamed: 0,ISSN,Title (Complete),Portico Match,Portico Title,PCA,Status,Earliest Year Preserved,Latest Year Preserved,Linking ISSN list,Linking ISSN split
0,2375-057X; 0093-0326,Archives of ophthalmology.,Yes,Archives of Ophthalmology (1929-1950) | Archiv...,No,Preserved,1929.0,1997.0,"[0003-9950 , 0093-0326 , 2374-765X]",0003-9950
0,2375-057X; 0093-0326,Archives of ophthalmology.,Yes,Archives of Ophthalmology (1929-1950) | Archiv...,No,Preserved,1929.0,1997.0,"[0003-9950 , 0093-0326 , 2374-765X]",0093-0326
0,2375-057X; 0093-0326,Archives of ophthalmology.,Yes,Archives of Ophthalmology (1929-1950) | Archiv...,No,Preserved,1929.0,1997.0,"[0003-9950 , 0093-0326 , 2374-765X]",2374-765X
1,0730-188X,The archives of internal medicine.,No,,,,,,[0730-188X],0730-188X
2,2376-3590; 0272-5533,Archives of surgery.,No,,,,,,[0272-5533],0272-5533
...,...,...,...,...,...,...,...,...,...,...
30871,2151-1136,Videourology.,Yes,Videourology,Yes,Preserved,2010.0,2018.0,[2151-1136],2151-1136
30872,1937-3392; 1937-3384,"Tissue engineering. Part C, Methods.",Yes,"Tissue Engineering, Part C, Methods",Yes,Preserved,2008.0,2019.0,[1937-3384],1937-3384
30873,1937-3376; 1937-3368,"Tissue engineering. Part B, Reviews.",Yes,"Tissue Engineering, Part B, Reviews",Yes,Preserved,2008.0,2019.0,[1937-3368],1937-3368
30874,1365-2168; 0007-1323,British journal of surgery.,Yes,British Journal of Surgery,Yes,Preserved,1913.0,2020.0,[1365-2168],1365-2168


In [79]:
portico_gran = portico_gran[(portico_gran['Portico Match'] == 'Yes') & (portico_gran['PCA'] == 'Yes') & 
                            (portico_gran['Status'] == 'Preserved')]
portico_gran

Unnamed: 0,ISSN,Title (Complete),Portico Match,Portico Title,PCA,Status,Earliest Year Preserved,Latest Year Preserved,Linking ISSN list,Linking ISSN split
30,1553-0795; 0002-9106,American journal of anatomy.,Yes,American Journal of Anatomy,Yes,Preserved,1901.0,1991.0,[0002-9106],0002-9106
31,1554-527X; 0736-0266,Journal of orthopaedic research.,Yes,Journal of Orthopaedic Research,Yes,Preserved,1983.0,2020.0,[1554-527X],1554-527X
32,1520-6327; 0739-4462,Archives of insect biochemistry and physiology.,Yes,Archives of Insect Biochemistry and Physiology,Yes,Preserved,1983.0,2020.0,[0739-4462],0739-4462
33,1097-0029; 1059-910X,Microscopy research and technique.,Yes,Microscopy Research and Technique,Yes,Preserved,1992.0,2020.0,[1097-0029],1097-0029
34,1097-4652; 0021-9541,Journal of cellular physiology.,Yes,Journal of Cellular Physiology,Yes,Preserved,1966.0,2020.0,[0021-9541],0021-9541
...,...,...,...,...,...,...,...,...,...,...
30870,0893-5068; 0893-5068,AIDS patient care.,Yes,AIDS Patient Care | AIDS Patient Care and STDs,Yes,Preserved,1987.0,2019.0,"[2332-4023 , 1087-2914]",1087-2914
30871,2151-1136,Videourology.,Yes,Videourology,Yes,Preserved,2010.0,2018.0,[2151-1136],2151-1136
30872,1937-3392; 1937-3384,"Tissue engineering. Part C, Methods.",Yes,"Tissue Engineering, Part C, Methods",Yes,Preserved,2008.0,2019.0,[1937-3384],1937-3384
30873,1937-3376; 1937-3368,"Tissue engineering. Part B, Reviews.",Yes,"Tissue Engineering, Part B, Reviews",Yes,Preserved,2008.0,2019.0,[1937-3368],1937-3368


In [80]:
portico_cols = list(portico_gran.columns)
portico_cols

['ISSN',
 'Title (Complete)',
 'Portico Match',
 'Portico Title',
 'PCA',
 'Status',
 'Earliest Year Preserved',
 'Latest Year Preserved',
 'Linking ISSN list',
 'Linking ISSN split']

In [81]:
portico_cols_edit = []
for x in portico_cols:
    portico_cols_edit.append(str(x) + '_PORTICO')
portico_cols_edit

['ISSN_PORTICO',
 'Title (Complete)_PORTICO',
 'Portico Match_PORTICO',
 'Portico Title_PORTICO',
 'PCA_PORTICO',
 'Status_PORTICO',
 'Earliest Year Preserved_PORTICO',
 'Latest Year Preserved_PORTICO',
 'Linking ISSN list_PORTICO',
 'Linking ISSN split_PORTICO']

In [82]:
portico_dict = dict(zip(portico_cols, portico_cols_edit))
portico_dict

{'ISSN': 'ISSN_PORTICO',
 'Title (Complete)': 'Title (Complete)_PORTICO',
 'Portico Match': 'Portico Match_PORTICO',
 'Portico Title': 'Portico Title_PORTICO',
 'PCA': 'PCA_PORTICO',
 'Status': 'Status_PORTICO',
 'Earliest Year Preserved': 'Earliest Year Preserved_PORTICO',
 'Latest Year Preserved': 'Latest Year Preserved_PORTICO',
 'Linking ISSN list': 'Linking ISSN list_PORTICO',
 'Linking ISSN split': 'Linking ISSN split_PORTICO'}

In [83]:
portico_gran.rename(index=str, columns = portico_dict, inplace=True)
portico_gran

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


Unnamed: 0,ISSN_PORTICO,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO
30,1553-0795; 0002-9106,American journal of anatomy.,Yes,American Journal of Anatomy,Yes,Preserved,1901.0,1991.0,[0002-9106],0002-9106
31,1554-527X; 0736-0266,Journal of orthopaedic research.,Yes,Journal of Orthopaedic Research,Yes,Preserved,1983.0,2020.0,[1554-527X],1554-527X
32,1520-6327; 0739-4462,Archives of insect biochemistry and physiology.,Yes,Archives of Insect Biochemistry and Physiology,Yes,Preserved,1983.0,2020.0,[0739-4462],0739-4462
33,1097-0029; 1059-910X,Microscopy research and technique.,Yes,Microscopy Research and Technique,Yes,Preserved,1992.0,2020.0,[1097-0029],1097-0029
34,1097-4652; 0021-9541,Journal of cellular physiology.,Yes,Journal of Cellular Physiology,Yes,Preserved,1966.0,2020.0,[0021-9541],0021-9541
...,...,...,...,...,...,...,...,...,...,...
30870,0893-5068; 0893-5068,AIDS patient care.,Yes,AIDS Patient Care | AIDS Patient Care and STDs,Yes,Preserved,1987.0,2019.0,"[2332-4023 , 1087-2914]",1087-2914
30871,2151-1136,Videourology.,Yes,Videourology,Yes,Preserved,2010.0,2018.0,[2151-1136],2151-1136
30872,1937-3392; 1937-3384,"Tissue engineering. Part C, Methods.",Yes,"Tissue Engineering, Part C, Methods",Yes,Preserved,2008.0,2019.0,[1937-3384],1937-3384
30873,1937-3376; 1937-3368,"Tissue engineering. Part B, Reviews.",Yes,"Tissue Engineering, Part B, Reviews",Yes,Preserved,2008.0,2019.0,[1937-3368],1937-3368


In [84]:
df_pcad_btaa_portico = pd.merge(df_pcad_btaa,portico_gran,how='left',left_on='ISSN_to_match',right_on='ISSN_PORTICO')
df_pcad_btaa_portico

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,ISSN_PORTICO,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO
0,69875,9939153590001701,The Wilson journal of ornithology.,['1559-4491'],['1559-4491'],p,2,1559-4491,,,...,,,,,,,,,,
1,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447,"[['61588428410001701', 'JSTOR Life Sciences Co...","[['53588425000001701', 'The Wilson journal of ...",...,,,,,,,,,,
2,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,...,,,,,,,,,,
3,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",...,,,,,,,,,,
4,112352,9967759560001701,Journal of the University Film Association,['2328-1944'],"['0041-9311', '2328-1944']",e,7,2328-1944,"[['61535210810001701', 'JSTOR Arts and Science...","[['53538633520001701', 'Journal of the Univers...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16158,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,,,,...,,,,,,,,,,
16159,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496,,,...,,,,,,,,,,
16160,125537,9968947900001701,International journal of adhesion and adhesives,['1879-0127'],"['0143-7496', '1879-0127']",e,101582,1879-0127,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",...,,,,,,,,,,
16161,114862,9977101571201701,Oriental insects,['2157-8745'],"['0030-5316', '2157-8745']",e,101697,2157-8745,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442430001701', 'Oriental insects'], ['...",...,,,,,,,,,,


In [85]:
df_pcad_btaa_portico.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_bib', 'ISSN_cluster',
       'p_or_e', 'matches_group_id', 'ISSN_to_match', 'e_coll_info',
       'portfolio_info', 'Coverage Information Combined', 'PCAD?',
       'Vendor_key', 'ISSN Number_BTAA-SPR', 'Title 1 (Print)_BTAA-SPR',
       'Publisher (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Publisher (Print).1_BTAA-SPR', 'Title 3 (Print)_BTAA-SPR',
       'Publisher (Print).2_BTAA-SPR', '(more bib records?)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'SPR Missing_BTAA-SPR',
       'ISSN_PORTICO', 'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO',
       'Linking ISSN list_PORTICO', 'Linking ISSN split_PORTICO'],
      dtype='object')

In [86]:
df_pcad_btaa_portico.drop_duplicates(subset=['record_index', 'MMS_ID', 'Title_bib', 
       'p_or_e', 'matches_group_id', 'ISSN_to_match', 'ISSN Number_BTAA-SPR', 'Title 1 (Print)_BTAA-SPR',
                                             'Title 2 (Print)_BTAA-SPR','Title 3 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO',
       'Linking ISSN split_PORTICO'], inplace=True)
df_pcad_btaa_portico

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,ISSN_PORTICO,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO
0,69875,9939153590001701,The Wilson journal of ornithology.,['1559-4491'],['1559-4491'],p,2,1559-4491,,,...,,,,,,,,,,
1,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447,"[['61588428410001701', 'JSTOR Life Sciences Co...","[['53588425000001701', 'The Wilson journal of ...",...,,,,,,,,,,
2,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,...,,,,,,,,,,
3,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",...,,,,,,,,,,
4,112352,9967759560001701,Journal of the University Film Association,['2328-1944'],"['0041-9311', '2328-1944']",e,7,2328-1944,"[['61535210810001701', 'JSTOR Arts and Science...","[['53538633520001701', 'Journal of the Univers...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16158,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,,,,...,,,,,,,,,,
16159,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496,,,...,,,,,,,,,,
16160,125537,9968947900001701,International journal of adhesion and adhesives,['1879-0127'],"['0143-7496', '1879-0127']",e,101582,1879-0127,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",...,,,,,,,,,,
16161,114862,9977101571201701,Oriental insects,['2157-8745'],"['0030-5316', '2157-8745']",e,101697,2157-8745,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442430001701', 'Oriental insects'], ['...",...,,,,,,,,,,


In [87]:
df_pcad_btaa_portico[df_pcad_btaa_portico.duplicated(['MMS_ID'], keep=False) == True].sort_values(['MMS_ID'])
df_pcad_btaa_portico

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,ISSN_PORTICO,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO
0,69875,9939153590001701,The Wilson journal of ornithology.,['1559-4491'],['1559-4491'],p,2,1559-4491,,,...,,,,,,,,,,
1,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447,"[['61588428410001701', 'JSTOR Life Sciences Co...","[['53588425000001701', 'The Wilson journal of ...",...,,,,,,,,,,
2,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,...,,,,,,,,,,
3,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",...,,,,,,,,,,
4,112352,9967759560001701,Journal of the University Film Association,['2328-1944'],"['0041-9311', '2328-1944']",e,7,2328-1944,"[['61535210810001701', 'JSTOR Arts and Science...","[['53538633520001701', 'Journal of the Univers...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16158,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,,,,...,,,,,,,,,,
16159,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496,,,...,,,,,,,,,,
16160,125537,9968947900001701,International journal of adhesion and adhesives,['1879-0127'],"['0143-7496', '1879-0127']",e,101582,1879-0127,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",...,,,,,,,,,,
16161,114862,9977101571201701,Oriental insects,['2157-8745'],"['0030-5316', '2157-8745']",e,101697,2157-8745,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442430001701', 'Oriental insects'], ['...",...,,,,,,,,,,


In [89]:
df_pcad_btaa_portico.to_pickle(f'df_pcad_repos_{today}.pkl')

In [90]:
df_pcad_btaa_portico.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_bib', 'ISSN_cluster',
       'p_or_e', 'matches_group_id', 'ISSN_to_match', 'e_coll_info',
       'portfolio_info', 'Coverage Information Combined', 'PCAD?',
       'Vendor_key', 'ISSN Number_BTAA-SPR', 'Title 1 (Print)_BTAA-SPR',
       'Publisher (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Publisher (Print).1_BTAA-SPR', 'Title 3 (Print)_BTAA-SPR',
       'Publisher (Print).2_BTAA-SPR', '(more bib records?)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'SPR Missing_BTAA-SPR',
       'ISSN_PORTICO', 'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO',
       'Linking ISSN list_PORTICO', 'Linking ISSN split_PORTICO'],
      dtype='object')

In [91]:
df_pcad_btaa_portico[['record_index', 'MMS_ID', 'Title_bib', 'ISSN_bib', 'ISSN_cluster',
       'p_or_e', 'matches_group_id', 'ISSN_to_match', 'e_coll_info',
       'portfolio_info', 'Coverage Information Combined', 'PCAD?',
       'Vendor_key', 'ISSN Number_BTAA-SPR', 'Title 1 (Print)_BTAA-SPR',
       'Title 2 (Print)_BTAA-SPR','Title 3 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR',
       'ISSN_PORTICO', 'Title (Complete)_PORTICO',
       'Portico Match_PORTICO', 'Portico Title_PORTICO', 'PCA_PORTICO',
       'Status_PORTICO', 'Earliest Year Preserved_PORTICO',
       'Latest Year Preserved_PORTICO']]
df_pcad_btaa_portico

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,ISSN_PORTICO,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO
0,69875,9939153590001701,The Wilson journal of ornithology.,['1559-4491'],['1559-4491'],p,2,1559-4491,,,...,,,,,,,,,,
1,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447,"[['61588428410001701', 'JSTOR Life Sciences Co...","[['53588425000001701', 'The Wilson journal of ...",...,,,,,,,,,,
2,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,...,,,,,,,,,,
3,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",...,,,,,,,,,,
4,112352,9967759560001701,Journal of the University Film Association,['2328-1944'],"['0041-9311', '2328-1944']",e,7,2328-1944,"[['61535210810001701', 'JSTOR Arts and Science...","[['53538633520001701', 'Journal of the Univers...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16158,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,,,,...,,,,,,,,,,
16159,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496,,,...,,,,,,,,,,
16160,125537,9968947900001701,International journal of adhesion and adhesives,['1879-0127'],"['0143-7496', '1879-0127']",e,101582,1879-0127,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",...,,,,,,,,,,
16161,114862,9977101571201701,Oriental insects,['2157-8745'],"['0030-5316', '2157-8745']",e,101697,2157-8745,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442430001701', 'Oriental insects'], ['...",...,,,,,,,,,,


#### Determine PCAD groups with repository coverage

In [92]:
df = df_pcad_btaa_portico

In [93]:
df.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_bib', 'ISSN_cluster',
       'p_or_e', 'matches_group_id', 'ISSN_to_match', 'e_coll_info',
       'portfolio_info', 'Coverage Information Combined', 'PCAD?',
       'Vendor_key', 'ISSN Number_BTAA-SPR', 'Title 1 (Print)_BTAA-SPR',
       'Publisher (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Publisher (Print).1_BTAA-SPR', 'Title 3 (Print)_BTAA-SPR',
       'Publisher (Print).2_BTAA-SPR', '(more bib records?)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'SPR Missing_BTAA-SPR',
       'ISSN_PORTICO', 'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO',
       'Linking ISSN list_PORTICO', 'Linking ISSN split_PORTICO'],
      dtype='object')

In [94]:
has_preservation = df[(df['Earliest Year Preserved_PORTICO'].notnull()) | df['SPR Holdings_BTAA-SPR'].notnull()]
has_preservation

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,ISSN_PORTICO,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO
2,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,...,,,,,,,,,,
7,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],['0020-2681'],p,12,0020-2681,,,...,,,,,,,,,,
16,25439,9963082590001701,Giornale degli economisti e annali di economia,['0017-0097'],['0017-0097'],p,92,0017-0097,,,...,,,,,,,,,,
29,78022,9964617400001701,Mathematics of the USSR. Izvestija,['0025-5726'],['0025-5726'],p,267,0025-5726,,,...,,,,,,,,,,
32,62770,9942472340001701,Laboratory techniques in biochemistry and mole...,['0075-7535'],['0075-7535'],p,277,0075-7535,,,...,0075-7535,Laboratory techniques in biochemistry and mole...,Yes,Laboratory Techniques in Biochemistry and Mole...,Yes,Preserved,,,[2589-2789],2589-2789
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16146,91677,9953205880001701,Physics letters. Section B,['0370-2693'],"['0031-9163', '0370-2693']",p,101461,0370-2693,,,...,,,,,,,,,,
16149,18706,9950698160001701,Proceedings and addresses of the American Phil...,['0065-972X'],"['0065-972X', '0003-049X']",p,101522,0065-972X,,,...,,,,,,,,,,
16150,26122,9914298940001701,Proceedings of the American Philosophical Society,['0003-049X'],['0003-049X'],p,101522,0003-049X,,,...,,,,,,,,,,
16156,17960,9946768760001701,The Americas,['0003-1615'],"['1533-6247', '0003-1615']",p,101581,0003-1615,,,...,,,,,,,,,,


In [95]:
p_pres_groups = has_preservation['matches_group_id'].unique().tolist()
p_pres_groups

[5,
 12,
 92,
 267,
 277,
 513,
 537,
 686,
 707,
 739,
 744,
 895,
 914,
 1001,
 1066,
 1087,
 1208,
 1345,
 1527,
 1606,
 1725,
 1745,
 1809,
 1857,
 2059,
 2065,
 2067,
 2078,
 2364,
 2435,
 2492,
 2529,
 2537,
 2557,
 2571,
 2592,
 2603,
 2604,
 2725,
 2734,
 2961,
 2993,
 3023,
 3399,
 3434,
 3475,
 3489,
 3544,
 3573,
 3682,
 3702,
 3724,
 3740,
 3794,
 3861,
 3960,
 4013,
 4058,
 4105,
 4188,
 4237,
 4258,
 4265,
 4273,
 4300,
 4321,
 4338,
 4344,
 4361,
 4368,
 4376,
 4380,
 4381,
 4393,
 4401,
 4422,
 4508,
 4511,
 4542,
 4588,
 4618,
 4690,
 4851,
 4904,
 4909,
 4954,
 4961,
 5058,
 5087,
 5088,
 5093,
 5110,
 5186,
 5225,
 5371,
 5420,
 5433,
 5600,
 5660,
 5665,
 5700,
 5712,
 5714,
 5857,
 6100,
 6126,
 6195,
 6213,
 6294,
 6311,
 6326,
 6346,
 6352,
 6367,
 6419,
 6426,
 6429,
 6433,
 6434,
 6475,
 6476,
 6596,
 6619,
 6637,
 6687,
 6728,
 6797,
 6825,
 6841,
 6851,
 6894,
 6935,
 6959,
 7012,
 7120,
 7141,
 7152,
 7242,
 7254,
 7257,
 7339,
 7340,
 7341,
 7356,
 7520,
 7

In [96]:
len(p_pres_groups)

3543

In [97]:
p_pres_groups_df = df[df['matches_group_id'].isin(p_pres_groups)]
p_pres_groups_df

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,ISSN_PORTICO,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO
2,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,...,,,,,,,,,,
3,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",...,,,,,,,,,,
7,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],['0020-2681'],p,12,0020-2681,,,...,,,,,,,,,,
8,111951,9968429800001701,Journal of the Institute of Actuaries,['2058-1009'],"['0020-2681', '2058-1009']",e,12,2058-1009,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",...,,,,,,,,,,
15,123907,9967115530001701,Giornale degli economisti e annali di economia,[''],['0017-0097'],e,92,,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16156,17960,9946768760001701,The Americas,['0003-1615'],"['1533-6247', '0003-1615']",p,101581,0003-1615,,,...,,,,,,,,,,
16157,117510,9967987860001701,The Americas - Academy of American Franciscan ...,['1533-6247'],"['1533-6247', '0003-1615']",e,101581,1533-6247,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",...,,,,,,,,,,
16158,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,,,,...,,,,,,,,,,
16159,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496,,,...,,,,,,,,,,


In [98]:
no_repo_p_pres_groups_df = df[~df['matches_group_id'].isin(p_pres_groups)]
no_repo_p_pres_groups_df

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,ISSN_PORTICO,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO
0,69875,9939153590001701,The Wilson journal of ornithology.,['1559-4491'],['1559-4491'],p,2,1559-4491,,,...,,,,,,,,,,
1,113344,9968879310001701,The Wilson journal of ornithology,['1938-5447'],"['1559-4491', '1938-5447']",e,2,1938-5447,"[['61588428410001701', 'JSTOR Life Sciences Co...","[['53588425000001701', 'The Wilson journal of ...",...,,,,,,,,,,
4,112352,9967759560001701,Journal of the University Film Association,['2328-1944'],"['0041-9311', '2328-1944']",e,7,2328-1944,"[['61535210810001701', 'JSTOR Arts and Science...","[['53538633520001701', 'Journal of the Univers...",...,,,,,,,,,,
5,107439,9967145520001701,Journal of the University Film and Video Assoc...,['2328-1936'],"['2328-1936', '0734-919X']",e,7,2328-1936,"[['61535210810001701', 'JSTOR Arts and Science...","[['53537295580001701', 'Journal of the Univers...",...,,,,,,,,,,
6,78492,9933370000001701,Journal of the University Film and Video Assoc...,['0734-919X'],"['0041-9311', '0734-919X']",p,7,0734-919X,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16140,36337,9957951680001701,Bird-banding,['0006-3630'],['0006-3630'],p,101457,0006-3630,,,...,,,,,,,,,,
16147,113874,9974799725001701,Nomos,[''],['0078-0979'],e,101504,,"[['61789519150001701', 'JSTOR Arts & Sciences ...","[['53789566180001701', 'Nomos.']]",...,,,,,,,,,,
16148,3295,9953296770001701,Nomos,['0078-0979'],['0078-0979'],p,101504,0078-0979,,,...,,,,,,,,,,
16161,114862,9977101571201701,Oriental insects,['2157-8745'],"['0030-5316', '2157-8745']",e,101697,2157-8745,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442430001701', 'Oriental insects'], ['...",...,,,,,,,,,,


In [99]:
p_pres_groups_df.to_pickle('all_groups_' + today + '.pkl')
no_repo_p_pres_groups_df.to_pickle('no_repo_groups_' + today + '.pkl')

In [100]:
df = p_pres_groups_df
df

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,ISSN_PORTICO,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO
2,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,...,,,,,,,,,,
3,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",...,,,,,,,,,,
7,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],['0020-2681'],p,12,0020-2681,,,...,,,,,,,,,,
8,111951,9968429800001701,Journal of the Institute of Actuaries,['2058-1009'],"['0020-2681', '2058-1009']",e,12,2058-1009,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",...,,,,,,,,,,
15,123907,9967115530001701,Giornale degli economisti e annali di economia,[''],['0017-0097'],e,92,,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16156,17960,9946768760001701,The Americas,['0003-1615'],"['1533-6247', '0003-1615']",p,101581,0003-1615,,,...,,,,,,,,,,
16157,117510,9967987860001701,The Americas - Academy of American Franciscan ...,['1533-6247'],"['1533-6247', '0003-1615']",e,101581,1533-6247,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",...,,,,,,,,,,
16158,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,,,,...,,,,,,,,,,
16159,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496,,,...,,,,,,,,,,


In [101]:
df['MMS_ID'].nunique()

8628

#### Get list of Physical MMS IDs where group has p2e, pcad, and repo coverage
You will need this file to populate a set in Alma of print records that are candidates for withdrawal.

In [102]:
p_mms_ids = df[df['p_or_e'] == 'p']['MMS_ID'].unique().tolist()
len(p_mms_ids)

4454

In [103]:
pmms = pd.DataFrame(p_mms_ids, columns=['MMS ID'])
pmms.to_csv(f'all_p_mms_ids-pcad-preserved_{today}.txt', sep='\t', index=False)