# PCAD notebook 5

This notebook completes the analysis, resulting in dataframes that are close to the final output.
- Counts items per title and merges that data with the main dataframe
- Processes chronology data for print holdings, trusted repositories, and electronic coverage so that overlap can be calculated
- Calculates the number of volumes per title that are candidates for withdrawal based on PCA and repository holdings
- Parses the main dataframe into separate dataframes by number of vendors, number of locations, and vendor groups.

This is the trickiest section of the entire process; the last big drop duplicates operation requires recasting list-type columns to strings, and then recasting the resulting dataframe's columns to their original types. There may be some trial and error involved there; please improve the code if you can.

Required files/inputs:
- `.pkl` file of location-filtered item-level enumeration and chronology data produced by PCAD notebook 4
- "All groups" pickle file produced by PCAD notebook 3 (`all_groups_{date}.pkl`)

Ouputs:
- `multi_loc_100*.pkl` files (8 files)
- `single_loc_100*.pkl` files (8 files)

Because this notebook is lengthy, and there are several complex operations, several other `.pkl` files are saved before critical or dangerous operations; these can be discarded if they are not needed.

In [1]:
import ast
import math
import re
import pandas as pd
import numpy as np
from os.path import splitext
from datetime import date
today = str(date.today()).replace('-','')

In [2]:
#change filename
chron = pd.read_pickle('items-enumchron-20201011.pkl')
chron

Unnamed: 0,001,954$a,954$b,954$c,954$d,954$e,954$f,954$g,954$h,954$i
0,9936524420001701,23409495250001701,31951000208506R,8,1978,v.8 (1978),TBIOM,PERS,TBIOM,PERS
0,9954539020001701,23484960990001701,31951P002253064,350,1994,v.350:no.1-3 (1994),TBIOM,PERS,TBIOM,PERS
0,9954539020001701,23484960900001701,31951P00454482T,354,1996,v.354:no.4-6+suppl. (1996),TBIOM,PERS,TBIOM,PERS
0,9954539020001701,23484960670001701,31951000265036H,278-279,1973,v.278-279+suppl. (1973),TBIOM,PERS,TBIOM,PERS
0,9954539020001701,23484960620001701,31951000265039B,284-285,1974,v.284-285 (1974),TBIOM,PERS,TBIOM,PERS
...,...,...,...,...,...,...,...,...,...,...
0,9961581080001701,23516118280001701,31951P00204631A,39,1993,v.39:no.7-12 (1993),TBIOM,PERS,TBIOM,PERS
0,9954865450001701,23486326540001701,31951P01062120F,88,2009,v.88A:no.3-4(2009),TBIOM,PERS,TBIOM,PERS
0,9961654610001701,23516343280001701,31951P005115126,12,1996,v.12 (1996),TBIOM,PERS,TBIOM,PERS
0,9946991890001701,23453395340001701,319510026777515,117-118,1922-23,bd.117-118 (1922-23),TBIOM,PERS,TBIOM,PERS


In [4]:
chron.columns

Index(['001', '954$a', '954$b', '954$c', '954$d', '954$e', '954$f', '954$g',
       '954$h', '954$i'],
      dtype='object')

In [5]:
chron.rename(columns={'001':'001-MMS_ID', '954$a':'954$a-Holdings', '954$b':'954$b-barcode', 
                      '954$c':'954$c-enum', '954$d':'954$d-chron', '954$e':'954$e-descr', 
                      '954$f':'954$f-perm-lib', '954$g':'954$g-perm-loc','954$h':'954$h-curr-lib', 
                      '954$i':'954$i-curr-loc'},inplace=True)
chron

Unnamed: 0,001-MMS_ID,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc
0,9936524420001701,23409495250001701,31951000208506R,8,1978,v.8 (1978),TBIOM,PERS,TBIOM,PERS
1,9954539020001701,23484960990001701,31951P002253064,350,1994,v.350:no.1-3 (1994),TBIOM,PERS,TBIOM,PERS
2,9954539020001701,23484960900001701,31951P00454482T,354,1996,v.354:no.4-6+suppl. (1996),TBIOM,PERS,TBIOM,PERS
3,9954539020001701,23484960670001701,31951000265036H,278-279,1973,v.278-279+suppl. (1973),TBIOM,PERS,TBIOM,PERS
4,9954539020001701,23484960620001701,31951000265039B,284-285,1974,v.284-285 (1974),TBIOM,PERS,TBIOM,PERS
...,...,...,...,...,...,...,...,...,...,...
254171,9961581080001701,23516118280001701,31951P00204631A,39,1993,v.39:no.7-12 (1993),TBIOM,PERS,TBIOM,PERS
254172,9954865450001701,23486326540001701,31951P01062120F,88,2009,v.88A:no.3-4(2009),TBIOM,PERS,TBIOM,PERS
254173,9961654610001701,23516343280001701,31951P005115126,12,1996,v.12 (1996),TBIOM,PERS,TBIOM,PERS
254174,9946991890001701,23453395340001701,319510026777515,117-118,1922-23,bd.117-118 (1922-23),TBIOM,PERS,TBIOM,PERS


In [6]:
chron = chron[chron['954$h-curr-lib'].notnull()]
chron

Unnamed: 0,001-MMS_ID,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc
0,9936524420001701,23409495250001701,31951000208506R,8,1978,v.8 (1978),TBIOM,PERS,TBIOM,PERS
1,9954539020001701,23484960990001701,31951P002253064,350,1994,v.350:no.1-3 (1994),TBIOM,PERS,TBIOM,PERS
2,9954539020001701,23484960900001701,31951P00454482T,354,1996,v.354:no.4-6+suppl. (1996),TBIOM,PERS,TBIOM,PERS
3,9954539020001701,23484960670001701,31951000265036H,278-279,1973,v.278-279+suppl. (1973),TBIOM,PERS,TBIOM,PERS
4,9954539020001701,23484960620001701,31951000265039B,284-285,1974,v.284-285 (1974),TBIOM,PERS,TBIOM,PERS
...,...,...,...,...,...,...,...,...,...,...
254171,9961581080001701,23516118280001701,31951P00204631A,39,1993,v.39:no.7-12 (1993),TBIOM,PERS,TBIOM,PERS
254172,9954865450001701,23486326540001701,31951P01062120F,88,2009,v.88A:no.3-4(2009),TBIOM,PERS,TBIOM,PERS
254173,9961654610001701,23516343280001701,31951P005115126,12,1996,v.12 (1996),TBIOM,PERS,TBIOM,PERS
254174,9946991890001701,23453395340001701,319510026777515,117-118,1922-23,bd.117-118 (1922-23),TBIOM,PERS,TBIOM,PERS


In [7]:
chron['curr-lib-loc'] = chron.apply(lambda row: row['954$h-curr-lib'] + ' ' + row['954$i-curr-loc'], axis=1)
chron

Unnamed: 0,001-MMS_ID,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc,curr-lib-loc
0,9936524420001701,23409495250001701,31951000208506R,8,1978,v.8 (1978),TBIOM,PERS,TBIOM,PERS,TBIOM PERS
1,9954539020001701,23484960990001701,31951P002253064,350,1994,v.350:no.1-3 (1994),TBIOM,PERS,TBIOM,PERS,TBIOM PERS
2,9954539020001701,23484960900001701,31951P00454482T,354,1996,v.354:no.4-6+suppl. (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS
3,9954539020001701,23484960670001701,31951000265036H,278-279,1973,v.278-279+suppl. (1973),TBIOM,PERS,TBIOM,PERS,TBIOM PERS
4,9954539020001701,23484960620001701,31951000265039B,284-285,1974,v.284-285 (1974),TBIOM,PERS,TBIOM,PERS,TBIOM PERS
...,...,...,...,...,...,...,...,...,...,...,...
254171,9961581080001701,23516118280001701,31951P00204631A,39,1993,v.39:no.7-12 (1993),TBIOM,PERS,TBIOM,PERS,TBIOM PERS
254172,9954865450001701,23486326540001701,31951P01062120F,88,2009,v.88A:no.3-4(2009),TBIOM,PERS,TBIOM,PERS,TBIOM PERS
254173,9961654610001701,23516343280001701,31951P005115126,12,1996,v.12 (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS
254174,9946991890001701,23453395340001701,319510026777515,117-118,1922-23,bd.117-118 (1922-23),TBIOM,PERS,TBIOM,PERS,TBIOM PERS


In [8]:
cgs = chron.groupby('001-MMS_ID').agg(lambda x: list(set(x))).reset_index()
cgs

Unnamed: 0,001-MMS_ID,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc,curr-lib-loc
0,9912002260001701,"[23305012440001701, 23305012240001701, 2330501...","[31951P00378985E, 31951P00654897M, 31951P00992...","[22, 23, 26, 7, 12, 6, 33, 16, 31, 21, 30, 20,...","[nan, 2004, 2010, 1978, 1984, 1995, 1971, 1977...","[v.3:no.1 (1973:winter), v.30:no.3-4 (2000), v...",[TBIOM],[PERS],[TBIOM],[PERS],[TBIOM PERS]
1,9912023030001701,"[23305079150001701, 23305079170001701, 2330507...","[31951P00548994H, 31951P00403583M, 31951D01258...","[22, 7, 26, 23, 12, 6, 21, 16, 20, 1, 25, 14, ...","[nan, 1978, 1995, 1987, 1977, 2001, 1980, 1985...","[v.13:no.2 (1986:Summer), v.6 (1977), v.22 (19...",[TBIOM],[PERS],[TBIOM],[PERS],[TBIOM PERS]
2,9912037080001701,[23305080030001701],[31951P00784468K],[60],[2004],[v.60:no.1-2 2004],[TSCI],[PER],[TSCI],[PER],[TSCI PER]
3,9912038890001701,"[23305086820001701, 23305086840001701, 2330508...","[31951P00686774A, 31951P003775736, 31951P00607...","[325, 320, 319, 318, 324, 321]","[1994, 1995, 1997]","[t.324:no.1-6 (1997), t.319:no.7-12 (1994), v....","[TCOS, ZMLAC]","[SN1, OWL]","[TCOS, ZMLAC]","[SN1, OWL]","[TCOS SN1, ZMLAC OWL]"
4,9912065260001701,"[23305229070001701, 23305229110001701, 2330522...","[31951000608115E, 31951000608122H, 31951000608...","[3-4, 20, 22, 19, 23, 5-7, 15-16, 13-14, 11-12...","[1962-63, 1962, 1964, 1963, 1962/66, 1965, 196...","[v.22 (1966), v.15-16 (1965), v.8-10 (1964), v...",[ZMLAC],[OWL],[ZMLAC],[OWL],[ZMLAC OWL]
...,...,...,...,...,...,...,...,...,...,...,...
4448,9975901508601701,"[23485188320001701, 23485188170001701, 2348518...","[31951D02173957I, 31951D02173954O, 31951D02173...","[10-14, 19-22, 15-18, 23-26]","[1934-1937, 1942-1945, 1938-1941, 1929-1933]","[v.10-14 (1929-1933), v.15-18 (1934-1937), v.2...",[TMAGR],[PER],[TMAGR],[PER],[TMAGR PER]
4449,9975989409901701,"[23361481890001701, 23361483880001701, 2336148...","[31951000597515S, 31951D002011808, 31951D00201...","[46, 7, 44-45, 1-2, 6, 45, 43, 37, 34-35, 46-4...","[1905, 1906, 1890, 1904, 1888/89, 1885-1885/18...","[v.44-45 (1907), v.43 (1906), v.10 (1889/90), ...","[TWILS, TMAGR]","[PERC, PER]","[TWILS, TMAGR]","[PERC, PER]","[TWILS PERC, TMAGR PER]"
4450,9976125196401701,"[23454315640001701, 23454315530001701, 2345431...","[31951D00263557D, 31951D00263542Q, 31951D00263...","[26-27, 24-25, 19, 17, 34-35, 32-33, 36-37, 40...","[1933-1934, 1934-1935, 1930-1931, 1932, 1938, ...","[Bd.18 (1930), Bd.26-27 (1932-33), Bd.28-29 (1...",[TMAGR],[PER],[TMAGR],[PER],[TMAGR PER]
4451,9976125496801701,"[23454315800001701, 23454315810001701, 2345431...","[31951D00263534P, 31951D00263528K, 31951D00263...","[16, 3-4, 15, 7, 8, 14, 12, 10, 5, 13, 11, 9, ...","[1926, 1924-1925, 1925, 1928, 1927, 1929]","[Bd.13 (1929), Bd.11 (1928), Bd.7 (1926), Bd.1...",[TMAGR],[PER],[TMAGR],[PER],[TMAGR PER]


In [9]:
chron0 = pd.merge(chron, cgs[['001-MMS_ID','curr-lib-loc']],how='left',on='001-MMS_ID')
chron0

Unnamed: 0,001-MMS_ID,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc,curr-lib-loc_x,curr-lib-loc_y
0,9936524420001701,23409495250001701,31951000208506R,8,1978,v.8 (1978),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS]
1,9954539020001701,23484960990001701,31951P002253064,350,1994,v.350:no.1-3 (1994),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]"
2,9954539020001701,23484960900001701,31951P00454482T,354,1996,v.354:no.4-6+suppl. (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]"
3,9954539020001701,23484960670001701,31951000265036H,278-279,1973,v.278-279+suppl. (1973),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]"
4,9954539020001701,23484960620001701,31951000265039B,284-285,1974,v.284-285 (1974),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]"
...,...,...,...,...,...,...,...,...,...,...,...,...
254171,9961581080001701,23516118280001701,31951P00204631A,39,1993,v.39:no.7-12 (1993),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[ZMLAC UMDN, TBIOM PERS]"
254172,9954865450001701,23486326540001701,31951P01062120F,88,2009,v.88A:no.3-4(2009),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS]
254173,9961654610001701,23516343280001701,31951P005115126,12,1996,v.12 (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS]
254174,9946991890001701,23453395340001701,319510026777515,117-118,1922-23,bd.117-118 (1922-23),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS]


In [10]:
item_ct = chron0.groupby('001-MMS_ID').agg(lambda x: len(x))
item_ct

Unnamed: 0_level_0,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc,curr-lib-loc_x,curr-lib-loc_y
001-MMS_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
9912002260001701,74,74,74,74,74,74,74,74,74,74,74
9912023030001701,35,35,35,35,35,35,35,35,35,35,35
9912037080001701,1,1,1,1,1,1,1,1,1,1,1
9912038890001701,13,13,13,13,13,13,13,13,13,13,13
9912065260001701,14,14,14,14,14,14,14,14,14,14,14
...,...,...,...,...,...,...,...,...,...,...,...
9975901508601701,4,4,4,4,4,4,4,4,4,4,4
9975989409901701,34,34,34,34,34,34,34,34,34,34,34
9976125196401701,14,14,14,14,14,14,14,14,14,14,14
9976125496801701,14,14,14,14,14,14,14,14,14,14,14


In [11]:
item_ct.reset_index(inplace=True)
item_ct

Unnamed: 0,001-MMS_ID,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc,curr-lib-loc_x,curr-lib-loc_y
0,9912002260001701,74,74,74,74,74,74,74,74,74,74,74
1,9912023030001701,35,35,35,35,35,35,35,35,35,35,35
2,9912037080001701,1,1,1,1,1,1,1,1,1,1,1
3,9912038890001701,13,13,13,13,13,13,13,13,13,13,13
4,9912065260001701,14,14,14,14,14,14,14,14,14,14,14
...,...,...,...,...,...,...,...,...,...,...,...,...
4448,9975901508601701,4,4,4,4,4,4,4,4,4,4,4
4449,9975989409901701,34,34,34,34,34,34,34,34,34,34,34
4450,9976125196401701,14,14,14,14,14,14,14,14,14,14,14
4451,9976125496801701,14,14,14,14,14,14,14,14,14,14,14


In [12]:
item_ct = item_ct[['001-MMS_ID','954$b-barcode']].rename(columns={'954$b-barcode':'all_item_count'})
item_ct

Unnamed: 0,001-MMS_ID,all_item_count
0,9912002260001701,74
1,9912023030001701,35
2,9912037080001701,1
3,9912038890001701,13
4,9912065260001701,14
...,...,...
4448,9975901508601701,4
4449,9975989409901701,34
4450,9976125196401701,14
4451,9976125496801701,14


In [13]:
#optional check of a specific MMS ID
item_ct[item_ct['001-MMS_ID'] == '9956479890001701']

Unnamed: 0,001-MMS_ID,all_item_count
3628,9956479890001701,181


In [14]:
chron2 = pd.merge(chron0,item_ct,how='left',on='001-MMS_ID')
chron2

Unnamed: 0,001-MMS_ID,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc,curr-lib-loc_x,curr-lib-loc_y,all_item_count
0,9936524420001701,23409495250001701,31951000208506R,8,1978,v.8 (1978),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],123
1,9954539020001701,23484960990001701,31951P002253064,350,1994,v.350:no.1-3 (1994),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146
2,9954539020001701,23484960900001701,31951P00454482T,354,1996,v.354:no.4-6+suppl. (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146
3,9954539020001701,23484960670001701,31951000265036H,278-279,1973,v.278-279+suppl. (1973),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146
4,9954539020001701,23484960620001701,31951000265039B,284-285,1974,v.284-285 (1974),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146
...,...,...,...,...,...,...,...,...,...,...,...,...,...
254171,9961581080001701,23516118280001701,31951P00204631A,39,1993,v.39:no.7-12 (1993),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[ZMLAC UMDN, TBIOM PERS]",64
254172,9954865450001701,23486326540001701,31951P01062120F,88,2009,v.88A:no.3-4(2009),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],48
254173,9961654610001701,23516343280001701,31951P005115126,12,1996,v.12 (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],13
254174,9946991890001701,23453395340001701,319510026777515,117-118,1922-23,bd.117-118 (1922-23),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],216


In [15]:
chron2.fillna('',inplace=True)
chron2

Unnamed: 0,001-MMS_ID,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc,curr-lib-loc_x,curr-lib-loc_y,all_item_count
0,9936524420001701,23409495250001701,31951000208506R,8,1978,v.8 (1978),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],123
1,9954539020001701,23484960990001701,31951P002253064,350,1994,v.350:no.1-3 (1994),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146
2,9954539020001701,23484960900001701,31951P00454482T,354,1996,v.354:no.4-6+suppl. (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146
3,9954539020001701,23484960670001701,31951000265036H,278-279,1973,v.278-279+suppl. (1973),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146
4,9954539020001701,23484960620001701,31951000265039B,284-285,1974,v.284-285 (1974),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146
...,...,...,...,...,...,...,...,...,...,...,...,...,...
254171,9961581080001701,23516118280001701,31951P00204631A,39,1993,v.39:no.7-12 (1993),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[ZMLAC UMDN, TBIOM PERS]",64
254172,9954865450001701,23486326540001701,31951P01062120F,88,2009,v.88A:no.3-4(2009),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],48
254173,9961654610001701,23516343280001701,31951P005115126,12,1996,v.12 (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],13
254174,9946991890001701,23453395340001701,319510026777515,117-118,1922-23,bd.117-118 (1922-23),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],216


In [16]:
def year_search(x):
    if re.search('[1-2][0-9]{3}[-/][1-2][0-9]{3}', x):
        return re.search('[1-2][0-9]{3}[-/][1-2][0-9]{3}', x).group(0)
    elif re.search('[1-2][0-9]{3}[-/][0-9]{2}', x):
        return re.search('[1-2][0-9]{3}[-/][0-9]{2}', x).group(0)
    elif re.search('[1-2][0-9]{3}', x):
        return re.search('[1-2][0-9]{3}', x).group(0)
    else:
        return ""

In [17]:
chron2['descr-year'] = chron2['954$e-descr'].apply(lambda x: year_search(x))
chron2

Unnamed: 0,001-MMS_ID,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc,curr-lib-loc_x,curr-lib-loc_y,all_item_count,descr-year
0,9936524420001701,23409495250001701,31951000208506R,8,1978,v.8 (1978),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],123,1978
1,9954539020001701,23484960990001701,31951P002253064,350,1994,v.350:no.1-3 (1994),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146,1994
2,9954539020001701,23484960900001701,31951P00454482T,354,1996,v.354:no.4-6+suppl. (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146,1996
3,9954539020001701,23484960670001701,31951000265036H,278-279,1973,v.278-279+suppl. (1973),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146,1973
4,9954539020001701,23484960620001701,31951000265039B,284-285,1974,v.284-285 (1974),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146,1974
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
254171,9961581080001701,23516118280001701,31951P00204631A,39,1993,v.39:no.7-12 (1993),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[ZMLAC UMDN, TBIOM PERS]",64,1993
254172,9954865450001701,23486326540001701,31951P01062120F,88,2009,v.88A:no.3-4(2009),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],48,2009
254173,9961654610001701,23516343280001701,31951P005115126,12,1996,v.12 (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],13,1996
254174,9946991890001701,23453395340001701,319510026777515,117-118,1922-23,bd.117-118 (1922-23),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],216,1922-23


In [18]:
chron2['chron'] = chron2.apply(lambda row: row['954$d-chron'] if (row['954$d-chron'] != '') else row['descr-year'], axis=1)
chron2

Unnamed: 0,001-MMS_ID,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc,curr-lib-loc_x,curr-lib-loc_y,all_item_count,descr-year,chron
0,9936524420001701,23409495250001701,31951000208506R,8,1978,v.8 (1978),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],123,1978,1978
1,9954539020001701,23484960990001701,31951P002253064,350,1994,v.350:no.1-3 (1994),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146,1994,1994
2,9954539020001701,23484960900001701,31951P00454482T,354,1996,v.354:no.4-6+suppl. (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146,1996,1996
3,9954539020001701,23484960670001701,31951000265036H,278-279,1973,v.278-279+suppl. (1973),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146,1973,1973
4,9954539020001701,23484960620001701,31951000265039B,284-285,1974,v.284-285 (1974),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[TVET PER, TBIOM PERS]",146,1974,1974
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
254171,9961581080001701,23516118280001701,31951P00204631A,39,1993,v.39:no.7-12 (1993),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,"[ZMLAC UMDN, TBIOM PERS]",64,1993,1993
254172,9954865450001701,23486326540001701,31951P01062120F,88,2009,v.88A:no.3-4(2009),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],48,2009,2009
254173,9961654610001701,23516343280001701,31951P005115126,12,1996,v.12 (1996),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],13,1996,1996
254174,9946991890001701,23453395340001701,319510026777515,117-118,1922-23,bd.117-118 (1922-23),TBIOM,PERS,TBIOM,PERS,TBIOM PERS,[TBIOM PERS],216,1922-23,1922-23


In [19]:
chron3 = chron2.groupby(['001-MMS_ID']).agg(lambda x: list(set(x))).reset_index()
chron3

Unnamed: 0,001-MMS_ID,954$a-Holdings,954$b-barcode,954$c-enum,954$d-chron,954$e-descr,954$f-perm-lib,954$g-perm-loc,954$h-curr-lib,954$i-curr-loc,curr-lib-loc_x,all_item_count,descr-year,chron
0,9912002260001701,"[23305012440001701, 23305012240001701, 2330501...","[31951P00378985E, 31951P00654897M, 31951P00992...","[22, 23, 26, 7, 12, 6, 33, 16, 31, 21, 30, 20,...","[, 2004, 2010, 1978, 1984, 1995, 1971, 1977, 1...","[v.3:no.1 (1973:winter), v.30:no.3-4 (2000), v...",[TBIOM],[PERS],[TBIOM],[PERS],[TBIOM PERS],[74],"[2004, 2010, 1978, 1984, 1995, 1971, 1977, 198...","[2004, 2010, 1978, 1984, 1995, 1971, 1977, 198..."
1,9912023030001701,"[23305079150001701, 23305079170001701, 2330507...","[31951P00548994H, 31951P00403583M, 31951D01258...","[22, 7, 26, 23, 12, 6, 21, 16, 20, 1, 25, 14, ...","[, 1978, 1995, 1987, 1977, 2001, 1980, 1985, 1...","[v.13:no.2 (1986:Summer), v.6 (1977), v.22 (19...",[TBIOM],[PERS],[TBIOM],[PERS],[TBIOM PERS],[35],"[1978, 1995, 1987, 1977, 2001, 1980, 1985, 199...","[1978, 1995, 1987, 1977, 2001, 1980, 1985, 199..."
2,9912037080001701,[23305080030001701],[31951P00784468K],[60],[2004],[v.60:no.1-2 2004],[TSCI],[PER],[TSCI],[PER],[TSCI PER],[1],[2004],[2004]
3,9912038890001701,"[23305086820001701, 23305086840001701, 2330508...","[31951P00686774A, 31951P003775736, 31951P00607...","[325, 320, 319, 318, 324, 321]","[1994, 1995, 1997]","[t.324:no.1-6 (1997), t.319:no.7-12 (1994), v....","[TCOS, ZMLAC]","[SN1, OWL]","[TCOS, ZMLAC]","[SN1, OWL]","[TCOS SN1, ZMLAC OWL]",[13],"[1994, 1995, 1997]","[1994, 1995, 1997]"
4,9912065260001701,"[23305229070001701, 23305229110001701, 2330522...","[31951000608115E, 31951000608122H, 31951000608...","[3-4, 20, 22, 19, 23, 5-7, 15-16, 13-14, 11-12...","[1962-63, 1962, 1964, 1963, 1962/66, 1965, 196...","[v.22 (1966), v.15-16 (1965), v.8-10 (1964), v...",[ZMLAC],[OWL],[ZMLAC],[OWL],[ZMLAC OWL],[14],"[1962-63, 1962, 1964, 1963, 1962/66, 1965, 196...","[1962-63, 1962, 1964, 1963, 1962/66, 1965, 196..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4448,9975901508601701,"[23485188320001701, 23485188170001701, 2348518...","[31951D02173957I, 31951D02173954O, 31951D02173...","[10-14, 19-22, 15-18, 23-26]","[1934-1937, 1942-1945, 1938-1941, 1929-1933]","[v.10-14 (1929-1933), v.15-18 (1934-1937), v.2...",[TMAGR],[PER],[TMAGR],[PER],[TMAGR PER],[4],"[1934-1937, 1942-1945, 1938-1941, 1929-1933]","[1934-1937, 1942-1945, 1938-1941, 1929-1933]"
4449,9975989409901701,"[23361481890001701, 23361483880001701, 2336148...","[31951000597515S, 31951D002011808, 31951D00201...","[46, 7, 44-45, 1-2, 6, 45, 43, 37, 34-35, 46-4...","[1905, 1906, 1890, 1904, 1888/89, 1885-1885/18...","[v.44-45 (1907), v.43 (1906), v.10 (1889/90), ...","[TWILS, TMAGR]","[PERC, PER]","[TWILS, TMAGR]","[PERC, PER]","[TWILS PERC, TMAGR PER]",[34],"[1905, 1906, 1890, 1904, 1888/89, 1885-1885, 1...","[1905, 1906, 1890, 1904, 1888/89, 1885-1885/18..."
4450,9976125196401701,"[23454315640001701, 23454315530001701, 2345431...","[31951D00263557D, 31951D00263542Q, 31951D00263...","[26-27, 24-25, 19, 17, 34-35, 32-33, 36-37, 40...","[1933-1934, 1934-1935, 1930-1931, 1932, 1938, ...","[Bd.18 (1930), Bd.26-27 (1932-33), Bd.28-29 (1...",[TMAGR],[PER],[TMAGR],[PER],[TMAGR PER],[14],"[1932-33, 1936-37, 1930-31, 1932, 1938, 1935, ...","[1933-1934, 1934-1935, 1930-1931, 1932, 1938, ..."
4451,9976125496801701,"[23454315800001701, 23454315810001701, 2345431...","[31951D00263534P, 31951D00263528K, 31951D00263...","[16, 3-4, 15, 7, 8, 14, 12, 10, 5, 13, 11, 9, ...","[1926, 1924-1925, 1925, 1928, 1927, 1929]","[Bd.13 (1929), Bd.11 (1928), Bd.7 (1926), Bd.1...",[TMAGR],[PER],[TMAGR],[PER],[TMAGR PER],[14],"[1926, 1924-25, 1925, 1928, 1927, 1929]","[1926, 1924-1925, 1925, 1928, 1927, 1929]"


In [20]:
chron4 = chron2[['001-MMS_ID','curr-lib-loc_y']]
chron4.rename(columns={'curr-lib-loc_y':'curr-lib-loc_ALL'}, inplace=True)
chron4

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


Unnamed: 0,001-MMS_ID,curr-lib-loc_ALL
0,9936524420001701,[TBIOM PERS]
1,9954539020001701,"[TVET PER, TBIOM PERS]"
2,9954539020001701,"[TVET PER, TBIOM PERS]"
3,9954539020001701,"[TVET PER, TBIOM PERS]"
4,9954539020001701,"[TVET PER, TBIOM PERS]"
...,...,...
254171,9961581080001701,"[ZMLAC UMDN, TBIOM PERS]"
254172,9954865450001701,[TBIOM PERS]
254173,9961654610001701,[TBIOM PERS]
254174,9946991890001701,[TBIOM PERS]


In [21]:
chron4['curr-lib-loc_ALL'] = chron4['curr-lib-loc_ALL'].apply(lambda x: str(x))
chron4

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,001-MMS_ID,curr-lib-loc_ALL
0,9936524420001701,['TBIOM PERS']
1,9954539020001701,"['TVET PER', 'TBIOM PERS']"
2,9954539020001701,"['TVET PER', 'TBIOM PERS']"
3,9954539020001701,"['TVET PER', 'TBIOM PERS']"
4,9954539020001701,"['TVET PER', 'TBIOM PERS']"
...,...,...
254171,9961581080001701,"['ZMLAC UMDN', 'TBIOM PERS']"
254172,9954865450001701,['TBIOM PERS']
254173,9961654610001701,['TBIOM PERS']
254174,9946991890001701,['TBIOM PERS']


In [22]:
chron4.drop_duplicates(inplace=True)
chron4

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,001-MMS_ID,curr-lib-loc_ALL
0,9936524420001701,['TBIOM PERS']
1,9954539020001701,"['TVET PER', 'TBIOM PERS']"
5,9957961320001701,['TBIOM PERS']
7,9956272140001701,"['TBIOM PERS', 'TZDS GEN']"
8,9915259360001701,['TBIOM PERS']
...,...,...
241614,9931402570001701,['TWILS GEN']
243145,9949826010001701,['TWILS PER']
246532,9963550760001701,['TZDS GEN']
252477,9942510430001701,['ZMLAC OWL']


In [23]:
chron3.columns

Index(['001-MMS_ID', '954$a-Holdings', '954$b-barcode', '954$c-enum',
       '954$d-chron', '954$e-descr', '954$f-perm-lib', '954$g-perm-loc',
       '954$h-curr-lib', '954$i-curr-loc', 'curr-lib-loc_x', 'all_item_count',
       'descr-year', 'chron'],
      dtype='object')

In [24]:
chron3 = chron3[['001-MMS_ID','curr-lib-loc_x','all_item_count', 'chron']]
chron3

Unnamed: 0,001-MMS_ID,curr-lib-loc_x,all_item_count,chron
0,9912002260001701,[TBIOM PERS],[74],"[2004, 2010, 1978, 1984, 1995, 1971, 1977, 198..."
1,9912023030001701,[TBIOM PERS],[35],"[1978, 1995, 1987, 1977, 2001, 1980, 1985, 199..."
2,9912037080001701,[TSCI PER],[1],[2004]
3,9912038890001701,"[TCOS SN1, ZMLAC OWL]",[13],"[1994, 1995, 1997]"
4,9912065260001701,[ZMLAC OWL],[14],"[1962-63, 1962, 1964, 1963, 1962/66, 1965, 196..."
...,...,...,...,...
4448,9975901508601701,[TMAGR PER],[4],"[1934-1937, 1942-1945, 1938-1941, 1929-1933]"
4449,9975989409901701,"[TWILS PERC, TMAGR PER]",[34],"[1905, 1906, 1890, 1904, 1888/89, 1885-1885/18..."
4450,9976125196401701,[TMAGR PER],[14],"[1933-1934, 1934-1935, 1930-1931, 1932, 1938, ..."
4451,9976125496801701,[TMAGR PER],[14],"[1926, 1924-1925, 1925, 1928, 1927, 1929]"


In [25]:
chron_combo = pd.merge(chron3, chron4, how='left',on='001-MMS_ID')
chron_combo

Unnamed: 0,001-MMS_ID,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL
0,9912002260001701,[TBIOM PERS],[74],"[2004, 2010, 1978, 1984, 1995, 1971, 1977, 198...",['TBIOM PERS']
1,9912023030001701,[TBIOM PERS],[35],"[1978, 1995, 1987, 1977, 2001, 1980, 1985, 199...",['TBIOM PERS']
2,9912037080001701,[TSCI PER],[1],[2004],['TSCI PER']
3,9912038890001701,"[TCOS SN1, ZMLAC OWL]",[13],"[1994, 1995, 1997]","['TCOS SN1', 'ZMLAC OWL']"
4,9912065260001701,[ZMLAC OWL],[14],"[1962-63, 1962, 1964, 1963, 1962/66, 1965, 196...",['ZMLAC OWL']
...,...,...,...,...,...
4448,9975901508601701,[TMAGR PER],[4],"[1934-1937, 1942-1945, 1938-1941, 1929-1933]",['TMAGR PER']
4449,9975989409901701,"[TWILS PERC, TMAGR PER]",[34],"[1905, 1906, 1890, 1904, 1888/89, 1885-1885/18...","['TWILS PERC', 'TMAGR PER']"
4450,9976125196401701,[TMAGR PER],[14],"[1933-1934, 1934-1935, 1930-1931, 1932, 1938, ...",['TMAGR PER']
4451,9976125496801701,[TMAGR PER],[14],"[1926, 1924-1925, 1925, 1928, 1927, 1929]",['TMAGR PER']


In [26]:
#change filename
groups_df = pd.read_pickle('all_groups_20201110.pkl')
groups_df

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,ISSN_PORTICO,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO
2,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,...,,,,,,,,,,
3,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",...,,,,,,,,,,
7,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],['0020-2681'],p,12,0020-2681,,,...,,,,,,,,,,
8,111951,9968429800001701,Journal of the Institute of Actuaries,['2058-1009'],"['0020-2681', '2058-1009']",e,12,2058-1009,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",...,,,,,,,,,,
15,123907,9967115530001701,Giornale degli economisti e annali di economia,[''],['0017-0097'],e,92,,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16156,17960,9946768760001701,The Americas,['0003-1615'],"['1533-6247', '0003-1615']",p,101581,0003-1615,,,...,,,,,,,,,,
16157,117510,9967987860001701,The Americas - Academy of American Franciscan ...,['1533-6247'],"['1533-6247', '0003-1615']",e,101581,1533-6247,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",...,,,,,,,,,,
16158,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,,,,...,,,,,,,,,,
16159,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496,,,...,,,,,,,,,,


In [27]:
df2 = pd.merge(groups_df,chron_combo,how='left',left_on='MMS_ID',right_on='001-MMS_ID')
df2

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO,001-MMS_ID,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL
0,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,...,,,,,,9963550760001701,[TZDS GEN],[1],[1963-1966],['TZDS GEN']
1,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",...,,,,,,,,,,
2,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],['0020-2681'],p,12,0020-2681,,,...,,,,,,9939481760001701,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS']
3,111951,9968429800001701,Journal of the Institute of Actuaries,['2058-1009'],"['0020-2681', '2058-1009']",e,12,2058-1009,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",...,,,,,,,,,,
4,123907,9967115530001701,Giornale degli economisti e annali di economia,[''],['0017-0097'],e,92,,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8867,17960,9946768760001701,The Americas,['0003-1615'],"['1533-6247', '0003-1615']",p,101581,0003-1615,,,...,,,,,,9946768760001701,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']"
8868,117510,9967987860001701,The Americas - Academy of American Franciscan ...,['1533-6247'],"['1533-6247', '0003-1615']",e,101581,1533-6247,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",...,,,,,,,,,,
8869,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,,,,...,,,,,,,,,,
8870,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496,,,...,,,,,,9957960200001701,[TSCI PER],[6],"[1989-90, 1985-86, 1982, 1980/81, 1983-84, 198...",['TSCI PER']


In [28]:
df2.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_bib', 'ISSN_cluster',
       'p_or_e', 'matches_group_id', 'ISSN_to_match', 'e_coll_info',
       'portfolio_info', 'Coverage Information Combined', 'PCAD?',
       'Vendor_key', 'ISSN Number_BTAA-SPR', 'Title 1 (Print)_BTAA-SPR',
       'Publisher (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Publisher (Print).1_BTAA-SPR', 'Title 3 (Print)_BTAA-SPR',
       'Publisher (Print).2_BTAA-SPR', '(more bib records?)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'SPR Missing_BTAA-SPR',
       'ISSN_PORTICO', 'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO',
       'Linking ISSN list_PORTICO', 'Linking ISSN split_PORTICO', '001-MMS_ID',
       'curr-lib-loc_x', 'all_item_count', 'chron', 'curr-lib-loc_ALL'],
      dtype='object')

#### Process chron data to get semi-sensical dates

In [29]:
dates = df2['chron']
dates.dropna(inplace=True)
dates

0                                             [1963-1966]
2       [, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...
5       [1970, 1939, 1969, 1950, 1960, 1978, 1940, 199...
7       [1970, 1981-1982, 1980-1981, 1978, 1984, 1983-...
8       [1970, 1978, 1984, 1995, 1987, 1983, 2009, 197...
                              ...                        
8862                                         [1984, 1985]
8863    [2004, 2012, 1940, 1975-76, 1964, 1987, 1981-8...
8866    [1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...
8867    [, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...
8870    [1989-90, 1985-86, 1982, 1980/81, 1983-84, 198...
Name: chron, Length: 4628, dtype: object

In [30]:
dates = dates[dates.apply(len) > 0]
dates

0                                             [1963-1966]
2       [, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...
5       [1970, 1939, 1969, 1950, 1960, 1978, 1940, 199...
7       [1970, 1981-1982, 1980-1981, 1978, 1984, 1983-...
8       [1970, 1978, 1984, 1995, 1987, 1983, 2009, 197...
                              ...                        
8862                                         [1984, 1985]
8863    [2004, 2012, 1940, 1975-76, 1964, 1987, 1981-8...
8866    [1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...
8867    [, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...
8870    [1989-90, 1985-86, 1982, 1980/81, 1983-84, 198...
Name: chron, Length: 4628, dtype: object

In [31]:
def date_fix (dlist):
    fixed_dates = []
    for date in dlist:
        #date = date.replace(' ','')
            
        date = str(date).strip()
        if date == '':
            print(date)
            print('Nothin?')
        #1973-89

        elif re.findall("^\d{4}-\d{2}$", date):
            #print(date + " range")
            year, partyear = re.split('-',date)
            century = year[:2]
            newyear = century + partyear
            rangeYear = int(newyear) + 1
            year_range = list(range(int(year), rangeYear))
            #print(year_range)
            fixed_dates.extend(year_range)
            #print(fixed_dates)

        elif re.findall("^\d{4}$", date):
            ##print(date + " four digits")
            fixed_dates.append(int(date))
            ##print(fixed_dates)

        elif re.findall("^\d{4}-\d{4}$",date):
            #print(date + " year-year")
            year1, year2 = re.split(r'[-]',date)
            #print(year1)
            #print(year2)
            year_range = list(range(int(year1),int(year2)+1))
            #print(year_range)
            fixed_dates.extend(year_range)
            #print(fixed_dates)

        elif re.findall("^\d{4}\/\d{4}$",date):
            #print(date + " year/year")
            year1, year2 = re.split(r'[/]',date)
            #print(year1)
            #print(year2)
            year_range = list(range(int(year1),int(year2)+1))
            #print(year_range)
            fixed_dates.extend(year_range)
            #print(fixed_dates)

        elif re.findall("^\d{4}\/\d{2}$",date):
            #print(date + " year/year")
            year, partyear = re.split(r'[/]',date)
            century = year[:2]
            newyear = century + partyear
            rangeYear = int(newyear) + 1
            year_range = list(range(int(year), rangeYear))
            #print(year_range)
            fixed_dates.extend(year_range)
            #print(fixed_dates)

        #1990/91-1991/92
        elif re.findall("^\d{4}\/\d{2}-\d{4}\/\d{2}$",date):
            #print(date)
            p = re.search("(?P<year1>^\d{4})(\/)(?P<yrpt2>\d{2})(-)(?P<year3>\d{4})(\/)(?P<yrpt4>\d{2}$)",date)
            #print(p)
            y1 = p.group('year1')
            c1 = y1[:2]
            y2 = c1 + p.group('yrpt2')
            range1 = list(range(int(y1),int(y2)+1))
            #print(range1)
            fixed_dates.extend(range1)

            y3 = p.group('year3')
            c2 = y3[:2]

            range3 = list(range(int(y2),int(y3)+1))
            #print(range3)
            fixed_dates.extend(range3)

            y4 = c2 + p.group('yrpt4')
            range2 = list(range(int(y3),int(y4)+1))
            #print(range2)
            fixed_dates.extend(range2)

            #print(fixed_dates)

        #2003/2004-2004/2005
        elif re.findall("^\d{4}\/\d{4}-\d{4}\/\d{4}$",date):
            #print(date + " year/year-year/year")
            date1, date2 = re.split('-',date)

            year1, year2 = re.split(r'[/]',date1)
            year_range1 = list(range(int(year1), int(year2)+1))
            #print(year_range1)
            fixed_dates.extend(year_range1)

            year3, year4 = re.split(r'[/]',date2)
            year_range2 = list(range(int(year2), int(year3)+1))
            #print(year_range2)
            fixed_dates.extend(year_range2)

            year_range3 = list(range(int(year3),int(year4)+1))
            #print(year_range3)
            fixed_dates.extend(year_range3)

            #print(fixed_dates)

        #2001-2001/2002
        elif re.findall("^\d{4}-\d{4}\/\d{4}$",date):
            #print(date + " year-year/year")
            date1, date2 = re.split('-',date)

            fixed_dates.append(int(date1))

            year1, year2 = re.split(r'[/]',date2)
            year_range = list(range(int(year1), int(year2)+1))
            #print(year_range)
            fixed_dates.extend(year_range)
            #print(fixed_dates)

        #1999/2000-2000
        elif re.findall("^\d{4}\/\d{4}-\d{4}$",date):
            #print(date + " year/year-year")
            date1, date2 = re.split('-',date)

            fixed_dates.append(int(date2))

            year1, year2 = re.split(r'[/]',date1)
            year_range = list(range(int(year1), int(year2)+1))
            #print(year_range)
            fixed_dates.extend(year_range)
            #print(fixed_dates)

        #1985-89/90
        elif re.findall("^\d{4}-\d{2}\/\d{2}$",date):
            #print(date + " year-yr/yr")
            y1,y2,y3 = re.split(r'[/-]',date)
            century = y1[:2]
            ny2 = century + y2
            ny3 = century + y3

            year_range1 = list(range(int(y1), int(ny2)+1))
            #print(year_range1)
            fixed_dates.extend(year_range1)

            year_range2 = list(range(int(ny2), int(ny3)+1))
            #print(year_range2)
            fixed_dates.extend(year_range2)

            #print(fixed_dates)        

        #1988/89-90
        elif re.findall("^\d{4}\/\d{2}-\d{2}$",date):
            #print(date + " year/yr-yr")
            year1, years = re.split(r'[/]',date)
            partyear1, partyear2 = re.split('-',years)
            century = year1[:2]
            newyear1 = century + partyear1
            newyear2 = century + partyear2

            rangeYear1 = int(newyear1) + 1
            year_range1 = list(range(int(year1), rangeYear1))
            #print(year_range1)
            fixed_dates.extend(year_range1)

            rangeYear2 = int(newyear2) + 1
            year_range2 = list(range(int(newyear1), rangeYear2))
            #print(year_range2)
            fixed_dates.extend(year_range2)

            #print(fixed_dates)    

        #2001/02-2004
        elif re.findall("^\d{4}\/\d{2}-\d{4}$",date):
            #print(date + " year/yr-year")
            years, year2 = re.split(r'[-]',date)
            year1, partyear =  re.split(r'[/]',years)
            century = year1[:2]
            newyear = century + partyear

            rangeYear1 = int(newyear) + 1
            year_range1 = list(range(int(year1), rangeYear1))
            #print(year_range1)
            fixed_dates.extend(year_range1)

            rangeYear2 = int(year2) + 1
            year_range2 = list(range(rangeYear1, rangeYear2))
            #print(year_range2)
            fixed_dates.extend(year_range2)
            #print(fixed_dates)            

        #1999/2000-2000/01
        elif re.findall("^\d{4}\/\d{4}-\d{4}\/\d{2}$",date):
            #print(date + " year/year-year/yr")
            ys1, ys2 = re.split(r'[-]',date)
            y1a, y1b  =  re.split(r'[/]',ys1)

            year_range1 = list(range(int(y1a), int(y1b)+1))
            #print(year_range1)
            fixed_dates.extend(year_range1)

            y2,py2 = re.split(r'[/]',ys2)
            ce2 = y2[:2]
            ny2 = ce2 + py2
            year_range2 = list(range(int(y2),int(ny2)+1))
            #print(year_range2)
            fixed_dates.extend(year_range2)
            #print(fixed_dates)  

        #1989-1990/91
        elif re.findall("^\d{4}-\d{4}\/\d{2}$",date):
            #print(date + " year-year/yr")
            y1, ys2 = re.split(r'[-]',date)
            y2a, y2b  =  re.split(r'[/]',ys2)

            year_range1 = list(range(int(y1), int(y2a)+1))
            #print(year_range1)
            fixed_dates.extend(year_range1)

            ce2 = y2a[:2]
            ny2 = ce2 + y2b
            year_range2 = list(range(int(y2a),int(ny2)+1))
            #print(year_range2)
            fixed_dates.extend(year_range2)
            #print(fixed_dates)  

        #1999,2001
        elif re.findall("^\d{4}\,\d{4}$",date):
            #print(date + " year,year")
            year1, year2 = re.split(r'[,]',date)
            year_range = [int(year1),int(year2)]
            #print(year_range)
            fixed_dates.extend(year_range)
            #print(fixed_dates)

        #200602006/2007
        elif re.findall("^\d{4}0\d{4}\/\d{4}$",date):
            p = re.search("(?P<year1>^\d{4})(?P<zero>0)(?P<range>\d{4}\/\d{4}$)",date)
            #print(p)
            y1 = p.group('year1')
            yset = p.group('range')
            year1,year2 = re.split(r'[/]',yset)
            yrange = list(range(int(y1),int(year1)+1))
            #print(yrange)
            fixed_dates.extend(yrange)
            yrange2 = list(range(int(year1),int(year2)+1))
            #print(yrange2)
            fixed_dates.extend(yrange2)

        #1960-68, 1978-2005
        elif re.findall("^\d{4}-\d{2},\s\d{4}-\d{4}$",date):
            #print(date)
            p = re.search("(?P<year1>^\d{4})(-)(?P<yrpt1>\d{2})(,\s)(?P<year3>\d{4})(-)(?P<year4>\d{4}$)",date)
            #p = re.search("(?P<year1>\d{4})(-)(?P<yrpt1>\d{2})(,\s)(?P<year3>\d{4})(-)",date)
            #print(p)
            y1 = p.group('year1')
            c1 = y1[:2]
            y2 = c1 + p.group('yrpt1')
            range1 = list(range(int(y1),int(y2)+1))
            #print(range1)
            fixed_dates.extend(range1)

            range2 = list(range(int(p.group('year3')),int(p.group('year4'))+1))
            #print(range2)
            fixed_dates.extend(range2)
            #print(fixed_dates)

        #1948-66,1970
        elif re.findall("^\d{4}-\d{2},\d{4}$",date):
            #print(date)
            p = re.search("(?P<year1>^\d{4})(-)(?P<yrpt1>\d{2})(,)(?P<year3>\d{4}$)",date)
            #print(p)
            y1 = p.group('year1')
            c1 = y1[:2]
            y2 = c1 + p.group('yrpt1')
            range1 = list(range(int(y1),int(y2)+1))
            #print(range1)
            fixed_dates.extend(range1)

            fixed_dates.append(int(p.group('year3')))
            #print(fixed_dates)

        #1988/89-89/90
        elif re.findall("^\d{4}\/\d{2}-\d{2}\/\d{2}$",date):
            #print(date)
            p = re.search("(?P<year1>^\d{4})(\/)(?P<yrpt2>\d{2})(-)(?P<yrpt3>\d{2})(\/)(?P<yrpt4>\d{2}$)",date)
            #print(p)
            y1 = p.group('year1')
            c1 = y1[:2]
            y2 = c1 + p.group('yrpt2')
            range1 = list(range(int(y1),int(y2)+1))
            #print(range1)
            fixed_dates.extend(range1)

            y3 = c1 + p.group('yrpt3')
            range2 = list(range(int(y2),int(y3)+1))
            #print(range2)
            fixed_dates.extend(range2)

            range3 = list(range(int(y3),int(c1 + p.group('yrpt4'))+1))
            #print(range3)
            fixed_dates.extend(range3)
            #print(fixed_dates)

        #1998/99-1999/2000
        elif re.findall("^\d{4}\/\d{2}-\d{4}\/\d{4}$",date):
            #print(date)
            p = re.search("(?P<year1>^\d{4})(\/)(?P<yrpt2>\d{2})(-)(?P<y3>\d{4})(\/)(?P<y4>\d{4}$)",date)
            #print(p)
            y1 = p.group('year1')
            c1 = y1[:2]
            y2 = c1 + p.group('yrpt2')
            range1 = list(range(int(y1),int(y2)+1))
            #print(range1)
            fixed_dates.extend(range1)

            y3 = p.group('y3')
            range2 = list(range(int(y2),int(y3)+1))
            #print(range2)
            fixed_dates.extend(range2)

            range3 = list(range(int(y3),int(p.group('y4'))+1))
            #print(range3)
            fixed_dates.extend(range3)
            #print(fixed_dates)

        #2015=2016
        elif re.findall("^\d{4}=\d{4}$",date):
            #print(date)
            p = re.search("(?P<year1>^\d{4})(\=)(?P<year2>\d{4}$)",date)
            #print(p)
            #print(type(p.group('year1')))
            #print(type(p.group('year2')))
            range1 = list(range(int(p.group('year1')),int(p.group('year2'))+1))
            #print(range1)
            fixed_dates.extend(range1)
            #print(fixed_dates)

        #200120/02-2003/2004
        elif re.findall("^\d{6}\/\d{2}-\d{4}\/\d{4}$",date):
            #print(date)
            yrs1,yrs2 = re.split(r'[-]',date)
            yrs1 = yrs1.replace("/","")
            year1 = yrs1[:4]
            year2 = yrs1[4:]
            #print(year1)
            #print(year2)

            year3,year4 = yrs2.split('/')
            #print(year3)
            #print(year4)

            range1 = list(range(int(year1),int(year2)+1))
            #print(range1)
            fixed_dates.extend(range1)

            range2 = list(range(int(year2),int(year3)+1))
            #print(range2)
            fixed_dates.extend(range2)

            range3 = list(range(int(year3),int(year4)+1))
            #print(range3)
            fixed_dates.extend(range3)

            #print(fixed_dates)

        #1-1
        elif re.findall("^\d{1}-\d{1}$",date):
            print(date)
            print(" probably not a date?")

        #1997/98- 1998
        elif re.findall("^\d{4}\/\d{2}-\s\d{4}$",date):
            #print(date)
            yrs,year3 = re.split(r'[-]',date)
            year3 = year3.strip()
            year1,year2 = re.split(r'[/]',yrs)
            c1 = year1[:2]
            year2 = c1 + year2

            #print(year1)
            #print(year2)
            #print(year3)

            range1 = list(range(int(year1),int(year2)+1))
            #print(range1)
            fixed_dates.extend(range1)

            range2 = list(range(int(year2),int(year3)+1))
            #print(range2)
            fixed_dates.extend(range2)

            #print(fixed_dates)

        #2001-10-12
        elif re.findall("^\d{4}-\d{2}-\d{2}$",date):
            #print(date + " year-mm-dd")
            year, mo, day = re.split('-',date)
            fixed_dates.append(int(year))

            #print(fixed_dates)

        #1976-19.78
        elif re.findall("^\d{4}-\d{2}\.\d{2}$",date):
            #print(date + " year-yy.yy")
            year1, year2 = re.split('-',date)
            #print(year1)
            year2 = year2.replace(".","")
            #print(year2)
            rangeyear = list(range(int(year1),int(year2)+1))
            fixed_dates.extend(rangeyear)

            #print(fixed_dates)

        else:
            print(date)
            print("UNKNOWN FORMAT")
            #fixed_dates.append(int(date))
            #print(fixed_dates)

    return list(sorted(set(fixed_dates)))

In [32]:
fixed_dates = dates.apply(date_fix)
fixed_dates


Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?
1984198
UNKNOWN FORMAT

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Nothin?

Noth

0                                [1963, 1964, 1965, 1966]
2       [1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...
5       [1939, 1940, 1941, 1942, 1946, 1947, 1949, 195...
7       [1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...
8       [1969, 1970, 1972, 1975, 1976, 1978, 1980, 198...
                              ...                        
8862                                         [1984, 1985]
8863    [1935, 1936, 1937, 1938, 1939, 1940, 1943, 194...
8866    [1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...
8867    [1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...
8870    [1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...
Name: chron, Length: 4628, dtype: object

In [33]:
def ranges(ints):
    #print(ints)
    ints = sorted(set(ints))
    if ints == []:
        print('empty list')
        yield ''
    else:
        range_start = previous_number = ints[0]
        for number in ints[1:]:
            #print(number)
            #print(type(number))
            #print(previous_number)
            #print(type(previous_number))
            if number == (previous_number + 1):
                previous_number = number
            else:
                yield range_start, previous_number
                range_start = previous_number = number
        yield range_start, previous_number

In [34]:
date_ranges = fixed_dates.apply(ranges).apply(list)
date_ranges

empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list
empty list

0                                          [(1963, 1966)]
2       [(1890, 1890), (1892, 1892), (1895, 1895), (19...
5       [(1939, 1942), (1946, 1947), (1949, 1951), (19...
7                                          [(1967, 1992)]
8       [(1969, 1970), (1972, 1972), (1975, 1976), (19...
                              ...                        
8862                                       [(1984, 1985)]
8863    [(1935, 1940), (1943, 1944), (1947, 1951), (19...
8866           [(1937, 1947), (1952, 1952), (1960, 2003)]
8867                                       [(1944, 2014)]
8870                                       [(1980, 1990)]
Name: chron, Length: 4628, dtype: object

In [35]:
dates_df = pd.DataFrame(fixed_dates)
dates_df

Unnamed: 0,chron
0,"[1963, 1964, 1965, 1966]"
2,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193..."
5,"[1939, 1940, 1941, 1942, 1946, 1947, 1949, 195..."
7,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
8,"[1969, 1970, 1972, 1975, 1976, 1978, 1980, 198..."
...,...
8862,"[1984, 1985]"
8863,"[1935, 1936, 1937, 1938, 1939, 1940, 1943, 194..."
8866,"[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194..."
8867,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195..."


In [36]:
ranges_df = pd.DataFrame(date_ranges)
ranges_df

Unnamed: 0,chron
0,"[(1963, 1966)]"
2,"[(1890, 1890), (1892, 1892), (1895, 1895), (19..."
5,"[(1939, 1942), (1946, 1947), (1949, 1951), (19..."
7,"[(1967, 1992)]"
8,"[(1969, 1970), (1972, 1972), (1975, 1976), (19..."
...,...
8862,"[(1984, 1985)]"
8863,"[(1935, 1940), (1943, 1944), (1947, 1951), (19..."
8866,"[(1937, 1947), (1952, 1952), (1960, 2003)]"
8867,"[(1944, 2014)]"


In [37]:
combo_ranges = pd.merge(dates_df,ranges_df,how='outer',right_index=True,left_index=True,suffixes=['_as_list','_ranges_calc'])
combo_ranges

Unnamed: 0,chron_as_list,chron_ranges_calc
0,"[1963, 1964, 1965, 1966]","[(1963, 1966)]"
2,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19..."
5,"[1939, 1940, 1941, 1942, 1946, 1947, 1949, 195...","[(1939, 1942), (1946, 1947), (1949, 1951), (19..."
7,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[(1967, 1992)]"
8,"[1969, 1970, 1972, 1975, 1976, 1978, 1980, 198...","[(1969, 1970), (1972, 1972), (1975, 1976), (19..."
...,...,...
8862,"[1984, 1985]","[(1984, 1985)]"
8863,"[1935, 1936, 1937, 1938, 1939, 1940, 1943, 194...","[(1935, 1940), (1943, 1944), (1947, 1951), (19..."
8866,"[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]"
8867,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]"


In [38]:
df_with_ranges = pd.merge(df2,combo_ranges,how='left',right_index=True,left_index=True)
df_with_ranges

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO,001-MMS_ID,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc
0,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],['0893-6706'],p,5,0893-6706,,,...,,,,9963550760001701,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]"
1,128733,9968441380001701,IEEE transactions on ultrasonics engineering,['2162-1373'],"['0893-6706', '2162-1373']",e,5,2162-1373,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",...,,,,,,,,,,
2,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],['0020-2681'],p,12,0020-2681,,,...,,,,9939481760001701,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19..."
3,111951,9968429800001701,Journal of the Institute of Actuaries,['2058-1009'],"['0020-2681', '2058-1009']",e,12,2058-1009,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",...,,,,,,,,,,
4,123907,9967115530001701,Giornale degli economisti e annali di economia,[''],['0017-0097'],e,92,,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8867,17960,9946768760001701,The Americas,['0003-1615'],"['1533-6247', '0003-1615']",p,101581,0003-1615,,,...,,,,9946768760001701,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]"
8868,117510,9967987860001701,The Americas - Academy of American Franciscan ...,['1533-6247'],"['1533-6247', '0003-1615']",e,101581,1533-6247,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",...,,,,,,,,,,
8869,124654,9924328340001701,Americas (Online),[''],['0003-1615'],e,101581,,,,...,,,,,,,,,,
8870,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],['0143-7496'],p,101582,0143-7496,,,...,,,,9957960200001701,[TSCI PER],[6],"[1989-90, 1985-86, 1982, 1980/81, 1983-84, 198...",['TSCI PER'],"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[(1980, 1990)]"


In [39]:
#check to make sure every row with data had a range calculated
df_with_ranges[df_with_ranges['chron_as_list'].notnull() & df_with_ranges['chron_ranges_calc'].isna()]

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_bib,ISSN_cluster,p_or_e,matches_group_id,ISSN_to_match,e_coll_info,portfolio_info,...,Latest Year Preserved_PORTICO,Linking ISSN list_PORTICO,Linking ISSN split_PORTICO,001-MMS_ID,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc


In [40]:
df_with_ranges.to_pickle(f'df_with_date_ranges_{today}.pkl')

#### Extract only the columns useful for coverage comparison

In [41]:
df = df_with_ranges
df.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_bib', 'ISSN_cluster',
       'p_or_e', 'matches_group_id', 'ISSN_to_match', 'e_coll_info',
       'portfolio_info', 'Coverage Information Combined', 'PCAD?',
       'Vendor_key', 'ISSN Number_BTAA-SPR', 'Title 1 (Print)_BTAA-SPR',
       'Publisher (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Publisher (Print).1_BTAA-SPR', 'Title 3 (Print)_BTAA-SPR',
       'Publisher (Print).2_BTAA-SPR', '(more bib records?)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'SPR Missing_BTAA-SPR',
       'ISSN_PORTICO', 'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO',
       'Linking ISSN list_PORTICO', 'Linking ISSN split_PORTICO', '001-MMS_ID',
       'curr-lib-loc_x', 'all_item_count', 'chron', 'curr-lib-loc_ALL',
       'chron_as_list', 'chron_ranges_calc'],
      dtype='object')

In [42]:
df = df[['record_index', 'MMS_ID', 'Title_bib','ISSN_cluster',
         'p_or_e', 'matches_group_id','e_coll_info','portfolio_info',
         'Coverage Information Combined', 'PCAD?', 'Vendor_key',
         'Title 1 (Print)_BTAA-SPR','Title 2 (Print)_BTAA-SPR', 'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 
         'Title (Complete)_PORTICO','Portico Match_PORTICO', 'Portico Title_PORTICO', 
         'PCA_PORTICO','Status_PORTICO', 'Earliest Year Preserved_PORTICO',
         'Latest Year Preserved_PORTICO', 'curr-lib-loc_x','all_item_count', 
         'chron', 'curr-lib-loc_ALL', 'chron_as_list', 'chron_ranges_calc']]
df

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc
0,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,,,,,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]"
1,128733,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],...,,,,,,,,,,
2,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,,,,,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19..."
3,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,,,,
4,123907,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8867,17960,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,...,,,,,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]"
8868,117510,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],...,,,,,,,,,,
8869,124654,9924328340001701,Americas (Online),['0003-1615'],e,101581,,,,,...,,,,,,,,,,
8870,31683,9957960200001701,International journal of adhesion and adhesives,['0143-7496'],p,101582,,,,,...,,,,,[TSCI PER],[6],"[1989-90, 1985-86, 1982, 1980/81, 1983-84, 198...",['TSCI PER'],"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[(1980, 1990)]"


In [43]:
df.sort_values(by=['matches_group_id','Title_bib','p_or_e'], inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc
1,128733,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],...,,,,,,,,,,
0,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,,,,,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]"
3,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,,,,
2,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,,,,,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19..."
4,123907,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8869,124654,9924328340001701,Americas (Online),['0003-1615'],e,101581,,,,,...,,,,,,,,,,
8867,17960,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,...,,,,,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]"
8868,117510,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],...,,,,,,,,,,
8871,125537,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],...,,,,,,,,,,


In [44]:
df = df[((df['p_or_e'] == 'e') & (df['PCAD?'].notnull())) | (df['p_or_e'] == 'p')]
df

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc
1,128733,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],...,,,,,,,,,,
0,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,,,,,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]"
3,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,,,,
2,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,,,,,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19..."
4,123907,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8866,61904,9959156260001701,Year book - American Philosophical Society,"['0065-9762', '0003-049X']",p,101522,,,,,...,,,,,"[TSCI PER, ZMLAC OWL]",[53],"[1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]"
8867,17960,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,...,,,,,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]"
8868,117510,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],...,,,,,,,,,,
8871,125537,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],...,,,,,,,,,,


In [45]:
df['matches_group_id'].nunique()

3543

In [46]:
dfe = df[df['p_or_e'] == 'e']
dfe

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc
1,128733,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],...,,,,,,,,,,
3,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,,,,
4,123907,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],...,,,,,,,,,,
6,104047,9968936470001701,Mathematics of the USSR. Izvestija (Online),"['0025-5726', '2169-5075']",e,267,"[['61695747580001701', 'Institute of Physics T...","[['53695747410001701', 'Mathematics of the USS...",[Unknown],[Yes],...,,,,,,,,,,
10,115295,9968336850001701,Laboratory techniques in biochemistry and mole...,['0075-7535'],e,277,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624652880001701', 'Laboratory techniques ...",[ Available from 2007 volume: 32 until 2009 vo...,[Yes],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8856,118480,9968239890001701,Physics letters.,"['0375-9601', '1873-2429']",e,101461,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624670170001701', 'Physics letters.']]",[ Available from 1967-01-02 volume: 24 issue: 1;],[Yes],...,,,,,,,,,,
8857,118481,9968240180001701,Physics letters.,"['1873-2445', '0370-2693']",e,101461,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624670190001701', 'Physics letters.']]",[ Available from 1967-01-09 volume: 24 issue: 1;],[Yes],...,,,,,,,,,,
8864,123828,9968665290001701,Proceedings and addresses of the American Phil...,"['2325-9248', '0065-972X']",e,101522,"[['61535211010001701', 'JSTOR Arts and Science...","[['53540700110001701', 'Proceedings and addres...",[ Available from 1927 volume: 1;],[Yes],...,,,,,,,,,,
8868,117510,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],...,,,,,,,,,,


In [47]:
dfe.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e',
       'matches_group_id', 'e_coll_info', 'portfolio_info',
       'Coverage Information Combined', 'PCAD?', 'Vendor_key',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'Title (Complete)_PORTICO',
       'Portico Match_PORTICO', 'Portico Title_PORTICO', 'PCA_PORTICO',
       'Status_PORTICO', 'Earliest Year Preserved_PORTICO',
       'Latest Year Preserved_PORTICO', 'curr-lib-loc_x', 'all_item_count',
       'chron', 'curr-lib-loc_ALL', 'chron_as_list', 'chron_ranges_calc'],
      dtype='object')

In [48]:
dfe['PCAD?'] = dfe['PCAD?'].apply(lambda x: str(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [49]:
dfe = dfe[['MMS_ID','Title_bib','p_or_e','PCAD?','Coverage Information Combined']]
dfe

Unnamed: 0,MMS_ID,Title_bib,p_or_e,PCAD?,Coverage Information Combined
1,9968441380001701,IEEE transactions on ultrasonics engineering,e,['Yes'],[ Available from 1963 volume: 10 issue: 1 unti...
3,9968429800001701,Journal of the Institute of Actuaries,e,['Yes'],[ Available from 1886 volume: 25 issue: 5 unti...
4,9967115530001701,Giornale degli economisti e annali di economia,e,['Yes'],[ Available from 1939 volume: 1 until 2012;]
6,9968936470001701,Mathematics of the USSR. Izvestija (Online),e,['Yes'],[Unknown]
10,9968336850001701,Laboratory techniques in biochemistry and mole...,e,['Yes'],[ Available from 2007 volume: 32 until 2009 vo...
...,...,...,...,...,...
8856,9968239890001701,Physics letters.,e,['Yes'],[ Available from 1967-01-02 volume: 24 issue: 1;]
8857,9968240180001701,Physics letters.,e,['Yes'],[ Available from 1967-01-09 volume: 24 issue: 1;]
8864,9968665290001701,Proceedings and addresses of the American Phil...,e,['Yes'],[ Available from 1927 volume: 1;]
8868,9967987860001701,The Americas - Academy of American Franciscan ...,e,['Yes'],[ Available from 1944 volume: 1 issue: 1;]


In [50]:
dfe.reset_index(inplace=True)
dfe.rename(columns={'index':'record_index'}, inplace=True)
dfe

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,PCAD?,Coverage Information Combined
0,1,9968441380001701,IEEE transactions on ultrasonics engineering,e,['Yes'],[ Available from 1963 volume: 10 issue: 1 unti...
1,3,9968429800001701,Journal of the Institute of Actuaries,e,['Yes'],[ Available from 1886 volume: 25 issue: 5 unti...
2,4,9967115530001701,Giornale degli economisti e annali di economia,e,['Yes'],[ Available from 1939 volume: 1 until 2012;]
3,6,9968936470001701,Mathematics of the USSR. Izvestija (Online),e,['Yes'],[Unknown]
4,10,9968336850001701,Laboratory techniques in biochemistry and mole...,e,['Yes'],[ Available from 2007 volume: 32 until 2009 vo...
...,...,...,...,...,...,...
3964,8856,9968239890001701,Physics letters.,e,['Yes'],[ Available from 1967-01-02 volume: 24 issue: 1;]
3965,8857,9968240180001701,Physics letters.,e,['Yes'],[ Available from 1967-01-09 volume: 24 issue: 1;]
3966,8864,9968665290001701,Proceedings and addresses of the American Phil...,e,['Yes'],[ Available from 1927 volume: 1;]
3967,8868,9967987860001701,The Americas - Academy of American Franciscan ...,e,['Yes'],[ Available from 1944 volume: 1 issue: 1;]


In [51]:
def avail_from(x):
    if re.search('Available\sfrom\s[1-2][0-9]{3}', x):
        return re.search('[1-2][0-9]{3}', x).group(0)
    else:
        return np.nan

In [52]:
melted_cov = pd.concat([pd.DataFrame(v, index=np.repeat(k,len(v))) for k,v in dfe['Coverage Information Combined'].to_dict().items()])
melted_cov = melted_cov.rename(columns={0:'Coverage_atomic'})
melted_cov = melted_cov[melted_cov['Coverage_atomic'] != '']
melted_cov

Unnamed: 0,Coverage_atomic
0,Available from 1963 volume: 10 issue: 1 until...
1,Available from 1886 volume: 25 issue: 5 until...
2,Available from 1939 volume: 1 until 2012;
3,Unknown
4,Available from 2007 volume: 32 until 2009 vol...
...,...
3964,Available from 1967-01-02 volume: 24 issue: 1;
3965,Available from 1967-01-09 volume: 24 issue: 1;
3966,Available from 1927 volume: 1;
3967,Available from 1944 volume: 1 issue: 1;


In [53]:
melted_cov['avail-start'] = melted_cov['Coverage_atomic'].apply(lambda x: avail_from(x))
melted_cov

Unnamed: 0,Coverage_atomic,avail-start
0,Available from 1963 volume: 10 issue: 1 until...,1963
1,Available from 1886 volume: 25 issue: 5 until...,1886
2,Available from 1939 volume: 1 until 2012;,1939
3,Unknown,
4,Available from 2007 volume: 32 until 2009 vol...,2007
...,...,...
3964,Available from 1967-01-02 volume: 24 issue: 1;,1967
3965,Available from 1967-01-09 volume: 24 issue: 1;,1967
3966,Available from 1927 volume: 1;,1927
3967,Available from 1944 volume: 1 issue: 1;,1944


In [54]:
def avail_until(x):
    if re.search('until\s[1-2][0-9]{3}', x):
        return re.search('until\s[1-2][0-9]{3}', x).group(0).strip('until ')
    else:
        return np.nan

In [55]:
melted_cov['avail-end'] = melted_cov['Coverage_atomic'].apply(lambda x: avail_until(x))
melted_cov

Unnamed: 0,Coverage_atomic,avail-start,avail-end
0,Available from 1963 volume: 10 issue: 1 until...,1963,1963
1,Available from 1886 volume: 25 issue: 5 until...,1886,1995
2,Available from 1939 volume: 1 until 2012;,1939,2012
3,Unknown,,
4,Available from 2007 volume: 32 until 2009 vol...,2007,2009
...,...,...,...
3964,Available from 1967-01-02 volume: 24 issue: 1;,1967,
3965,Available from 1967-01-09 volume: 24 issue: 1;,1967,
3966,Available from 1927 volume: 1;,1927,
3967,Available from 1944 volume: 1 issue: 1;,1944,


In [56]:
mc_curr = melted_cov[(melted_cov['avail-start'].notnull()) & (melted_cov['avail-end'].isna())]
mc_curr

Unnamed: 0,Coverage_atomic,avail-start,avail-end
5,Available from 1893 volume: 1 issue: 1;,1893,
7,Available from 1922 volume: 4;,1922,
12,Available from 1996 volume: 43 issue: 1;,1996,
13,Available from 1890 volume: 1 issue: 1;,1890,
15,Available from 2011-06- volume: 56 issue: 1;,2011,
...,...,...,...
3964,Available from 1967-01-02 volume: 24 issue: 1;,1967,
3965,Available from 1967-01-09 volume: 24 issue: 1;,1967,
3966,Available from 1927 volume: 1;,1927,
3967,Available from 1944 volume: 1 issue: 1;,1944,


In [57]:
mc_curr['avail-end'] = '2019'
mc_curr

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Coverage_atomic,avail-start,avail-end
5,Available from 1893 volume: 1 issue: 1;,1893,2019
7,Available from 1922 volume: 4;,1922,2019
12,Available from 1996 volume: 43 issue: 1;,1996,2019
13,Available from 1890 volume: 1 issue: 1;,1890,2019
15,Available from 2011-06- volume: 56 issue: 1;,2011,2019
...,...,...,...
3964,Available from 1967-01-02 volume: 24 issue: 1;,1967,2019
3965,Available from 1967-01-09 volume: 24 issue: 1;,1967,2019
3966,Available from 1927 volume: 1;,1927,2019
3967,Available from 1944 volume: 1 issue: 1;,1944,2019


In [58]:
mc_com = melted_cov.combine_first(mc_curr)
mc_com

Unnamed: 0,Coverage_atomic,avail-start,avail-end
0,Available from 1963 volume: 10 issue: 1 until...,1963,1963
1,Available from 1886 volume: 25 issue: 5 until...,1886,1995
2,Available from 1939 volume: 1 until 2012;,1939,2012
3,Unknown,,
4,Available from 2007 volume: 32 until 2009 vol...,2007,2009
...,...,...,...
3964,Available from 1967-01-02 volume: 24 issue: 1;,1967,2019
3965,Available from 1967-01-09 volume: 24 issue: 1;,1967,2019
3966,Available from 1927 volume: 1;,1927,2019
3967,Available from 1944 volume: 1 issue: 1;,1944,2019


In [59]:
dfe2 = mc_com[mc_com['avail-start'].notnull() & mc_com['avail-end'].notnull()]
dfe2

Unnamed: 0,Coverage_atomic,avail-start,avail-end
0,Available from 1963 volume: 10 issue: 1 until...,1963,1963
1,Available from 1886 volume: 25 issue: 5 until...,1886,1995
2,Available from 1939 volume: 1 until 2012;,1939,2012
4,Available from 2007 volume: 32 until 2009 vol...,2007,2009
5,Available from 1944-1-1 until 2000-10-31; Av...,1944,2000
...,...,...,...
3964,Available from 1967-01-02 volume: 24 issue: 1;,1967,2019
3965,Available from 1967-01-09 volume: 24 issue: 1;,1967,2019
3966,Available from 1927 volume: 1;,1927,2019
3967,Available from 1944 volume: 1 issue: 1;,1944,2019


In [60]:
dfe2_no_date_calc = mc_com[mc_com['avail-start'].isna() | mc_com['avail-end'].isna()]
dfe2_no_date_calc

Unnamed: 0,Coverage_atomic,avail-start,avail-end
3,Unknown,,
39,Unknown,,
60,Unknown,,
126,Unknown,,
127,Unknown,,
...,...,...,...
3556,Unknown,,
3655,Unknown,,
3656,Unknown,,
3740,Unknown,,


In [61]:
dfe2_no_date_calc.shape

(99, 3)

In [62]:
dfe2_no_date_calc.reset_index(inplace=True)
dfe2_no_date_calc.rename(columns={'index':'record_index'},inplace=True)
dfe2_no_date_calc

Unnamed: 0,record_index,Coverage_atomic,avail-start,avail-end
0,3,Unknown,,
1,39,Unknown,,
2,60,Unknown,,
3,126,Unknown,,
4,127,Unknown,,
...,...,...,...,...
94,3556,Unknown,,
95,3655,Unknown,,
96,3656,Unknown,,
97,3740,Unknown,,


In [63]:
dfe2_no_date_calc = pd.merge(dfe, dfe2_no_date_calc, how='right', on= 'record_index')
dfe2_no_date_calc

Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,PCAD?,Coverage Information Combined,Coverage_atomic,avail-start,avail-end
0,3,9968429800001701,Journal of the Institute of Actuaries,e,['Yes'],[ Available from 1886 volume: 25 issue: 5 unti...,Unknown,,
1,209,9967220140001701,Seminars in oncology,e,['Yes'],[ Available from 2001-02- volume: 28;],Unknown,,
2,246,9968504000001701,Tennyson research bulletin.,e,['Yes'],[ Available from 1985-1-1 until 1986-12-31; A...,Unknown,,
3,345,9966479180001701,Mathematical methods in the applied sciences,e,['Yes'],[ Available from 1996 volume: 19 issue: 1;],Unknown,,
4,397,9975630388601701,Clinical research practices and drug regulator...,e,['Yes'],[ Available from 1983 volume: 1 issue: 1 until...,Unknown,,
...,...,...,...,...,...,...,...,...,...
94,3253,,,,,,Unknown,,
95,3291,,,,,,Unknown,,
96,3556,,,,,,Unknown,,
97,3656,,,,,,Unknown,,


In [64]:
dfe2_no_date_calc_df = pd.merge(df, dfe2_no_date_calc, how='right', on= 'MMS_ID')
dfe2_no_date_calc_df

Unnamed: 0,record_index_x,MMS_ID,Title_bib_x,ISSN_cluster,p_or_e_x,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined_x,PCAD?_x,...,chron_as_list,chron_ranges_calc,record_index_y,Title_bib_y,p_or_e_y,PCAD?_y,Coverage Information Combined_y,Coverage_atomic,avail-start,avail-end
0,111951.0,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12.0,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,3,Journal of the Institute of Actuaries,e,['Yes'],[ Available from 1886 volume: 25 issue: 5 unti...,Unknown,,
1,121955.0,9967220140001701,Seminars in oncology,"['0093-7754', '1532-8708']",e,4588.0,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624692560001701', 'Seminars in oncology.']]",[ Available from 2001-02- volume: 28;],[Yes],...,,,209,Seminars in oncology,e,['Yes'],[ Available from 2001-02- volume: 28;],Unknown,,
2,115366.0,9968504000001701,Tennyson research bulletin.,['0082-2841'],e,5186.0,"[['61686647970001701', 'Periodicals Archive On...","[['53686647540001701', 'Tennyson research bull...",[ Available from 1985-1-1 until 1986-12-31; A...,[Yes],...,,,246,Tennyson research bulletin.,e,['Yes'],[ Available from 1985-1-1 until 1986-12-31; A...,Unknown,,
3,121390.0,9966479180001701,Mathematical methods in the applied sciences,"['0170-4214', '1099-1476']",e,6841.0,"[['61765175560001701', 'Wiley Online Library D...","[['53743071260001701', 'Mathematical methods i...",[ Available from 1996 volume: 19 issue: 1;],[Yes],...,,,345,Mathematical methods in the applied sciences,e,['Yes'],[ Available from 1996 volume: 19 issue: 1;],Unknown,,
4,125290.0,9975630388601701,Clinical research practices and drug regulator...,['0735-7915'],e,7786.0,"[['61697871840001701', 'Taylor & Francis Medic...","[['53697870630001701', 'Clinical research prac...",[ Available from 1983 volume: 1 issue: 1 until...,[Yes],...,,,397,Clinical research practices and drug regulator...,e,['Yes'],[ Available from 1983 volume: 1 issue: 1 until...,Unknown,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,,,,,,,,,,,...,,,3253,,,,,Unknown,,
103,,,,,,,,,,,...,,,3291,,,,,Unknown,,
104,,,,,,,,,,,...,,,3556,,,,,Unknown,,
105,,,,,,,,,,,...,,,3656,,,,,Unknown,,


In [65]:
no_date_groups = list(dfe2_no_date_calc_df['matches_group_id'])
len(no_date_groups)

107

In [66]:
dfe2_no_date_calc = df[df['matches_group_id'].isin(no_date_groups)]
dfe2_no_date_calc

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc
3,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,,,,
2,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,,,,,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19..."
209,121955,9967220140001701,Seminars in oncology,"['0093-7754', '1532-8708']",e,4588,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624692560001701', 'Seminars in oncology.']]",[ Available from 2001-02- volume: 28;],[Yes],...,,,,,,,,,,
210,65075,9949345650001701,Seminars in oncology,['0093-7754'],p,4588,,,,,...,,,,,[TBIOM PERS],[87],"[2004, 2010, 1984, 1995, 1987, 2001, 2009, 198...",['TBIOM PERS'],"[1974, 1975, 1976, 1977, 1978, 1979, 1980, 198...","[(1974, 1982), (1984, 2011)]"
245,12584,9922468710001701,Tennyson research bulletin,['0082-2841'],p,5186,,,,,...,,,,,[TWILS PER],[9],"[1982/91, 2012/2016, 2018, 1997/2001, 2017, 20...",['TWILS PER'],"[1982, 1983, 1984, 1985, 1986, 1987, 1988, 198...","[(1982, 2019)]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3740,104355,9966748550001701,Biochemistry,"['0006-2960', '1520-4995']",e,39912,"[['61535215970001701', 'American Chemical Soci...","[['53536372850001701', 'Biochemistry.']]",[ Available from 1962 volume: 1 issue: 1 until...,[Yes],...,,,,,,,,,,
3739,62835,9942519690001701,Biochemistry,['0006-2960'],p,39912,,,,,...,,,,,[ZMLAC UMDN],[158],"[1970, 1978, 1984, 1964, 1963, 1977, 1971, 198...",['ZMLAC UMDN'],"[1962, 1963, 1964, 1965, 1966, 1967, 1968, 196...","[(1962, 1993)]"
3741,77056,9956319410001701,Biochemistry,['0006-2960'],p,39912,,,,,...,,,,,"[TBIOM PERS, TZDS GEN, ZMLAC OWL]",[865],"[2004, 1970, 1978, 1995, 1984, 1964, 1971, 197...","['TBIOM PERS', 'TZDS GEN', 'ZMLAC OWL']","[1962, 1963, 1964, 1965, 1966, 1967, 1968, 196...","[(1962, 2009)]"
3742,115410,9968561600001701,Macromolecules,"['0024-9297', '1520-5835']",e,39912,"[['61535215970001701', 'American Chemical Soci...","[['53540476540001701', 'Macromolecules.']]",[ Available from 1968 volume: 1 issue: 1 until...,[Yes],...,,,,,,,,,,


In [68]:
dfe2_no_date_calc.to_pickle(f'dfe2_no_date_calc_{today}.pkl')

In [69]:
dfe2['pcad-range'] = dfe2.apply(lambda row: list(range(int(row['avail-start']),int(row['avail-end'])+1)), axis=1)
dfe2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Coverage_atomic,avail-start,avail-end,pcad-range
0,Available from 1963 volume: 10 issue: 1 until...,1963,1963,[1963]
1,Available from 1886 volume: 25 issue: 5 until...,1886,1995,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189..."
2,Available from 1939 volume: 1 until 2012;,1939,2012,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194..."
4,Available from 2007 volume: 32 until 2009 vol...,2007,2009,"[2007, 2008, 2009]"
5,Available from 1944-1-1 until 2000-10-31; Av...,1944,2000,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195..."
...,...,...,...,...
3964,Available from 1967-01-02 volume: 24 issue: 1;,1967,2019,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
3965,Available from 1967-01-09 volume: 24 issue: 1;,1967,2019,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
3966,Available from 1927 volume: 1;,1927,2019,"[1927, 1928, 1929, 1930, 1931, 1932, 1933, 193..."
3967,Available from 1944 volume: 1 issue: 1;,1944,2019,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195..."


In [70]:
dfe2.reset_index(inplace=True)
dfe2.rename(columns={'index':'record_index'},inplace=True)
dfe2

Unnamed: 0,record_index,Coverage_atomic,avail-start,avail-end,pcad-range
0,0,Available from 1963 volume: 10 issue: 1 until...,1963,1963,[1963]
1,1,Available from 1886 volume: 25 issue: 5 until...,1886,1995,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189..."
2,2,Available from 1939 volume: 1 until 2012;,1939,2012,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194..."
3,4,Available from 2007 volume: 32 until 2009 vol...,2007,2009,"[2007, 2008, 2009]"
4,5,Available from 1944-1-1 until 2000-10-31; Av...,1944,2000,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195..."
...,...,...,...,...,...
5576,3964,Available from 1967-01-02 volume: 24 issue: 1;,1967,2019,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
5577,3965,Available from 1967-01-09 volume: 24 issue: 1;,1967,2019,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
5578,3966,Available from 1927 volume: 1;,1927,2019,"[1927, 1928, 1929, 1930, 1931, 1932, 1933, 193..."
5579,3967,Available from 1944 volume: 1 issue: 1;,1944,2019,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195..."


In [71]:
dfeg = dfe2[['record_index','pcad-range']].groupby(['record_index']).agg(lambda x: list(set([item for sublist in x for item in sublist]))).reset_index()
dfeg

Unnamed: 0,record_index,pcad-range
0,0,[1963]
1,1,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189..."
2,2,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194..."
3,4,"[2008, 2009, 2007]"
4,5,"[1893, 1894, 1895, 1896, 1897, 1898, 1899, 190..."
...,...,...
3908,3964,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
3909,3965,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
3910,3966,"[1927, 1928, 1929, 1930, 1931, 1932, 1933, 193..."
3911,3967,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195..."


In [72]:
dfeg['pcad-range'] = dfeg['pcad-range'].apply(lambda x: sorted(x))
dfeg

Unnamed: 0,record_index,pcad-range
0,0,[1963]
1,1,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189..."
2,2,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194..."
3,4,"[2007, 2008, 2009]"
4,5,"[1893, 1894, 1895, 1896, 1897, 1898, 1899, 190..."
...,...,...
3908,3964,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
3909,3965,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
3910,3966,"[1927, 1928, 1929, 1930, 1931, 1932, 1933, 193..."
3911,3967,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195..."


In [73]:
dfe_pcad = pd.merge(dfe, dfeg, left_index=True, how='left', right_on= 'record_index')
dfe_pcad

Unnamed: 0,record_index,record_index_x,MMS_ID,Title_bib,p_or_e,PCAD?,Coverage Information Combined,record_index_y,pcad-range
0.0,0,1,9968441380001701,IEEE transactions on ultrasonics engineering,e,['Yes'],[ Available from 1963 volume: 10 issue: 1 unti...,0.0,[1963]
1.0,1,3,9968429800001701,Journal of the Institute of Actuaries,e,['Yes'],[ Available from 1886 volume: 25 issue: 5 unti...,1.0,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189..."
2.0,2,4,9967115530001701,Giornale degli economisti e annali di economia,e,['Yes'],[ Available from 1939 volume: 1 until 2012;],2.0,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194..."
,3,6,9968936470001701,Mathematics of the USSR. Izvestija (Online),e,['Yes'],[Unknown],,
3.0,4,10,9968336850001701,Laboratory techniques in biochemistry and mole...,e,['Yes'],[ Available from 2007 volume: 32 until 2009 vo...,4.0,"[2007, 2008, 2009]"
...,...,...,...,...,...,...,...,...,...
3908.0,3964,8856,9968239890001701,Physics letters.,e,['Yes'],[ Available from 1967-01-02 volume: 24 issue: 1;],3964.0,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
3909.0,3965,8857,9968240180001701,Physics letters.,e,['Yes'],[ Available from 1967-01-09 volume: 24 issue: 1;],3965.0,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
3910.0,3966,8864,9968665290001701,Proceedings and addresses of the American Phil...,e,['Yes'],[ Available from 1927 volume: 1;],3966.0,"[1927, 1928, 1929, 1930, 1931, 1932, 1933, 193..."
3911.0,3967,8868,9967987860001701,The Americas - Academy of American Franciscan ...,e,['Yes'],[ Available from 1944 volume: 1 issue: 1;],3967.0,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195..."


In [74]:
dfe_pcad.columns

Index(['record_index', 'record_index_x', 'MMS_ID', 'Title_bib', 'p_or_e',
       'PCAD?', 'Coverage Information Combined', 'record_index_y',
       'pcad-range'],
      dtype='object')

In [75]:
df.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e',
       'matches_group_id', 'e_coll_info', 'portfolio_info',
       'Coverage Information Combined', 'PCAD?', 'Vendor_key',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'Title (Complete)_PORTICO',
       'Portico Match_PORTICO', 'Portico Title_PORTICO', 'PCA_PORTICO',
       'Status_PORTICO', 'Earliest Year Preserved_PORTICO',
       'Latest Year Preserved_PORTICO', 'curr-lib-loc_x', 'all_item_count',
       'chron', 'curr-lib-loc_ALL', 'chron_as_list', 'chron_ranges_calc'],
      dtype='object')

In [76]:
df_e_range = pd.merge(df, dfe_pcad[['MMS_ID','pcad-range']], how='left', on='MMS_ID')
df_e_range

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range
0,128733,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],...,,,,,,,,,,[1963]
1,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,,,,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",
2,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189..."
3,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,,,,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",
4,123907,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],...,,,,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,61904,9959156260001701,Year book - American Philosophical Society,"['0065-9762', '0003-049X']",p,101522,,,,,...,,,,"[TSCI PER, ZMLAC OWL]",[53],"[1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]",
8819,17960,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,...,,,,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",
8820,117510,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],...,,,,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195..."
8821,125537,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],...,,,,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198..."


In [77]:
df_e_range.sort_values(['matches_group_id','Title_bib','p_or_e'], inplace=True)
df_e_range

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range
0,128733,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],...,,,,,,,,,,[1963]
1,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,,,,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",
2,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189..."
3,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,,,,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",
4,123907,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],...,,,,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,61904,9959156260001701,Year book - American Philosophical Society,"['0065-9762', '0003-049X']",p,101522,,,,,...,,,,"[TSCI PER, ZMLAC OWL]",[53],"[1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]",
8819,17960,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,...,,,,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",
8820,117510,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],...,,,,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195..."
8821,125537,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],...,,,,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198..."


#### Process date ranges for SPR data

In [78]:
df_e_range.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e',
       'matches_group_id', 'e_coll_info', 'portfolio_info',
       'Coverage Information Combined', 'PCAD?', 'Vendor_key',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'Title (Complete)_PORTICO',
       'Portico Match_PORTICO', 'Portico Title_PORTICO', 'PCA_PORTICO',
       'Status_PORTICO', 'Earliest Year Preserved_PORTICO',
       'Latest Year Preserved_PORTICO', 'curr-lib-loc_x', 'all_item_count',
       'chron', 'curr-lib-loc_ALL', 'chron_as_list', 'chron_ranges_calc',
       'pcad-range'],
      dtype='object')

In [79]:
df_spr = df_e_range[['MMS_ID', 'Title_bib', 'p_or_e', 'matches_group_id',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR', 'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR']]
df_spr

Unnamed: 0,MMS_ID,Title_bib,p_or_e,matches_group_id,Title 1 (Print)_BTAA-SPR,Title 2 (Print)_BTAA-SPR,Match?_BTAA-SPR,SPR Holdings_BTAA-SPR
0,9968441380001701,IEEE transactions on ultrasonics engineering,e,5,,,,
1,9963550760001701,IEEE transactions on ultrasonics engineering,p,5,IEEE transactions on ultrasonics engineering.,,YES,10 (1963)
2,9968429800001701,Journal of the Institute of Actuaries,e,12,,,,
3,9939481760001701,Journal of the Institute of Actuaries,p,12,Journal of the Institute of Actuaries.,,YES,66 (1935)-110 (1983)
4,9967115530001701,Giornale degli economisti e annali di economia,e,92,,,,
...,...,...,...,...,...,...,...,...
8818,9959156260001701,Year book - American Philosophical Society,p,101522,,,,
8819,9946768760001701,The Americas,p,101581,The Americas.,,,1 (1969-1970)-45 (1989)
8820,9967987860001701,The Americas - Academy of American Franciscan ...,e,101581,,,,
8821,9968947900001701,International journal of adhesion and adhesives,e,101582,,,,


In [80]:
df_spr = df_spr[df_spr['Match?_BTAA-SPR'].notnull()]
df_spr

Unnamed: 0,MMS_ID,Title_bib,p_or_e,matches_group_id,Title 1 (Print)_BTAA-SPR,Title 2 (Print)_BTAA-SPR,Match?_BTAA-SPR,SPR Holdings_BTAA-SPR
1,9963550760001701,IEEE transactions on ultrasonics engineering,p,5,IEEE transactions on ultrasonics engineering.,,YES,10 (1963)
3,9939481760001701,Journal of the Institute of Actuaries,p,12,Journal of the Institute of Actuaries.,,YES,66 (1935)-110 (1983)
6,9964617400001701,Mathematics of the USSR. Izvestija,p,267,Mathematics of the USSR: Izvestija.,,YES,1 (1967)-39 (1992)
18,9929554030001701,Veterinary research communications.,p,707,Veterinary research communications.,,YES,4 (1980)-33 (2009)
25,9931084310001701,Neuropeptides,p,744,Neuropeptides.,,YES,1 (1980/1981)-40 (2006)
...,...,...,...,...,...,...,...,...
8793,9939180800001701,Annals of the New York Academy of Sciences,p,101219,Annals of the New York Academy of Sciences.,The year in ecology and conservation biology.,YES,"1-8, 10-83, 85-104, 106-126, 128, 130-146, 148..."
8808,9912065260001701,Physics letters,p,101461,Physics letters.,Physics letters.,YES,1 (1962)-23 (1966)
8811,9953202980001701,Physics letters. Section A,p,101461,Physics letters.,,YES,24 (1967)-347 (2005)
8812,9953205880001701,Physics letters. Section B,p,101461,Physics letters.,,YES,"24 (1967)-269 (1991), 271 (1991)-566 (2009), 5..."


In [81]:
df_spr['SPR Holdings_BTAA-SPR'] = df_spr['SPR Holdings_BTAA-SPR'].apply(lambda x: x.strip().replace(';',',').replace(' ',''))
df_spr

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,MMS_ID,Title_bib,p_or_e,matches_group_id,Title 1 (Print)_BTAA-SPR,Title 2 (Print)_BTAA-SPR,Match?_BTAA-SPR,SPR Holdings_BTAA-SPR
1,9963550760001701,IEEE transactions on ultrasonics engineering,p,5,IEEE transactions on ultrasonics engineering.,,YES,10(1963)
3,9939481760001701,Journal of the Institute of Actuaries,p,12,Journal of the Institute of Actuaries.,,YES,66(1935)-110(1983)
6,9964617400001701,Mathematics of the USSR. Izvestija,p,267,Mathematics of the USSR: Izvestija.,,YES,1(1967)-39(1992)
18,9929554030001701,Veterinary research communications.,p,707,Veterinary research communications.,,YES,4(1980)-33(2009)
25,9931084310001701,Neuropeptides,p,744,Neuropeptides.,,YES,1(1980/1981)-40(2006)
...,...,...,...,...,...,...,...,...
8793,9939180800001701,Annals of the New York Academy of Sciences,p,101219,Annals of the New York Academy of Sciences.,The year in ecology and conservation biology.,YES,"1-8,10-83,85-104,106-126,128,130-146,148-149,1..."
8808,9912065260001701,Physics letters,p,101461,Physics letters.,Physics letters.,YES,1(1962)-23(1966)
8811,9953202980001701,Physics letters. Section A,p,101461,Physics letters.,,YES,24(1967)-347(2005)
8812,9953205880001701,Physics letters. Section B,p,101461,Physics letters.,,YES,"24(1967)-269(1991),271(1991)-566(2009),571(200..."


In [82]:
df_spr['SPR-holdings'] = df_spr['SPR Holdings_BTAA-SPR'].apply(lambda x: x.split(','))
df_spr

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,MMS_ID,Title_bib,p_or_e,matches_group_id,Title 1 (Print)_BTAA-SPR,Title 2 (Print)_BTAA-SPR,Match?_BTAA-SPR,SPR Holdings_BTAA-SPR,SPR-holdings
1,9963550760001701,IEEE transactions on ultrasonics engineering,p,5,IEEE transactions on ultrasonics engineering.,,YES,10(1963),[10(1963)]
3,9939481760001701,Journal of the Institute of Actuaries,p,12,Journal of the Institute of Actuaries.,,YES,66(1935)-110(1983),[66(1935)-110(1983)]
6,9964617400001701,Mathematics of the USSR. Izvestija,p,267,Mathematics of the USSR: Izvestija.,,YES,1(1967)-39(1992),[1(1967)-39(1992)]
18,9929554030001701,Veterinary research communications.,p,707,Veterinary research communications.,,YES,4(1980)-33(2009),[4(1980)-33(2009)]
25,9931084310001701,Neuropeptides,p,744,Neuropeptides.,,YES,1(1980/1981)-40(2006),[1(1980/1981)-40(2006)]
...,...,...,...,...,...,...,...,...,...
8793,9939180800001701,Annals of the New York Academy of Sciences,p,101219,Annals of the New York Academy of Sciences.,The year in ecology and conservation biology.,YES,"1-8,10-83,85-104,106-126,128,130-146,148-149,1...","[1-8, 10-83, 85-104, 106-126, 128, 130-146, 14..."
8808,9912065260001701,Physics letters,p,101461,Physics letters.,Physics letters.,YES,1(1962)-23(1966),[1(1962)-23(1966)]
8811,9953202980001701,Physics letters. Section A,p,101461,Physics letters.,,YES,24(1967)-347(2005),[24(1967)-347(2005)]
8812,9953205880001701,Physics letters. Section B,p,101461,Physics letters.,,YES,"24(1967)-269(1991),271(1991)-566(2009),571(200...","[24(1967)-269(1991), 271(1991)-566(2009), 571(..."


In [83]:
def spr_years(holdings):
    years = []
    for x in holdings :
        yr = re.findall(r'[1-2][0-9]{3}',x)
        if len(yr) > 1:
            years.extend(list(range(int(yr[0]),int(yr[1])+1)))
        elif len(yr) == 1:
            y = yr[0]
            years.append(int(y))
        else:
            print(yr)
            print(holdings)
    return years

In [84]:
df_spr['SPR-yrs'] = df_spr['SPR-holdings'].apply(lambda x: sorted(spr_years(x)))
df_spr

[]
['ser.1', '1(1969)-2(1970)', 'ser.2', '1(1971)-7(1977)', 'ser.3', '1(1978)-ser.3', '19(1996)']
[]
['ser.1', '1(1969)-2(1970)', 'ser.2', '1(1971)-7(1977)', 'ser.3', '1(1978)-ser.3', '19(1996)']
[]
['ser.1', '1(1969)-2(1970)', 'ser.2', '1(1971)-7(1977)', 'ser.3', '1(1978)-ser.3', '19(1996)']
[]
['ser.1', 'v.1(1859)-ser.8', 'v.6(1906)', 'ser.9', 'v.1(1907)-ser.14', 'v.6(1942)', 'ser.14', 'v.1(1937)-ser.14', 'v.3(1939)', '85(1943)-146(2004)']
[]
['ser.1', 'v.1(1859)-ser.8', 'v.6(1906)', 'ser.9', 'v.1(1907)-ser.14', 'v.6(1942)', 'ser.14', 'v.1(1937)-ser.14', 'v.3(1939)', '85(1943)-146(2004)']
[]
['ser.1', 'v.1(1859)-ser.8', 'v.6(1906)', 'ser.9', 'v.1(1907)-ser.14', 'v.6(1942)', 'ser.14', 'v.1(1937)-ser.14', 'v.3(1939)', '85(1943)-146(2004)']
[]
['1(1969)-89(2006)', 'Index20-22', '28-36', '38-40']
[]
['1(1969)-89(2006)', 'Index20-22', '28-36', '38-40']
[]
['1(1969)-89(2006)', 'Index20-22', '28-36', '38-40']
[]
['35(1973)-70(2008)', 'Suppl.63', '65-67']
[]
['35(1973)-70(2008)', 'Suppl.63',

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,MMS_ID,Title_bib,p_or_e,matches_group_id,Title 1 (Print)_BTAA-SPR,Title 2 (Print)_BTAA-SPR,Match?_BTAA-SPR,SPR Holdings_BTAA-SPR,SPR-holdings,SPR-yrs
1,9963550760001701,IEEE transactions on ultrasonics engineering,p,5,IEEE transactions on ultrasonics engineering.,,YES,10(1963),[10(1963)],[1963]
3,9939481760001701,Journal of the Institute of Actuaries,p,12,Journal of the Institute of Actuaries.,,YES,66(1935)-110(1983),[66(1935)-110(1983)],"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
6,9964617400001701,Mathematics of the USSR. Izvestija,p,267,Mathematics of the USSR: Izvestija.,,YES,1(1967)-39(1992),[1(1967)-39(1992)],"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
18,9929554030001701,Veterinary research communications.,p,707,Veterinary research communications.,,YES,4(1980)-33(2009),[4(1980)-33(2009)],"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198..."
25,9931084310001701,Neuropeptides,p,744,Neuropeptides.,,YES,1(1980/1981)-40(2006),[1(1980/1981)-40(2006)],"[1980, 1981]"
...,...,...,...,...,...,...,...,...,...,...
8793,9939180800001701,Annals of the New York Academy of Sciences,p,101219,Annals of the New York Academy of Sciences.,The year in ecology and conservation biology.,YES,"1-8,10-83,85-104,106-126,128,130-146,148-149,1...","[1-8, 10-83, 85-104, 106-126, 128, 130-146, 14...",[]
8808,9912065260001701,Physics letters,p,101461,Physics letters.,Physics letters.,YES,1(1962)-23(1966),[1(1962)-23(1966)],"[1962, 1963, 1964, 1965, 1966]"
8811,9953202980001701,Physics letters. Section A,p,101461,Physics letters.,,YES,24(1967)-347(2005),[24(1967)-347(2005)],"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
8812,9953205880001701,Physics letters. Section B,p,101461,Physics letters.,,YES,"24(1967)-269(1991),271(1991)-566(2009),571(200...","[24(1967)-269(1991), 271(1991)-566(2009), 571(...","[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."


In [85]:
dfs = pd.merge(df_e_range,df_spr[['SPR-yrs']],how='left',left_index=True,right_index=True)
dfs

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs
0,128733,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],...,,,,,,,,,[1963],
1,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,,,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963]
2,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...",
3,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,,,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
4,123907,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],...,,,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,61904,9959156260001701,Year book - American Philosophical Society,"['0065-9762', '0003-049X']",p,101522,,,,,...,,,"[TSCI PER, ZMLAC OWL]",[53],"[1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]",,
8819,17960,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,...,,,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",,
8820,117510,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],...,,,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",
8821,125537,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],...,,,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",


#### Process Portico coverage

In [86]:
dfs.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e',
       'matches_group_id', 'e_coll_info', 'portfolio_info',
       'Coverage Information Combined', 'PCAD?', 'Vendor_key',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'Title (Complete)_PORTICO',
       'Portico Match_PORTICO', 'Portico Title_PORTICO', 'PCA_PORTICO',
       'Status_PORTICO', 'Earliest Year Preserved_PORTICO',
       'Latest Year Preserved_PORTICO', 'curr-lib-loc_x', 'all_item_count',
       'chron', 'curr-lib-loc_ALL', 'chron_as_list', 'chron_ranges_calc',
       'pcad-range', 'SPR-yrs'],
      dtype='object')

In [88]:
dfp = dfs[['record_index', 'MMS_ID', 'Title_bib', 'p_or_e', 'matches_group_id',
       'Title (Complete)_PORTICO','Portico Match_PORTICO', 
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO']]
dfp

Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,matches_group_id,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO
0,128733,9968441380001701,IEEE transactions on ultrasonics engineering,e,5,,,,,,,
1,57684,9963550760001701,IEEE transactions on ultrasonics engineering,p,5,,,,,,,
2,111951,9968429800001701,Journal of the Institute of Actuaries,e,12,,,,,,,
3,88618,9939481760001701,Journal of the Institute of Actuaries,p,12,,,,,,,
4,123907,9967115530001701,Giornale degli economisti e annali di economia,e,92,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
8818,61904,9959156260001701,Year book - American Philosophical Society,p,101522,,,,,,,
8819,17960,9946768760001701,The Americas,p,101581,,,,,,,
8820,117510,9967987860001701,The Americas - Academy of American Franciscan ...,e,101581,,,,,,,
8821,125537,9968947900001701,International journal of adhesion and adhesives,e,101582,,,,,,,


In [89]:
dfp = dfp[(dfp['Portico Match_PORTICO'] == 'Yes') & (dfp['PCA_PORTICO'] == 'Yes') & (dfp['Status_PORTICO'] == 'Preserved')]
dfp

Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,matches_group_id,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO
8,62770,9942472340001701,Laboratory techniques in biochemistry and mole...,p,277,Laboratory techniques in biochemistry and mole...,Yes,Laboratory Techniques in Biochemistry and Mole...,Yes,Preserved,,
33,74020,9918228280001701,Annals of agricultural science.,p,1066,Annals of agricultural science.,Yes,Annals of Agricultural Sciences,Yes,Preserved,2011.0,2019.0
36,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001.0,2020.0
37,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001.0,2020.0
38,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001.0,2020.0
...,...,...,...,...,...,...,...,...,...,...,...,...
8704,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954.0,1997.0
8705,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954.0,1997.0
8706,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954.0,1997.0
8787,37420,9931018220001701,Ophelia,p,101035,Ophelia.,Yes,Ophelia,Yes,Preserved,1986.0,2001.0


In [90]:
dfp['Earliest Year Preserved_PORTICO'].isna().value_counts()

False    831
True      24
Name: Earliest Year Preserved_PORTICO, dtype: int64

In [91]:
dfp[(dfp['Earliest Year Preserved_PORTICO'].isna())]

Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,matches_group_id,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO
8,62770,9942472340001701,Laboratory techniques in biochemistry and mole...,p,277,Laboratory techniques in biochemistry and mole...,Yes,Laboratory Techniques in Biochemistry and Mole...,Yes,Preserved,,
143,52765,9916006040001701,Semiconductors and semimetals,p,4013,Semiconductors and semimetals [electronic reso...,Yes,Semiconductors and Semimetals,Yes,Preserved,,
145,97135,9934198220001701,Advances in applied mechanics,p,4058,Advances in Applied Mechanics [electronic reso...,Yes,Advances in Applied Mechanics,Yes,Preserved,,
865,33250,9915103410001701,PN review.,p,13001,Poetry nation.,Yes,PN Review,Yes,Preserved,,
867,123843,9967042010001701,Poetry nation.,e,13001,Poetry nation.,Yes,PN Review,Yes,Preserved,,
981,83005,9954507840001701,Reviews of infectious diseases,p,14097,Reviews of infectious diseases.,Yes,Reviews of Infectious Diseases,Yes,Preserved,,
1382,124764,9975238308201701,Journal of the American Pharmaceutical Associa...,e,17896,Journal of the American Pharmaceutical Associa...,Yes,Journal of the American Pharmaceutical Associa...,Yes,Preserved,,
1383,17569,9942533970001701,Journal of the American Pharmaceutical Associa...,p,17896,Journal of the American Pharmaceutical Associa...,Yes,Journal of the American Pharmaceutical Associa...,Yes,Preserved,,
1458,49759,9953227240001701,Progress in optics,p,18743,Progress in optics [electronic resource].,Yes,Progress in Optics,Yes,Preserved,,
1594,38529,9930589110001701,Advances in catalysis,p,20477,Advances in catalysis [electronic resource].,Yes,Advances in Catalysis,Yes,Preserved,,


In [92]:
dfp = dfp[(dfp['Earliest Year Preserved_PORTICO'].notnull())]
dfp

Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,matches_group_id,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO
33,74020,9918228280001701,Annals of agricultural science.,p,1066,Annals of agricultural science.,Yes,Annals of Agricultural Sciences,Yes,Preserved,2011.0,2019.0
36,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001.0,2020.0
37,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001.0,2020.0
38,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001.0,2020.0
39,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001.0,2020.0
...,...,...,...,...,...,...,...,...,...,...,...,...
8704,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954.0,1997.0
8705,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954.0,1997.0
8706,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954.0,1997.0
8787,37420,9931018220001701,Ophelia,p,101035,Ophelia.,Yes,Ophelia,Yes,Preserved,1986.0,2001.0


In [93]:
dfp['Earliest Year Preserved_PORTICO'] = dfp['Earliest Year Preserved_PORTICO'].apply(lambda x: int(x))
dfp

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,matches_group_id,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO
33,74020,9918228280001701,Annals of agricultural science.,p,1066,Annals of agricultural science.,Yes,Annals of Agricultural Sciences,Yes,Preserved,2011,2019.0
36,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020.0
37,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020.0
38,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020.0
39,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020.0
...,...,...,...,...,...,...,...,...,...,...,...,...
8704,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954,1997.0
8705,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954,1997.0
8706,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954,1997.0
8787,37420,9931018220001701,Ophelia,p,101035,Ophelia.,Yes,Ophelia,Yes,Preserved,1986,2001.0


In [94]:
dfp['Latest Year Preserved_PORTICO'] = dfp['Latest Year Preserved_PORTICO'].apply(lambda x: int(x))
dfp

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,matches_group_id,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO
33,74020,9918228280001701,Annals of agricultural science.,p,1066,Annals of agricultural science.,Yes,Annals of Agricultural Sciences,Yes,Preserved,2011,2019
36,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020
37,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020
38,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020
39,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020
...,...,...,...,...,...,...,...,...,...,...,...,...
8704,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954,1997
8705,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954,1997
8706,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954,1997
8787,37420,9931018220001701,Ophelia,p,101035,Ophelia.,Yes,Ophelia,Yes,Preserved,1986,2001


In [95]:
dfp['Earliest Year Preserved_PORTICO'].isna().value_counts()

False    831
Name: Earliest Year Preserved_PORTICO, dtype: int64

In [96]:
dfp.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'p_or_e', 'matches_group_id',
       'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO'],
      dtype='object')

In [97]:
dfp['Portico-years'] = dfp.apply(lambda row: list(range(row['Earliest Year Preserved_PORTICO'],
                                                        row['Latest Year Preserved_PORTICO']+1)), axis=1)
dfp

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,record_index,MMS_ID,Title_bib,p_or_e,matches_group_id,Title (Complete)_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,PCA_PORTICO,Status_PORTICO,Earliest Year Preserved_PORTICO,Latest Year Preserved_PORTICO,Portico-years
33,74020,9918228280001701,Annals of agricultural science.,p,1066,Annals of agricultural science.,Yes,Annals of Agricultural Sciences,Yes,Preserved,2011,2019,"[2011, 2012, 2013, 2014, 2015, 2016, 2017, 201..."
36,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200..."
37,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200..."
38,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200..."
39,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",p,1087,"Journal für praktische Chemie, Chemiker-Zeitu...",Yes,Journal für Praktische Chemie (1834-1991) | Jo...,Yes,Preserved,2001,2020,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8704,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954,1997,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196..."
8705,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954,1997,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196..."
8706,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,e,93064,Zentralblatt für Veterinärmedizin.,Yes,Zentralblatt für Veterinärmedizin | Journal of...,Yes,Preserved,1954,1997,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196..."
8787,37420,9931018220001701,Ophelia,p,101035,Ophelia.,Yes,Ophelia,Yes,Preserved,1986,2001,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199..."


In [98]:
df = pd.merge(dfs,dfp[['Portico-years']],how='left',left_index=True,right_index=True)
df

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,Latest Year Preserved_PORTICO,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years
0,128733,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],...,,,,,,,,[1963],,
1,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963],
2,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...",,
3,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",
4,123907,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],...,,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,61904,9959156260001701,Year book - American Philosophical Society,"['0065-9762', '0003-049X']",p,101522,,,,,...,,"[TSCI PER, ZMLAC OWL]",[53],"[1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]",,,
8819,17960,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,...,,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",,,
8820,117510,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],...,,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,
8821,125537,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],...,,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,


#### Combine repository coverage

In [99]:
df_s_p = df[df['SPR-yrs'].notnull() & df['Portico-years'].notnull()]
df_s_p

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,Latest Year Preserved_PORTICO,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years
422,121903,9967349630001701,Canadian journal of biochemistry and cell biology,['0714-7511'],e,8412,"[['61535212690001701', 'Canadian Science Publi...","[['53537719460001701', 'Canadian journal of bi...",[ Available from 1983 volume: 61 issue: 1 unti...,[Yes],...,1985.0,,,,,,,"[1983, 1984, 1985]","[1983, 1984, 1985]","[1983, 1984, 1985]"
423,20683,9956330160001701,Canadian journal of biochemistry and cell biology,['0714-7511'],p,8412,,,,,...,1985.0,[TBIOM PERS],[6],"[1983, 1984, 1985]",['TBIOM PERS'],"[1983, 1984, 1985]","[(1983, 1985)]",,"[1983, 1984, 1985]","[1983, 1984, 1985]"
462,89929,9941324470001701,Australian journal of statistics,['0004-9581'],p,8786,,,,,...,2019.0,[TWILS PERC],[18],"[1984-85, 1967-70, 1963-66, 1959/1983, 1991, 1...",['TWILS PERC'],"[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196...","[(1959, 1993)]",,"[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196...","[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196..."
463,89929,9941324470001701,Australian journal of statistics,['0004-9581'],p,8786,,,,,...,2019.0,[TWILS PERC],[18],"[1984-85, 1967-70, 1963-66, 1959/1983, 1991, 1...",['TWILS PERC'],"[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196...","[(1959, 1993)]",,"[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196...","[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196..."
586,112463,9968110510001701,Journal of Comparative Pathology and Therapeutics,['0368-1742'],e,10384,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624638360001701', 'Journal of Comparative...",[ Available from 1888 volume: 1 until 1964 vol...,[Yes],...,1964.0,,,,,,,"[1888, 1889, 1890, 1891, 1892, 1893, 1894, 189...","[1891, 1896, 1897, 1898, 1899, 1900, 1901, 190...","[1888, 1889, 1890, 1891, 1892, 1893, 1894, 189..."
587,39855,9965015360001701,Journal of Comparative Pathology and Therapeutics,['0368-1742'],p,10384,,,,,...,1964.0,"[ZMLAC NON, TVET PER]",[43],"[1949-50, 1936-37, 1899-1900, 1914, 1964, 1903...","['ZMLAC NON', 'TVET PER']","[1894, 1895, 1896, 1899, 1900, 1901, 1902, 190...","[(1894, 1896), (1899, 1939), (1943, 1964)]",,"[1891, 1896, 1897, 1898, 1899, 1900, 1901, 190...","[1888, 1889, 1890, 1891, 1892, 1893, 1894, 189..."
1438,121712,9969448190001701,Comparative biochemistry and physiology.,['0300-9629'],e,18625,"[['61535212360001701', 'Elsevier SD Backfile B...","[['53624576870001701', 'Comparative biochemist...",[ Available from 1971 volume: 38 until 1994 vo...,[Yes],...,1997.0,,,,,,,"[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197..."
1439,64008,9947190700001701,Comparative biochemistry and physiology.,['0300-9629'],p,18625,,,,,...,1997.0,[TBIOM PERS],[12],"[1995, 1997, 1994, 1996]",['TBIOM PERS'],"[1994, 1995, 1996, 1997]","[(1994, 1997)]",,"[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197..."
1441,46587,9956317740001701,Comparative biochemistry and physiology. A. Co...,['0300-9629'],p,18625,,,,,...,1997.0,"[TBIOM PERS, ZMLAC OWL]",[90],"[1978, 1984, 1987, 1971, 1977, 1983, 1972, 198...","['TBIOM PERS', 'ZMLAC OWL']","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[(1971, 1991)]",,"[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197..."
1509,96778,9921748030001701,Pharmacology and therapeutics. Part C. Clinica...,['0362-5486'],p,19112,,,,,...,1978.0,[TCOS SN1],[2],"[1976, 1977]",['TCOS SN1'],"[1976, 1977]","[(1976, 1977)]",,"[1976, 1977, 1978]","[1976, 1977, 1978]"


In [104]:
#both should be lists; change index as needed
print(type(df_s_p['SPR-yrs'][1914]))
print(type(df_s_p['Portico-years'][1914]))

<class 'list'>
<class 'list'>


In [105]:
df_s_p['repo-coverage'] = df_s_p.apply(lambda row: list(set(row['SPR-yrs'] + row['Portico-years'])), axis=1)
df_s_p

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
422,121903,9967349630001701,Canadian journal of biochemistry and cell biology,['0714-7511'],e,8412,"[['61535212690001701', 'Canadian Science Publi...","[['53537719460001701', 'Canadian journal of bi...",[ Available from 1983 volume: 61 issue: 1 unti...,[Yes],...,,,,,,,"[1983, 1984, 1985]","[1983, 1984, 1985]","[1983, 1984, 1985]","[1984, 1985, 1983]"
423,20683,9956330160001701,Canadian journal of biochemistry and cell biology,['0714-7511'],p,8412,,,,,...,[TBIOM PERS],[6],"[1983, 1984, 1985]",['TBIOM PERS'],"[1983, 1984, 1985]","[(1983, 1985)]",,"[1983, 1984, 1985]","[1983, 1984, 1985]","[1984, 1985, 1983]"
462,89929,9941324470001701,Australian journal of statistics,['0004-9581'],p,8786,,,,,...,[TWILS PERC],[18],"[1984-85, 1967-70, 1963-66, 1959/1983, 1991, 1...",['TWILS PERC'],"[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196...","[(1959, 1993)]",,"[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196...","[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196...","[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196..."
463,89929,9941324470001701,Australian journal of statistics,['0004-9581'],p,8786,,,,,...,[TWILS PERC],[18],"[1984-85, 1967-70, 1963-66, 1959/1983, 1991, 1...",['TWILS PERC'],"[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196...","[(1959, 1993)]",,"[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196...","[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196...","[1959, 1960, 1961, 1962, 1963, 1964, 1965, 196..."
586,112463,9968110510001701,Journal of Comparative Pathology and Therapeutics,['0368-1742'],e,10384,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624638360001701', 'Journal of Comparative...",[ Available from 1888 volume: 1 until 1964 vol...,[Yes],...,,,,,,,"[1888, 1889, 1890, 1891, 1892, 1893, 1894, 189...","[1891, 1896, 1897, 1898, 1899, 1900, 1901, 190...","[1888, 1889, 1890, 1891, 1892, 1893, 1894, 189...","[1888, 1889, 1890, 1891, 1892, 1893, 1894, 189..."
587,39855,9965015360001701,Journal of Comparative Pathology and Therapeutics,['0368-1742'],p,10384,,,,,...,"[ZMLAC NON, TVET PER]",[43],"[1949-50, 1936-37, 1899-1900, 1914, 1964, 1903...","['ZMLAC NON', 'TVET PER']","[1894, 1895, 1896, 1899, 1900, 1901, 1902, 190...","[(1894, 1896), (1899, 1939), (1943, 1964)]",,"[1891, 1896, 1897, 1898, 1899, 1900, 1901, 190...","[1888, 1889, 1890, 1891, 1892, 1893, 1894, 189...","[1888, 1889, 1890, 1891, 1892, 1893, 1894, 189..."
1438,121712,9969448190001701,Comparative biochemistry and physiology.,['0300-9629'],e,18625,"[['61535212360001701', 'Elsevier SD Backfile B...","[['53624576870001701', 'Comparative biochemist...",[ Available from 1971 volume: 38 until 1994 vo...,[Yes],...,,,,,,,"[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197..."
1439,64008,9947190700001701,Comparative biochemistry and physiology.,['0300-9629'],p,18625,,,,,...,[TBIOM PERS],[12],"[1995, 1997, 1994, 1996]",['TBIOM PERS'],"[1994, 1995, 1996, 1997]","[(1994, 1997)]",,"[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197..."
1441,46587,9956317740001701,Comparative biochemistry and physiology. A. Co...,['0300-9629'],p,18625,,,,,...,"[TBIOM PERS, ZMLAC OWL]",[90],"[1978, 1984, 1987, 1971, 1977, 1983, 1972, 198...","['TBIOM PERS', 'ZMLAC OWL']","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[(1971, 1991)]",,"[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197...","[1971, 1972, 1973, 1974, 1975, 1976, 1977, 197..."
1509,96778,9921748030001701,Pharmacology and therapeutics. Part C. Clinica...,['0362-5486'],p,19112,,,,,...,[TCOS SN1],[2],"[1976, 1977]",['TCOS SN1'],"[1976, 1977]","[(1976, 1977)]",,"[1976, 1977, 1978]","[1976, 1977, 1978]","[1976, 1977, 1978]"


In [107]:
#next 3 cells to check results of last operation; use different index if needed
df_s_p['SPR-yrs'][587]

[1891,
 1896,
 1897,
 1898,
 1899,
 1900,
 1901,
 1902,
 1903,
 1904,
 1905,
 1906,
 1907,
 1908,
 1909,
 1910,
 1911,
 1912,
 1913,
 1914,
 1915,
 1916,
 1917,
 1918,
 1919,
 1920,
 1921,
 1922,
 1923,
 1924,
 1925,
 1926,
 1927,
 1928,
 1929,
 1930,
 1931,
 1932,
 1933,
 1934,
 1935,
 1936,
 1937,
 1938,
 1939,
 1940,
 1941,
 1942,
 1943,
 1944,
 1945,
 1946,
 1947,
 1948,
 1949,
 1950,
 1951,
 1952,
 1953,
 1954,
 1955,
 1956,
 1957,
 1958,
 1959,
 1960,
 1961,
 1962,
 1963,
 1964]

In [108]:
df_s_p['Portico-years'][587]

[1888,
 1889,
 1890,
 1891,
 1892,
 1893,
 1894,
 1895,
 1896,
 1897,
 1898,
 1899,
 1900,
 1901,
 1902,
 1903,
 1904,
 1905,
 1906,
 1907,
 1908,
 1909,
 1910,
 1911,
 1912,
 1913,
 1914,
 1915,
 1916,
 1917,
 1918,
 1919,
 1920,
 1921,
 1922,
 1923,
 1924,
 1925,
 1926,
 1927,
 1928,
 1929,
 1930,
 1931,
 1932,
 1933,
 1934,
 1935,
 1936,
 1937,
 1938,
 1939,
 1940,
 1941,
 1942,
 1943,
 1944,
 1945,
 1946,
 1947,
 1948,
 1949,
 1950,
 1951,
 1952,
 1953,
 1954,
 1955,
 1956,
 1957,
 1958,
 1959,
 1960,
 1961,
 1962,
 1963,
 1964]

In [109]:
df_s_p['repo-coverage'][587]

[1888,
 1889,
 1890,
 1891,
 1892,
 1893,
 1894,
 1895,
 1896,
 1897,
 1898,
 1899,
 1900,
 1901,
 1902,
 1903,
 1904,
 1905,
 1906,
 1907,
 1908,
 1909,
 1910,
 1911,
 1912,
 1913,
 1914,
 1915,
 1916,
 1917,
 1918,
 1919,
 1920,
 1921,
 1922,
 1923,
 1924,
 1925,
 1926,
 1927,
 1928,
 1929,
 1930,
 1931,
 1932,
 1933,
 1934,
 1935,
 1936,
 1937,
 1938,
 1939,
 1940,
 1941,
 1942,
 1943,
 1944,
 1945,
 1946,
 1947,
 1948,
 1949,
 1950,
 1951,
 1952,
 1953,
 1954,
 1955,
 1956,
 1957,
 1958,
 1959,
 1960,
 1961,
 1962,
 1963,
 1964]

In [110]:
df = pd.merge(df,df_s_p[['repo-coverage']],how='left',left_index=True,right_index=True)
df

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,128733,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],...,,,,,,,[1963],,,
1,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963],,
2,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...",,,
3,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,
4,123907,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],...,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,61904,9959156260001701,Year book - American Philosophical Society,"['0065-9762', '0003-049X']",p,101522,,,,,...,"[TSCI PER, ZMLAC OWL]",[53],"[1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]",,,,
8819,17960,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,...,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",,,,
8820,117510,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],...,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,,
8821,125537,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],...,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,,


In [111]:
df_s = df[df['SPR-yrs'].notnull() & df['Portico-years'].isna()]
df_s

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
1,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963],,
3,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,
6,78022,9964617400001701,Mathematics of the USSR. Izvestija,['0025-5726'],p,267,,,,,...,[TMATH PER],[40],"[1970, 1981-1982, 1980-1981, 1978, 1984, 1983-...",['TMATH PER'],"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[(1967, 1992)]",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,
18,92015,9929554030001701,Veterinary research communications.,"['0378-4312', '0165-7380']",p,707,,,,,...,[TVET PER],[31],"[1984-85, 2004, 1995, 1987, 1983, 2001, 1997, ...",['TVET PER'],"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[(1980, 2007)]",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,
25,72522,9931084310001701,Neuropeptides,['0143-4179'],p,744,,,,,...,[TBIOM PERS],[35],"[1984-85, 1995, 1987, 2001, 1985, 1997, 1991, ...",['TBIOM PERS'],"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[(1980, 2003)]",,"[1980, 1981]",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8793,43247,9939180800001701,Annals of the New York Academy of Sciences,['0077-8923'],p,101219,,,,,...,"[TBIOM PERS, ZMLAC OWL, TSCI GEN]",[1796],"[2004, 2012, 1940, 2015, 1964, 1987, 1968-1969...","['TBIOM PERS', 'ZMLAC OWL', 'TSCI GEN']","[1877, 1878, 1879, 1880, 1881, 1882, 1883, 188...","[(1877, 1885), (1887, 1894), (1896, 2015)]",,[],,
8808,20011,9912065260001701,Physics letters,"['0031-9163', '1873-2410']",p,101461,,,,,...,[ZMLAC OWL],[14],"[1962-63, 1962, 1964, 1963, 1962/66, 1965, 196...",['ZMLAC OWL'],"[1962, 1963, 1964, 1965, 1966]","[(1962, 1966)]",,"[1962, 1963, 1964, 1965, 1966]",,
8811,49755,9953202980001701,Physics letters. Section A,"['0375-9601', '0031-9163']",p,101461,,,,,...,[TZDS GEN],[306],"[1974/75, 1970, 1978, 1984, 1995, 1982/84, 197...",['TZDS GEN'],"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[(1967, 2003)]",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,
8812,91677,9953205880001701,Physics letters. Section B,"['0031-9163', '0370-2693']",p,101461,,,,,...,[TZDS GEN],[517],"[1974/75, 2004, 1970, 1978, 1984, 1995, 1983/8...",['TZDS GEN'],"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[(1967, 2004)]",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,


In [112]:
df_s['repo-coverage'] = df_s.apply(lambda row : row['SPR-yrs'], axis=1)
df_s

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
1,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963],,[1963]
3,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
6,78022,9964617400001701,Mathematics of the USSR. Izvestija,['0025-5726'],p,267,,,,,...,[TMATH PER],[40],"[1970, 1981-1982, 1980-1981, 1978, 1984, 1983-...",['TMATH PER'],"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[(1967, 1992)]",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
18,92015,9929554030001701,Veterinary research communications.,"['0378-4312', '0165-7380']",p,707,,,,,...,[TVET PER],[31],"[1984-85, 2004, 1995, 1987, 1983, 2001, 1997, ...",['TVET PER'],"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[(1980, 2007)]",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198..."
25,72522,9931084310001701,Neuropeptides,['0143-4179'],p,744,,,,,...,[TBIOM PERS],[35],"[1984-85, 1995, 1987, 2001, 1985, 1997, 1991, ...",['TBIOM PERS'],"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[(1980, 2003)]",,"[1980, 1981]",,"[1980, 1981]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8793,43247,9939180800001701,Annals of the New York Academy of Sciences,['0077-8923'],p,101219,,,,,...,"[TBIOM PERS, ZMLAC OWL, TSCI GEN]",[1796],"[2004, 2012, 1940, 2015, 1964, 1987, 1968-1969...","['TBIOM PERS', 'ZMLAC OWL', 'TSCI GEN']","[1877, 1878, 1879, 1880, 1881, 1882, 1883, 188...","[(1877, 1885), (1887, 1894), (1896, 2015)]",,[],,[]
8808,20011,9912065260001701,Physics letters,"['0031-9163', '1873-2410']",p,101461,,,,,...,[ZMLAC OWL],[14],"[1962-63, 1962, 1964, 1963, 1962/66, 1965, 196...",['ZMLAC OWL'],"[1962, 1963, 1964, 1965, 1966]","[(1962, 1966)]",,"[1962, 1963, 1964, 1965, 1966]",,"[1962, 1963, 1964, 1965, 1966]"
8811,49755,9953202980001701,Physics letters. Section A,"['0375-9601', '0031-9163']",p,101461,,,,,...,[TZDS GEN],[306],"[1974/75, 1970, 1978, 1984, 1995, 1982/84, 197...",['TZDS GEN'],"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[(1967, 2003)]",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
8812,91677,9953205880001701,Physics letters. Section B,"['0031-9163', '0370-2693']",p,101461,,,,,...,[TZDS GEN],[517],"[1974/75, 2004, 1970, 1978, 1984, 1995, 1983/8...",['TZDS GEN'],"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[(1967, 2004)]",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."


In [113]:
df_s.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e',
       'matches_group_id', 'e_coll_info', 'portfolio_info',
       'Coverage Information Combined', 'PCAD?', 'Vendor_key',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'Title (Complete)_PORTICO',
       'Portico Match_PORTICO', 'Portico Title_PORTICO', 'PCA_PORTICO',
       'Status_PORTICO', 'Earliest Year Preserved_PORTICO',
       'Latest Year Preserved_PORTICO', 'curr-lib-loc_x', 'all_item_count',
       'chron', 'curr-lib-loc_ALL', 'chron_as_list', 'chron_ranges_calc',
       'pcad-range', 'SPR-yrs', 'Portico-years', 'repo-coverage'],
      dtype='object')

In [114]:
df0 = df.combine_first(df_s[['MMS_ID','repo-coverage']])
df0

Unnamed: 0,Coverage Information Combined,Earliest Year Preserved_PORTICO,ISSN_cluster,Latest Year Preserved_PORTICO,MMS_ID,Match?_BTAA-SPR,PCAD?,PCA_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,...,chron_ranges_calc,curr-lib-loc_ALL,curr-lib-loc_x,e_coll_info,matches_group_id,p_or_e,pcad-range,portfolio_info,record_index,repo-coverage
0,[ Available from 1963 volume: 10 issue: 1 unti...,,"['0893-6706', '2162-1373']",,9968441380001701,,[Yes],,,,...,,,,"[['61619505660001701', 'IEEE/IET Electronic Li...",5,e,[1963],"[['53620359120001701', 'IEEE transactions on u...",128733,
1,,,['0893-6706'],,9963550760001701,YES,,,,,...,"[(1963, 1966)]",['TZDS GEN'],[TZDS GEN],,5,p,,,57684,[1963]
2,[ Available from 1886 volume: 25 issue: 5 unti...,,"['0020-2681', '2058-1009']",,9968429800001701,,[Yes],,,,...,,,,"[['61535216140001701', 'JSTOR Business III Col...",12,e,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[['53540140170001701', 'Journal of the Institu...",111951,
3,,,['0020-2681'],,9939481760001701,YES,,,,,...,"[(1890, 1890), (1892, 1892), (1895, 1895), (19...",['TWILS CLS'],[TWILS CLS],,12,p,,,88618,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
4,[ Available from 1939 volume: 1 until 2012;],,['0017-0097'],,9967115530001701,,[Yes],,,,...,,,,"[['61745117840001701', 'JSTOR Arts and Science...",92,e,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...","[['53537228640001701', 'Giornale degli economi...",123907,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,,,"['0065-9762', '0003-049X']",,9959156260001701,,,,,,...,"[(1937, 1947), (1952, 1952), (1960, 2003)]","['TSCI PER', 'ZMLAC OWL']","[TSCI PER, ZMLAC OWL]",,101522,p,,,61904,
8819,,,"['1533-6247', '0003-1615']",,9946768760001701,,,,,,...,"[(1944, 2014)]","['ZMLAC OWL', 'TWILS PER']","[ZMLAC OWL, TWILS PER]",,101581,p,,,17960,
8820,[ Available from 1944 volume: 1 issue: 1;],,"['1533-6247', '0003-1615']",,9967987860001701,,[Yes],,,,...,,,,"[['61535211010001701', 'JSTOR Arts and Science...",101581,e,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[['53539160390001701', 'The Americas.']]",117510,
8821,[ Available from 1980-07- volume: 1 issue: 1;],,"['0143-7496', '1879-0127']",,9968947900001701,,[Yes],,,,...,,,,"[['61624504590001701', 'Elsevier ScienceDirect...",101582,e,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[['53624617390001701', 'International journal ...",125537,


In [115]:
df0 = df0[['record_index', 'MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e',
       'matches_group_id', 'e_coll_info', 'portfolio_info',
       'Coverage Information Combined', 'PCAD?', 'Vendor_key',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR',
       'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO',
       'curr-lib-loc_x', 'all_item_count', 'chron', 'curr-lib-loc_ALL',
       'chron_as_list', 'chron_ranges_calc', 'pcad-range', 'SPR-yrs',
       'Portico-years', 'repo-coverage']]
df0

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,128733,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],...,,,,,,,[1963],,,
1,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963],,[1963]
2,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...",,,
3,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
4,123907,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],...,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,61904,9959156260001701,Year book - American Philosophical Society,"['0065-9762', '0003-049X']",p,101522,,,,,...,"[TSCI PER, ZMLAC OWL]",[53],"[1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]",,,,
8819,17960,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,...,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",,,,
8820,117510,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],...,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,,
8821,125537,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],...,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,,


In [116]:
df_p = df0[df0['SPR-yrs'].isna() & df['Portico-years'].notnull()]
df_p

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
33,74020,9918228280001701,Annals of agricultural science.,"['1110-0249', '0570-1783']",p,1066,,,,,...,[ZMLAC OWL],[7],"[1960, 1956, 1959, 1964, 1961, 1965, 1957]",['ZMLAC OWL'],"[1956, 1957, 1959, 1960, 1961, 1964, 1965]","[(1956, 1957), (1959, 1961), (1964, 1965)]",,,"[2011, 2012, 2013, 2014, 2015, 2016, 2017, 201...",
36,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",['0941-1216'],p,1087,,,,,...,[ZMLAC OWL],[10],"[1995, 1993, 1992, 1994, 1996]",['ZMLAC OWL'],"[1992, 1993, 1994, 1995, 1996]","[(1992, 1996)]",,,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200...",
37,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",['0941-1216'],p,1087,,,,,...,[ZMLAC OWL],[10],"[1995, 1993, 1992, 1994, 1996]",['ZMLAC OWL'],"[1992, 1993, 1994, 1995, 1996]","[(1992, 1996)]",,,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200...",
38,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",['0941-1216'],p,1087,,,,,...,[ZMLAC OWL],[10],"[1995, 1993, 1992, 1994, 1996]",['ZMLAC OWL'],"[1992, 1993, 1994, 1995, 1996]","[(1992, 1996)]",,,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200...",
39,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",['0941-1216'],p,1087,,,,,...,[ZMLAC OWL],[10],"[1995, 1993, 1992, 1994, 1996]",['ZMLAC OWL'],"[1992, 1993, 1994, 1995, 1996]","[(1992, 1996)]",,,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8704,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,['0044-4294'],e,93064,"[['61535213310001701', 'Wiley Online Library V...","[['53742266770001701', 'Zentralblatt für Veter...",[ Available from 1954 volume: 1 issue: 1 until...,[Yes],...,,,,,,,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...",,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...",
8705,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,['0044-4294'],e,93064,"[['61535213310001701', 'Wiley Online Library V...","[['53742266770001701', 'Zentralblatt für Veter...",[ Available from 1954 volume: 1 issue: 1 until...,[Yes],...,,,,,,,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...",,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...",
8706,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,['0044-4294'],e,93064,"[['61535213310001701', 'Wiley Online Library V...","[['53742266770001701', 'Zentralblatt für Veter...",[ Available from 1954 volume: 1 issue: 1 until...,[Yes],...,,,,,,,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...",,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...",
8787,37420,9931018220001701,Ophelia,['0078-5326'],p,101035,,,,,...,[TMAGR PER],[20],"[1984-85, 1991-92, 1987-88, 1968-69, 1991, 198...",['TMAGR PER'],"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","[(1964, 1994)]",,,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",


In [117]:
df_p['repo-coverage'] = df_p['Portico-years']
df_p

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
33,74020,9918228280001701,Annals of agricultural science.,"['1110-0249', '0570-1783']",p,1066,,,,,...,[ZMLAC OWL],[7],"[1960, 1956, 1959, 1964, 1961, 1965, 1957]",['ZMLAC OWL'],"[1956, 1957, 1959, 1960, 1961, 1964, 1965]","[(1956, 1957), (1959, 1961), (1964, 1965)]",,,"[2011, 2012, 2013, 2014, 2015, 2016, 2017, 201...","[2011, 2012, 2013, 2014, 2015, 2016, 2017, 201..."
36,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",['0941-1216'],p,1087,,,,,...,[ZMLAC OWL],[10],"[1995, 1993, 1992, 1994, 1996]",['ZMLAC OWL'],"[1992, 1993, 1994, 1995, 1996]","[(1992, 1996)]",,,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200...","[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200..."
37,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",['0941-1216'],p,1087,,,,,...,[ZMLAC OWL],[10],"[1995, 1993, 1992, 1994, 1996]",['ZMLAC OWL'],"[1992, 1993, 1994, 1995, 1996]","[(1992, 1996)]",,,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200...","[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200..."
38,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",['0941-1216'],p,1087,,,,,...,[ZMLAC OWL],[10],"[1995, 1993, 1992, 1994, 1996]",['ZMLAC OWL'],"[1992, 1993, 1994, 1995, 1996]","[(1992, 1996)]",,,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200...","[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200..."
39,43622,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",['0941-1216'],p,1087,,,,,...,[ZMLAC OWL],[10],"[1995, 1993, 1992, 1994, 1996]",['ZMLAC OWL'],"[1992, 1993, 1994, 1995, 1996]","[(1992, 1996)]",,,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200...","[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8704,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,['0044-4294'],e,93064,"[['61535213310001701', 'Wiley Online Library V...","[['53742266770001701', 'Zentralblatt für Veter...",[ Available from 1954 volume: 1 issue: 1 until...,[Yes],...,,,,,,,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...",,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...","[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196..."
8705,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,['0044-4294'],e,93064,"[['61535213310001701', 'Wiley Online Library V...","[['53742266770001701', 'Zentralblatt für Veter...",[ Available from 1954 volume: 1 issue: 1 until...,[Yes],...,,,,,,,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...",,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...","[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196..."
8706,125990,9970295590001701,Zentralblatt f©ơr Veterin©Þrmedizin,['0044-4294'],e,93064,"[['61535213310001701', 'Wiley Online Library V...","[['53742266770001701', 'Zentralblatt für Veter...",[ Available from 1954 volume: 1 issue: 1 until...,[Yes],...,,,,,,,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...",,"[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196...","[1954, 1955, 1956, 1957, 1958, 1959, 1960, 196..."
8787,37420,9931018220001701,Ophelia,['0078-5326'],p,101035,,,,,...,[TMAGR PER],[20],"[1984-85, 1991-92, 1987-88, 1968-69, 1991, 198...",['TMAGR PER'],"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","[(1964, 1994)]",,,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199..."


In [118]:
df_p.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e',
       'matches_group_id', 'e_coll_info', 'portfolio_info',
       'Coverage Information Combined', 'PCAD?', 'Vendor_key',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'Title (Complete)_PORTICO',
       'Portico Match_PORTICO', 'Portico Title_PORTICO', 'PCA_PORTICO',
       'Status_PORTICO', 'Earliest Year Preserved_PORTICO',
       'Latest Year Preserved_PORTICO', 'curr-lib-loc_x', 'all_item_count',
       'chron', 'curr-lib-loc_ALL', 'chron_as_list', 'chron_ranges_calc',
       'pcad-range', 'SPR-yrs', 'Portico-years', 'repo-coverage'],
      dtype='object')

In [119]:
df = df0.combine_first(df_p[['repo-coverage']])
df

Unnamed: 0,Coverage Information Combined,Earliest Year Preserved_PORTICO,ISSN_cluster,Latest Year Preserved_PORTICO,MMS_ID,Match?_BTAA-SPR,PCAD?,PCA_PORTICO,Portico Match_PORTICO,Portico Title_PORTICO,...,chron_ranges_calc,curr-lib-loc_ALL,curr-lib-loc_x,e_coll_info,matches_group_id,p_or_e,pcad-range,portfolio_info,record_index,repo-coverage
0,[ Available from 1963 volume: 10 issue: 1 unti...,,"['0893-6706', '2162-1373']",,9968441380001701,,[Yes],,,,...,,,,"[['61619505660001701', 'IEEE/IET Electronic Li...",5,e,[1963],"[['53620359120001701', 'IEEE transactions on u...",128733,
1,,,['0893-6706'],,9963550760001701,YES,,,,,...,"[(1963, 1966)]",['TZDS GEN'],[TZDS GEN],,5,p,,,57684,[1963]
2,[ Available from 1886 volume: 25 issue: 5 unti...,,"['0020-2681', '2058-1009']",,9968429800001701,,[Yes],,,,...,,,,"[['61535216140001701', 'JSTOR Business III Col...",12,e,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[['53540140170001701', 'Journal of the Institu...",111951,
3,,,['0020-2681'],,9939481760001701,YES,,,,,...,"[(1890, 1890), (1892, 1892), (1895, 1895), (19...",['TWILS CLS'],[TWILS CLS],,12,p,,,88618,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
4,[ Available from 1939 volume: 1 until 2012;],,['0017-0097'],,9967115530001701,,[Yes],,,,...,,,,"[['61745117840001701', 'JSTOR Arts and Science...",92,e,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...","[['53537228640001701', 'Giornale degli economi...",123907,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,,,"['0065-9762', '0003-049X']",,9959156260001701,,,,,,...,"[(1937, 1947), (1952, 1952), (1960, 2003)]","['TSCI PER', 'ZMLAC OWL']","[TSCI PER, ZMLAC OWL]",,101522,p,,,61904,
8819,,,"['1533-6247', '0003-1615']",,9946768760001701,,,,,,...,"[(1944, 2014)]","['ZMLAC OWL', 'TWILS PER']","[ZMLAC OWL, TWILS PER]",,101581,p,,,17960,
8820,[ Available from 1944 volume: 1 issue: 1;],,"['1533-6247', '0003-1615']",,9967987860001701,,[Yes],,,,...,,,,"[['61535211010001701', 'JSTOR Arts and Science...",101581,e,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[['53539160390001701', 'The Americas.']]",117510,
8821,[ Available from 1980-07- volume: 1 issue: 1;],,"['0143-7496', '1879-0127']",,9968947900001701,,[Yes],,,,...,,,,"[['61624504590001701', 'Elsevier ScienceDirect...",101582,e,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[['53624617390001701', 'International journal ...",125537,


In [120]:
df.columns

Index(['Coverage Information Combined', 'Earliest Year Preserved_PORTICO',
       'ISSN_cluster', 'Latest Year Preserved_PORTICO', 'MMS_ID',
       'Match?_BTAA-SPR', 'PCAD?', 'PCA_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'Portico-years', 'SPR Holdings_BTAA-SPR',
       'SPR-yrs', 'Status_PORTICO', 'Title (Complete)_PORTICO',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR', 'Title_bib',
       'Vendor_key', 'all_item_count', 'chron', 'chron_as_list',
       'chron_ranges_calc', 'curr-lib-loc_ALL', 'curr-lib-loc_x',
       'e_coll_info', 'matches_group_id', 'p_or_e', 'pcad-range',
       'portfolio_info', 'record_index', 'repo-coverage'],
      dtype='object')

In [121]:
df = df[['record_index', 'MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e',
       'matches_group_id', 'e_coll_info', 'portfolio_info',
       'Coverage Information Combined', 'PCAD?', 'Vendor_key',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR',
       'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO',
       'curr-lib-loc_x', 'all_item_count', 'chron', 'curr-lib-loc_ALL',
       'chron_as_list', 'chron_ranges_calc', 'pcad-range', 'SPR-yrs',
       'Portico-years', 'repo-coverage']]
df

Unnamed: 0,record_index,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,128733,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],...,,,,,,,[1963],,,
1,57684,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963],,[1963]
2,111951,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],...,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...",,,
3,88618,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
4,123907,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],...,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,61904,9959156260001701,Year book - American Philosophical Society,"['0065-9762', '0003-049X']",p,101522,,,,,...,"[TSCI PER, ZMLAC OWL]",[53],"[1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]",,,,
8819,17960,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,...,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",,,,
8820,117510,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],...,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,,
8821,125537,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],...,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,,


In [122]:
df.to_pickle(f'all_coverage_as_lists_{today}.pkl')

#### Drop duplicates
In order to drop dups, we need to convert columns with list data types to strings, drop duplicates, then convert back by reading as string literal to data type.

In [230]:
df = pd.read_pickle('all_coverage_as_lists_20201112.pkl')
df.columns

Index(['record_index', 'MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e',
       'matches_group_id', 'e_coll_info', 'portfolio_info',
       'Coverage Information Combined', 'PCAD?', 'Vendor_key',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR', 'Title (Complete)_PORTICO',
       'Portico Match_PORTICO', 'Portico Title_PORTICO', 'PCA_PORTICO',
       'Status_PORTICO', 'Earliest Year Preserved_PORTICO',
       'Latest Year Preserved_PORTICO', 'curr-lib-loc_x', 'all_item_count',
       'chron', 'curr-lib-loc_ALL', 'chron_as_list', 'chron_ranges_calc',
       'pcad-range', 'SPR-yrs', 'Portico-years', 'repo-coverage'],
      dtype='object')

In [231]:
df = df[['MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e',
       'matches_group_id', 'e_coll_info', 'portfolio_info',
       'Coverage Information Combined', 'PCAD?', 'Vendor_key',
       'Title 1 (Print)_BTAA-SPR', 'Title 2 (Print)_BTAA-SPR',
       'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR',
       'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO',
       'curr-lib-loc_x', 'all_item_count', 'chron', 'curr-lib-loc_ALL',
       'chron_as_list', 'chron_ranges_calc', 'pcad-range', 'SPR-yrs',
       'Portico-years', 'repo-coverage']]
df

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,,,,,,[1963],,,
1,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963],,[1963]
2,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...",,,
3,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
4,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],[JSTOR],...,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,9959156260001701,Year book - American Philosophical Society,"['0065-9762', '0003-049X']",p,101522,,,,,,...,"[TSCI PER, ZMLAC OWL]",[53],"[1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]",,,,
8819,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,,...,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",,,,
8820,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],[JSTOR],...,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,,
8821,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,,


In [232]:
cols = list(df.columns)
cols

['MMS_ID',
 'Title_bib',
 'ISSN_cluster',
 'p_or_e',
 'matches_group_id',
 'e_coll_info',
 'portfolio_info',
 'Coverage Information Combined',
 'PCAD?',
 'Vendor_key',
 'Title 1 (Print)_BTAA-SPR',
 'Title 2 (Print)_BTAA-SPR',
 'Match?_BTAA-SPR',
 'SPR Holdings_BTAA-SPR',
 'Title (Complete)_PORTICO',
 'Portico Match_PORTICO',
 'Portico Title_PORTICO',
 'PCA_PORTICO',
 'Status_PORTICO',
 'Earliest Year Preserved_PORTICO',
 'Latest Year Preserved_PORTICO',
 'curr-lib-loc_x',
 'all_item_count',
 'chron',
 'curr-lib-loc_ALL',
 'chron_as_list',
 'chron_ranges_calc',
 'pcad-range',
 'SPR-yrs',
 'Portico-years',
 'repo-coverage']

In [233]:
len(cols)

31

In [234]:
for x in df.loc[2564]:
    print(type(x))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'numpy.int64'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'str'>
<class 'float'>
<class 'str'>
<class 'str'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'str'>
<class 'list'>
<class 'list'>
<class 'float'>
<class 'list'>
<class 'float'>
<class 'list'>


In [235]:
list_cols = []
i = 0
for x in df.loc[0]:
    if type(x) == list:
        list_cols.append(cols[i])
    i += 1
list_cols

['e_coll_info',
 'portfolio_info',
 'Coverage Information Combined',
 'PCAD?',
 'Vendor_key',
 'pcad-range']

*There's probably a better way to do this, but until you figure it out: run the next cell several times with different indices in df.loc until list_cols stops changing.*

In [242]:
i = 0
for x in df.loc[987]:
    if type(x) == list:
        list_cols.append(cols[i])
    i += 1
list_cols = list(set(list_cols))
print(list_cols)
len(list_cols)

['Coverage Information Combined', 'SPR-yrs', 'Portico-years', 'chron_ranges_calc', 'e_coll_info', 'chron', 'repo-coverage', 'PCAD?', 'all_item_count', 'pcad-range', 'Vendor_key', 'curr-lib-loc_x', 'portfolio_info', 'chron_as_list']


14

In [243]:
for x in list_cols:
    df[x] = df[x].apply(lambda x: str(x))
df

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[""['61619505660001701', 'IEEE/IET Electronic L...","[""['53620359120001701', 'IEEE transactions on ...",[' Available from 1963 volume: 10 issue: 1 unt...,['Yes'],['other'],...,,,,,,,[1963],,,
1,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,,...,['TZDS GEN'],[1],['1963-1966'],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963],,[1963]
2,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[""['61535216140001701', 'JSTOR Business III Co...","[""['53540140170001701', 'Journal of the Instit...",[' Available from 1886 volume: 25 issue: 5 unt...,['Yes'],['JSTOR'],...,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...",,,
3,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,,...,['TWILS CLS'],[46],"['', '1890', '1939', '1948', '1915', '1943', '...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
4,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[""['61745117840001701', 'JSTOR Arts and Scienc...","[""['53537228640001701', 'Giornale degli econom...",[' Available from 1939 volume: 1 until 2012;'],['Yes'],['JSTOR'],...,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,9959156260001701,Year book - American Philosophical Society,"['0065-9762', '0003-049X']",p,101522,,,,,,...,"['TSCI PER', 'ZMLAC OWL']",[53],"['1970', '1939', '1998/99', '1960', '1978', '1...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]",,,,
8819,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,,...,"['ZMLAC OWL', 'TWILS PER']",[74],"['', '1973/1974', '2004', '1958/1959', '1968/1...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",,,,
8820,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[""['61535211010001701', 'JSTOR Arts and Scienc...","[""['53539160390001701', 'The Americas.']""]",[' Available from 1944 volume: 1 issue: 1;'],['Yes'],['JSTOR'],...,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,,
8821,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[""['61624504590001701', 'Elsevier ScienceDirec...","[""['53624617390001701', 'International journal...",[' Available from 1980-07- volume: 1 issue: 1;'],['Yes'],['Elsevier'],...,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,,


In [248]:
for x in df.loc[1776]:
    print(type(x))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'numpy.int64'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'float'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


In [249]:
df.drop_duplicates(inplace=True)
df

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,9968441380001701,IEEE transactions on ultrasonics engineering,"['0893-6706', '2162-1373']",e,5,"[""['61619505660001701', 'IEEE/IET Electronic L...","[""['53620359120001701', 'IEEE transactions on ...",[' Available from 1963 volume: 10 issue: 1 unt...,['Yes'],['other'],...,,,,,,,[1963],,,
1,9963550760001701,IEEE transactions on ultrasonics engineering,['0893-6706'],p,5,,,,,,...,['TZDS GEN'],[1],['1963-1966'],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963],,[1963]
2,9968429800001701,Journal of the Institute of Actuaries,"['0020-2681', '2058-1009']",e,12,"[""['61535216140001701', 'JSTOR Business III Co...","[""['53540140170001701', 'Journal of the Instit...",[' Available from 1886 volume: 25 issue: 5 unt...,['Yes'],['JSTOR'],...,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...",,,
3,9939481760001701,Journal of the Institute of Actuaries,['0020-2681'],p,12,,,,,,...,['TWILS CLS'],[46],"['', '1890', '1939', '1948', '1915', '1943', '...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
4,9967115530001701,Giornale degli economisti e annali di economia,['0017-0097'],e,92,"[""['61745117840001701', 'JSTOR Arts and Scienc...","[""['53537228640001701', 'Giornale degli econom...",[' Available from 1939 volume: 1 until 2012;'],['Yes'],['JSTOR'],...,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,9959156260001701,Year book - American Philosophical Society,"['0065-9762', '0003-049X']",p,101522,,,,,,...,"['TSCI PER', 'ZMLAC OWL']",[53],"['1970', '1939', '1998/99', '1960', '1978', '1...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]",,,,
8819,9946768760001701,The Americas,"['1533-6247', '0003-1615']",p,101581,,,,,,...,"['ZMLAC OWL', 'TWILS PER']",[74],"['', '1973/1974', '2004', '1958/1959', '1968/1...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",,,,
8820,9967987860001701,The Americas - Academy of American Franciscan ...,"['1533-6247', '0003-1615']",e,101581,"[""['61535211010001701', 'JSTOR Arts and Scienc...","[""['53539160390001701', 'The Americas.']""]",[' Available from 1944 volume: 1 issue: 1;'],['Yes'],['JSTOR'],...,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,,
8821,9968947900001701,International journal of adhesion and adhesives,"['0143-7496', '1879-0127']",e,101582,"[""['61624504590001701', 'Elsevier ScienceDirec...","[""['53624617390001701', 'International journal...",[' Available from 1980-07- volume: 1 issue: 1;'],['Yes'],['Elsevier'],...,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,,


In [250]:
cols

['MMS_ID',
 'Title_bib',
 'ISSN_cluster',
 'p_or_e',
 'matches_group_id',
 'e_coll_info',
 'portfolio_info',
 'Coverage Information Combined',
 'PCAD?',
 'Vendor_key',
 'Title 1 (Print)_BTAA-SPR',
 'Title 2 (Print)_BTAA-SPR',
 'Match?_BTAA-SPR',
 'SPR Holdings_BTAA-SPR',
 'Title (Complete)_PORTICO',
 'Portico Match_PORTICO',
 'Portico Title_PORTICO',
 'PCA_PORTICO',
 'Status_PORTICO',
 'Earliest Year Preserved_PORTICO',
 'Latest Year Preserved_PORTICO',
 'curr-lib-loc_x',
 'all_item_count',
 'chron',
 'curr-lib-loc_ALL',
 'chron_as_list',
 'chron_ranges_calc',
 'pcad-range',
 'SPR-yrs',
 'Portico-years',
 'repo-coverage']

In [251]:
list_cols.append('ISSN_cluster')
for x in list_cols:
    df[x] = df[x].apply(lambda x: ast.literal_eval(x) if (x != 'nan') else np.nan)
df

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,,,,,,[1963],,,
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963],,[1963]
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...",,,
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
4,9967115530001701,Giornale degli economisti e annali di economia,[0017-0097],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],[JSTOR],...,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8818,9959156260001701,Year book - American Philosophical Society,"[0065-9762, 0003-049X]",p,101522,,,,,,...,"[TSCI PER, ZMLAC OWL]",[53],"[1970, 1939, 1998/99, 1960, 1978, 1940, 1995, ...","['TSCI PER', 'ZMLAC OWL']","[1937, 1938, 1939, 1940, 1941, 1942, 1943, 194...","[(1937, 1947), (1952, 1952), (1960, 2003)]",,,,
8819,9946768760001701,The Americas,"[1533-6247, 0003-1615]",p,101581,,,,,,...,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",,,,
8820,9967987860001701,The Americas - Academy of American Franciscan ...,"[1533-6247, 0003-1615]",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],[JSTOR],...,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,,
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,,


#### Make sure all groups still have p and e

In [252]:
find_only_p_or_e = df[['matches_group_id','p_or_e','MMS_ID','Title_bib']]
find_only_p_or_e

Unnamed: 0,matches_group_id,p_or_e,MMS_ID,Title_bib
0,5,e,9968441380001701,IEEE transactions on ultrasonics engineering
1,5,p,9963550760001701,IEEE transactions on ultrasonics engineering
2,12,e,9968429800001701,Journal of the Institute of Actuaries
3,12,p,9939481760001701,Journal of the Institute of Actuaries
4,92,e,9967115530001701,Giornale degli economisti e annali di economia
...,...,...,...,...
8818,101522,p,9959156260001701,Year book - American Philosophical Society
8819,101581,p,9946768760001701,The Americas
8820,101581,e,9967987860001701,The Americas - Academy of American Franciscan ...
8821,101582,e,9968947900001701,International journal of adhesion and adhesives


In [253]:
find_only_p_or_e = find_only_p_or_e.groupby(['matches_group_id']).agg(lambda x: sorted(list(set(x)))).reset_index()
find_only_p_or_e

Unnamed: 0,matches_group_id,p_or_e,MMS_ID,Title_bib
0,5,"[e, p]","[9963550760001701, 9968441380001701]",[IEEE transactions on ultrasonics engineering]
1,12,"[e, p]","[9939481760001701, 9968429800001701]",[Journal of the Institute of Actuaries]
2,92,"[e, p]","[9963082590001701, 9967115530001701]",[Giornale degli economisti e annali di economia]
3,267,"[e, p]","[9964617400001701, 9968936470001701]","[Mathematics of the USSR. Izvestija, Mathemati..."
4,277,"[e, p]","[9942472340001701, 9968336850001701]",[Laboratory techniques in biochemistry and mol...
...,...,...,...,...
3538,101448,"[e, p]","[9931514970001701, 9936090000001701, 996927747...","[Gazette, Gazette (Leiden, Netherlands : Onlin..."
3539,101461,"[e, p]","[9912065260001701, 9953202980001701, 995320588...","[Physics letters, Physics letters., Physics le..."
3540,101522,"[e, p]","[9914298940001701, 9918096880001701, 992826753...","[Grantees' reports /, Memoirs of the American ..."
3541,101581,"[e, p]","[9946768760001701, 9967987860001701]","[The Americas, The Americas - Academy of Ameri..."


In [254]:
find_only_p_or_e['pore'] = find_only_p_or_e['p_or_e'].apply(lambda x: ' '.join(x))
find_only_p_or_e

Unnamed: 0,matches_group_id,p_or_e,MMS_ID,Title_bib,pore
0,5,"[e, p]","[9963550760001701, 9968441380001701]",[IEEE transactions on ultrasonics engineering],e p
1,12,"[e, p]","[9939481760001701, 9968429800001701]",[Journal of the Institute of Actuaries],e p
2,92,"[e, p]","[9963082590001701, 9967115530001701]",[Giornale degli economisti e annali di economia],e p
3,267,"[e, p]","[9964617400001701, 9968936470001701]","[Mathematics of the USSR. Izvestija, Mathemati...",e p
4,277,"[e, p]","[9942472340001701, 9968336850001701]",[Laboratory techniques in biochemistry and mol...,e p
...,...,...,...,...,...
3538,101448,"[e, p]","[9931514970001701, 9936090000001701, 996927747...","[Gazette, Gazette (Leiden, Netherlands : Onlin...",e p
3539,101461,"[e, p]","[9912065260001701, 9953202980001701, 995320588...","[Physics letters, Physics letters., Physics le...",e p
3540,101522,"[e, p]","[9914298940001701, 9918096880001701, 992826753...","[Grantees' reports /, Memoirs of the American ...",e p
3541,101581,"[e, p]","[9946768760001701, 9967987860001701]","[The Americas, The Americas - Academy of Ameri...",e p


In [255]:
#If all have p and e should only be one value here
find_only_p_or_e['pore'].value_counts()

e p    3543
Name: pore, dtype: int64

#### Filter out groups with len > 2

In [256]:
grouped = df.groupby('matches_group_id')
print("Total count of groups")
print(len(grouped))
print("Shape of frame of groups bigger than 2")
print(grouped.filter(lambda x: len(x) > 2).shape)
print("Count of groups bigger than 2")
print(len(grouped.filter(lambda x: len(x) > 2).groupby('matches_group_id')))
print("Shape of frame of groups of length 2")
print(grouped.filter(lambda x: len(x) == 2).shape)
print("Count of groups of length 2")
print(len(grouped.filter(lambda x: len(x) == 2).groupby('matches_group_id')))

Total count of groups
3543
Shape of frame of groups bigger than 2
(2670, 31)
Count of groups bigger than 2
701
Shape of frame of groups of length 2
(5684, 31)
Count of groups of length 2
2842


In [257]:
df['matches_group_id'].max()

101582

In [258]:
pe11 = grouped.filter(lambda x: len(x) == 2)
pe11

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,,,,,,[1963],,,
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",,[1963],,[1963]
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,,,,,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...",,,
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
4,9967115530001701,Giornale degli economisti e annali di economia,[0017-0097],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],[JSTOR],...,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8802,9946092870001701,Bibliothèque d'humanisme et renaissance ;,[0006-1999],p,101380,,,,,,...,[TWILS PER],[90],"[2004, 2012, 2015, 1964, 1987, 2005, 1945, 196...",['TWILS PER'],"[1941, 1942, 1943, 1944, 1945, 1946, 1947, 194...","[(1941, 1948), (1950, 2020)]",,,,
8819,9946768760001701,The Americas,"[1533-6247, 0003-1615]",p,101581,,,,,,...,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]",,,,
8820,9967987860001701,The Americas - Academy of American Franciscan ...,"[1533-6247, 0003-1615]",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],[JSTOR],...,,,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,,
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,,


##### Note on next cell:
In early PCAD runs, we reviewed the file of groups larger than 2 titles (rows), and added a new column for hand-coding new group ids based on actually looking at the records (for example: the script matched 4 rows under one group ID, but in looking at them they are two separate groups, so they get two different new group ids.) The purpose of this was that the script only handles calculating overlaps for groups of 2 rows, and to increase the number of groups that the script could analyze based on this constraint. We stopped doing this step because we had enough titles to work with without including them, but the list is something that a human should probably look at at some point. There are probably fewer titles on these lists since Sunshine and her team were doing some clean-up on e-resources with title changes, so there should be more records that match 1:1 now.

So, to summarize, ignore these for now, knowing that someone needs to look at those big groups by hand at some point.

In [259]:
pe_big_groups = grouped.filter(lambda x: len(x) > 2)
pe_big_groups.to_csv('pe_big_groups_' + today + '.txt',sep='\t',index=False)
pe_big_groups

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
12,9965662340001701,Journal of endocrinology,[0022-0795],p,537,,,,,,...,"[TVET PER, TBIOM PERS]",[432],"[2004, 2012, 2015, 1964, 1987, 2005, 1956-57, ...","['TVET PER', 'TBIOM PERS']","[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...","[(1939, 2018)]",,,,
13,9960319340001701,Journal of endocrinology,[0022-0795],p,537,,,,,,...,[ZMLAC UMDN],[154],"[1940/1941, 1970, 1939, 1978, 1960, 1995, 1984...",['ZMLAC UMDN'],"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...","[(1939, 1947), (1949, 1955), (1957, 1999)]",,,,
14,9966705820001701,Journal of endocrinology (Online),"[0022-0795, 1479-6805]",e,537,"[['61808661620001701', 'Society for Endocrinol...","[['53808661570001701', 'Journal of endocrinolo...",[ Available from 1939 volume: 1 issue: 1 until...,[Yes],[other],...,,,,,,,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
17,9966212070001701,Veterinary research communications,"[0165-7380, 1573-7446]",e,707,"[['61535213730001701', 'SpringerLink Historica...","[['53535250180001701', 'Veterinary research co...",[ Available from 1977 volume: 1 issue: 1 until...,[Yes],[Springer],...,,,,,,,"[1977, 1978, 1979, 1980, 1981, 1982, 1983, 198...",,,
18,9929554030001701,Veterinary research communications.,"[0378-4312, 0165-7380]",p,707,,,,,,...,[TVET PER],[31],"[1984-85, 2004, 1995, 1987, 1983, 2001, 1997, ...",['TVET PER'],"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[(1980, 2007)]",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8814,9918096880001701,Memoirs of the American Philosophical Society ...,"[0065-9738, 0065-9762]",p,101522,,,,,,...,"[TSCI REFA, TSCI GEN]",[257],"[2004, 2012, 1940, 1975-76, 1964, 1987, 1981-8...","['TSCI REFA', 'TSCI GEN']","[1935, 1936, 1937, 1938, 1939, 1940, 1943, 194...","[(1935, 1940), (1943, 1944), (1947, 1951), (19...",,,,
8815,9968665290001701,Proceedings and addresses of the American Phil...,"[2325-9248, 0065-972X]",e,101522,"[['61535211010001701', 'JSTOR Arts and Science...","[['53540700110001701', 'Proceedings and addres...",[ Available from 1927 volume: 1;],[Yes],[JSTOR],...,,,,,,,"[1927, 1928, 1929, 1930, 1931, 1932, 1933, 193...",,,
8816,9950698160001701,Proceedings and addresses of the American Phil...,"[0065-972X, 0003-049X]",p,101522,,,,,,...,"[TWILS CLS, TWILS PER]",[82],"[, 2004, 2010, 2012, 2018, 2001-2002, 2008-200...","['TWILS CLS', 'TWILS PER']","[1985, 1986, 1987, 1988, 1989, 1990, 1991, 199...","[(1985, 2019)]",,,,
8817,9914298940001701,Proceedings of the American Philosophical Society,[0003-049X],p,101522,,,,,,...,"[TSCI PER, ZMLAC OWL]",[151],"[2004, 2012, 1933, 1871/72, 1940, 2015, 1904-0...","['TSCI PER', 'ZMLAC OWL']","[1838, 1839, 1840, 1841, 1842, 1843, 1844, 184...","[(1838, 2019)]",,,,


#### Back and forward fill coverage ranges within groups

In [260]:
pegr = pe11.groupby('matches_group_id')
pegr

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000027C97B7D448>

In [261]:
dates_lists = ['chron_as_list','pcad-range','SPR-yrs','Portico-years','repo-coverage']
for x in dates_lists:
    pe11[x] = pegr[x].bfill()
    pe11[x] = pegr[x].ffill()

pe11

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,,,,"[1963, 1964, 1965, 1966]",,[1963],[1963],,[1963]
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",[1963],[1963],,[1963]
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,,,,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...",,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...","[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
4,9967115530001701,Giornale degli economisti e annali di economia,[0017-0097],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],[JSTOR],...,,,,,"[1939, 1940, 1941, 1942, 1946, 1947, 1949, 195...",,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8802,9946092870001701,Bibliothèque d'humanisme et renaissance ;,[0006-1999],p,101380,,,,,,...,[TWILS PER],[90],"[2004, 2012, 2015, 1964, 1987, 2005, 1945, 196...",['TWILS PER'],"[1941, 1942, 1943, 1944, 1945, 1946, 1947, 194...","[(1941, 1948), (1950, 2020)]","[1941, 1942, 1943, 1944, 1945, 1946, 1947, 194...",,,
8819,9946768760001701,The Americas,"[1533-6247, 0003-1615]",p,101581,,,,,,...,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,,
8820,9967987860001701,The Americas - Academy of American Franciscan ...,"[1533-6247, 0003-1615]",e,101581,"[['61535211010001701', 'JSTOR Arts and Science...","[['53539160390001701', 'The Americas.']]",[ Available from 1944 volume: 1 issue: 1;],[Yes],[JSTOR],...,,,,,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,"[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,,
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199..."


#### Filter out no repo, no pcad, no chron

In [262]:
pe = pe11[pd.isnull(pe11['repo-coverage']) == False]
pe

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,,,,"[1963, 1964, 1965, 1966]",,[1963],[1963],,[1963]
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",[1963],[1963],,[1963]
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,,,,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...",,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...","[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
6,9964617400001701,Mathematics of the USSR. Izvestija,[0025-5726],p,267,,,,,,...,[TMATH PER],[40],"[1970, 1981-1982, 1980-1981, 1978, 1984, 1983-...",['TMATH PER'],"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[(1967, 1992)]",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,[TMAGR PER],[20],"[1984-85, 1991-92, 1987-88, 1968-69, 1991, 198...",['TMAGR PER'],"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","[(1964, 1994)]","[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...",,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199..."
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,,,,,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]"
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,[TMAGR PER],[14],"[1985-86, 1996-97, 1979, 1984, 1988, 1994-95, ...",['TMAGR PER'],"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","[(1979, 1999)]","[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]"
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199..."


In [263]:
pe_no_repo = pe11[pd.isnull(pe11['repo-coverage']) == True]
pe_no_repo.sort_values(['matches_group_id'],inplace=True)
pe_no_repo

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
4,9967115530001701,Giornale degli economisti e annali di economia,[0017-0097],e,92,"[['61745117840001701', 'JSTOR Arts and Science...","[['53537228640001701', 'Giornale degli economi...",[ Available from 1939 volume: 1 until 2012;],[Yes],[JSTOR],...,,,,,"[1939, 1940, 1941, 1942, 1946, 1947, 1949, 195...",,"[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
5,9963082590001701,Giornale degli economisti e annali di economia,[0017-0097],p,92,,,,,,...,[TWILS PER],[58],"[1970, 1939, 1969, 1950, 1960, 1978, 1940, 199...",['TWILS PER'],"[1939, 1940, 1941, 1942, 1946, 1947, 1949, 195...","[(1939, 1942), (1946, 1947), (1949, 1951), (19...","[1939, 1940, 1941, 1942, 1943, 1944, 1945, 194...",,,
8,9942472340001701,Laboratory techniques in biochemistry and mole...,[0075-7535],p,277,,,,,,...,"[TBIOM GENS, TMAGR GEN]",[70],"[1970, 1978, 1984, 1995, 1987, 1983, 2009, 197...","['TBIOM GENS', 'TMAGR GEN']","[1969, 1970, 1972, 1975, 1976, 1978, 1980, 198...","[(1969, 1970), (1972, 1972), (1975, 1976), (19...","[2007, 2008, 2009]",,,
9,9968336850001701,Laboratory techniques in biochemistry and mole...,[0075-7535],e,277,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624652880001701', 'Laboratory techniques ...",[ Available from 2007 volume: 32 until 2009 vo...,[Yes],[Elsevier],...,,,,,"[1969, 1970, 1972, 1975, 1976, 1978, 1980, 198...",,"[2007, 2008, 2009]",,,
10,9932958210001701,Revue de métaphysique et de morale,[0035-1571],p,513,,,,,,...,[TWILS PER],[78],"[2004, 2010, 2012, 1948, 2018, 1960, 1978, 195...",['TWILS PER'],"[1945, 1946, 1947, 1948, 1949, 1950, 1951, 195...","[(1945, 2020)]","[1893, 1894, 1895, 1896, 1897, 1898, 1899, 190...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8800,9947993970001701,New Zealand entomologist,"[0077-9962, 1179-3430]",p,101345,,,,,,...,[TMAGR PER],[14],"[1984-1987, 2000-2003, 2004-2007, 1975-1978, 2...",['TMAGR PER'],"[1962, 1963, 1964, 1965, 1966, 1967, 1968, 196...","[(1962, 1973), (1975, 1979), (1984, 2015)]","[1952, 1953, 1954, 1955, 1956, 1957, 1958, 195...",,,
8801,9966748600001701,Bibliothe que d'humanisme et Renaissance,"[0006-1999, 2418-7135]",e,101380,"[['61535214270001701', 'JSTOR Arts and Science...","[['53536372790001701', ""Bibliothèque d'humani...",[ Available from 1941 volume: 1;],[Yes],[JSTOR],...,,,,,"[1941, 1942, 1943, 1944, 1945, 1946, 1947, 194...",,"[1941, 1942, 1943, 1944, 1945, 1946, 1947, 194...",,,
8802,9946092870001701,Bibliothèque d'humanisme et renaissance ;,[0006-1999],p,101380,,,,,,...,[TWILS PER],[90],"[2004, 2012, 2015, 1964, 1987, 2005, 1945, 196...",['TWILS PER'],"[1941, 1942, 1943, 1944, 1945, 1946, 1947, 194...","[(1941, 1948), (1950, 2020)]","[1941, 1942, 1943, 1944, 1945, 1946, 1947, 194...",,,
8819,9946768760001701,The Americas,"[1533-6247, 0003-1615]",p,101581,,,,,,...,"[ZMLAC OWL, TWILS PER]",[74],"[, 1973/1974, 2004, 1958/1959, 1968/1969, 1972...","['ZMLAC OWL', 'TWILS PER']","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...","[(1944, 2014)]","[1944, 1945, 1946, 1947, 1948, 1949, 1950, 195...",,,


In [264]:
pe_no_repo.to_pickle('pe_no_repo_' + today + '.pkl')
pe_no_repo.to_csv('pe_no_repo_' + today + '.txt',sep='\t',index=False)

In [265]:
pe = pe.groupby('matches_group_id').filter(lambda x: len(x) == 2)
pe

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,,,,"[1963, 1964, 1965, 1966]",,[1963],[1963],,[1963]
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",[1963],[1963],,[1963]
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,,,,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...",,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...","[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
6,9964617400001701,Mathematics of the USSR. Izvestija,[0025-5726],p,267,,,,,,...,[TMATH PER],[40],"[1970, 1981-1982, 1980-1981, 1978, 1984, 1983-...",['TMATH PER'],"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[(1967, 1992)]",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,[TMAGR PER],[20],"[1984-85, 1991-92, 1987-88, 1968-69, 1991, 198...",['TMAGR PER'],"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","[(1964, 1994)]","[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...",,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199..."
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,,,,,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]"
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,[TMAGR PER],[14],"[1985-86, 1996-97, 1979, 1984, 1988, 1994-95, ...",['TMAGR PER'],"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","[(1979, 1999)]","[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]"
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199..."


In [266]:
peno_no_PCAD = pe.groupby('matches_group_id').filter(lambda x: (x['pcad-range'].isnull().any()))
peno_no_PCAD

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
6,9964617400001701,Mathematics of the USSR. Izvestija,[0025-5726],p,267,,,,,,...,[TMATH PER],[40],"[1970, 1981-1982, 1980-1981, 1978, 1984, 1983-...",['TMATH PER'],"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[(1967, 1992)]",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
7,9968936470001701,Mathematics of the USSR. Izvestija (Online),"[0025-5726, 2169-5075]",e,267,"[['61695747580001701', 'Institute of Physics T...","[['53695747410001701', 'Mathematics of the USS...",[Unknown],[Yes],[other],...,,,,,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
87,9967380620001701,The Journal of the Acoustical Society of America,"[0001-4966, 1520-8524]",e,2557,"[['61549768600001701', 'AIP Digital Archive', ...","[['53804697800001701', 'The Journal of the Aco...",[Unknown],[Yes],[other],...,,,,,"[1929, 1930, 1931, 1932, 1933, 1934, 1935, 193...",,,"[1929, 1930, 1931, 1932, 1933, 1934, 1935, 193...",,"[1929, 1930, 1931, 1932, 1933, 1934, 1935, 193..."
88,9963190950001701,The Journal of the Acoustical Society of America,[0001-4966],p,2557,,,,,,...,"[TSCI PER, TBIOM PERS]",[597],"[, 2004, 2012, 2015, 1964, 1987, 1974/78, 1942...","['TSCI PER', 'TBIOM PERS']","[1929, 1930, 1931, 1932, 1933, 1934, 1935, 193...","[(1929, 2017)]",,"[1929, 1930, 1931, 1932, 1933, 1934, 1935, 193...",,"[1929, 1930, 1931, 1932, 1933, 1934, 1935, 193..."
503,9966238570001701,Acta applicandae mathematicae,"[0167-8019, 1572-9036]",e,9314,"[['61535213400001701', 'SpringerLink Historica...","[['53535251500001701', 'Acta applicandae mathe...",[Unknown],[Yes],[Springer],...,,,,,"[1991, 1992, 1993, 1994, 1995, 1996, 1997, 199...",,,"[1983, 1984, 1985, 1986, 1987, 1988, 1989, 199...",,"[1983, 1984, 1985, 1986, 1987, 1988, 1989, 199..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7208,9934784910001701,Meccanica.,[0025-6455],p,73451,,,,,,...,[TSCI PER],[14],"[2004, 1999/2000, 1997, 2007, 2000, 2003, 1998...",['TSCI PER'],"[1996, 1997, 1998, 1999, 2000, 2001, 2002, 200...","[(1996, 2005), (2007, 2007)]",,"[1966, 1967, 1968, 1969, 1970, 1971, 1972, 197...",,"[1966, 1967, 1968, 1969, 1970, 1971, 1972, 197..."
7290,9949995780001701,Annals of operations research,[0254-5330],p,74717,,,,,,...,[TWILS GEN],[6],[2002],['TWILS GEN'],[2002],"[(2002, 2002)]",,"[1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...",,"[1984, 1985, 1986, 1987, 1988, 1989, 1990, 199..."
7291,9967674060001701,Annals of operations research (Online),"[1572-9338, 0254-5330]",e,74717,"[['61535213690001701', 'SpringerLink Historica...","[['53538467150001701', 'Annals of operations r...",[Unknown],[Yes],[Springer],...,,,,,[2002],,,"[1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...",,"[1984, 1985, 1986, 1987, 1988, 1989, 1990, 199..."
7442,9957936980001701,Inflammation,[0360-3997],p,75974,,,,,,...,[TBIOM PERS],[31],"[2004, 1984, 1995, 1987, 1983, 2001, 1980, 198...",['TBIOM PERS'],"[1975, 1976, 1977, 1978, 1979, 1980, 1981, 198...","[(1975, 2004)]",,"[1975, 1976, 1977, 1978, 1979, 1980, 1981, 198...",,"[1975, 1976, 1977, 1978, 1979, 1980, 1981, 198..."


In [267]:
peno_no_PCAD.to_pickle('peno_no_PCAD_' + today + '.pkl')
peno_no_PCAD.to_csv('peno_no_PCAD_' + today + '.txt',sep='\t',index=False)

In [268]:
peno = pe.groupby('matches_group_id').filter(lambda x: ~(x['pcad-range'].isnull().any()))
peno

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,,,,"[1963, 1964, 1965, 1966]",,[1963],[1963],,[1963]
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",[1963],[1963],,[1963]
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,,,,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...",,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...","[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...","[1980, 1981]",,"[1980, 1981]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,[TMAGR PER],[20],"[1984-85, 1991-92, 1987-88, 1968-69, 1991, 198...",['TMAGR PER'],"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","[(1964, 1994)]","[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...",,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199..."
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,,,,,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]"
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,[TMAGR PER],[14],"[1985-86, 1996-97, 1979, 1984, 1988, 1994-95, ...",['TMAGR PER'],"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","[(1979, 1999)]","[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]"
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199..."


In [269]:
no_chron = pe.groupby('matches_group_id').filter(lambda x: (x['chron_as_list'].isnull().any()))
no_chron

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
4544,9966792410001701,Physics of the earth and planetary interiors,"[1872-7395, 0031-9201]",e,43006,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624670100001701', 'Physics of the earth a...",[ Available from 1967-10- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,,,,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."
4545,9950202520001701,Physics of the earth and planetary interiors.,[0031-9201],p,43006,,,,,,...,,,,,,,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...","[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197...",,"[1967, 1968, 1969, 1970, 1971, 1972, 1973, 197..."


In [270]:
no_chron.to_pickle('no_chron_peok.pkl')
no_chron.to_csv('no_chron_peok.txt',sep='\t',index=False)

In [288]:
peok = peno.groupby('matches_group_id').filter(lambda x: ~(x['repo-coverage'].isnull().any()))
peok

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,,,,"[1963, 1964, 1965, 1966]",,[1963],[1963],,[1963]
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",[1963],[1963],,[1963]
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,,,,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...",,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...","[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...","[1980, 1981]",,"[1980, 1981]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,[TMAGR PER],[20],"[1984-85, 1991-92, 1987-88, 1968-69, 1991, 198...",['TMAGR PER'],"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","[(1964, 1994)]","[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...",,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199..."
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,,,,,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]"
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,[TMAGR PER],[14],"[1985-86, 1996-97, 1979, 1984, 1988, 1994-95, ...",['TMAGR PER'],"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","[(1979, 1999)]","[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]"
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199..."


In [289]:
peok = peok.groupby('matches_group_id').filter(lambda x: ~(x['chron_as_list'].isnull().any()))
peok

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_x,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,,,,"[1963, 1964, 1965, 1966]",,[1963],[1963],,[1963]
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[TZDS GEN],[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",[1963],[1963],,[1963]
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,,,,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...",,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,[TWILS CLS],[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...","[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194..."
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...","[1980, 1981]",,"[1980, 1981]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,[TMAGR PER],[20],"[1984-85, 1991-92, 1987-88, 1968-69, 1991, 198...",['TMAGR PER'],"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","[(1964, 1994)]","[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...",,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199..."
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,,,,,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]"
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,[TMAGR PER],[14],"[1985-86, 1996-97, 1979, 1984, 1988, 1994-95, ...",['TMAGR PER'],"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","[(1979, 1999)]","[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]"
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199..."


In [290]:
peok['matches_group_id'].nunique()

1622

#### Calculate overlap

In [291]:
peok['p2e_no_pcad'] = peok.apply(lambda row: sorted(set(row['chron_as_list']) - set(row['pcad-range'])), axis=1)
peok

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage,p2e_no_pcad
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,,,"[1963, 1964, 1965, 1966]",,[1963],[1963],,[1963],"[1964, 1965, 1966]"
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",[1963],[1963],,[1963],"[1964, 1965, 1966]"
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,,,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...",,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[]
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...","[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[]
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...","[1980, 1981]",,"[1980, 1981]","[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,[20],"[1984-85, 1991-92, 1987-88, 1968-69, 1991, 198...",['TMAGR PER'],"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","[(1964, 1994)]","[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...",,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",[]
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,,,,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[]
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,[14],"[1985-86, 1996-97, 1979, 1984, 1988, 1994-95, ...",['TMAGR PER'],"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","[(1979, 1999)]","[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[]
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",[]


In [292]:
peok['chron_as_list'] = peok['chron_as_list'].apply(lambda x: sorted(x))
peok['pcad-range'] = peok['pcad-range'].apply(lambda x: sorted(x))
peok['repo-coverage'] = peok['repo-coverage'].apply(lambda x: sorted(x))
peok

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,all_item_count,chron,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage,p2e_no_pcad
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,,,"[1963, 1964, 1965, 1966]",,[1963],[1963],,[1963],"[1964, 1965, 1966]"
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[1],[1963-1966],['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",[1963],[1963],,[1963],"[1964, 1965, 1966]"
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,,,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...",,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[]
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,[46],"[, 1890, 1939, 1948, 1915, 1943, 1946, 1913, 1...",['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...","[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[]
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...","[1980, 1981]",,"[1980, 1981]","[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,[20],"[1984-85, 1991-92, 1987-88, 1968-69, 1991, 198...",['TMAGR PER'],"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","[(1964, 1994)]","[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...",,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",[]
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,,,,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[]
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,[14],"[1985-86, 1996-97, 1979, 1984, 1988, 1994-95, ...",['TMAGR PER'],"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","[(1979, 1999)]","[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[]
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",[]


In [293]:
peok['ct_chron'] = peok.apply(lambda row: len(row['chron_as_list']), axis=1)
peok['ct_pcad'] = peok.apply(lambda row: len(row['pcad-range']),axis=1)
peok

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,curr-lib-loc_ALL,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage,p2e_no_pcad,ct_chron,ct_pcad
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,"[1963, 1964, 1965, 1966]",,[1963],[1963],,[1963],"[1964, 1965, 1966]",4,1
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,['TZDS GEN'],"[1963, 1964, 1965, 1966]","[(1963, 1966)]",[1963],[1963],,[1963],"[1964, 1965, 1966]",4,1
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...",,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,['TWILS CLS'],"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...","[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...","[1980, 1981]",,"[1980, 1981]","[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",24,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,['TMAGR PER'],"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","[(1964, 1994)]","[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...",,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",[],31,41
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,['TMAGR PER'],"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","[(1979, 1999)]","[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",[],11,40


In [294]:
peok['ct_p_no_match'] = peok.apply(lambda row: len(row['p2e_no_pcad']), axis=1)
peok

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,chron_as_list,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage,p2e_no_pcad,ct_chron,ct_pcad,ct_p_no_match
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,"[1963, 1964, 1965, 1966]",,[1963],[1963],,[1963],"[1964, 1965, 1966]",4,1,3
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,"[1963, 1964, 1965, 1966]","[(1963, 1966)]",[1963],[1963],,[1963],"[1964, 1965, 1966]",4,1,3
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...",,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,"[1890, 1892, 1895, 1912, 1913, 1914, 1915, 193...","[(1890, 1890), (1892, 1892), (1895, 1895), (19...","[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...","[1980, 1981]",,"[1980, 1981]","[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",24,10,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","[(1964, 1994)]","[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...",,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",[],31,41,0
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21,0
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","[(1979, 1999)]","[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21,0
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",[],11,40,0


In [295]:
peok.to_pickle('peok_before_calc.pkl')

In [296]:
peok['p2e-percent'] = peok.apply(lambda row: round((int(row['ct_chron'])-int(row['ct_p_no_match']))/(int(row['ct_chron'])),2) if int(row['ct_chron']) != 0 else 0,axis=1)
peok

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,chron_ranges_calc,pcad-range,SPR-yrs,Portico-years,repo-coverage,p2e_no_pcad,ct_chron,ct_pcad,ct_p_no_match,p2e-percent
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,[1963],[1963],,[1963],"[1964, 1965, 1966]",4,1,3,0.25
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,"[(1963, 1966)]",[1963],[1963],,[1963],"[1964, 1965, 1966]",4,1,3,0.25
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,"[(1890, 1890), (1892, 1892), (1895, 1895), (19...","[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,,"[1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...","[1980, 1981]",,"[1980, 1981]","[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",24,10,15,0.38
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,"[(1964, 1994)]","[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...",,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",[],31,41,0,1.00
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21,0,1.00
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,"[(1979, 1999)]","[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21,0,1.00
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",[],11,40,0,1.00


In [297]:
peok['p2e_has_pcad'] = peok.apply(lambda row: set(row['chron_as_list'])&set(row['pcad-range']),axis=1)
peok

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,pcad-range,SPR-yrs,Portico-years,repo-coverage,p2e_no_pcad,ct_chron,ct_pcad,ct_p_no_match,p2e-percent,p2e_has_pcad
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,[1963],[1963],,[1963],"[1964, 1965, 1966]",4,1,3,0.25,{1963}
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[1963],[1963],,[1963],"[1964, 1965, 1966]",4,1,3,0.25,{1963}
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00,"{1890, 1915, 1892, 1895, 1939, 1943, 1912, 191..."
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,"[1886, 1887, 1888, 1889, 1890, 1891, 1892, 189...","[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00,"{1890, 1915, 1892, 1895, 1939, 1943, 1912, 191..."
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,"[1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...","[1980, 1981]",,"[1980, 1981]","[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",24,10,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,"[1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...",,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",[],31,41,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197..."
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198..."
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,"[1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...",,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198..."
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",[],11,40,0,1.00,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 198..."


In [298]:
peok['pcad-repo'] = peok.apply(lambda row: row['p2e_has_pcad'] & set(row['repo-coverage']), axis=1)
peok

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,SPR-yrs,Portico-years,repo-coverage,p2e_no_pcad,ct_chron,ct_pcad,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,[1963],,[1963],"[1964, 1965, 1966]",4,1,3,0.25,{1963},{1963}
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[1963],,[1963],"[1964, 1965, 1966]",4,1,3,0.25,{1963},{1963}
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00,"{1890, 1915, 1892, 1895, 1939, 1943, 1912, 191...","{1939, 1943, 1946, 1947, 1948, 1951}"
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00,"{1890, 1915, 1892, 1895, 1939, 1943, 1912, 191...","{1939, 1943, 1946, 1947, 1948, 1951}"
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,"[1980, 1981]",,"[1980, 1981]","[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",24,10,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{}
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",[],31,41,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199..."
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}"
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}"
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",[],11,40,0,1.00,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 198...",{1990}


In [299]:
peok['pcad-repo-percent'] = peok.apply(lambda row: round(len(row['pcad-repo'])/len(row['p2e_has_pcad']),2) if (len(row['p2e_has_pcad']) != 0) else 0,axis=1)
peok

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,Portico-years,repo-coverage,p2e_no_pcad,ct_chron,ct_pcad,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,,[1963],"[1964, 1965, 1966]",4,1,3,0.25,{1963},{1963},1.00
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,,[1963],"[1964, 1965, 1966]",4,1,3,0.25,{1963},{1963},1.00
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00,"{1890, 1915, 1892, 1895, 1939, 1943, 1912, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00,"{1890, 1915, 1892, 1895, 1939, 1943, 1912, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,,"[1980, 1981]","[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",24,10,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...","[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",[],31,41,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,"[1989, 1990, 1991, 1992]","[1989, 1990, 1991, 1992]",[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",[],11,40,0,1.00,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 198...",{1990},0.09


In [300]:
peok['total-to-remove'] = peok.apply(lambda row: round(row['p2e-percent']*row['pcad-repo-percent'],2), axis=1)
peok

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,repo-coverage,p2e_no_pcad,ct_chron,ct_pcad,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,[1963],"[1964, 1965, 1966]",4,1,3,0.25,{1963},{1963},1.00,0.25
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[1963],"[1964, 1965, 1966]",4,1,3,0.25,{1963},{1963},1.00,0.25
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00,"{1890, 1915, 1892, 1895, 1939, 1943, 1912, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00,"{1890, 1915, 1892, 1895, 1939, 1943, 1912, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,"[1980, 1981]","[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",24,10,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",[],31,41,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29,0.29
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,"[1989, 1990, 1991, 1992]",[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,"[1989, 1990, 1991, 1992]",[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",[],11,40,0,1.00,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 198...",{1990},0.09,0.09


In [301]:
peok.to_pickle(f'peok_{today}.pkl')

In [335]:
peok = pd.read_pickle('peok_20201112.pkl')
peok

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,repo-coverage,p2e_no_pcad,ct_chron,ct_pcad,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,[1963],"[1964, 1965, 1966]",4,1,3,0.25,{1963},{1963},1.00,0.25
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,[1963],"[1964, 1965, 1966]",4,1,3,0.25,{1963},{1963},1.00,0.25
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,"[1980, 1981]","[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",24,10,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8787,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",[],31,41,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29,0.29
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,"[1989, 1990, 1991, 1992]",[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19
8789,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,"[1989, 1990, 1991, 1992]",[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19
8821,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,"[1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...",[],11,40,0,1.00,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 198...",{1990},0.09,0.09


In [336]:
peok[['MMS_ID','curr-lib-loc_ALL']]

Unnamed: 0,MMS_ID,curr-lib-loc_ALL
0,9968441380001701,
1,9963550760001701,['TZDS GEN']
2,9968429800001701,
3,9939481760001701,['TWILS CLS']
24,9966907990001701,
...,...,...
8787,9931018220001701,['TMAGR PER']
8788,9967674080001701,
8789,9942687740001701,['TMAGR PER']
8821,9968947900001701,


In [337]:
peok.columns

Index(['MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e', 'matches_group_id',
       'e_coll_info', 'portfolio_info', 'Coverage Information Combined',
       'PCAD?', 'Vendor_key', 'Title 1 (Print)_BTAA-SPR',
       'Title 2 (Print)_BTAA-SPR', 'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR',
       'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO',
       'curr-lib-loc_x', 'all_item_count', 'chron', 'curr-lib-loc_ALL',
       'chron_as_list', 'chron_ranges_calc', 'pcad-range', 'SPR-yrs',
       'Portico-years', 'repo-coverage', 'p2e_no_pcad', 'ct_chron', 'ct_pcad',
       'ct_p_no_match', 'p2e-percent', 'p2e_has_pcad', 'pcad-repo',
       'pcad-repo-percent', 'total-to-remove'],
      dtype='object')

In [338]:
peok[peok['all_item_count'].isnull()]

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,repo-coverage,p2e_no_pcad,ct_chron,ct_pcad,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,[1963],"[1964, 1965, 1966]",4,1,3,0.25,{1963},{1963},1.00,0.25
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,"[1935, 1936, 1937, 1938, 1939, 1940, 1941, 194...",[],13,110,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46
24,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,"[1980, 1981]","[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",24,10,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00,0.00
34,9967018760001701,Annals of agricultural sciences.,[0570-1783],e,1066,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624539550001701', 'Annals of agricultural...",[ Available from 2011-06- volume: 56 issue: 1;],[Yes],[Elsevier],...,"[2011, 2012, 2013, 2014, 2015, 2016, 2017, 201...","[1956, 1957, 1959, 1960, 1961, 1964, 1965]",7,9,7,0.00,{},{},0.00,0.00
35,9969535240001701,"Journal fu r praktische Chemie, Chemiker-Zeitu...",[0941-1216],e,1087,"[['61535209520001701', 'Wiley Online Library C...","[['53542636090001701', 'Journal für praktisch...",[ Available from 1992 volume: 334 issue: 1 unt...,[Yes],[Wiley],...,"[2001, 2002, 2003, 2004, 2005, 2006, 2007, 200...",[],5,7,0,1.00,"{1992, 1993, 1994, 1995, 1996}",{},0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8780,9968527570001701,Journal of muscle research and cell motility,"[0142-4319, 1573-2657]",e,100867,"[['61535213730001701', 'SpringerLink Historica...","[['53540364460001701', 'Journal of muscle rese...",[ Available from 1980 volume: 1 issue: 1 until...,[Yes],[Springer],...,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...","[1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004]",25,17,8,0.68,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...","{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...",1.00,0.68
8784,9967155120001701,Journal of the Franklin Institute,"[1879-2693, 0016-0032]",e,101001,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624654960001701', 'Journal of the Frankli...",[ Available from 1826-01- volume: 1 issue: 1;],[Yes],[Elsevier],...,"[1829, 1830, 1831, 1832, 1833, 1834, 1835, 183...",[],178,194,0,1.00,"{1826, 1827, 1828, 1829, 1830, 1831, 1832, 183...","{1829, 1830, 1831, 1832, 1833, 1834, 1835, 183...",0.84,0.84
8786,9977101567901701,Ophelia,[0078-5326],e,101035,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815441630001701', 'Ophelia.'], ['53815434...",[ Available from 1964 volume: 1 issue: 1 until...,[Yes],[Taylor & Francis],...,"[1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",[],31,41,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29,0.29
8788,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,"[1989, 1990, 1991, 1992]",[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19


#### Add count of potential volumes to withdraw

In [339]:
pvols = peok[peok['all_item_count'].notnull() & peok['total-to-remove'].notnull()]
pvols = pvols[['MMS_ID','all_item_count','total-to-remove']]
pvols

Unnamed: 0,MMS_ID,all_item_count,total-to-remove
1,9963550760001701,[1],0.25
3,9939481760001701,[46],0.46
25,9931084310001701,[35],0.00
33,9918228280001701,[7],0.00
36,9975846880001701,[10],0.00
...,...,...,...
8781,9913415620001701,[38],0.68
8785,9952125650001701,[312],0.84
8787,9931018220001701,[20],0.29
8789,9942687740001701,[14],0.19


In [340]:
pvols['all_item_count'] = pvols['all_item_count'].apply(lambda x: x[0])
pvols['potential volumes to withdraw'] = pvols.apply(lambda row: row['total-to-remove']*row['all_item_count'], axis=1)
pvols['potential volumes to withdraw'] = pvols['potential volumes to withdraw'].apply(lambda x: math.floor(x))
pvols

Unnamed: 0,MMS_ID,all_item_count,total-to-remove,potential volumes to withdraw
1,9963550760001701,1,0.25,0
3,9939481760001701,46,0.46,21
25,9931084310001701,35,0.00,0
33,9918228280001701,7,0.00,0
36,9975846880001701,10,0.00,0
...,...,...,...,...
8781,9913415620001701,38,0.68,25
8785,9952125650001701,312,0.84,262
8787,9931018220001701,20,0.29,5
8789,9942687740001701,14,0.19,2


In [341]:
pvols = pvols.drop_duplicates()
pvols

Unnamed: 0,MMS_ID,all_item_count,total-to-remove,potential volumes to withdraw
1,9963550760001701,1,0.25,0
3,9939481760001701,46,0.46,21
25,9931084310001701,35,0.00,0
33,9918228280001701,7,0.00,0
36,9975846880001701,10,0.00,0
...,...,...,...,...
8781,9913415620001701,38,0.68,25
8785,9952125650001701,312,0.84,262
8787,9931018220001701,20,0.29,5
8789,9942687740001701,14,0.19,2


In [342]:
df = pd.merge(peok, pvols[['MMS_ID','potential volumes to withdraw']], how='left', on='MMS_ID')
df

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,p2e_no_pcad,ct_chron,ct_pcad,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove,potential volumes to withdraw
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,"[1964, 1965, 1966]",4,1,3,0.25,{1963},{1963},1.00,0.25,
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,"[1964, 1965, 1966]",4,1,3,0.25,{1963},{1963},1.00,0.25,0.0
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,[],13,110,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,[],13,110,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,21.0
4,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,"[1980, 1981, 1982, 1983, 1984, 1985, 1986, 198...",24,10,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00,0.00,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3239,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,[],31,41,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29,0.29,5.0
3240,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,
3241,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,[],21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,2.0
3242,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,[],11,40,0,1.00,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 198...",{1990},0.09,0.09,


In [346]:
#check as many indices as desired
df.loc[12]

MMS_ID                                                              9969162730001701
Title_bib                          Progress in nuclear magnetic resonance spectro...
ISSN_cluster                                                  [0079-6565, 1873-3301]
p_or_e                                                                             e
matches_group_id                                                                2435
e_coll_info                        [['61624504590001701', 'Elsevier ScienceDirect...
portfolio_info                     [['53624687270001701', 'Progress in nuclear ma...
Coverage Information Combined                      [ Available from 1966 volume: 1;]
PCAD?                                                                          [Yes]
Vendor_key                                                                [Elsevier]
Title 1 (Print)_BTAA-SPR                                                         NaN
Title 2 (Print)_BTAA-SPR                                         

#### Normalize group IDs

In [347]:
def new_group_ids ( df, identifier_column, group_name ):

    df2 = pd.DataFrame()
    df2 = df.groupby([identifier_column]).ngroup()
    print('grouped')

    groups = df2.to_frame()
    groups.rename(columns={0: group_name + '_group_id' },inplace=True)
    print(groups.columns)
    
    eg = pd.merge(df,groups,left_index=True,right_index=True, how="inner")
    
    return eg

In [348]:
df1 = new_group_ids(df, 'matches_group_id','final')
df1

grouped
Index(['final_group_id'], dtype='object')


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,matches_group_id,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,...,ct_chron,ct_pcad,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove,potential volumes to withdraw,final_group_id
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,5,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],...,4,1,3,0.25,{1963},{1963},1.00,0.25,,0
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,5,,,,,,...,4,1,3,0.25,{1963},{1963},1.00,0.25,0.0,0
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,12,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],...,13,110,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,,1
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,12,,,,,,...,13,110,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,21.0,1
4,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,744,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],...,24,10,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00,0.00,,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3239,9931018220001701,Ophelia,[0078-5326],p,101035,,,,,,...,31,41,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29,0.29,5.0,1619
3240,9967674080001701,South African journal of zoology,[0254-1858],e,101036,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],...,21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,,1620
3241,9942687740001701,South African journal of zoology,[0254-1858],p,101036,,,,,,...,21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,2.0,1620
3242,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,101582,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],...,11,40,0,1.00,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 198...",{1990},0.09,0.09,,1621


In [349]:
#check counts
print(df1['matches_group_id'].nunique())
print(df1['final_group_id'].max())
print(df1['final_group_id'].min())
print(df1['final_group_id'].nunique())

1622
1621
0
1622


In [350]:
df1.columns

Index(['MMS_ID', 'Title_bib', 'ISSN_cluster', 'p_or_e', 'matches_group_id',
       'e_coll_info', 'portfolio_info', 'Coverage Information Combined',
       'PCAD?', 'Vendor_key', 'Title 1 (Print)_BTAA-SPR',
       'Title 2 (Print)_BTAA-SPR', 'Match?_BTAA-SPR', 'SPR Holdings_BTAA-SPR',
       'Title (Complete)_PORTICO', 'Portico Match_PORTICO',
       'Portico Title_PORTICO', 'PCA_PORTICO', 'Status_PORTICO',
       'Earliest Year Preserved_PORTICO', 'Latest Year Preserved_PORTICO',
       'curr-lib-loc_x', 'all_item_count', 'chron', 'curr-lib-loc_ALL',
       'chron_as_list', 'chron_ranges_calc', 'pcad-range', 'SPR-yrs',
       'Portico-years', 'repo-coverage', 'p2e_no_pcad', 'ct_chron', 'ct_pcad',
       'ct_p_no_match', 'p2e-percent', 'p2e_has_pcad', 'pcad-repo',
       'pcad-repo-percent', 'total-to-remove', 'potential volumes to withdraw',
       'final_group_id'],
      dtype='object')

In [351]:
df1.drop(columns=['matches_group_id'],inplace=True)
df1

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,Title 1 (Print)_BTAA-SPR,...,ct_chron,ct_pcad,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove,potential volumes to withdraw,final_group_id
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],,...,4,1,3,0.25,{1963},{1963},1.00,0.25,,0
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,,,,,,IEEE transactions on ultrasonics engineering.,...,4,1,3,0.25,{1963},{1963},1.00,0.25,0.0,0
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],,...,13,110,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,,1
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,,,,,,Journal of the Institute of Actuaries.,...,13,110,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,21.0,1
4,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],,...,24,10,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00,0.00,,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3239,9931018220001701,Ophelia,[0078-5326],p,,,,,,,...,31,41,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29,0.29,5.0,1619
3240,9967674080001701,South African journal of zoology,[0254-1858],e,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],,...,21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,,1620
3241,9942687740001701,South African journal of zoology,[0254-1858],p,,,,,,,...,21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,2.0,1620
3242,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],,...,11,40,0,1.00,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 198...",{1990},0.09,0.09,,1621


#### Sort out single vs. multiple locations

In [352]:
p2 = df1[df1['curr-lib-loc_ALL'].notnull()]
p2['curr-lib-loc_ALL'] = p2['curr-lib-loc_ALL'].apply(lambda x: ast.literal_eval(x))
p2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,Title 1 (Print)_BTAA-SPR,...,ct_chron,ct_pcad,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove,potential volumes to withdraw,final_group_id
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,,,,,,IEEE transactions on ultrasonics engineering.,...,4,1,3,0.25,{1963},{1963},1.00,0.25,0.0,0
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,,,,,,Journal of the Institute of Actuaries.,...,13,110,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,21.0,1
5,9931084310001701,Neuropeptides,[0143-4179],p,,,,,,Neuropeptides.,...,24,10,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00,0.00,0.0,2
6,9918228280001701,Annals of agricultural science.,"[1110-0249, 0570-1783]",p,,,,,,,...,7,9,7,0.00,{},{},0.00,0.00,0.0,3
9,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",[0941-1216],p,,,,,,,...,5,7,0,1.00,"{1992, 1993, 1994, 1995, 1996}",{},0.00,0.00,0.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3235,9913415620001701,Journal of muscle research and cell motility.,[0142-4319],p,,,,,,Journal of muscle research and cell motility.,...,25,17,8,0.68,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...","{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...",1.00,0.68,25.0,1617
3237,9952125650001701,Journal of the Franklin Institute,[0016-0032],p,,,,,,Journal of the Franklin Institute.,...,178,194,0,1.00,"{1826, 1827, 1828, 1829, 1830, 1831, 1832, 183...","{1829, 1830, 1831, 1832, 1833, 1834, 1835, 183...",0.84,0.84,262.0,1618
3239,9931018220001701,Ophelia,[0078-5326],p,,,,,,,...,31,41,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29,0.29,5.0,1619
3241,9942687740001701,South African journal of zoology,[0254-1858],p,,,,,,,...,21,21,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,2.0,1620


In [353]:
p2['curr-lib-loc_ALL']

1         [TZDS GEN]
3        [TWILS CLS]
5       [TBIOM PERS]
6        [ZMLAC OWL]
9        [ZMLAC OWL]
            ...     
3235    [TBIOM PERS]
3237      [TSCI PER]
3239     [TMAGR PER]
3241     [TMAGR PER]
3243      [TSCI PER]
Name: curr-lib-loc_ALL, Length: 1622, dtype: object

In [354]:
p2['loc_count'] = ''
p2['locs'] = ''
for index, row in p2.iterrows():
    locs = [x for x in row['curr-lib-loc_ALL'] if ('WDN' not in x)]
    p2.set_value(index, 'loc_count', len(locs))
    p2.set_value(index, 'locs', locs)
    
p2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
  """
  


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,Title 1 (Print)_BTAA-SPR,...,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove,potential volumes to withdraw,final_group_id,loc_count,locs
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,,,,,,IEEE transactions on ultrasonics engineering.,...,3,0.25,{1963},{1963},1.00,0.25,0.0,0,1,[TZDS GEN]
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,,,,,,Journal of the Institute of Actuaries.,...,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,21.0,1,1,[TWILS CLS]
5,9931084310001701,Neuropeptides,[0143-4179],p,,,,,,Neuropeptides.,...,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00,0.00,0.0,2,1,[TBIOM PERS]
6,9918228280001701,Annals of agricultural science.,"[1110-0249, 0570-1783]",p,,,,,,,...,7,0.00,{},{},0.00,0.00,0.0,3,1,[ZMLAC OWL]
9,9975846880001701,"Journal für praktische Chemie, Chemiker-Zeitung",[0941-1216],p,,,,,,,...,0,1.00,"{1992, 1993, 1994, 1995, 1996}",{},0.00,0.00,0.0,4,1,[ZMLAC OWL]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3235,9913415620001701,Journal of muscle research and cell motility.,[0142-4319],p,,,,,,Journal of muscle research and cell motility.,...,8,0.68,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...","{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...",1.00,0.68,25.0,1617,1,[TBIOM PERS]
3237,9952125650001701,Journal of the Franklin Institute,[0016-0032],p,,,,,,Journal of the Franklin Institute.,...,0,1.00,"{1826, 1827, 1828, 1829, 1830, 1831, 1832, 183...","{1829, 1830, 1831, 1832, 1833, 1834, 1835, 183...",0.84,0.84,262.0,1618,1,[TSCI PER]
3239,9931018220001701,Ophelia,[0078-5326],p,,,,,,,...,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29,0.29,5.0,1619,1,[TMAGR PER]
3241,9942687740001701,South African journal of zoology,[0254-1858],p,,,,,,,...,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,2.0,1620,1,[TMAGR PER]


In [355]:
p2[['curr-lib-loc_ALL','locs','loc_count']]

Unnamed: 0,curr-lib-loc_ALL,locs,loc_count
1,[TZDS GEN],[TZDS GEN],1
3,[TWILS CLS],[TWILS CLS],1
5,[TBIOM PERS],[TBIOM PERS],1
6,[ZMLAC OWL],[ZMLAC OWL],1
9,[ZMLAC OWL],[ZMLAC OWL],1
...,...,...,...
3235,[TBIOM PERS],[TBIOM PERS],1
3237,[TSCI PER],[TSCI PER],1
3239,[TMAGR PER],[TMAGR PER],1
3241,[TMAGR PER],[TMAGR PER],1


In [356]:
p3 = pd.merge(df1, p2[['locs','loc_count']], how='left',left_index=True, right_index=True)
p3

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,Title 1 (Print)_BTAA-SPR,...,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove,potential volumes to withdraw,final_group_id,locs,loc_count
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],,...,3,0.25,{1963},{1963},1.00,0.25,,0,,
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,,,,,,IEEE transactions on ultrasonics engineering.,...,3,0.25,{1963},{1963},1.00,0.25,0.0,0,[TZDS GEN],1
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],,...,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,,1,,
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,,,,,,Journal of the Institute of Actuaries.,...,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,21.0,1,[TWILS CLS],1
4,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],,...,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00,0.00,,2,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3239,9931018220001701,Ophelia,[0078-5326],p,,,,,,,...,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29,0.29,5.0,1619,[TMAGR PER],1
3240,9967674080001701,South African journal of zoology,[0254-1858],e,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],,...,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,,1620,,
3241,9942687740001701,South African journal of zoology,[0254-1858],p,,,,,,,...,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,2.0,1620,[TMAGR PER],1
3242,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],,...,0,1.00,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 198...",{1990},0.09,0.09,,1621,,


In [357]:
p3['loc_count'].fillna(0,inplace=True)
p3

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,Title 1 (Print)_BTAA-SPR,...,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove,potential volumes to withdraw,final_group_id,locs,loc_count
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],,...,3,0.25,{1963},{1963},1.00,0.25,,0,,0
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,,,,,,IEEE transactions on ultrasonics engineering.,...,3,0.25,{1963},{1963},1.00,0.25,0.0,0,[TZDS GEN],1
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],,...,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,,1,,0
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,,,,,,Journal of the Institute of Actuaries.,...,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,21.0,1,[TWILS CLS],1
4,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],,...,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00,0.00,,2,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3239,9931018220001701,Ophelia,[0078-5326],p,,,,,,,...,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29,0.29,5.0,1619,[TMAGR PER],1
3240,9967674080001701,South African journal of zoology,[0254-1858],e,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],,...,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,,1620,,0
3241,9942687740001701,South African journal of zoology,[0254-1858],p,,,,,,,...,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,2.0,1620,[TMAGR PER],1
3242,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],,...,0,1.00,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 198...",{1990},0.09,0.09,,1621,,0


In [358]:
multi_locs = p3[p3['loc_count'] > 1]
multi_locs_ids = list(multi_locs['final_group_id'])
multi_loc_data = p3[p3['final_group_id'].isin(multi_locs_ids)]
multi_loc_data

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,Title 1 (Print)_BTAA-SPR,...,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove,potential volumes to withdraw,final_group_id,locs,loc_count
10,9968016560001701,Annals of the Entomological Society of America,"[0013-8746, 1938-2901]",e,"[['61640618310001701', 'Oxford University Pres...","[['53640944640001701', 'Annals of the Entomolo...",[ Available from 1908 volume: 1 issue: 1 until...,[Yes],[other],,...,17,0.83,"{1908, 1909, 1910, 1911, 1912, 1913, 1914, 191...","{1908, 1909, 1910, 1911, 1912, 1913, 1914, 191...",0.96,0.80,,5,,0
11,9925898310001701,Annals of the Entomological Society of America,"[0013-8746, 1938-2901]",p,,,,,,Annals of the Entomological Society of America.,...,17,0.83,"{1908, 1909, 1910, 1911, 1912, 1913, 1914, 191...","{1908, 1909, 1910, 1911, 1912, 1913, 1914, 191...",0.96,0.80,120.0,5,"[TNRL GEN, ZMLAC OWL]",2
30,9966977060001701,Sociology of education,"[0038-0407, 1939-8573]",e,"[['61786968900001701', 'SAGE Premier 2020', ''...","[['53787025550001701', 'Sociology of education...","[ Available from 2004 until 2009;, Available ...",[Yes],[SAGE],,...,41,0.13,"{2004, 2005, 2006, 2007, 2008, 2009}","{2004, 2005, 2006, 2007, 2008, 2009}",1.00,0.13,,15,,0
31,9926013400001701,Sociology of education,[0038-0407],p,,,,,,Sociology of education.,...,41,0.13,"{2004, 2005, 2006, 2007, 2008, 2009}","{2004, 2005, 2006, 2007, 2008, 2009}",1.00,0.13,5.0,15,"[ZMLAC OWL, TWILS PER]",2
40,9967040110001701,Journal of mathematical behavior (Online),"[0732-3123, 1873-8028]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624646450001701', 'The journal of mathema...",[ Available from 1994-03- volume: 13 issue: 1;],[Yes],[Elsevier],,...,3,0.73,"{1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001}","{1994, 1995, 1996, 1997, 1998, 1999, 2001}",0.88,0.64,,20,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3181,9942475030001701,Journal of lightwave technology,[0733-8724],p,,,,,,Journal of lightwave technology : a joint IEEE...,...,4,0.81,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...","{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...",1.00,0.81,63.0,1590,"[ZMLAC UMDN, TSCI PER]",2
3184,9949962250001701,IEEE robotics & automation magazine,"[1558-223X, 1070-9932]",p,,,,,,IEEE robotics & automation magazine.,...,4,0.60,"{1994, 1995, 1996, 1997, 1998, 1999}","{1994, 1995, 1996, 1997, 1998, 1999}",1.00,0.60,13.0,1592,"[ZMLAC UMDN, TSCI PER]",2
3185,9966891010001701,IEEE robotics & automation magazine (Online),[1070-9932],e,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620436290001701', 'IEEE robotics & automa...",[ Available from 1994 until 1999;],[Yes],[other],,...,4,0.60,"{1994, 1995, 1996, 1997, 1998, 1999}","{1994, 1995, 1996, 1997, 1998, 1999}",1.00,0.60,,1592,,0
3202,9963163570001701,I.R.E. transactions on automatic control,[0096-199X],p,,,,,,IRE transactions on automatic control.,...,0,1.00,"{1956, 1957, 1958, 1959, 1960, 1961, 1962}","{1956, 1957, 1958}",0.43,0.43,3.0,1601,"[ZMLAC UMDN, ZMLAC OWL]",2


In [359]:
single_loc_data = p3[~p3['final_group_id'].isin(multi_locs_ids)]
single_loc_data

Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,Title 1 (Print)_BTAA-SPR,...,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove,potential volumes to withdraw,final_group_id,locs,loc_count
0,9968441380001701,IEEE transactions on ultrasonics engineering,"[0893-6706, 2162-1373]",e,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620359120001701', 'IEEE transactions on u...",[ Available from 1963 volume: 10 issue: 1 unti...,[Yes],[other],,...,3,0.25,{1963},{1963},1.00,0.25,,0,,0
1,9963550760001701,IEEE transactions on ultrasonics engineering,[0893-6706],p,,,,,,IEEE transactions on ultrasonics engineering.,...,3,0.25,{1963},{1963},1.00,0.25,0.0,0,[TZDS GEN],1
2,9968429800001701,Journal of the Institute of Actuaries,"[0020-2681, 2058-1009]",e,"[['61535216140001701', 'JSTOR Business III Col...","[['53540140170001701', 'Journal of the Institu...",[ Available from 1886 volume: 25 issue: 5 unti...,[Yes],[JSTOR],,...,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,,1,,0
3,9939481760001701,Journal of the Institute of Actuaries,[0020-2681],p,,,,,,Journal of the Institute of Actuaries.,...,0,1.00,"{1890, 1951, 1892, 1947, 1895, 1939, 1943, 191...","{1939, 1943, 1946, 1947, 1948, 1951}",0.46,0.46,21.0,1,[TWILS CLS],1
4,9966907990001701,Neuropeptides,"[1532-2785, 0143-4179]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624660480001701', 'Neuropeptides.']]",[ Available from 1995-01- volume: 28 issue: 1 ...,[Yes],[Elsevier],,...,15,0.38,"{1995, 1996, 1997, 1998, 1999, 2000, 2001, 200...",{},0.00,0.00,,2,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3239,9931018220001701,Ophelia,[0078-5326],p,,,,,,,...,0,1.00,"{1964, 1965, 1966, 1967, 1968, 1969, 1970, 197...","{1986, 1987, 1988, 1989, 1990, 1991, 1992, 199...",0.29,0.29,5.0,1619,[TMAGR PER],1
3240,9967674080001701,South African journal of zoology,[0254-1858],e,"[['61808506700001701', 'Taylor & Francis Biolo...","[['53815442650001701', 'South African journal ...",[ Available from 1979 volume: 14 issue: 1 unti...,[Yes],[Taylor & Francis],,...,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,,1620,,0
3241,9942687740001701,South African journal of zoology,[0254-1858],p,,,,,,,...,0,1.00,"{1979, 1980, 1981, 1982, 1983, 1984, 1985, 198...","{1992, 1989, 1990, 1991}",0.19,0.19,2.0,1620,[TMAGR PER],1
3242,9968947900001701,International journal of adhesion and adhesives,"[0143-7496, 1879-0127]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624617390001701', 'International journal ...",[ Available from 1980-07- volume: 1 issue: 1;],[Yes],[Elsevier],,...,0,1.00,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 198...",{1990},0.09,0.09,,1621,,0


In [360]:
multi_loc_data.to_pickle('multi_loc_data_' + today + '.pkl')
single_loc_data.to_pickle('single_loc_data_' + today + '.pkl')

In [361]:
print(multi_loc_data['final_group_id'].nunique())
print(single_loc_data['final_group_id'].nunique())

311
1311


#### Split multi-loc and single-loc into coverage groups

In [362]:
def split_by_coverage(df, df_name):
    
    base = df_name
    base_100 = base + '_100'
    base_100_df = base_100 + '_df'
    base_0_100 = base + '_gt0_lt100'
    base_0_100_df = base_0_100 + '_df'
    base_0 = base + '_0'
    base_0_df = base_0 + '_df'
    
    base_100_df = df.groupby('final_group_id').filter(lambda x: x['total-to-remove'].mean() == 1)
    base_100_df.to_pickle(base_100 + '.pkl')
    base_100_df.to_csv(base_100 + '.txt',sep='\t')
    
    base_0_df = df.groupby('final_group_id').filter(lambda x: x['total-to-remove'].mean() == 0)
    base_0_df.to_pickle(base_0 + '.pkl')
    base_0_df.to_csv(base_0 + '.txt',sep='\t')
    
    base_0_100_df = df.groupby('final_group_id').filter(lambda x: 0 < x['total-to-remove'].mean() < 1)
    base_0_100_df.to_pickle(base_0_100 + '.pkl')
    base_0_100_df.to_csv(base_0_100 + '.txt',sep='\t')
    
    return base_100_df, base_0_100_df, base_0_df

In [363]:
multi_loc_100, multi_loc_0_100, multi_loc_0 = split_by_coverage(multi_loc_data,'multi_loc_data')
single_loc_100, single_loc_0_100, single_loc_0 = split_by_coverage(single_loc_data,'single_loc_data')

In [364]:
print(multi_loc_100.shape)
print(multi_loc_0_100.shape)
print(multi_loc_0.shape)
print(single_loc_100.shape)
print(single_loc_0_100.shape)
print(single_loc_0.shape)

(126, 43)
(474, 43)
(22, 43)
(566, 43)
(1722, 43)
(334, 43)


#### Split coverage groups by single or multi vendor

In [365]:
def split_by_vendor_ct (df, df_name):
    
    df_has_vendors = df[df['Vendor_key'].notnull()]
    has_vendor_groups = list(df_has_vendors['final_group_id'])
    has_vendor_data = df[df['final_group_id'].isin(has_vendor_groups)]
    
    no_vendor_data = df[~df['final_group_id'].isin(has_vendor_groups)]
    print('no_vendor')
    print(no_vendor_data.shape)
    no_vendor_data.to_pickle(df_name + 'no_vendor_data.pkl')
    no_vendor_data.to_csv(df_name +'no_vendor_data.txt',sep='\t')
    
    df_has_vendors['num_of_vend'] = df_has_vendors['Vendor_key'].apply(lambda x: len(x))
    df_has_one_vendor = df_has_vendors[df_has_vendors['num_of_vend'] == 1]
    
    df_has_multiple_vendors = df_has_vendors[df_has_vendors['num_of_vend'] > 1]
    multi_vendor_groups = list(df_has_multiple_vendors['final_group_id'])
    multi_vendor_data = has_vendor_data[has_vendor_data['final_group_id'].isin(multi_vendor_groups)]
    
    print('has multiple vendors')
    print(multi_vendor_data.shape)
    multi_vendor_data.to_pickle(df_name + 'multi-vendor-data.pkl')
    multi_vendor_data.to_csv(df_name + 'multi-vendor-data.txt',sep='\t', index=False)
    
    single_vendor_data = has_vendor_data[~has_vendor_data['final_group_id'].isin(multi_vendor_groups)]
    print('has single vendor')
    print(single_vendor_data.shape)
    single_vendor_data.to_pickle(df_name +'single-vendor-data.pkl')
    single_vendor_data.to_csv(df_name +'single-vendor-data.txt',sep='\t', index=False)
    
    return single_vendor_data

In [366]:
multi_loc_100_vendor = split_by_vendor_ct(multi_loc_100, 'multi_loc_100')
multi_loc_100_vendor

no_vendor
(0, 43)
has multiple vendors
(6, 43)
has single vendor
(120, 43)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,Title 1 (Print)_BTAA-SPR,...,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove,potential volumes to withdraw,final_group_id,locs,loc_count
68,9966748700001701,Behavior therapy,"[1878-1888, 0005-7894]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624565280001701', 'Behavior therapy']]",[ Available from 1970-03- volume: 1 issue: 1;],[Yes],[Elsevier],,...,0,1.0,"{1970, 1971, 1972, 1973, 1974, 1975, 1976, 197...","{1970, 1971, 1972, 1973, 1974, 1975, 1976, 197...",1.0,1.0,,34,,0
69,9956703420001701,Behavior therapy,[0005-7894],p,,,,,,Behavior therapy.,...,0,1.0,"{1970, 1971, 1972, 1973, 1974, 1975, 1976, 197...","{1970, 1971, 1972, 1973, 1974, 1975, 1976, 197...",1.0,1.0,30.0,34,"[TBIOM PERS, ZMLAC OWL]",2
200,9968110510001701,Journal of Comparative Pathology and Therapeutics,[0368-1742],e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624638360001701', 'Journal of Comparative...",[ Available from 1888 volume: 1 until 1964 vol...,[Yes],[Elsevier],Journal of comparative pathology and therapeut...,...,0,1.0,"{1920, 1921, 1922, 1923, 1924, 1925, 1926, 192...","{1920, 1921, 1922, 1923, 1924, 1925, 1926, 192...",1.0,1.0,,100,,0
201,9965015360001701,Journal of Comparative Pathology and Therapeutics,[0368-1742],p,,,,,,Journal of comparative pathology and therapeut...,...,0,1.0,"{1920, 1921, 1922, 1923, 1924, 1925, 1926, 192...","{1920, 1921, 1922, 1923, 1924, 1925, 1926, 192...",1.0,1.0,43.0,100,"[ZMLAC NON, TVET PER]",2
212,9966893360001701,Brain and language,"[1090-2155, 0093-934X]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624562670001701', 'Brain and language.']]",[ Available from 1974-01- volume: 1 issue: 1;],[Yes],[Elsevier],,...,0,1.0,"{1974, 1975, 1976, 1977, 1978, 1979, 1980, 198...","{1974, 1975, 1976, 1977, 1978, 1979, 1980, 198...",1.0,1.0,,106,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2855,9926256870001701,Progress in lipid research,[0163-7827],p,,,,,,Progress in lipid research.,...,0,1.0,"{1978, 1979, 1980, 1981, 1982, 1983, 1984, 198...","{1978, 1979, 1980, 1981, 1982, 1983, 1984, 198...",1.0,1.0,32.0,1427,"[TCOS SN1, TZDS GEN]",2
3076,9968437680001701,IRE transactions on communications systems,"[0096-2244, 2162-2132]",e,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620358750001701', 'IRE transactions on co...",[ Available from 1956 volume: 4 issue: 1 until...,[Yes],[other],,...,0,1.0,"{1960, 1961, 1962, 1959}","{1960, 1961, 1962, 1959}",1.0,1.0,,1538,,0
3077,9946180690001701,IRE transactions on communications systems,[0096-2244],p,,,,,,IRE transactions on communications systems.,...,0,1.0,"{1960, 1961, 1962, 1959}","{1960, 1961, 1962, 1959}",1.0,1.0,6.0,1538,"[ZMLAC UMDN, ZMLAC OWL]",2
3166,9955333930001701,Journal of ultrastructure research,[0022-5320],p,,,,,,Journal of ultrastructure research.,...,0,1.0,"{1957, 1958, 1959, 1960, 1961, 1962, 1963, 196...","{1957, 1958, 1959, 1960, 1961, 1962, 1963, 196...",1.0,1.0,158.0,1583,"[TBIOM PERS, ZMLAC UMDN, ZMLAC OWL]",3


In [367]:
single_loc_100_vendor = split_by_vendor_ct(single_loc_100, 'single_loc_100')
single_loc_100_vendor

no_vendor
(0, 43)
has multiple vendors
(30, 43)
has single vendor
(536, 43)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


Unnamed: 0,MMS_ID,Title_bib,ISSN_cluster,p_or_e,e_coll_info,portfolio_info,Coverage Information Combined,PCAD?,Vendor_key,Title 1 (Print)_BTAA-SPR,...,ct_p_no_match,p2e-percent,p2e_has_pcad,pcad-repo,pcad-repo-percent,total-to-remove,potential volumes to withdraw,final_group_id,locs,loc_count
16,9967030990001701,Progress in solid state chemistry,"[1873-1643, 0079-6786]",e,"[['61624504590001701', 'Elsevier ScienceDirect...","[['53624686980001701', 'Progress in solid stat...",[ Available from 1964 volume: 1;],[Yes],[Elsevier],,...,0,1.0,"{1964, 1965, 1967, 1971, 1973}","{1964, 1965, 1967, 1971, 1973}",1.0,1.0,,8,,0
17,9916137210001701,Progress in solid state chemistry,[0079-6786],p,,,,,,Progress in solid state chemistry.,...,0,1.0,"{1964, 1965, 1967, 1971, 1973}","{1964, 1965, 1967, 1971, 1973}",1.0,1.0,7.0,8,[ZMLAC GEN],1
18,9967833530001701,Progress in the chemistry of fats and other li...,"[0079-6832, 1878-3198]",e,"[['61535212360001701', 'Elsevier SD Backfile B...","[['53611930190001701', 'Progress in the chemis...",[ Available from 1952 volume: 1 until 1978 vol...,[Yes],[Elsevier],,...,0,1.0,"{1952, 1954, 1955, 1957, 1958, 1963, 1964, 196...","{1952, 1954, 1955, 1957, 1958, 1963, 1964, 196...",1.0,1.0,,9,,0
19,9930545140001701,Progress in the chemistry of fats and other li...,[0079-6832],p,,,,,,Progress in the chemistry of fats and other li...,...,0,1.0,"{1952, 1954, 1955, 1957, 1958, 1963, 1964, 196...","{1952, 1954, 1955, 1957, 1958, 1963, 1964, 196...",1.0,1.0,13.0,9,[ZMLAC OWL],1
20,9968670150001701,Journal of the Royal Institute of Internationa...,[1473-799X],e,"[['61535212310001701', 'JSTOR Arts and Science...","[['53540678550001701', 'Journal of the Royal I...",[ Available from 1926 volume: 5 issue: 3 until...,[Yes],[JSTOR],,...,0,1.0,"{1928, 1929, 1930}","{1928, 1929, 1930}",1.0,1.0,,10,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3219,9915270740001701,Journal of virological methods.,[0166-0934],p,,,,,,Journal of virological methods.,...,0,1.0,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...","{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...",1.0,1.0,46.0,1609,[TVET PER],1
3220,9968725470001701,IEEE ASSP magazine,"[1558-1284, 2168-3050, 0740-7467]",e,"[['61619505660001701', 'IEEE/IET Electronic Li...","[['53620358530001701', 'IEEE ASSP magazine : a...",[ Available from 1984 volume: 1 issue: 1 until...,[Yes],[other],,...,0,1.0,"{1984, 1985, 1986, 1987, 1988, 1989, 1990}","{1984, 1985, 1986, 1987, 1988, 1989, 1990}",1.0,1.0,,1610,,0
3221,9937260310001701,IEEE ASSP magazine,[0740-7467],p,,,,,,IEEE ASSP magazine.,...,0,1.0,"{1984, 1985, 1986, 1987, 1988, 1989, 1990}","{1984, 1985, 1986, 1987, 1988, 1989, 1990}",1.0,1.0,3.0,1610,[ZMLAC OWL],1
3224,9966662330001701,SIAM journal on scientific and statistical com...,"[2168-3417, 0196-5204]",e,"[['61535215690001701', ""LOCUS - SIAM''s Online...","[['53536196430001701', 'SIAM journal on scient...",[ Available from 1980 volume: 1 issue: 1 until...,[Yes],[other],,...,0,1.0,"{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...","{1984, 1985, 1986, 1987, 1988, 1989, 1990, 199...",1.0,1.0,,1612,,0


#### Split out single vendor groups by vendor

In [368]:
def split_by_vendor(df, df_name):
    vends = df[df['Vendor_key'].notnull() == True]
    vends['vendor'] = vends['Vendor_key'].apply(lambda x: x[0])
    
    vendor_list = ['Wiley', 'Elsevier','SAGE','Springer','Taylor & Francis']
    for x in vendor_list:
        print(x)
        group_list = list(vends[vends['vendor'] == x]['final_group_id'])
        data = df[df['final_group_id'].isin(group_list)]
        print(data.shape)
        data.to_pickle(df_name + '_' + x + '.pkl')
        data.to_csv(df_name + '_' + x + '.txt', sep='\t', index=False)
        
    other_groups = list(vends[(vends['vendor'] == 'JSTOR') | (vends['vendor'] == 'other')]['final_group_id'])
    print('other + JSTOR')
    other = df[df['final_group_id'].isin(other_groups)]
    print(other.shape)
    other.to_pickle(df_name + '_other_and_JSTOR.pkl')
    other.to_csv(df_name + '_other_and_JSTOR.txt', sep='\t',index=False)

In [369]:
split_by_vendor(multi_loc_100_vendor, 'multi_loc_100_vendor')
split_by_vendor(single_loc_100_vendor, 'single_loc_100_vendor')

Wiley
(6, 43)
Elsevier
(80, 43)
SAGE
(2, 43)
Springer
(4, 43)
Taylor & Francis
(2, 43)
other + JSTOR
(26, 43)
Wiley
(60, 43)
Elsevier
(256, 43)
SAGE
(14, 43)
Springer
(66, 43)
Taylor & Francis
(14, 43)
other + JSTOR
(126, 43)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
