# LFPL Collections Data

## Data Sources 

This project uses data from the Louisville Metro Open Data site. You can find 
the main info page for this data set here: 
[Library Collection Inventory](https://data.louisvilleky.gov/datasets/LOJIC::louisville-metro-ky-library-collection-inventory-/about). It has been modified and a copy is present in this repository at data/rwa/books.csv.gz


This project also scrapes data about Young Adult book genre from wikipedia using 01_load_authors.py, which usees the beautiful soup library. The wikipedia article is here:
[List of young adult fiction writers](https://en.m.wikipedia.org/wiki/List_of_young_adult_fiction_writers). Scraped data is included in this repository under data/authors.csv

In [2]:
import pandas as pd
# Dont display numbers in scientific notation.
pd.set_option('display.float_format', lambda x: '%.5f' % x)



# LFPL data

In [3]:
books = pd.read_csv("data/raw/books.csv.gz")

In [4]:
# Size information
books.shape
#Rows, Columns

(1190176, 10)

### LFPL Data dictionary
LFPL's collection inventory. Updated on a monthly basis.

|Column name | Type | Description | Notes |
| ----------- | ---- | ----------- | ----- |
| BibNum | number:int64| The unique identifier of a bibliographic record within our materials database. Materials with the same bibliographic # will generally have the same cataloging metadata, differing only in the barcode number, assigned location and anything else specific to the individual copy. | |
| Title | string | The name of the material. | |
| Author | string | The writer or creator of the material. | Inconsistent author names; life dates included; some missing |
| ISBN | number:float64 | The International Standard Book Number is a numeric commercial book identifier that is intended to be unique. Publishers purchase ISBNs from an affiliate of the International ISBN Agency. An ISBN is assigned to each separate edition and variation of a publication. | |
| PublicationYear | number:int64 | The year that the material was originally published.| year 0; years in the future;typos? |
| ItemType | string | Describes the type of material of each item, including Books, Audiobooks, Serials, DVDs, Microforms, Three Dimensional Objects, Kits, and Printed Cartographic Materials. | Constant: always "Book"; can ignore|
| ItemCollection | string | Refers to the collection the material belongs to based on common themes, including but not limited to Adult Fiction, Adult Reference, Mystery, Children’s Fiction, etc.  | Complex categories contain multiple cats. Split "adult fiction" into "adult" and "fiction" for example. Some DVD Video materials? Thought it was all books. |
| ItemLocation | string | The library location where the material was assigned at the time the report was run. | Complex categories: multiple "main" and "remote shelving"; include mobile libraries. Anything interesting there? |
| ItemPrice | number:float64 | The price, in USD, that LFPL purchased the material for. | Really big range; some prices are zero; round to 2 decimals |
| ReportDate | alphanumeric | Probably date-time that the report was generated. | Constant; probably can ignore|

In [5]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1190176 entries, 0 to 1190175
Data columns (total 10 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   BibNum           1190176 non-null  int64  
 1   Title            1190175 non-null  object 
 2   Author           1124225 non-null  object 
 3   ISBN             1153891 non-null  float64
 4   PublicationYear  1190176 non-null  int64  
 5   ItemType         1190176 non-null  object 
 6   ItemCollection   1190036 non-null  object 
 7   ItemLocation     1190176 non-null  object 
 8   ItemPrice        1190176 non-null  float64
 9   ReportDate       1190176 non-null  object 
dtypes: float64(2), int64(2), object(6)
memory usage: 90.8+ MB


In [6]:
books


Unnamed: 0,BibNum,Title,Author,ISBN,PublicationYear,ItemType,ItemCollection,ItemLocation,ItemPrice,ReportDate
0,707409,"Jeff Immelt and the new GE way : innovation, t...","Magee, David, 1965-",9780071605878.00000,2009,Book,Adult Non-Fiction,Main,25.95000,02/01/2023 00:00:00
1,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",9780307451408.00000,2009,Book,Adult Non-Fiction,Southwest,19.99000,02/01/2023 00:00:00
2,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",9780307451408.00000,2009,Book,Adult Non-Fiction,Southwest,19.99000,02/01/2023 00:00:00
3,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",9780307451408.00000,2009,Book,Adult Non-Fiction,Remote Shelving - Main,19.99000,02/01/2023 00:00:00
4,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",9780307451408.00000,2009,Book,Adult Non-Fiction,Remote Shelving - Main,19.99000,02/01/2023 00:00:00
...,...,...,...,...,...,...,...,...,...,...
1190171,2608597,25 ready-to-use sustainable living programs fo...,,9780838936498.00000,2022,Book,Adult Non-Fiction,South Central,63.69000,02/01/2023 00:00:00
1190172,2608598,Crypto basics : a nontechnical introduction to...,"Gomzin, Slava",9781484283202.00000,2022,Book,Adult Non-Fiction,Bon Air,30.09000,02/01/2023 00:00:00
1190173,2608598,Crypto basics : a nontechnical introduction to...,"Gomzin, Slava",9781484283202.00000,2022,Book,Adult Non-Fiction,Newburg,30.09000,02/01/2023 00:00:00
1190174,2608599,Data governance,"Reichental, Jonathan",9781119906773.00000,2023,Book,Adult Non-Fiction,Main,24.34000,02/01/2023 00:00:00


In [7]:
#BibNum
len(books["BibNum"].unique())

439253

In [8]:
#Author
authors = books["Author"]
authors.value_counts()



Author
Patterson, James, 1947-                 5856
Osborne, Mary Pope.                     2063
Steel, Danielle                         1824
Pilkey, Dav, 1966-                      1812
Seuss, Dr.                              1812
                                        ... 
Sánchez Ferlosio, Rafael, 1927-2019       1
Pippen, Kitty, 1919-2018                   1
Adkins, Frank (Francis A.)                 1
Ray, James A.                              1
Kniffke, Sophie.                           1
Name: count, Length: 187472, dtype: int64

In [9]:
#ISBN
isbns = books['ISBN']
print("unique\t",len(isbns.unique())) 


unique	 410544


In [10]:
#PublicationYear
from statistics import mean, median, mode

years = books["PublicationYear"]
years.describe()

count   1190176.00000
mean       2004.26355
std         101.77562
min           0.00000
25%        2005.00000
50%        2014.00000
75%        2018.00000
max        9999.00000
Name: PublicationYear, dtype: float64

In [11]:

years = years.unique()
years.sort()

print(years)
real_years = years[1:-2]

stats=pd.DataFrame([{"mean":mean(y), "median":median(y)} for y in (years, real_years)],
                     index=("raw", "cleaned*"))



[   0 1790 1794 1798 1800 1807 1808 1809 1812 1814 1817 1818 1821 1822
 1823 1825 1828 1829 1830 1831 1832 1833 1835 1836 1837 1838 1839 1840
 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854
 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868
 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882
 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896
 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910
 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924
 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938
 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952
 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966
 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
 2009 

In [12]:
#ItemType # always "Book"
books["ItemType"].unique()


array(['Book'], dtype=object)

In [13]:
#ItemCollection
books["ItemCollection"].unique()

array(['Adult Non-Fiction', 'Adult Fiction', 'Mystery',
       'Older Teen Fiction', 'Younger Teen  Fiction', 'Adult Paperback',
       'Science Fiction', "Children's Fiction", 'Western',
       "Children's Picture Paperback", "Children's Paperback",
       "Children's Picture Book", 'International Collection',
       'ELL Collection', 'Teen Non-Fiction', "Children's Non-Fiction",
       'Holiday', 'Natural Resources', 'Kentucky History', 'Oversize',
       'Urban Fiction', 'Bestsellers', 'Storytime Collection',
       "Children's Board Book", "Children's Easy Reader",
       'Preschool  Picture Book', 'Adult Reference', 'Interlibrary Loan',
       nan, 'Adult Paperbacks Tall', "Children's Easy Reader Paperback",
       'Caldecott/Newbery', 'Laptop', 'Government Documents',
       'Large Print', 'Telereference', "Children's Non-Fiction Paperback",
       'Big Book', "Children's Reference", 'Teen Reference',
       'College Shop', 'Magazines and Newspaper',
       'Younger Teen  Paperba

In [14]:
books.ItemCollection.value_counts()

ItemCollection
Adult Non-Fiction                   371433
Adult Fiction                       177604
Children's Non-Fiction               86356
Mystery                              60314
Children's Picture Book              59348
Preschool  Picture Book              51276
Children's Fiction                   48446
Adult Paperback                      45302
Children's Paperback                 45076
Children's Easy Reader               24511
Teen Non-Fiction                     24376
Older Teen Fiction                   23787
Children's Board Book                20057
Younger Teen  Fiction                17532
Kentucky History                     16962
Science Fiction                      16048
Children's Easy Reader Paperback     15959
Holiday                              15583
International Collection             15581
Adult Reference                      11197
Children's Picture Paperback          9731
Urban Fiction                         7601
Caldecott/Newbery                     6

In [15]:
#ItemLocation
print(books["ItemLocation"].unique())
books.ItemLocation.value_counts()

['Main' 'Southwest' 'Remote Shelving - Main' 'Newburg' 'South Central'
 'St Matthews' 'Fairdale' 'Bon Air' 'Jeffersontown' 'Iroquois'
 'Crescent Hill' 'Remote Shelving - Shawnee' 'Northeast'
 'Childrens Main Library' 'Shively' 'Highlands - Shelby Park' 'Middletown'
 'Portland' 'Western' 'Main Teen' 'Shawnee' 'Childrens Bookmobile'
 'Content Management' 'Adult Bookmobile']


ItemLocation
Remote Shelving - Main       139987
Northeast                    124473
Southwest                    122113
Main                         121439
South Central                115837
Bon Air                       74730
St Matthews                   69531
Jeffersontown                 56706
Iroquois                      52382
Highlands - Shelby Park       45539
Crescent Hill                 42837
Childrens Main Library        38994
Middletown                    33120
Shively                       23623
Newburg                       23586
Fairdale                      23149
Shawnee                       22906
Western                       21648
Portland                      13334
Childrens Bookmobile           9129
Remote Shelving - Shawnee      9083
Main Teen                      6024
Content Management                4
Adult Bookmobile                  2
Name: count, dtype: int64

In [16]:
#ItemPrice
prices = books["ItemPrice"]
prices.describe()

count   1190176.00000
mean         18.45097
std          15.99772
min           0.00000
25%          10.95000
50%          15.99000
75%          24.95000
max        1077.00000
Name: ItemPrice, dtype: float64

In [17]:
#ReportDate # Always '02/01/2023 00:00:00'
books["ReportDate"].unique()

array(['02/01/2023 00:00:00'], dtype=object)

# Scraped Wikipedia data


In [18]:
authors = pd.read_csv("data/raw/authors.csv", index_col=0)
authors.shape


(635, 1)

# Data Dictionary

| Column name | Type | Description | Notes |
| ----------- | ---- | ----------- | ----- |
| index | number | unique number per row | |
| Name | string | author's name | some information in parenthesis; for diambiguation? Are there authors with the same name?|

In [21]:
from string import punctuation
parens = set("()")
punctuation = set(punctuation) - parens
nonalpha_values = list()
parens_values = list()
for value in authors.values:
    value = value[0]
    chars = set(value)
    if chars & parens:
        parens_values.append(value)
    elif chars & punctuation:
        nonalpha_values.append(value)

['Faridah Àbíké-Íyímídé', 'S.K. Ali', 'Elaine M. Alphin', 'M. T. Anderson', 'V. C. Andrews', 'Amelia Atwater-Rhodes', 'T. A. Barron', 'L. Frank Baum', 'Elizabeth J. Braswell', 'Roseanne A. Brown', 'N. M. Browne', 'Elizabeth C. Bunce', 'W. Bruce Cameron', 'Mary H.K. Choi', 'Rosemary Clement-Moore', 'Sneed B. Collard III', 'Caroline B. Cooney', 'Sharon G. Flake', 'E.R. Frank', 'Barbara C. Freeman', 'M-E Girard', 'Laurell K. Hamilton', 'Alix E. Harrow', 'Robert A. Heinlein', 'S.E. Hinton', 'A. M. Jenkins', 'E. K. Johnston', 'A. S. King', 'E. L. Konigsburg', 'Ursula K. Le Guin', "Madeleine L'Engle", 'E. Lockhart', 'Sarah J. Maas', 'Ann M. Martin', 'Syed M. Masood', 'Sharon E. McKay', 'Anna-Marie McLemore', 'Karen M. McManus', 'O. R. Melling', 'Gloria D. Miklowitz', 'Lorin Morgan-Richards', "Tyne O'Connell", "Scott O'Dell", "Louise O'Neill", 'Emily X.R. Pan', 'Mary E. Pearson', 'K. M. Peyton', 'J. K. Rowling', 'J. D. Salinger', 'V. E. Schwab', 'Andrew A. Smith', 'R. L. Stine', 'Francisco X.

In [22]:
nonalpha_values

['Faridah Àbíké-Íyímídé',
 'S.K. Ali',
 'Elaine M. Alphin',
 'M. T. Anderson',
 'V. C. Andrews',
 'Amelia Atwater-Rhodes',
 'T. A. Barron',
 'L. Frank Baum',
 'Elizabeth J. Braswell',
 'Roseanne A. Brown',
 'N. M. Browne',
 'Elizabeth C. Bunce',
 'W. Bruce Cameron',
 'Mary H.K. Choi',
 'Rosemary Clement-Moore',
 'Sneed B. Collard III',
 'Caroline B. Cooney',
 'Sharon G. Flake',
 'E.R. Frank',
 'Barbara C. Freeman',
 'M-E Girard',
 'Laurell K. Hamilton',
 'Alix E. Harrow',
 'Robert A. Heinlein',
 'S.E. Hinton',
 'A. M. Jenkins',
 'E. K. Johnston',
 'A. S. King',
 'E. L. Konigsburg',
 'Ursula K. Le Guin',
 "Madeleine L'Engle",
 'E. Lockhart',
 'Sarah J. Maas',
 'Ann M. Martin',
 'Syed M. Masood',
 'Sharon E. McKay',
 'Anna-Marie McLemore',
 'Karen M. McManus',
 'O. R. Melling',
 'Gloria D. Miklowitz',
 'Lorin Morgan-Richards',
 "Tyne O'Connell",
 "Scott O'Dell",
 "Louise O'Neill",
 'Emily X.R. Pan',
 'Mary E. Pearson',
 'K. M. Peyton',
 'J. K. Rowling',
 'J. D. Salinger',
 'V. E. Schwab'

In [24]:
parens_values

['Anthony (writer)',
 'Karen Bass (writer)',
 'Joan Bauer (novelist)',
 'Julie Berry (author)',
 'Kevin Brooks (writer)',
 'Christopher Collier (historian)',
 'Terry Davis (author)',
 'Anne Emery (young adult author)',
 'Nancy Farmer (author)',
 'Michael Grant (author, born 1954)',
 'Alex Hall (author)',
 'Barbara Hall (TV producer)',
 'Rachel Hawkins (writer)',
 'Angela Johnson (writer)',
 'Catherine Johnson (novelist)',
 'Leah Johnson (writer)',
 'Carrie Jones (author)',
 'Elizabeth Laird (author)',
 'Michael Lawrence (writer)',
 'Keith Mansfield (writer)',
 'John Marsden (writer)',
 'Patricia McCormick (author)',
 'Mike Mullin (author)',
 'Julie Murphy (author)',
 'William Nicholson (writer)',
 'Richard Peck (writer)',
 'Christopher Pike (author)',
 'David Rees (author)',
 'Alex Sánchez (author)',
 'Elizabeth Scott (author)',
 'Michael Scott (Irish author)',
 'Mark Shulman (author)',
 'L. J. Smith (author)',
 'Nicholas Sparks (author)',
 'Aiden Thomas (author)',
 'Rob Thomas (writer

# Conclusion

Filter out parenthetical values and discard.