<a href="https://colab.research.google.com/github/archivesunleashed/notebooks/blob/main/arch/filtering_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!curl "https://webdata.archive-it.org/ait/1796/research_services/download/ARCHIVEIT-14462/WebPagesExtraction/web-pages.csv.gz?access=HQSTAVXQALHVLERXY3RIOS6SDL7YXYZQ" --output web-pages.csv.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2628k    0 2628k    0     0  1026k      0 --:--:--  0:00:02 --:--:-- 1026k


In [2]:
!gunzip web-pages.csv.gz

Filtering with `grep`, and using `wc` to count lines.

Let's count the lines in the csv file.

In [3]:
!wc -l web-pages.csv

5902 web-pages.csv


We can use `grep` to search across the entire csv; all columns, so think of it as a full text seach.

We'll search for my last name, and use the `-i` flag so our search is case insensitive, and pipe that to `wc` to see how many results we get.

In [4]:
!grep -i 'ruest' web-pages.csv | wc -l

1454


Then we can take a look at the what the results look like.

In [5]:
!grep -i 'ruest' web-pages.csv | head -n 25

20200624,cs.uwaterloo.ca,https://cs.uwaterloo.ca/~jimmylin/publications/index.html,text/html,application/xhtml+xml,en,"Jimmy Lin » Publications Jimmy Lin University of Waterloo Home Publications Projects Students Teaching Resources Publications Restrict? Restrict? deep learning, neural networks big data, large-scale data processing reproducibility, evaluation issues and methodology Twitter, real-time search and filtering information seeking, user interaction, visualization information retrieval medical and biomedical informatics question answering, document summarization computational social science, digital humanities natural language processing, computational linguistics Jump to: 2020 2019 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 1999 | 1998 2020 450. Nick Ruest, Jimmy Lin, Ian Milligan, and Samantha Fritz. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly A

If we're happy with that, we can redirect that out to a new file.

In [6]:
!grep -i 'ruest' web-pages.csv > ruest-web-pages.csv

In [7]:
!head ruest-web-pages.csv

20200624,cs.uwaterloo.ca,https://cs.uwaterloo.ca/~jimmylin/publications/index.html,text/html,application/xhtml+xml,en,"Jimmy Lin » Publications Jimmy Lin University of Waterloo Home Publications Projects Students Teaching Resources Publications Restrict? Restrict? deep learning, neural networks big data, large-scale data processing reproducibility, evaluation issues and methodology Twitter, real-time search and filtering information seeking, user interaction, visualization information retrieval medical and biomedical informatics question answering, document summarization computational social science, digital humanities natural language processing, computational linguistics Jump to: 2020 2019 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 1999 | 1998 2020 450. Nick Ruest, Jimmy Lin, Ian Milligan, and Samantha Fritz. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly A

You can also use [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) with `grep`.

...create regex example


We can also filter with pandas

(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.filter.html).

Let's import `pandas` and load our example into a DataFrame.

In [12]:
import pandas as pd

web_pages = pd.read_csv("web-pages.csv")
web_pages.shape[0]

5901

Let's filter on the `content` column for "ruest", and we'll ingore case sensitivity, and `NaN` values.



In [20]:
web_pages_ruest = web_pages[web_pages['content'].str.contains("ruest", na=False, case=False)]

In [21]:
web_pages_ruest.shape[0]

1454

In [22]:
web_pages_ruest

Unnamed: 0,crawl_date,domain,url,mime_type_web_server,mime_type_tika,language,content
4,20200624,cs.uwaterloo.ca,https://cs.uwaterloo.ca/~jimmylin/publications...,text/html,application/xhtml+xml,en,Jimmy Lin » Publications Jimmy Lin University ...
36,20200624,ianmilligan.ca,https://www.ianmilligan.ca/,text/html,text/html,en,Ian Milligan Search Ian Milligan Home Posts Bo...
56,20200624,ianmilligan.ca,https://www.ianmilligan.ca/post/we-could-but-s...,text/html,text/html,en,"New Paper: We Could, but Should We? Ethical Co..."
57,20200624,ianmilligan.ca,https://www.ianmilligan.ca/post/,text/html,text/html,en,Posts | Ian Milligan Search Ian Milligan Home ...
68,20200625,ianmilligan.ca,https://www.ianmilligan.ca/project/longitudinal/,text/html,text/html,en,A Longitudinal Analysis of the Canadian World ...
...,...,...,...,...,...,...,...
5496,20210624,ianmilligan.ca,https://www.ianmilligan.ca/,text/html,text/html,en,Ian Milligan Search Ian Milligan Home Posts Bo...
5574,20210624,archivesunleashed.org,https://archivesunleashed.org/cohorts/,text/html,text/html,en,Archives Unleashed Cohorts - The Archives Unle...
5684,20210724,cs.uwaterloo.ca,https://cs.uwaterloo.ca/~jimmylin/publications...,text/html,application/xhtml+xml,en,Jimmy Lin » Publications Jimmy Lin University ...
5792,20210724,archivesunleashed.org,https://archivesunleashed.org/cloud/,text/html,text/html,en,The Archives Unleashed Cloud - The Archives Un...


What if we want to filter for multiple terms?

In [23]:
web_pages_multi_term_filter = web_pages[web_pages['content'].str.contains("ruest", na=False, case=False) | web_pages['content'].str.contains("milligan", na=False, case=False)]

In [24]:
web_pages_multi_term_filter.shape[0]

1818

In [25]:
web_pages_multi_term_filter

Unnamed: 0,crawl_date,domain,url,mime_type_web_server,mime_type_tika,language,content
4,20200624,cs.uwaterloo.ca,https://cs.uwaterloo.ca/~jimmylin/publications...,text/html,application/xhtml+xml,en,Jimmy Lin » Publications Jimmy Lin University ...
36,20200624,ianmilligan.ca,https://www.ianmilligan.ca/,text/html,text/html,en,Ian Milligan Search Ian Milligan Home Posts Bo...
41,20200624,ianmilligan.ca,https://www.ianmilligan.ca/post/summers-review/,text/html,text/html,en,Review Essay featuring History in the Age of A...
42,20200624,ianmilligan.ca,https://www.ianmilligan.ca/post/preserving-our...,text/html,text/html,en,Waterloo Stories: Preserving our Digital Histo...
43,20200624,ianmilligan.ca,https://www.ianmilligan.ca/talk/au-montreal/,text/html,text/html,en,Archives Unleashed Datathon - Montreal (IIPC) ...
...,...,...,...,...,...,...,...
5574,20210624,archivesunleashed.org,https://archivesunleashed.org/cohorts/,text/html,text/html,en,Archives Unleashed Cohorts - The Archives Unle...
5603,20210624,aut.docs.archivesunleashed.org,https://aut.docs.archivesunleashed.org/docs/to...,text/html,text/html,en,Toolkit Walkthrough · Archives Unleashed Toolk...
5684,20210724,cs.uwaterloo.ca,https://cs.uwaterloo.ca/~jimmylin/publications...,text/html,application/xhtml+xml,en,Jimmy Lin » Publications Jimmy Lin University ...
5792,20210724,archivesunleashed.org,https://archivesunleashed.org/cloud/,text/html,text/html,en,The Archives Unleashed Cloud - The Archives Un...


Then we can write out these new filtered DataFrames to csv if we want.

In [29]:
web_pages_multi_term_filter.to_csv("web_pages_multi_term_filter.csv", index=False)

In [30]:
!ls web_pages_multi_term_filter.csv

web_pages_multi_term_filter.csv


In [31]:
!head web_pages_multi_term_filter.csv

crawl_date,domain,url,mime_type_web_server,mime_type_tika,language,content
20200624,cs.uwaterloo.ca,https://cs.uwaterloo.ca/~jimmylin/publications/index.html,text/html,application/xhtml+xml,en,"Jimmy Lin » Publications Jimmy Lin University of Waterloo Home Publications Projects Students Teaching Resources Publications Restrict? Restrict? deep learning, neural networks big data, large-scale data processing reproducibility, evaluation issues and methodology Twitter, real-time search and filtering information seeking, user interaction, visualization information retrieval medical and biomedical informatics question answering, document summarization computational social science, digital humanities natural language processing, computational linguistics Jump to: 2020 2019 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 1999 | 1998 2020 450. Nick Ruest, Jimmy Lin, Ian Milligan, and Samantha Fritz. The Archives U