# Please don't edit directly in this document. Create your own copy first.

# This is a demo for [Canadian Archive of Women in STEM](https://biblio.uottawa.ca/en/women-in-stem) with [requests](https://github.com/psf/requests) and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Python libraries.

---



# Extract

First, let's run the cell below to import neccesary libraries. Although most of the commonly used Python libraries are pre-installed, new libraries can be installed as *!pip install [package name]* or *!apt-get install [package name]*.

##1. Libraries
*   [requests](https://github.com/psf/requests): an elegant and simple HTTP library for Python.
*   [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): Python library for pulling data out of HTML and XML files






In [None]:
# install
#!pip install requests
#!pip install beautifulsoup4

# import
import requests
from bs4 import BeautifulSoup

Second, set the url of the website from which we'd like to extract data using the requests library that we imported in the first step. If the access was successful, you should see the output as <Response [200]>.

## 2. Set the URL

In [None]:
# Set the URL you want to scrape from
url='https://biblio.uottawa.ca/en/women-in-stem'

# Connect to the URL and download document
response = requests.get(url)
response

<Response [200]>

Third, parse the html with BeautifulSoup.

## 3. Parse HTML file

In [None]:
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")
soup

<!DOCTYPE html>

<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lt IE 9]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 9]><html class="ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gt IE 9)|(gt IEMobile 7)]><!--><html dir="ltr" lang="en" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns#"><!--<![endif]-->
<head>
<title>Canadian Archive of Women in STEM | Library | University of Ottawa</title>
<meta content="width" name="MobileOptimized"/>
<meta content="true" name="HandheldFriendly"/>
<meta content="width=device-width" name="viewport"/>
<meta charset="utf-8">
<link href="https://biblio.uottawa.ca/sites/all/themes/custom/uottawa_zen_assets/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<meta content="Drupal 7 (https://www.drupal.org)" name="generator">
<link href="https://biblio.uottawa.ca/en/women-in-stem" rel="canonical"/>
<link href="https://biblio.uottawa.ca/en/women-in-stem" rel="shortlink"/>
<meta cont

Fourth, find an element with its attribute name. 

###Syntax: find_all(name, attrs)
Find all elements following the same syntax rules.
###Syntax: find(name, attrs)
Find a specific element only.

## 4. Extract fonds title, hosting institution, description and STEM fields.

In [None]:
#First element only
soup.find('div', attrs={'class': 'field field-name-title-field field-type-text field-label-hidden'}).text


' Margaret M. Allemang fonds'

In [None]:
# Find fonds_title: save as list

title = [fonds_title.text for fonds_title in soup.find_all('div', attrs={'class': 'field field-name-title-field field-type-text field-label-hidden'})]
title

[' Margaret M. Allemang fonds',
 'Academic Faculties and Schools: School of Nursing',
 'Ada Funnell fonds',
 'Adele A. Crowder fonds',
 "Admission of Women to Queen's Medical School (20th Century)",
 'Agnes Isabel Tennant fonds',
 'Agnes Marion Ayre Herbarium collection',
 'Alfreda Jean Attrill collection',
 'Alice Girard fonds ',
 'Allie Vibert Douglas fonds',
 "Alumnae Association of the Women's College Hospital School of Nursing fonds ",
 'Alumnae Association, School of Nursing, Toronto General Hospital fonds',
 'Amelia Taylor Anderson fonds',
 'Amelia Yeomans file',
 'Anila Maskeri fonds',
 'Ann E. McJanet fonds',
 'Anna Stahmer fonds',
 'Annie Elaine Bryenton fonds',
 'Annie Green series (part of Green Family fonds)',
 'Annie Holiday Pelletier Fonds',
 'Association of Consulting Engineers of Canada fonds',
 'Association of Registered Nurses of Newfoundland and Labrador (ARNNL) fonds',
 'Audrey Maureen Cowling fonds',
 'Augusta Stowe-Gullen fonds',
 "Balaclava Junior Women's Instit

In [None]:
# Hosting institutions: save as list

hosting = [hosting_institutions.text for hosting_institutions in soup.find_all('div', attrs={'class': 'field field-name-uottawa-women-organization field-type-text field-label-hidden'})]
hosting

['Margaret M. Allemang Centre for the History of Nursing',
 'Trinity Western University',
 "Queen's University",
 "Queen's University",
 "Queen's University",
 'National Defence Headquarters Directorate of History and Heritage',
 'Memorial University of Newfoundland',
 'Health Sciences Centre Winnipeg',
 'University of Montreal',
 "Queen's University",
 "Women's College Hospital",
 'City of Toronto',
 'Glenbow Museum',
 'University of Manitoba',
 'Ryerson University',
 'University of Toronto',
 'Ryerson University',
 'Provincial Archives of New Brunswick ',
 "Queen's University",
 'McGill University ',
 'Library and Archives Canada',
 'Memorial University of Newfoundland',
 'University of Toronto',
 'Victoria University',
 'Grey County Archives',
 'McGill University',
 'Provincial Archives of New Brunswick ',
 'Ryerson University',
 'McGill University',
 'University of Manitoba',
 'University of British Columbia',
 'Concordia University   ',
 'Provincial Archives of New Brunswick ',
 '

In [None]:
# Description: save as list

description = [description.text for description in soup.find_all('div', attrs={'class': 'field field-name-uottawa-women-scope field-type-text-long field-label-hidden'})]
description

["Collection consists of records concerning the Canadian Nursing Sisters of World War I and II Oral History Program undertaken in two stages, between 1977 and 1980, and from 1987 to 1995, under the project direction of Margaret Allemang. Records include transcripts, audio cassettes, and photographs of interviewees; the first series contains 19 printed volumes, 46 audio cassettes, and 28 photographs of Canadian Nursing Sisters of WWI interviewees, while the second series consists of 2 printed volumes documenting Canadian Nursing Sisters of WWII, 'Their Lives and Experiences in a Changing Society.' Also included in the collection are observational records from a study on the experiences of eight cardiac patients during a period of hospitalization in a general hospital, performed in cooperation with the Toronto Western Hospital, which resulted in the completion of a report in 1960. Additionally, records from a research project based on educational theory concerning patients’ perceptions o

In [None]:
# STEM Fields: save as list

stem = [stem_fields.text for stem_fields in soup.find_all('div', attrs={'class': 'field field-name-uottawa-women-category field-type-entityreference field-label-hidden'})]
stem

['Nursing',
 'Nursing',
 'Medicine',
 'Biology',
 'Medicine',
 'Nursing',
 'Botany',
 'Nursing',
 'Nursing',
 'Astrophysics',
 'Nursing',
 'Nursing',
 'Nursing',
 'Medicine',
 'Nursing',
 'Architecture',
 'Trades and Technology',
 'Nursing',
 'Medicine',
 'Science',
 'Engineering',
 'Nursing',
 'DentistryNursing',
 'Medicine',
 'Home Economics',
 'Genetics',
 'Nursing',
 'Information technology',
 'Nursing',
 'Biochemistry',
 'Home Economics',
 'Engineering',
 'Public HealthNursing',
 'Nursing',
 'Nursing',
 'Gerontology',
 'Medicine',
 'Nursing',
 'Pharmacy',
 'Physiotherapy',
 'Nursing',
 'Pathology',
 'Biochemistry',
 'Botany',
 'Engineering',
 'Engineering',
 'Nutrition',
 'Nursing',
 'Botany',
 'Nutrition',
 'Information technology',
 'Microbiology',
 'Nursing',
 'Information technology',
 'Nursing',
 'Ecology',
 'Science',
 'MathematicsPhysics ',
 'ScienceEngineering',
 'Medicine',
 'Medicine',
 'Geology',
 'Nursing',
 'Nursing',
 'Nursing',
 'Nursing',
 'Medicine',
 'Nursing',
 

# Export to CSV
Import neccesary libraries. The file will be saved in the virtual machine, so in order to download a csv file to your local computer, you need to import *files* from google.colab. 

## 1. Libraries

*   [pandas](https://pandas.pydata.org/): open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools.




In [None]:
# import
import pandas as pd
from google.colab import files

You need to combine multiple lists (title, hosting institution, description, and STEM fields) per row to the one data frame. The easiest approach is to create an empty data frame and then add each list to the data frame.

## 2. DataFrame

In [None]:
# Create an empty data frame
df = pd.DataFrame()

# Add a title to df
df['Title'] = title

# Add a hosting institution to df
df['Hosting'] = hosting

# Add description to df
df['Description'] = description

# Add STEM fields to df
df['STEM'] = stem

df

Unnamed: 0,Title,Hosting,Description,STEM
0,Margaret M. Allemang fonds,Margaret M. Allemang Centre for the History of...,Collection consists of records concerning the ...,Nursing
1,Academic Faculties and Schools: School of Nursing,Trinity Western University,The fonds consists of records that document th...,Nursing
2,Ada Funnell fonds,Queen's University,The fonds consists of a few pieces of correspo...,Medicine
3,Adele A. Crowder fonds,Queen's University,The fonds consists of three articles (one typs...,Biology
4,Admission of Women to Queen's Medical School (...,Queen's University,The article details the story of the readmissi...,Medicine
...,...,...,...,...
413,Women in Engineering Committee fonds,Ryerson University,Fonds contains records related to the Women in...,Engineering
414,Women in Medicine Oral History Project Collection,University of Toronto,"For information, contact the host institution....",Medicine
415,Women's College Hospital Board of Directors fonds,Women's College Hospital,The fonds consists of the records of the Women...,Medicine
416,Women's College Hospital School of Nursing fonds,Women's College Hospital,The fonds consists of the records of the Women...,Nursing


You can save df (Data frame) to csv format using df.to_csv().

## 3. Save as CSV

In [None]:
# Save df as csv in the virtual machine provided by Google
df.to_csv('women_stem.csv', sep='\t', encoding='utf-8')

# Download the file to your computer
files.download("women_stem.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>