# How to use this notebook?

You need to install the requirements:

```
pip install -r requirements.txt
```

Then you can execute the cells in the notebook...

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
MERLOT_LIST_PAGE_URL = "https://merlot.org/merlot/materials.htm"

In [3]:
MERLOT_VIVEW_PAGE_URL = "https://merlot.org/merlot/viewMaterial.htm"

In [4]:
SCRAPED_PAGE_NUMBERS = range(1,51)

In [5]:
EXTRACTED_IDS = []

In [6]:
EXTRACTED_MATERIAL_ENTITIES = []

In [7]:
for page_num in SCRAPED_PAGE_NUMBERS:
    res = requests.get(f"{MERLOT_LIST_PAGE_URL}?page={page_num}")
    soup = BeautifulSoup(res.text, "html.parser")
    materials = soup.find_all('div', class_ = 'card merlot-material-item')
    for material in materials:
        title_a = material.find('a')
        id = title_a['href'].split('id=')[1]
        EXTRACTED_IDS.append(id)

In [8]:
for id in EXTRACTED_IDS:
    res = requests.get(f"{MERLOT_VIVEW_PAGE_URL}?id={id}")
    soup = BeautifulSoup(res.text, "html.parser")
    material_name = soup.find('h2').text.strip()
    material_desc = soup.find('div',id='material_description').text.strip()
    material_raw_disciplines = []

    for li in soup.find('ul',class_='list-unstyled list-small-mb').find_all('li'):
        material_raw_disciplines.append(li.text.strip())

    material_disciplines = [" ".join(dsc.split()).replace('/','+') for dsc in material_raw_disciplines] # each + is a different discipline

    material_meta_data = {}
    for row in soup.find_all('div',class_='col detail-more-about'):
        dts = row.find_all('dt')
        dds = row.find_all('dd')
        for x, y in zip(dts, dds):
            material_meta_data[x.text.strip()] = y.text.strip()
    
    MATERIAL_ENTITY = {
        'name': material_name,
        'description': material_desc,
        'disciplines': material_disciplines
    }
    
    for metadata in material_meta_data:
        MATERIAL_ENTITY[metadata] = material_meta_data[metadata]
    
    EXTRACTED_MATERIAL_ENTITIES.append(MATERIAL_ENTITY)

In [9]:
extracted_df = pd.DataFrame(EXTRACTED_MATERIAL_ENTITIES)

In [10]:
extracted_df.to_csv('merlot.csv')

In [11]:
extracted_df

Unnamed: 0,name,description,disciplines,Material Type:,Date Added to MERLOT:,Date Modified in MERLOT:,Author:,Submitter:,Primary Audience:,Technical Format:,Mobile Compatibility:,Technical Requirements:,Language:,Cost Involved:,Source Code Available:,Accessibility Information Available:,Creative Commons:,Authors:,Languages:
0,Authentic Assessment Toolbox,The Authentic Assessment Toolbox site is a tut...,[Academic Support Services + Faculty Developme...,Tutorial,"January 6, 2003","October 11, 2022","Jon Mueller, \nNorth Central College, IL",Cris Guenter,"College General Ed,\n \n ...",Website,Not specified at this time,Internet Explorer is recommend by the author.,English,No,No,No,This work is licensed under a\n ...,,
1,DNA from the Beginning,DNA from the Beginning is an animated tutorial...,[Science and Technology + Agriculture and Envi...,Simulation,"April 11, 2000","September 21, 2022",Cold Spring Harbor Laboratory,Jeff Bell,College General Ed,Website,Not specified at this time,Flash 3 and RealAudio,English,No,No,No,No,,
2,Assessing Blood Pressure,ADOBE FLASH REQUIRED. This learning module pr...,[Workforce Development + Technical Allied Heal...,Tutorial,"April 21, 2004","July 9, 2022","Andrew Winterstein, \nUniversity of Wisconsin ...",Andrew Winterstein,College General Ed,Website,Not specified at this time,,English,No,No,No,No,,
3,LangMedia Foreign Language Media Archive,This site presents information on culture and ...,[Humanities + World Languages + Multilingual R...,Simulation,"January 10, 2002","May 6, 2022","Lang Media, \n Five Colleges",Elizabeth Pyatt,College General Ed,Website,Not specified at this time,Real Player required.,English,No,No,No,No,,
4,Mathematical Visualization Toolkit,This site consists of a collection of plotting...,[Mathematics and Statistics + Mathematics + Ca...,Simulation,"July 17, 2001","October 13, 2022","University of Colorado at Boulder, Department ...",Kurt Cogswell,College General Ed,"Website,\n \n...",Not specified at this time,"Java-enabled Browser, preferably Internet Expl...",English,No,No,No,No,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1195,Endangered Species,A comprehensive gateway site to information on...,[Science and Technology + Biology + Ecology an...,Reference Material,"December 27, 2002","September 14, 2022",North American Association for Environmental E...,Marty Zahn,"Grade School,\n \n ...",,Not specified at this time,,English,No,No,No,No,,
1196,Escher Web Sketch,Drawing program which transforms into symmetri...,[Mathematics and Statistics + Mathematics + Ge...,Simulation,"June 12, 1997","June 16, 2021",,Ric Stewart,College General Ed,,Not specified at this time,,English,No,No,Unknown,No,Wes Hardaker\n \n ...,
1197,Essay Punch,Commercial product that proposes a topic and l...,[Humanities + World Languages + ESL or EFL + L...,Tutorial,"March 13, 2001","May 21, 2020",Merit Software,Jeanne Gilleland,College General Ed,Website,Not specified at this time,,English,No,Yes,No,No,,
1198,Essentials of Marketing Research,"This is a free, online textbook offered by Boo...",[Business + Marketing + Market Research],Open (Access) Textbook,"January 12, 2011","July 29, 2020","Paurav Shukla, \n Bookboon.com",Cathy Swift,"College Upper Division,\n \n ...",PDF,Not specified at this time,,English,No,Unknown,Unknown,Unknown,,
