## Collecting data from websites

Have you ever needed to collect data from websites where the data is not made readily available? If you have, then you probably spent a significant amount of time copying and pasting from the website to a spreadsheet, and trying to carefully collect only the information that you need, while avoiding mistakes based on copying and pasting or typing. If your project required that data be collected from *many* pages, then this likely became a painful and repetitive effort that occupied a substantial amount of time.

Fortunately, your knowledge of Python can facilitate the data collection process through libraries designed to automate the collection of data from large numbers of pages. This class, we will focus on how to use a few of these libraries to streamline the collection of information from websites. The best way to understand this process is to do it, so we will be walking through the process while learning about why we scrape data the way that we do.


## Parsing websites with Python

Obviously, if we want to scrape a website, we will first want to *access* that website. We can do this with the `requests` library, like we did before to grab some text for our regex experiments.

In [49]:
import requests
myPage = requests.get("https://poshmark.com/category/Women-Bags")

During this lesson, we will be using [Poshmark.com's Women's Bag Listings](https://poshmark.com/category/Women-Bags) as our example. This is a fun website for learning to scrape, because the website listings change all the time, so there is always something new to scrape. But this means that when you run my code you'll probably get different results based on the listings that are currently offered.

We will focus on exploring the page to see what information we can extract.

### Process the HTML

In [50]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(myPage.text)

The code above imports the `BeautifulSoup` library/function, and prepares our requested URL for scraping. When we feed our website into the parser, we need to make sure to pass the `text` attribute of the requested URL, since this is the place in which the full HTML of the page is stored. If we just pass the `myPage` object, then we will be unable to parse the HTML like we want to. Now, we simply store a parsed website as an object (in this case we call it `parsed`), and we are ready to go.

In [51]:
soup.title

<title>Women Bags on Poshmark</title>

    <title>Women Bags on Poshmark</title>



`BeautifulSoup`'s parsed pages are structured based on the HTML tags that are encountered within the page. For example, above we requested the `title` tag from the page, and we got back the full tag, as well as all content within that tag. In order to only return the text inside the tag, we can use the following code:

In [52]:
soup.title.text

'Women Bags on Poshmark'

    'Women Bags on Poshmark'



For a tag with nothing else embedded inside, this is a great way to extract the text. However, many tags will contain one or more other tags, which add to the formatting of the page. Other tags will also repeat multiple times on the same page (unlike the title tag), so we will have to differentiate between them.

The tag that we will be most interested in for now is the `div` tag, which is a generic tag wrapped around each individual listing on the website. Unfortunately, f we just look for the `div` tag like we did with the title, then we will get a whole bunch of stuff, some of which is useful for finding the listings on the page, and some is not.

Let's take a look at a single listing in our developer tools. Follow the link to the women's bags page, and right click the page and choose the "Inspect" tool. You can then hover over each element on the page and see how the code relates to each visible element on the page.

Here is a screenshot highlighting the important stuff:

![](images/listing_tile.png)

The first listing is highlighted, so we can see the code. The `div` tag that creates this listing has a **class** of `card--small` (note the double hyphen!). This is true of all listings. We can extract the first of these listings with the code below.

A BeautifulSoup-parsed document provides us an object that holds a `find` method. We can use this method to search through the parsed document/site for a tag with specific properties. We are going to look for a `div` tag with a class of `card--small`. The class argument is spelled `class_` with an underscore, since `class` is a reserved word in Python. 

In [53]:
soup.find('div', class_="card--small")

<div class="card card--small"><a class="tile__covershot" data-et-element-type="image" data-et-name="listing" data-et-on-page_group_id="689cf4e7e2ac7d82134f21a9" data-et-prop-category_id="00248975d97b4e80ef00a955" data-et-prop-department_id="000e8975d97b4e80ef00a955" data-et-prop-lister_id="637558284527462c43b10041" data-et-prop-listing_id="6890c5e124b20bc714d201dc" data-et-prop-location="listing_tile" data-et-prop-unit_position="0" href="/listing/Black-Wristlet-and-Wallet-lot4-pieces-total-checkbook-card-holder-6890c5e124b20bc714d201dc" target=""><div class="img__container img__container--square"><picture title=""><source srcset="https://di2ponv0v5otw.cloudfront.net/posts/2025/08/04/6890c5e124b20bc714d201dc/s_wp_6890c5f9a9a448b000d7f6ef.webp" type="image/webp"><source srcset="https://di2ponv0v5otw.cloudfront.net/posts/2025/08/04/6890c5e124b20bc714d201dc/s_6890c5f9a9a448b000d7f6ef.jpg" type="image/jpeg"><img alt="Black Wristlet and Wallet lot4 pieces total checkbook card holder" class="

Wow! There sure is a lot of stuff for us to work through within that tag! It turns out that the article tag contains *everything* related to a particular listing, so we will have to work through that information more carefully if we would like to be able to scrape information about each listing. 

The first thing that we need to do, though, is collect ALL of the listings, so that we can parse each one and collect the most useful information.

## Navigating scraped data

Our processed website has some other tools besides being able to search for a single tag. One of the most helpful is a method called `find_all`, which will allow us to look in a specific portion of the page (or across the whole page) for *all instances* of a specific tag. Before, we could only see the first instance of the tag we were searching for, but this will allow us to find all the listings on a page!

In order to not end up with a massive text blob for output, let's store the results of our `find_all` method in a list.

In [54]:
listings = [i for i in soup.find_all('div', class_="card--small")]

To store the article tags in a list, we use a simple list comprehension, so that each separate article tag is a new entry in the list called `listings`. One of the really cool things about `BeautifulSoup` is that each returned object is treated just like the full parsed webpage: we can use tags to walk through each of our new objects in the list, or to run another `find` or `find_all` method.

Let's try finding an `img` tag inside of the first listing, that contains the url to the image of the listed bag:

In [55]:
listings[0].img

<img alt="Black Wristlet and Wallet lot4 pieces total checkbook card holder" class="ovf--h d--b" src="https://di2ponv0v5otw.cloudfront.net/posts/2025/08/04/6890c5e124b20bc714d201dc/s_6890c5f9a9a448b000d7f6ef.jpg" title=""/>

Awesome! We can walk even extract the characteristics of this tag to get that link!

In [56]:
listings[0].img['src']

'https://di2ponv0v5otw.cloudfront.net/posts/2025/08/04/6890c5e124b20bc714d201dc/s_6890c5f9a9a448b000d7f6ef.jpg'

Next, let's see how many articles are stored on each page of search results:

In [57]:
len(listings)

48

It looks like our results page has 48 results. How do we figure this all out? Remember that when we opened a page of results up, and we used the developer "Inspect" tool built into our browser to help us find the part of the page that contains the information we care about. THIS WILL BE DIFFERENT FOR EVERY WEBSITE. As we prepare to scrape a page, we will spend a lot of time going back and forth between the website as we see it, and the code that we are designing to scrape that website. It really is more of an art than a science, and is highly specific to the page that we are scraping.

As we look through our list of articles, though, we will want to start extracting information that will help us learn about each bag. Let's try our hand at finding the name of the listings, and the price of each one. Fortunately, this information won't be TOO hard to find. If we inspect the title of the first result (using the link that we started with at the top of the notebook), we can see that the name of the listing is stored within the div tag using ANOTHER `div` tag, but with a class `title__condition__container`. Let's request that from our list:

In [58]:
listings[0].find('div', class_="title__condition__container")

<div class="title__condition__container"><a class="tile__title tc--b" data-et-element-type="link" data-et-name="listing" data-et-on-page_group_id="689cf4e7e2ac7d82134f21a9" data-et-prop-category_id="00248975d97b4e80ef00a955" data-et-prop-department_id="000e8975d97b4e80ef00a955" data-et-prop-lister_id="637558284527462c43b10041" data-et-prop-listing_id="6890c5e124b20bc714d201dc" data-et-prop-location="listing_tile" data-et-prop-unit_position="0" href="/listing/Black-Wristlet-and-Wallet-lot4-pieces-total-checkbook-card-holder-6890c5e124b20bc714d201dc">
            Black Wristlet and Wallet lot4 pieces total checkbook card holder
          </a><div class="d--fl ai--c m--l--1"><div style="display:none;"><div class="d--fl ai--c"><i class="icon posh-star-small"></i></div></div><div style="display:none;"><span class="condition-tag all-caps tr--uppercase condition-tag--small">
</span></div></div></div>

<br>
Okay, so we got the tag back, and the title is in there, but it's a MESS! How do we get down to just the information we want?

If there is text inside of a tag (that is not itself between the `<` and `>` of a tag), then we can use the `.text` attribute to just pull the text from the tag:

In [59]:
listings[0].find('div', class_="title__condition__container").text

'\n            Black Wristlet and Wallet lot4 pieces total checkbook card holder\n          \n'

<br>

Closer! We got the listing name, but it's still looks a bit messy with those whitespace characters...

We can use the string method `.strip()` to cut the whitespace off of the ends of the label. Let's do that now.

In [60]:
listings[0].find('div', class_="title__condition__container").text.strip()

'Black Wristlet and Wallet lot4 pieces total checkbook card holder'

<br>

There we go! Now we have what we want, so let's create a loop to extract the same information from each listing. Each one is structured in the same way, so we can easily loop through each listing in a list comprehension.

In [61]:
[tile.find('div', class_="title__condition__container").text.strip() for tile in listings]

['Black Wristlet and Wallet lot4 pieces total checkbook card holder',
 'Lilly Pulitzer Estée Lauder Lemon Tropical\u205fCanvas Tote\u205fBag',
 'Gucci Beige/Brown GG Canvas and Leather Web Compact Wallet',
 'Vintage Black Christian Dior Crossbody/Shoulder bag | Good Condition',
 'Lululemon Green Tote Bag',
 'Mytagalongs Velour Fanny\u205fPack Waist Bag',
 'Vintage Roses on Satin Classic Taupe Mini Handbag Leather Handle',
 'NWOT An Reich Teal Pouch',
 'Michael Kors Eva medium tote - EUC',
 '🧩Lululemon All Night Festival Bag 5L - Black',
 'Luli Bebe Petit Monaco Diaper Backpack Ebony Black',
 'Michael Kors Large Travel Continental Wallet Wristlet Black Multi\n          \n      NWT',
 'Urban Fit by Urban Expressions Quilted Puffer Hobo Carry All Tote in Brown',
 'Wandler Anna Belt Bag in Spicy Orange',
 'ALO Yoga Explorer Fanny Pack - Adjustable Waist/Shoulder Strap 3L Nylon\n          \n      NWT',
 'NWT Dodo Bar Or Lili Bag 100% Cotton Tote Purse Blue Purple Embroidered Checker\n      

That was the easy part. Now that we have the listing names, we need to find their prices. We will start by finding their prices on the website itself. If we poke around the website, we can see that the prices are listed in dollars, and that they are always inside of a `div` tag (this site seems to love those) with a bizarre class name of `m--t--1`. Who knows what this means (I sure don't), but it contains the number we want to collect.

At this point, it's worth noting that there are a few other ways that we can move around the website. First, using the `.find()` method, we can search by tag, we can search by class (with the `class_` argument), AND we can search by the text that the tag contains with the `string` argument. So something like `.find("div", string="$")` would search for a div with text that is EXACTLY EQUAL to "$". Sometimes helpful, sometimes not.

We can also move from one tag to the next adjacent (sibling) tag using the `.next_sibling` or `.previous_sibling` attributes of a find result.

For now, though, we just need to find a tag:

In [62]:
listings[0].find('div', class_="m--t--1").text

'\n              $25\n            '

Got it! This isn't so bad if we just move slowly. Unfortunately, the text contains more stuff than just the price in Euros. It turns out that the website just has a blob of text that contains prices, possibly in dollars, possibly in Euros, and possibly both, with some extra text at the end. Since price isn't a consistent number of digits, we need a way to recognize patterns in text and extract only the part that we want.

It turns out that the website just has a blob of text that contains prices in dollars, with a dollar sign in the string and some extra white space around it. Since price isn't a consistent number of digits, we need a way to recognize patterns in text and extract only the part that we want.

Regular expression comes to the rescue!

In [63]:
import re

float(
    re.search(r'(?:[$])(\d{1,3}(?:,\d{3})?)', 
          listings[0].find('div', class_="m--t--1").text).groups()[0].replace(",","")
)

25.0

`r'(?:[$])(\d{1,3}(?:,\d{1,3})?)'` is a regular expression that looks for a dollar sign (in a non-collecting group), then one to three numbers, possibly followed by a comma and three more digits.

When we get back the results from this search, we only need the first group (or value in parentheses), which omits the dollar symbol but includes the entire number. This expression allows for prices from \$1 to \$999,999 (I don't think there are million dollar items on Poshmark, but I could be wrong!). It's a string, but we can easily convert it to a number using the `float()` function once we have removed the commas with a `replace` function.

Now that we know how to find each of the two values that we care about, it is time to start formalizing our code with a `for` loop to grab the same pieces of information from each listing. We can use our loop to walk through the HTML associated with each tile on the results page and extract the relevant information.

In [64]:
import numpy

data = []

for tile in listings:
    row = []
    try:
        row.append(tile.find('div', class_="title__condition__container").text.strip())
    except:
        row.append('')
    try:
        row.append(
            float(
                re.search(r'(?:[$])(\d{1,3}(?:,\d{3})?)', 
                      tile.find('div', class_="m--t--1").text).groups()[0].replace(",","")
            )
        )
    except:
        row.append(np.nan)
    data.append(row)

We created an empty list called `data`, and our `for` loop was used to add rows to that list. Each row consists of a list of two items: listing name and listing price. Once we have created the list representing that row/tile/listing, we simply append it to the `data` list and move on to the next listing.

The next step (below) is to create a Data Frame based on our list called `data`, and to name our columns. This provides easy structure and functionality to our data:

In [65]:
import pandas as pd

data = pd.DataFrame(data, columns = ['listing', 'price'])

data

Unnamed: 0,listing,price
0,Black Wristlet and Wallet lot4 pieces total ch...,25.0
1,Lilly Pulitzer Estée Lauder Lemon Tropical Can...,35.0
2,Gucci Beige/Brown GG Canvas and Leather Web Co...,175.0
3,Vintage Black Christian Dior Crossbody/Shoulde...,558.0
4,Lululemon Green Tote Bag,10.0
5,Mytagalongs Velour Fanny Pack Waist Bag,20.0
6,Vintage Roses on Satin Classic Taupe Mini Hand...,75.0
7,NWOT An Reich Teal Pouch,7.0
8,Michael Kors Eva medium tote - EUC,69.0
9,🧩Lululemon All Night Festival Bag 5L - Black,56.0


## Scraping many pages

Now that we have established a pattern of code that is able to collect the information we desire, it is time to make sure that we can collect the same information from each page of search results. It is typically insufficient to collect only one page of search results, so we want to be able to follow the links in our search from page to page in order to continue collecting data.

Ideally, we can inspect the button that navigates from one page to the next. We find that the element is an `button` tag, with the following text:

```python
"""

    Next
  """
```

Using the `find` method, we can can then extract the `href` parameter from the `a` tag representing the link that takes us to the next page:

In [66]:
nextPage = parsed.find('button', string="""
    Next
  """)

nextPage

<button class="btn btn--pagination">
    Next
  </button>

Tragically, this page seems to use javascript to "turn the page". We can tell because this tag contains no information aside from the text, but when we click it we go to the next page. This means that we will need advance manually, and just figure out the pattern of next pages so that we can describe the urls to our scraper. Given that we get 48 results per page, let's just go for ten pages of results, or just under 500 listings.

    'https://poshmark.com/category/Women-Bags?max_id=2'



Above is the link that I see when I click the "Next" button. This link certainly looks like it will take us to the next page of results! Even better, it looks like there is an obvious way for us to advance by simply changing the number in the URL. We will soon find out. Below is the code that we have collected so far, applied to the second page of results.

In [67]:
nextPage = "https://poshmark.com/category/Women-Bags?max_id=2"

myPage = requests.get(nextPage)

parsed = BeautifulSoup(myPage.text)
listings = [i for i in soup.find_all('div', class_="card--small")]

newData = []

for tile in listings:
    row = []
    try:
        row.append(tile.find('div', class_="title__condition__container").text.strip())
    except:
        row.append('')
    try:
        row.append(
            float(
                re.search(r'(?:[$])(\d{1,3}(?:,\d{3})?)', 
                      tile.find('div', class_="m--t--1").text).groups()[0].replace(",","")
            )
        )
    except:
        row.append(np.nan)
    newData.append(row)

newData = pd.DataFrame(newData, columns = ['listing', 'price'])

newData

Unnamed: 0,listing,price
0,Black Wristlet and Wallet lot4 pieces total ch...,25.0
1,Lilly Pulitzer Estée Lauder Lemon Tropical Can...,35.0
2,Gucci Beige/Brown GG Canvas and Leather Web Co...,175.0
3,Vintage Black Christian Dior Crossbody/Shoulde...,558.0
4,Lululemon Green Tote Bag,10.0
5,Mytagalongs Velour Fanny Pack Waist Bag,20.0
6,Vintage Roses on Satin Classic Taupe Mini Hand...,75.0
7,NWOT An Reich Teal Pouch,7.0
8,Michael Kors Eva medium tote - EUC,69.0
9,🧩Lululemon All Night Festival Bag 5L - Black,56.0


Additionally, we can concatenate our Data Frames so that we have a single Data Frame containing all of the results from our scrape. After we concatenate our data, it is good practice to reset the index using the `.reset_index()` method. This will overwrite the index of the Data Frame so that it does not have any repeat values. Be sure to include the argument `drop=True`, so that the old index isn't added back into your Data Frame.

In [68]:
data = pd.concat([data, newData], axis=0).reset_index(drop=True)

data

Unnamed: 0,listing,price
0,Black Wristlet and Wallet lot4 pieces total ch...,25.0
1,Lilly Pulitzer Estée Lauder Lemon Tropical Can...,35.0
2,Gucci Beige/Brown GG Canvas and Leather Web Co...,175.0
3,Vintage Black Christian Dior Crossbody/Shoulde...,558.0
4,Lululemon Green Tote Bag,10.0
...,...,...
91,FENDI Selleria Leather Peekaboo Long Wallet -...,199.0
92,Kate Spade Hayden Cedar Street Red Leather Sat...,119.0
93,Cream and Gold Quilted Clutch with Chain Strap,24.0
94,GUCCI HYSTERIA ( 2003) collection WALLET- vintage,145.0


## Moving from script to function

We talked about functions earlier in the term as an excellent way to make our code more reusable, and to eliminate the need to copy and paste code with the risk of creating more typos and places for code to be updated. Now that we know how to scrape useful information from a website, let's create a function to do the work for us, so that we don't have to copy and paste the code for each subsequent page of search results.

In order to make our code into a function, we will have to create a function that takes a starting URL (the URL for our search results), and returns a Data Frame after reading through each page of the search results. We will have to perform some abstraction to make our code work on each page, but the differences are pretty minor:

- Use `requests.get()` on the URL passed to the function
- Check whether or not a "next" page exists
    - If there IS a next page, we need to call the function on *that* page, then merge the results
    - If there is NOT a next page, we return the existing data as a Data Frame.
    
Take some time to examine the code below and how each of these changes is made:

In [73]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re
import time

# A function to collect lego sets from search results on brickset.com
def poshmark(startURL, page=None):
    # keep track of what page we are on
    if page==None:
        page = 1
    # Add headers to imitate a real browser
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Referer': 'https://www.google.com/'
    }
    # Retrieve starting URL
    myPage = requests.get(startURL)

    # Parse the website with Beautiful Soup
    parsed = BeautifulSoup(myPage.text)
    
    # Grab all sets from the page
    listings = [i for i in soup.find_all('div', class_="card--small")]

    # Create and empty data set
    newData = []

    # Iterate over all sets on the page
    for tile in listings:
        row = []
        try:
            row.append(tile.find('div', class_="title__condition__container").text.strip())
        except:
            row.append('')
        try:
            row.append(
                float(
                    re.search(r'(?:[$])(\d{1,3}(?:,\d{3})?)', 
                          tile.find('div', class_="m--t--1").text).groups()[0].replace(",","")
                )
            )
        except:
            row.append(np.nan)
        # Add the row of data to the dataset
        newData.append(row)

    newData = pd.DataFrame(newData, columns = ['listing', 'price'])
    
    # Until we have processed 5 pages, grab the next page of results
    if page<5:
        # Tell our program not to load new pages too fast by "sleeping" for two seconds before
        #   going to the next page
        time.sleep(2)
        # Merge current data with next page
        page += 1
        nextPage = f"https://poshmark.com/category/Women-Bags?max_id={page}"
        print(nextPage)
        return pd.concat([newData, poshmark(nextPage, page=page)], axis=0)
    # Otherwise return the current data
    else:
        return newData

*Note: We sometimes need to use **headers** (text telling the website what kind of browser we are "using") so that we are able to access the website we want to scrape. Mileage will vary by website*
(Shoutout to Kiran Best of Aalto University for finding the right header to keep this site working as an example)

Observe that we use several `try`-`except` blocks. These code blocks permit us to write code that *might* result in an error. This is the code that is indented beneath the `try` keyword. Then, we write code that should be executed whenever an error *does* occur under the `except` keyword. In this way, we prevent errors from breaking our function, and we can better control the data that is recorded in our Data Frame. Let's run the code now:

In [74]:
bags = poshmark("https://poshmark.com/category/Women-Bags")

bags

https://poshmark.com/category/Women-Bags?max_id=2
https://poshmark.com/category/Women-Bags?max_id=3
https://poshmark.com/category/Women-Bags?max_id=4
https://poshmark.com/category/Women-Bags?max_id=5


Unnamed: 0,listing,price
0,Black Wristlet and Wallet lot4 pieces total ch...,25.0
1,Lilly Pulitzer Estée Lauder Lemon Tropical Can...,35.0
2,Gucci Beige/Brown GG Canvas and Leather Web Co...,175.0
3,Vintage Black Christian Dior Crossbody/Shoulde...,558.0
4,Lululemon Green Tote Bag,10.0
...,...,...
43,FENDI Selleria Leather Peekaboo Long Wallet -...,199.0
44,Kate Spade Hayden Cedar Street Red Leather Sat...,119.0
45,Cream and Gold Quilted Clutch with Chain Strap,24.0
46,GUCCI HYSTERIA ( 2003) collection WALLET- vintage,145.0


In [75]:
bags['price'].mean()

181.41666666666666

Based on the data we have collected, the mean price listed bags is about $181.

Now, it's your turn to collect data!

**Solve it**:

Update the code used above to extract the following information regarding Women's Bags from Poshmark (use the starting url [https://poshmark.com/category/Women-Bags](https://poshmark.com/category/Women-Bags)):
- Name of the listing
- Price of the listing
- Number of brand
- Number of size of bag

When you're done, you should have all of your web scraping code built into a function named `poshmark`. You should then call this function in order to collect information for at least 200 results, starting on the main listing page of the women's bags category [https://poshmark.com/category/Women-Bags](https://poshmark.com/category/Women-Bags). Store your results in a Data Frame called `bags`. The columns should be labeled `listing`, `price`, `brand`, `size`, respectively. You will receive points for the following:

- `bags` contains at least 200 entries [1 point]
- Column `listing` contains the names for each listing, and should have an "object" (string) `dtype` [1 point]
- Column `price` contains the prices for each listing, and the column should have a "float64" `dtype` (this makes it possible to assign missing prices a value of `np.nan`) [1 point]
- Column `brand` contains the brand of the bag, and should have an "object" (string) `dtype`. Missing values can either be empty or have some other label indicating that the information was unavailable. [1 point]
- Column `size` contains the size category of the bag, and should have an "object" (string) `dtype`. Missing values can either be empty or have some other label indicating that the information was unavailable. [1 point]

In order to get credit, please print your dataframe at the end of your code execution, so that the table of results renders in your notebook. That way, I can see what you saw at the time you ran the code. This is important since Poshmark listings change so quickly.

Please put ALL NECESSARY CODE into the cell below:

In [None]:
#si-exercise