# Parse Wellesley Hive

November 20, 2020

This is a short exploration of an HTML file that was downloaded from the Wellesley Hive. It contains information about 821 users who were "active" when the file was saved.

I filtered the profiles to include alums who work in Computer Science, Cyber Security, Data Science, Design/Visual Arts, Engineering, and Technology/IT 

We might need to download the file that contains all results.

In [3]:
from bs4 import BeautifulSoup as BS

Read the content of the file, make sure that it has content:

In [4]:
with open('The_Wellesley_Hive_Data.htm') as inputF:
    htmlRaw = inputF.read()
    
len(htmlRaw)

2726319

Create a DOM tree representation for the HTML code:

In [5]:
domTree = BS(htmlRaw, 'html.parser')

From inspecting the DOM in the browser I have identified one class name, let me look for it:

In [6]:
allCards = domTree.findAll(attrs={'class': 'person-card__name-block'})
len(allCards)

821

It found all 821 cards! Let's check one card:

In [7]:
allCards[0]

<div class="person-card__name-block"><div aria-label="Cabelle Ahn ’12 - View Profile" class="person-card__identity person-card__identity__hover person-card__identity__hover-inactive" role="link" tabindex="0">Cabelle Ahn ’12</div></div>

Okay, this card doesn't have enough information, so let me look for another class name in the HTML file on the brwoser.

In [8]:
biggerCards = domTree.findAll(attrs={'class': 'ant-card-body'})
len(biggerCards)

821

In [9]:
biggerCards[0]

<div class="ant-card-body"><div tabindex="-1"><div class="person-card__card-header"><div aria-label=" Bookmark Cabelle Ahn's profile" class="bookmark-icon bookmark-icon__grid" role="button" tabindex="0"><div tabindex="-1"><svg class="" height="30px" version="1.1" viewbox="0 0 18 30" width="18px" xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg"><g fill="none" fill-rule="evenodd" stroke="none" stroke-width="1"><g class="" fill="#e4e6eb" transform="translate(-1149.000000, -316.000000)"><g transform="translate(250.000000, 249.000000)"><g transform="translate(893.000000, 65.000000)"><g><text font-family="mentor-icons" font-size="30" font-weight="normal"></text><path d="M23.75 1.937V31.37c0 .25-.122.438-.367.563-.184.125-.49.062-.673-.063L15 24.31l-7.71 7.56c-.183.125-.49.188-.673.063-.245-.125-.367-.313-.367-.563V1.937C6.25.875 7.045 0 8.086 0h13.828c1.04 0 1.836.875 1.836 1.937z"></path></g></g></g></g></g></svg></div></div><div class="person-card__image person-card_

This class name looks exactly what we want. Can I extract the text directly?

In [1]:
#biggerCards[0].text

I can, but the structure of the text is gone and many fields are concateaned into a single string, that's not a good idea. It's better to searhc for the children of the node:

In [11]:
for child in biggerCards[0].children:
    print(child.text)
    print()

Cabelle Ahn ’12PhD Candidate in History of ArtAmsterdam, NetherlandsWellesley College, Bachelor's Degree, 2012, Art History, Courtauld Institute of Art, Master's Degree, 2013, History of Art & Architecture, Bard Graduate Center, Master's Degree, 2015, Decorative Arts & Design History, Harvard University, PhD - Doctor of Philosophy, History of Art & ArchitectureHarvard UniversityAmsterdam, Netherlands

Let's Connect



That didn't help either, the text is still jumbled all together. We need to find the fine-grained structure of the DOM portion for this card. I'll use a function that prints out the nested HTML:

In [12]:
print(biggerCards[0].prettify())

<div class="ant-card-body">
 <div tabindex="-1">
  <div class="person-card__card-header">
   <div aria-label=" Bookmark Cabelle Ahn's profile" class="bookmark-icon bookmark-icon__grid" role="button" tabindex="0">
    <div tabindex="-1">
     <svg class="" height="30px" version="1.1" viewbox="0 0 18 30" width="18px" xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg">
      <g fill="none" fill-rule="evenodd" stroke="none" stroke-width="1">
       <g class="" fill="#e4e6eb" transform="translate(-1149.000000, -316.000000)">
        <g transform="translate(250.000000, 249.000000)">
         <g transform="translate(893.000000, 65.000000)">
          <g>
           <text font-family="mentor-icons" font-size="30" font-weight="normal">
           </text>
           <path d="M23.75 1.937V31.37c0 .25-.122.438-.367.563-.184.125-.49.062-.673-.063L15 24.31l-7.71 7.56c-.183.125-.49.188-.673.063-.245-.125-.367-.313-.367-.563V1.937C6.25.875 7.045 0 8.086 0h13.828c1.04 0 1.836.875 1

I identified a few more class names that might be useful. Some of these class names are repeated multiple times:

In [None]:
firstCard = biggerCards[0]

for className in ['person-card__name-block',  
                  'person-card__details', 
                  'person-details-container__info-line']:
    divs = firstCard.findAll(attrs={'class': className})
    for div in divs:
        print(className, div.text)


This worked good. Let's try it with another card:

In [None]:
anotherCard = biggerCards[2]

for className in ['person-card__name-block',  
                  'person-card__details', 
                  'person-details-container__info-line']:
    divs = anotherCard.findAll(attrs={'class': className})
    for div in divs:
        print(className, div.text)


Let's try it out with more cards:

In [None]:
for i in range(10):
    card = biggerCards[i]
    for className in ['person-card__name-block',  
                  'person-card__details', 
                  'person-details-container__info-line']:
        divs = card.findAll(attrs={'class': className})
        for div in divs:
            print(className, div.text)
        
    print(40*'-')


**To do**

- figure out the meaning of each field in the card. There is some consistency, but occasionally things are confusing. Also, the line of the education has multiple institutions all together.
- write a function that returns the data for one card
- write code that stores each card as a dictionary
- download the file that contains all Wellesley Hive members.
- make sure to now share the data that have the names with other people outside your team