## Extracting data from confluence pages

In [1]:
import requests
import bs4
import pandas

### Data from confluence must be imported locally through html saved via element inspector

In [2]:
# url = "https://vcdi-dpc.atlassian.net/wiki/spaces/AKB/pages/187301909/Project+Canvas+Template"
# req = requests.get(url)
soup = bs4.BeautifulSoup(open("/users/danielcorcoran/Desktop/test_canvas.html"),"html.parser")

### Inspect the soup

In [3]:
print(soup.prettify())

<div id="aui-flag-container" style="top: 20px;">
 <div class="aui-flag-stack" data-aui-flag-stack="quickreload">
  <div aria-hidden="true" class="aui-flag qr-flag aui-flag-stack-top-item">
   <div class="aui-message aui-message-info info closeable shadowed">
    <p class="title">
     <strong>
      New page edits
     </strong>
    </p>
    <div>
     <div class="qr-notice-authors">
      <span class="qr-author-avatar">
       <span class="aui-avatar aui-avatar-small">
        <span class="aui-avatar-inner">
         <img src="/wiki/aa-avatar/86737cb56d2cb00df0e256b57325cb09?s=48&amp;d=https%3A%2F%2Fvcdi-dpc.atlassian.net%2Fwiki%2Fimages%2Ficons%2Fprofilepics%2Fdefault.png%3FnoRedirect%3Dtrue"/>
        </span>
       </span>
      </span>
     </div>
     <div class="qr-notice-summary">
      <button class="aui-button aui-button-link qr-notice-show aui-button-text">
       Reload page
      </button>
      <span class="qr-notice-summary-text">
       by
       <a class="url fn" href=

### Retrieving title

In [4]:
soup.find("h1").text

'Project Canvas Template'

### Multiple attributes can be used to find a cell however this method is unreliable

In [5]:
types_of_analytics_ul = soup.find("td", {"class":"confluenceTd", "colspan":"6"}).find("ul")

In [6]:
types_of_analytics_list_items = types_of_analytics_ul.find_all("li")

### Dealing with checkboxes in confluence 

In [7]:
for item in types_of_analytics_list_items:
    
    class_item = item.get("class")
    
    if class_item == ["checked"]:
        print("Checked:", item.find("span").text)
    else:
        print("Unchecked:", item.find("span").text)

Checked: Prescriptive
Unchecked: Predictive
Unchecked: Diagnostic
Unchecked: Descriptive


### Targetting Building Blocks Outcomes Checklist

In [8]:
all_confluenceTd = soup.find_all("td", {"class":"confluenceTd"})
for block in all_confluenceTd:
    
    #print(block.prettify(),"\n","|"*60)
    h3 = block.find("h3")
    if h3 is not None:
        h3_text = h3.text
        
        if h3_text == "Building Blocks Outcomes:":
            for list_item in block.find("ul",{"class":"inline-task-list"}).find_all("li"):
                if list_item.get("class") == ["checked"]:
                    print("Checked:", list_item.text)

Checked: Jobs Now
Checked: Fairness and Equity


### Targetting Problem Summary Box

In [9]:
all_confluenceTd = soup.find_all("td", {"class":"confluenceTd"})

for block in all_confluenceTd:
    
    h3 = block.find("h3")
    
    if h3 is not None:
        
        h3_text = h3.text
        h3_text_lower_strip = h3_text.strip().lower()
        if h3_text_lower_strip == "problem summary:":
            text_block = block.text.strip()
            while "  " in text_block:
                text_block.replace("  "," ")
            print(text_block)

Problem Summary: e.g Supporting the financial sustainability of government debt relative to revenue is a key priority for Government. Over the next four years:traffic cameras and other fines are expected to net $3.4 billion for governmentrevenue from speed and red light cameras alone are expected to be $393 million in 2017-18, increasing to $421 million by 2020-21the budget also estimates revenue from on-the-spot fines and toll road evasion will continue to growthere are currently approximately $2 billion unpaid fines with the average being greater than 1 year oldWith these estimates in mind, recovery of fines is a key priority area for government, which will be achieved by smart and effect revenue collection from fines through enforcement and recovery activities.


### Targetting macro buttons

In [10]:
macro_button = soup.find("span", {"class":"status-macro"})

In [11]:
macro_button.text

'GREEN'

### Pulling text from ordered (numbered) lists ("ol")

In [12]:
ordered_list = soup.find("ol")
print(ordered_list.prettify())

<ol>
 <li>
  <span style="color: rgb(0,0,255);">
   <em>
    Reduce aged debt by 10% within 12 months
   </em>
  </span>
 </li>
 <li>
  <span style="color: rgb(0,0,255);">
   <em>
    Reduce average age of overdue fees to less than 1 year within 6 months
   </em>
  </span>
 </li>
 <li>
  <span style="color: rgb(0,0,255);">
   <em>
    Reduce operational costs by 10% by 2020
   </em>
  </span>
 </li>
</ol>


In [13]:
for list_item in ordered_list:
    print(list_item.text)

Reduce aged debt by 10% within 12 months
Reduce average age of overdue fees to less than 1 year within 6 months
Reduce operational costs by 10% by 2020


### Targetting "Benefits:" box, using h3 tag as an identifier

In [14]:
all_confluenceTd = soup.find_all("td", {"class":"confluenceTd"})

for block in all_confluenceTd:
    
    #print(block.prettify(),"\n","|"*60)
    h3 = block.find("h3")
    if h3 is not None:
        h3_text = h3.text
        
        if h3_text == "Benefits:":
            for list_item in block.find("ul").find_all("li"):
                print(list_item.text)
                
            

Advance IMES's self-service analytics capabilities
Reduce aged debt
Reduce operational costs of managing and collecting
Change behaviour of future defendants


### Pulling text from bullet point lists ("ul")

In [15]:
project_team_box = soup.find("td", {"class":"confluenceTd",
                                   "colspan":"7",
                                   "rowspan":"2"})

In [16]:
print(project_team_box.prettify())

<td class="confluenceTd" colspan="7" rowspan="2">
 <h3 id="ProjectCanvasTemplate-ProjectTeam:">
  <span style="color: rgb(153,51,102);">
   Project Team:
  </span>
 </h3>
 <ul>
  <li>
   <em>
    <span style="color: rgb(0,0,255);">
     'Name'
    </span>
   </em>
   - Analytics Manager
  </li>
  <li>
   <em>
    <span style="color: rgb(0,0,255);">
     'Name'
    </span>
    -
    <span style="color: rgb(0,0,255);">
     'Position Title'
    </span>
   </em>
  </li>
  <li>
   <em>
    <span style="color: rgb(0,0,255);">
     'Name'
    </span>
    -
    <span style="color: rgb(0,0,255);">
     'Position Title'
    </span>
   </em>
  </li>
 </ul>
</td>


In [17]:
project_team_ul = project_team_box.find("ul")
print(project_team_ul.prettify())

<ul>
 <li>
  <em>
   <span style="color: rgb(0,0,255);">
    'Name'
   </span>
  </em>
  - Analytics Manager
 </li>
 <li>
  <em>
   <span style="color: rgb(0,0,255);">
    'Name'
   </span>
   -
   <span style="color: rgb(0,0,255);">
    'Position Title'
   </span>
  </em>
 </li>
 <li>
  <em>
   <span style="color: rgb(0,0,255);">
    'Name'
   </span>
   -
   <span style="color: rgb(0,0,255);">
    'Position Title'
   </span>
  </em>
 </li>
</ul>


In [18]:
for list_item in project_team_ul:
    print(list_item.text)

'Name' - Analytics Manager
'Name' - 'Position Title'
'Name' - 'Position Title'
