# Confluence Dummy Template Scrape

## About

- Parses **'html file'** project page from confluence into a single .csv file.
- One must download the page as html through element inspector, save as html, then pass through this script.
- Tables are pulled from html as a list via pandas module.
- Tables are processed invidually using a range of custom functions
- Functions as expected as of 6th June 2018


## How to obtain 'html file'
- Load canvas page in confluence
- Right click, inspect element
- Right click on html body, copy inner code
- Paste in editor (notepad++, sublime etc)
- Save as .html
- Link path into this %%script

## Setup 

### Import Modules

In [1]:
import pandas
import bs4
import requests
import datetime

### Specify html file path and import as soup object, using html.parser

In [2]:
input_path = "/Users/danielcorcoran/Desktop/github_repos/python_nb_data_pulling/confluence_scrape/data/inputs/project_dummy_page_number_1.html"

output_path = "/Users/danielcorcoran/Desktop/github_repos/python_nb_data_pulling/confluence_scrape/data/outputs/project_page/"

In [3]:
tables = pandas.read_html(input_path)

In [4]:
len(tables)

13

In [5]:
for index in range(len(tables)):
    print("Table at index:", index, "\n\n\n", tables[index], "\n\n\n","=" * 60)

Table at index: 0 


                    0                            1
0         Project ID                          000
1      Project Title  Building Cladding TaskForce
2        Last Update                  24 May 2018
3        VCDI Stream                    Analytics
4       Project Lead            Suhith Illesinghe
5    Key Stakeholder                    TaskForce
6  Executive Sponsor                          TBC
7      Project Start                  01 Mar 2017
8        Project End                  30 Aug 2017
9      Current State                     Inactive 


Table at index: 1 


   Project Selection Score  Priority  Maturity  Complexity
0           75% (324/430)       7.2       7.1         7.0 


Table at index: 2 


                   0         1
0   1 - Pre Project  COMPLETE
1   2 - Feasibility  COMPLETE
2   3 - Foundations  COMPLETE
3   4 - Development  COMPLETE
4      5 - Delivery  COMPLETE
5       6 - Closure  COMPLETE
6  7 - Post Project       NaN 


Table at index: 3 


## Functions
These functions will help mutate the tables into the desired look and feel, and perform repetitive tasks.


### Convert a string containing '>' into a pipeline seperated string

In [6]:
def convert_list_to_pipelinestring(string):
    lines_list = string.split(">")
    
    for index in range(len(lines_list)):
        lines_list[index] = lines_list[index].strip()
    
    for item in lines_list:
        if item == "":
            lines_list.remove(item)
            
    full_string = " | ".join(lines_list)
    return full_string

In [7]:
string = "> This is the objective of the project> This is another objective of the project"

In [8]:
convert_list_to_pipelinestring(string)

'This is the objective of the project | This is another objective of the project'

### Drop all rows and columns containing 100% nulls

In [9]:
def drop_null_rows_and_columns(dataframe):
    dataframe.dropna(how = "all", axis =1, inplace = True)
    dataframe.dropna(how = "all", axis =0, inplace = True)
    return dataframe

### Transpose table with two columns

In [10]:
def transpose_table_with_two_columns(table):

    table.columns = ["index", "values"]

    index_list = list(table["index"])
    table.index = index_list
    table.drop("index", axis = 1, inplace = True)

    table2 = table.transpose()
    table2.reset_index(inplace = True, drop = True)


    return table2

### Convert first row to headers and drop first row

In [11]:
def first_row_to_headers(table):
    
    table.columns = table.iloc[0].tolist()
    table.drop([0], axis = 0, inplace = True)
    
    return table

### Compress table vertically

In [12]:
def compress_table_vertically(table):
    
    table_dict = table.to_dict(orient = "list")

    new = {}
    for key in table_dict.keys():
        new[key] = ""
        for item in table_dict[key]:
            new[key] = new[key] + str(item) + " | "

    for key in new.keys():
        item = new[key]
        item_max_length = len(item)
        item2 = item[:item_max_length-3]
        new[key] = item2

    final_dictionary = {}

    for key in new.keys():
        final_dictionary[key] = [new[key]]

    data = pandas.DataFrame(final_dictionary)

    return data

### Reset table index and drop old index

In [13]:
def reset(table):
    table = table.reset_index(drop = True)
    return table

### Drop null column and rows in particular dataframes

In [14]:
relevant_table_indices = list(range(len(tables)))
relevant_table_indices.remove(1)
relevant_table_indices.remove(5)
relevant_table_indices.remove(2)
relevant_table_indices.remove(6)
for index in relevant_table_indices:
    tables[index] = drop_null_rows_and_columns(tables[index])

### Clean Project Code

In [15]:
def cleancode(var):
    var = str(var)
    var = var.strip()
    maxchar = len(var)
    if maxchar == 1:
        new_var = "00" + var
    elif maxchar == 2:
        new_var = "0" + var
    else:
        new_var = var
        
    return new_var

## Inspect and process each table

### Table 0 (KEY INFORMATION)

In [16]:
tables[0]

Unnamed: 0,0,1
0,Project ID,000
1,Project Title,Building Cladding TaskForce
2,Last Update,24 May 2018
3,VCDI Stream,Analytics
4,Project Lead,Suhith Illesinghe
5,Key Stakeholder,TaskForce
6,Executive Sponsor,TBC
7,Project Start,01 Mar 2017
8,Project End,30 Aug 2017
9,Current State,Inactive


In [17]:
tables[0] = transpose_table_with_two_columns(tables[0])
tables[0]

Unnamed: 0,Project ID,Project Title,Last Update,VCDI Stream,Project Lead,Key Stakeholder,Executive Sponsor,Project Start,Project End,Current State
0,0,Building Cladding TaskForce,24 May 2018,Analytics,Suhith Illesinghe,TaskForce,TBC,01 Mar 2017,30 Aug 2017,Inactive


In [18]:
project_code = cleancode(tables[0].iloc[0,0])
tables[0].iloc[0,0] = project_code

### Table 1 (PROJECT SCORES)

In [19]:
tables[1]

Unnamed: 0,Project Selection Score,Priority,Maturity,Complexity
0,75% (324/430),7.2,7.1,7.0


### Table 2 (PROJECT LIFE CYCLE 1/2)

In [20]:
tables[2]

Unnamed: 0,0,1
0,1 - Pre Project,COMPLETE
1,2 - Feasibility,COMPLETE
2,3 - Foundations,COMPLETE
3,4 - Development,COMPLETE
4,5 - Delivery,COMPLETE
5,6 - Closure,COMPLETE
6,7 - Post Project,


In [21]:
tables[2] = transpose_table_with_two_columns(tables[2])
tables[2]

Unnamed: 0,1 - Pre Project,2 - Feasibility,3 - Foundations,4 - Development,5 - Delivery,6 - Closure,7 - Post Project
0,COMPLETE,COMPLETE,COMPLETE,COMPLETE,COMPLETE,COMPLETE,


### Table 3 (PROJECT LIFE CYCLE 2/2)

In [22]:
tables[3]

Unnamed: 0,0,1
0,Current Phase,7 - Post Project
1,Overall Status,5-COMPLETE


In [23]:
tables[3] = transpose_table_with_two_columns(tables[3])
tables[3]

Unnamed: 0,Current Phase,Overall Status
0,7 - Post Project,5-COMPLETE


###  Table 4 (STATUS UPDATE)

In [24]:
tables[4]

Unnamed: 0,0
0,Status Update
1,Exec summary


In [25]:
tables[4] = first_row_to_headers(tables[4])
tables[4]

Unnamed: 0,Status Update
1,Exec summary


In [26]:
tables[4] = reset(tables[4])
tables[4]

Unnamed: 0,Status Update
0,Exec summary


### Table 5 (DATA CATEGORIES)

In [27]:
tables[5]

Unnamed: 0,Data Type,Yes / No,Nature of Data / Info Used
0,Personal,Yes,Detailed description about PERSONAL data break...
1,Health,No,Detailed description about HEALTH data breakdo...


In [28]:
columns = list(tables[5].columns)
columns

['Data Type', 'Yes / No', 'Nature of Data / Info Used']

In [29]:
new_tbl5_dict = {}

In [30]:
for index in range(tables[5].shape[0]):
    datatype = tables[5].loc[index, columns[0]]
    type_header =  columns[1] + " " + datatype
    type_selection = tables[5].loc[index, columns[1]]
    info_header = columns[2] + " " + datatype
    info_selection = tables[5].loc[index, columns[2]]
    
    new_tbl5_dict[type_header] = [type_selection]
    new_tbl5_dict[info_header] = [info_selection]
new_tbl5_dict

{'Yes / No Personal': ['Yes'],
 'Nature of Data / Info Used Personal': ['Detailed description about PERSONAL data breakdown schedule'],
 'Yes / No Health': ['No'],
 'Nature of Data / Info Used Health': ['Detailed description about HEALTH data breakdown schedule']}

In [31]:
tbl5_data = pandas.DataFrame(new_tbl5_dict)
tables[5] = tbl5_data
tables[5]

Unnamed: 0,Yes / No Personal,Nature of Data / Info Used Personal,Yes / No Health,Nature of Data / Info Used Health
0,Yes,Detailed description about PERSONAL data break...,No,Detailed description about HEALTH data breakdo...


### Table 6 (STAKEHOLDERS)

In [32]:
tables[6]

Unnamed: 0,Stakeholder,Contacts / Description
0,Cladding TaskForce,TBC
1,VBA,TBC
2,,
3,,
4,,


In [33]:
tables[6] = tables[6].dropna(how = "all", subset = "Stakeholder", axis = 0)

TypeError: Index(...) must be called with a collection of some kind, 'Stakeholder' was passed

In [None]:
tables[6]

In [None]:
tables[6]["Project ID"] = project_code

In [None]:
tables[6]

In [None]:
null_stakeholder_column = tables[6]["Stakeholder"].isnull().sum()/tables[6].shape[0]

In [None]:
if null_stakeholder_column == 1:
    tables[6] = tables[6][:1]

### Table 7 (ARTIFACTS)

In [None]:
tables[7]

In [None]:
tables[7]

In [None]:
tables[7] = first_row_to_headers(tables[7])
tables[7]

In [None]:
tables[7] = compress_table_vertically(tables[7])
tables[7]

### Table 8 (PROJECT OBJECTIVE OUTCOMES)

In [None]:
tables[8]

In [None]:
dictionary = {tables[8].iloc[0,0] : [tables[8].iloc[1,0]],
             tables[8].iloc[2,0] : [tables[8].iloc[3,0]]}

dictionary

In [None]:
tables[8] = pandas.DataFrame(dictionary)
tables[8]

In [None]:
tables[8].loc[0,"Objective"] = convert_list_to_pipelinestring(tables[8].loc[0,"Objective"])

In [None]:
tables[8]

### Table 9 (RISK REGISTER)

In [None]:
tables[9]

In [None]:
for index in range(tables[9].shape[0]):
    tables[9].iloc[index, 1] = project_code

In [None]:
tables[9] = compress_table_vertically(tables[9])
tables[9]

### Table 10 (ISSUES REGISTER)

In [None]:
tables[10]

In [None]:
for index in range(tables[10].shape[0]):
    tables[10].iloc[index, 1] = project_code

In [None]:
tables[10] = compress_table_vertically(tables[10])
tables[10]

### Table 11 (BENEFITS REGISTER)

In [None]:
tables[11]

In [None]:
for index in range(tables[11].shape[0]):
    tables[11].iloc[index, 1] = project_code

In [None]:
tables[11] = compress_table_vertically(tables[11])
tables[11]

### Table 12 (DATA REQUEST REGISTER)

In [None]:
tables[12]

In [None]:
for index in range(tables[12].shape[0]):
    tables[12].iloc[index, 1] = project_code

In [None]:
tables[12] = compress_table_vertically(tables[12])
tables[12]

## Combine All Tables

- Resulting table will be stakeholders table joined against combined remaining tables, minus the register and artifact tables.

### Combine tables 0,1,2,3,4,8 horizontally

In [None]:
combined_data1 = pandas.concat([tables[0],
                              tables[1], 
                              tables[2], 
                              tables[3], 
                              tables[4], 
                              tables[5],
                              tables[8]], axis = 1)

combined_data1

In [None]:
combined_data2 = tables[6].merge(combined_data1,
                                 on = "Project ID",
                                 how = "left")
combined_data2

### Export 

In [None]:
filename = "combined_data_" + str(project_code) + ".csv"

combined_data2.to_csv(output_path + filename, index = False)