## Excel Extraction Notebook

This notebook is an attempt to extract values and formulas from Excel spreadsheet.

In [4]:
import pandas as pd

simple_values_df = pd.read_excel("../sample_excel_files/simple_values.xlsx")

# Printing the first info. Somehow, this will only extract the first sheet.
print(f"Excel's dimension: {simple_values_df.shape}")
simple_values_df.head()

Excel's dimension: (3, 2)


Unnamed: 0,Text,Value
0,iPhone,100
1,Pixel,90
2,Galaxy Note,90


In [5]:
# However, we can retrieve the second sheet if we explicitly notify the sheet.
# Reference: https://datatofish.com/read_excel/

simple_values_second_df = pd.read_excel("../sample_excel_files/simple_values.xlsx", sheet_name="Sheet2")

# As you can see here, the column names become "Unnamed" because the start of the content is in the middle of the sheet.
simple_values_second_df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,,,,
2,,,Exercise,Reps
3,,,Pull-ups,8
4,,,Dips,15


In [6]:
# We can try loading the Excel file as it is (instead of Dataframe) in Pandas!
# Reference: https://www.datacamp.com/community/tutorials/python-excel-tutorial

simple_values_xl = pd.ExcelFile("../sample_excel_files/simple_values.xlsx")

print("Sheet names:")
print(simple_values_xl.sheet_names)

Sheet names:
['Sheet1', 'Sheet2', 'Sheet3']


In [7]:
# Now, let's try parsing each sheet.

sheet_1_df = simple_values_xl.parse('Sheet1')
sheet_2_df = simple_values_xl.parse('Sheet2')

print("Sheet 2:")
sheet_2_df.head()

# Yeah, the result is pretty similar to the read_excel method.

Sheet 2:


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,,,,
2,,,Exercise,Reps
3,,,Pull-ups,8
4,,,Dips,15


In [11]:
# What about sheet with formulas?

sheet_3_df = simple_values_xl.parse('Sheet3')

print("Sheet 3:")
sheet_3_df.head()

# Oh wow, it could read the result of the formula!

Sheet 3:


Unnamed: 0,Name,Price
0,Pizza,10
1,Samosa,2
2,Satay,2
3,Nasi Goreng,5
4,Total,19


In [16]:
# Attempt to get the values from the DF:

print("Object from row 0:")
print(sheet_3_df.iloc[0])

print(f"Object Name: {sheet_3_df.iloc[0]['Name']}")
print(f"Object Price: {sheet_3_df.iloc[0]['Price']}")

Object from row 0:
Name     Pizza
Price       10
Name: 0, dtype: object
Object Name: Pizza
Object Price: 10


In [30]:
# What about getting the formula from a cell (instead of the computed value?)
# Reference: https://stackoverflow.com/a/42106262/1448626

from openpyxl import load_workbook
import pandas as pd

simple_workbook = load_workbook("../sample_excel_files/simple_values.xlsx")

print("Workbook sheet names:")
print(simple_workbook.sheetnames)

sheet_3_ranges = simple_workbook['Sheet3']
print("Sheet 3 Ranges:")
print(sheet_3_ranges)

# Since we know the place where it has formula ('B6'), we can take the value as formula.
b6_cell = sheet_3_ranges['B6']
print(f"Value of B6: {b6_cell.value}")
print(f"Internal value of B6: {b6_cell.internal_value}")

# What's the type of cell B6, anyway?
print(f"Type of B6: {type(b6_cell)}")

Workbook sheet names:
['Sheet1', 'Sheet2', 'Sheet3']
Sheet 3 Ranges:
<Worksheet "Sheet3">
Value of B6: =SUM(B2:B5)
Internal value of B6: =SUM(B2:B5)
Type of B6: <class 'openpyxl.cell.cell.Cell'>


## Formula and computed value comparison

Now that we're able to retrieve the formula and value of a cell, how do we compare one to the other?

In [1]:
from openpyxl import load_workbook

computed_value_wb = load_workbook("../sample_excel_files/simple_values.xlsx", data_only=True)
formula_wb = load_workbook("../sample_excel_files/simple_values.xlsx", data_only=False) # False is the default option.

sheet_3_computed_value = computed_value_wb['Sheet3']
sheet_3_formula = formula_wb['Sheet3']

print("Values of B6:")
print(f"Formula: {sheet_3_formula['B6'].value}")
print(f"Computed value: {sheet_3_computed_value['B6'].value}")

Values of B6:
Formula: =SUM(B2:B5)
Computed value: 19
