# Reading poorly structured Excel files with Pandas

This is a walkthrough of [Chris Moffitt's Practical Business Python](https://pbpython.com/pandas-excel-range.html) blog post from Oct 19, 2020.

This post came across my feed right as we were looking at Pandas and I think has great examples with real-world datasets. In an ideal world, all data would be nicely formatted and easy to work with...that world does not exist...data are messy and people don't follow best practices in formatting files.

Note that this tutorial requires openpyxl >= 3.0.4

In [1]:
# Download the example file to the current directory.
!wget https://github.com/chris1610/pbpython/raw/master/data/shipping_tables.xlsx

--2020-10-20 11:41:39--  https://github.com/chris1610/pbpython/raw/master/data/shipping_tables.xlsx
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/chris1610/pbpython/master/data/shipping_tables.xlsx [following]
--2020-10-20 11:41:39--  https://raw.githubusercontent.com/chris1610/pbpython/master/data/shipping_tables.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.204.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.204.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16307 (16K) [application/octet-stream]
Saving to: ‘shipping_tables.xlsx.2’


2020-10-20 11:41:39 (1.86 MB/s) - ‘shipping_tables.xlsx.2’ saved [16307/16307]



In [2]:
# Need a module for Excel that isn't installed on HiPerGato.
#  `pip install MODULE --user` is the command that would install modules
#  in your user directory.
!pip install xlrd --user



In [3]:
import pandas as pd

# Let's see what happens if we simply try to read this into a dataframe
df=pd.read_excel('shipping_tables.xlsx')
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Date,2020-01-01 00:00:00
0,,order id,order date,state,priority,item_type,,Notes,,
1,,669165933,2019-01-03 00:00:00,MN,2-day,Baby Food,,Check this one out,,
2,,963881480,2019-01-04 00:00:00,WI,next-day,Cereal,,,,
3,,341417157,2019-01-05 00:00:00,TX,2-day,Office Supplies,,,,
4,,514321792,2019-01-06 00:00:00,CA,next-day,Office Supplies,,,,


Simply reading the file in with `pd.read_excel()` gives messy results because the first row of the first sheet has just a date in columns I & J. The real headers are in the second row of the Excel file. Also column A has no data, so we can ignore that. The `pd.read_excel()` function has options to deal with these. 

In [4]:
df = pd.read_excel('shipping_tables.xlsx', header=1, usecols='B:F')
df.head()

Unnamed: 0,order id,order date,state,priority,item_type
0,669165933,2019-01-03,MN,2-day,Baby Food
1,963881480,2019-01-04,WI,next-day,Cereal
2,341417157,2019-01-05,TX,2-day,Office Supplies
3,514321792,2019-01-06,CA,next-day,Office Supplies
4,115456712,2019-01-07,CA,2-day,Office Supplies


As noted, the `header=1` is a 0-based index to the header row--the second row in this case.

The `usecols` flag also takes a lot of different formats for the specification, letters, numbers, column names, etc. The original post also looks at using a lambda function to make all the column names lower case so that multiple files with similar column names can be combined.

Pandas can also read from (and write to) lots of different types of data sources. Check the [I/O section of the Pandas docs](https://pandas.pydata.org/pandas-docs/stable/reference/io.html).

## Reading from Excel Worksheets, Ranges and Tables

Also notice above that we got the data from the first worksheet, and nothing with information that the Excel file has two worksheets. 

The example in the file may seem extreem, but again...people are people, publishers are publishers, and there's certainly data out there with these formats. One file may not be an issue to work with by hand, but what if you had hundereds of these files to work with?

While we can get part of the way there using `sheet_name` in `pd.read_excel` that function doesn't know about Table names:

In [5]:
df_rates = pd.read_excel('shipping_tables.xlsx', sheet_name='shipping_rates')

df_rates.head()

Unnamed: 0,ship_type,Notes,ship_cost,Unnamed: 3,Unnamed: 4
0,Baby Food,2-day and next-day,5-7,,
1,Cereal,next-day and 2-day,8-11,,
2,Fruit,next-day and 2-day,5-6,,
3,Office Supplies,2-day and next-day,7-9,,
4,,,,,


To access named tables within the Excel file, there is another module called `openpyxl`.

In [6]:
from openpyxl import load_workbook

wb = load_workbook(filename = 'shipping_tables.xlsx') # Notice the different format here where 
                                                      # filename flag is needed
type(wb) # Like Pandas dataframes, openpyxl adds a data type of workbook.

openpyxl.workbook.workbook.Workbook

In [7]:
wb.sheetnames

['sales', 'shipping_rates']

In [8]:
# Create a sheet variable with the shipping_rates sheet

sheet = wb['shipping_rates']

Look at the named tables with that sheet.

In [12]:
sheet.tables.keys()


dict_keys(['ship_cost'])

In [14]:
# Get the Excel range for the ship_cost table:
# Again, different than the 
lookup_table = sheet.tables['ship_cost']
lookup_table.ref

'C8:E16'

In [15]:
# Using the above range, convert the data into a data frame

# Access the data in the table range
data = sheet[lookup_table.ref]
rows_list = []

# Loop through each row and get the values in the cells
for row in data:
    # Get a list of all columns in each row
    cols = []
    for col in row:
        cols.append(col.value)
    rows_list.append(cols)

# Create a pandas dataframe from the rows_list.
# The first row is the column names
df = pd.DataFrame(data=rows_list[1:], index=None, columns=rows_list[0])
df.head()

Unnamed: 0,item_type,priority,shipping_cost
0,Baby Food,2-day,5
1,Baby Food,next-day,7
2,Cereal,2-day,8
3,Cereal,next-day,11
4,Fruit,2-day,5


## Summary

Chris also points to a paper that I typically assign later in the semester, [Broman and Woo (2018)](https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989), that covers best practices in data organization in spreadsheets. 