![image.png](https://raw.githubusercontent.com/fjvarasc/DSPXI/master/figures/python_logo.png)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Packages-To-Parse-Excel-Files-And-Write-Them-Back-With-Python" data-toc-modified-id="Packages-To-Parse-Excel-Files-And-Write-Them-Back-With-Python-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Packages To Parse Excel Files And Write Them Back With Python</a></span><ul class="toc-item"><li><span><a href="#How-To-Read-and-Write-Excel-Files-With-openpyxl" data-toc-modified-id="How-To-Read-and-Write-Excel-Files-With-openpyxl-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>How To Read and Write Excel Files With openpyxl</a></span></li><li><span><a href="#Reading-And-Formatting-Excel-Files:-xlrd" data-toc-modified-id="Reading-And-Formatting-Excel-Files:-xlrd-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Reading And Formatting Excel Files: xlrd</a></span></li><li><span><a href="#Using-pyexcel-To-Read-.xls-or-.xlsx-Files" data-toc-modified-id="Using-pyexcel-To-Read-.xls-or-.xlsx-Files-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Using pyexcel To Read .xls or .xlsx Files</a></span></li><li><span><a href="#Writing-Files-With-pyexcel" data-toc-modified-id="Writing-Files-With-pyexcel-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Writing Files With pyexcel</a></span></li><li><span><a href="#Reading-and-Writing-.csv-files" data-toc-modified-id="Reading-and-Writing-.csv-files-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Reading and Writing .csv files</a></span></li><li><span><a href="#Final-Check-of-Your-Data" data-toc-modified-id="Final-Check-of-Your-Data-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Final Check of Your Data</a></span></li></ul></li></ul></div>

# Packages To Parse Excel Files And Write Them Back With Python
Besides the Pandas package, which you will probably use very often to load in your data, you can also use other packages to get your data in Python. Our overview of the available packages is based on [this page](http://www.python-excel.org/), which includes a list of packages that you can use to work with Excel files in Python.

In what follows, you’ll see how to use these packages with the help of some real-life but simplified examples.

## How To Read and Write Excel Files With openpyxl

This package is generally recommended if you want to read and write .xlsx, xlsm, xltx, and xltm files.

In [157]:
# Import `load_workbook` module from `openpyxl`
from openpyxl import load_workbook

# Load in the workbook
wb = load_workbook('./IMDB-Movie-Data.xlsx')
#wb = load_workbook('https://github.com/fjvarasc/DSPXI/blob/master/data/IMDB-Movie-Data.xlsx?raw=true')
# Get sheet names
print(wb.sheetnames)

['2010', '2009', '2008', '2007', '2006']


You see that the code chunk above returns the sheet names of the workbook that you loaded in Python. Next, you can use this information to also retrieve separate sheets of the workbook.

You can also check which sheet is currently active with `wb.active`. As you can see in the code below, you can also use it to load in another sheet from your workbook:

In [114]:
# Get a sheet by name 
sheet = wb['2010']

# Print the sheet title 
sheet.title

# Get currently active sheet
anotherSheet = wb.active

# Check `anotherSheet` 
anotherSheet

<Worksheet "2010">

You’ll see that with these `Worksheet` objects, you won’t be able to do much at first sight. However, you can retrieve values from certain cells in your workbook's sheet by using square brackets `[]`, to which you pass the exact cell from which you want to retrieve the value.

Note that this seems very similar to selecting, getting and indexing NumPy arrays and Pandas DataFrames, yet this is not all that you need to do to get the value; You need to add the attribute `value`:

In [115]:
# Retrieve the value of a certain cell
sheet['A1'].value

# Select element 'B2' of your sheet 
c = sheet['B2']

# Retrieve the row number of your element
c.row

# Retrieve the column letter of your element
c.column

# Retrieve the coordinates of the cell 
c.coordinate

'B2'

As you can see that besides `value`, there are also other attributes that you can use to inspect your cell, namely `row`, `column` and `coordinate`.

- The `row` attribute will give back `2`;
- Adding the `column` attribute to `c` will give you `'B'`, and
- The `coordinate` will give back `'B2'`.

You can also retrieve cell values by using the `cell()` function. Pass the `row` and the `column` arguments and add values to these arguments that correspond to the values of the cell that you want to retrieve and, of course, don’t forget to add the attribute `value`:

In [116]:
# Retrieve cell value 
sheet.cell(row=1, column=2).value

# Print out values in column 2 
for i in range(1, 4):
     print(i, sheet.cell(row=i, column=2).value)

1 Title
2 Inception
3 Shutter Island


Note that if you don’t specify the attribute `value`, you’ll get back `<Cell Sheet3.B1>`, which doesn’t tell you anything about the value that is contained within that particular cell.

You see that you use a for loop with the help of the `range()` function to help you to print out the values of the rows that have values in column 2. If those particular cells are empty, you’ll just get back `None`.

What’s more, there are also special functions that you can call to get certain other values back, like `get_column_letter()` and `column_index_from_string`.

The two functions already state more or less what you can retrieve by using them, but for clarity it’s best to make them explicit: while you can retrieve the letter of the column with the former, you can do the reverse or get the index of a column when you pass a letter to the latter. You can see how it works below:

In [117]:
# Import relevant modules from `openpyxl.utils`
from openpyxl.utils import get_column_letter, column_index_from_string

# Return 'A'
get_column_letter(1)

# Return '1'
column_index_from_string('A')

1

You have already retrieved values for rows that have values in a particular column, but what do you need to do if you want to print out the rows of your file, without just focusing on one single column?

You use another for loop, of course!

You say, for example, that you want to focus on the area that lies in between `'A1'` and `'C3'`, where the first specifies the left upper corner and the second the right bottom corner of the area on which you want to focus.

This area will be the so-called `cellObj` that you see in the first line of code below. You then say that for each cell that lies in that area, you print the coordinate and the value that is contained within that cell. After the end of each row, you’ll print a message that signals that the row of that `cellObj` area has been printed.

In [118]:
# Print row per row
for cellObj in sheet['A1':'C3']:
    for cell in cellObj:
        print(cell.coordinate, cell.value)
    print('--- END ---')

A1 Rank
B1 Title
C1 Genre
--- END ---
A2 81
B2 Inception
C2 Action,Adventure,Sci-Fi
--- END ---
A3 139
B3 Shutter Island
C3 Mystery,Thriller
--- END ---


Note again how the selection of the area is very similar to selecting, getting and indexing list and NumPy array elements, where you also use square brackets and a colon `:` to indicate the area of which you want to get the values. In addition, the above loop also makes good use of the cell attributes!

Lastly, there are some attributes that you can use to check up on the result of your import, namely `max_row` and `max_column`. These attributes are of course general ways of making sure that you loaded in the data correctly, but nonetheless they can and will be useful.

In [119]:
# Retrieve the maximum amount of rows 
sheet.max_row

61

In [120]:
# Retrieve the maximum amount of columns
sheet.max_column

12

This is all very good, but I can almost hear you thinking now that this seems to be an awfully hard way to work with these files, especially if you want to still manipulate the data.
`openpyxl` has support for Pandas DataFrames! You can use the `DataFrame()` function from the Pandas package to put the values of a sheet into a DataFrame:

In [121]:
# Import `pandas` 
import pandas as pd

# Convert Sheet to DataFrame
df = pd.DataFrame(sheet.values)

If you want to specify headers, you need to add a little bit more code:

In [122]:
# Put the sheet values in `data`
data = sheet.values

# Indicate the columns in the sheet values
cols = next(data)[0:]

# Convert your data to a list
data = list(data)

# Make your DataFrame
pd.DataFrame(data, columns=cols)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,81,Inception,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0
1,139,Shutter Island,"Mystery,Thriller","In 1954, a U.S. marshal investigates the disap...",Martin Scorsese,"Leonardo DiCaprio, Emily Mortimer, Mark Ruffal...",2010,138,8.1,855604,127.97,63.0
2,142,Diary of a Wimpy Kid,"Comedy,Family",The adventures of a teenager who is fresh out ...,Thor Freudenthal,"Zachary Gordon, Robert Capron, Rachael Harris,...",2010,94,6.2,34184,64.0,56.0
3,159,Scott Pilgrim vs. the World,"Action,Comedy,Fantasy",Scott Pilgrim must defeat his new girlfriend's...,Edgar Wright,"Michael Cera, Mary Elizabeth Winstead, Kieran ...",2010,112,7.5,291457,31.49,69.0
4,220,Kick-Ass,"Action,Comedy",Dave Lizewski is an unnoticed high school stud...,Matthew Vaughn,"Aaron Taylor-Johnson, Nicolas Cage, Chlo√´ Gra...",2010,117,7.7,456749,48.04,66.0
5,228,Predators,"Action,Adventure,Sci-Fi",A group of elite warriors parachute into an un...,Nimr√≥d Antal,"Adrien Brody, Laurence Fishburne, Topher Grace...",2010,107,6.4,179450,52.0,51.0
6,245,Percy Jackson & the Olympians: The Lightning T...,"Adventure,Family,Fantasy",A teenager discovers he's the descendant of a ...,Chris Columbus,"Logan Lerman, Kevin McKidd, Steve Coogan,Brand...",2010,118,5.9,148949,88.76,47.0
7,262,Black Swan,"Drama,Thriller",A committed dancer wins the lead role in a pro...,Darren Aronofsky,"Natalie Portman, Mila Kunis, Vincent Cassel,Wi...",2010,108,8.0,581518,106.95,79.0
8,305,She's Out of My League,"Comedy,Romance","An average Joe meets the perfect woman, but hi...",Jim Field Smith,"Jay Baruchel, Alice Eve, T.J. Miller, Mike Vogel",2010,104,6.4,105619,31.58,46.0
9,325,The Social Network,"Biography,Drama",Harvard student Mark Zuckerberg creates the so...,David Fincher,"Jesse Eisenberg, Andrew Garfield, Justin Timbe...",2010,120,7.7,510100,96.92,95.0


Next, you can start manipulating the data with all the functions that the Pandas package has to offer.
To write your Pandas DataFrames back to an Excel file, you can easily use the `dataframe_to_rows()` function from the `utils` module:

In [139]:
# Import `dataframe_to_rows`
from openpyxl.utils.dataframe import dataframe_to_rows

# Initialize a workbook 
wb = load_workbook('./IMDB-Movie-Data.xlsx')

# Get the worksheet in the active workbook
ws = wb.active

# Append the rows of the DataFrame to your worksheet
for r in dataframe_to_rows(df, index=True, header=True):
    ws.append(r)

In [140]:
ws

<Worksheet "2010">

But this is definitely not all! The `openpyxl` package offers you high flexibility on how you write your data back to Excel files, changing cell styles or using the write-only mode, which makes it one of the packages that you definitely need to know when you're often working with spreadsheets.
Tip: read up more on how you can change cell styles, change to the write-only mode or how the package works with NumPy [here](https://openpyxl.readthedocs.io/en/stable/pandas.html).

Now, let's also check out some other packages that you can use to get your spreadsheet data in Python. 

## Reading And Formatting Excel Files: xlrd

This package is ideal if you want to read data and format data from files with the .xls or .xlsx extension.

In [147]:
# Import `xlrd`
import xlrd

# Open a workbook 
workbook = xlrd.open_workbook('IMDB-Movie-Data.xlsx')

# Loads only current sheets to memory
workbook = xlrd.open_workbook('IMDB-Movie-Data.xlsx', on_demand = True)

When you don’t want to consider the whole workbook, you might want to use functions such as `sheet_by_name()` or `sheet_by_index()` to retrieve the sheets that you do want to use in your analysis.

In [148]:
# Load a specific sheet by name
worksheet = workbook.sheet_by_name('2010')

# Load a specific sheet by index 
worksheet = workbook.sheet_by_index(0)

# Retrieve the value from cell at indices (0,0) 
worksheet.cell(1, 2).value

'Action,Adventure,Sci-Fi'

Lastly, you also see that you can retrieve the value at certain coordinates, which you express with indices, from your sheet.
Continue to `xlwt` and `xlutils` to know more about how they relate to the [xlrd](https://xlwt.readthedocs.io/en/latest/) package!

## Using pyexcel To Read .xls or .xlsx Files

Another package that you can use to read spreadsheet data in Python is pyexcel; It’s a Python Wrapper that provides one API for reading, manipulating and writing data in .csv, .ods, .xls, .xlsx and .xlsm files. Of course, for this tutorial, you will just focus on the .xls and .xls files.
To get your data in an array, you can use the get_array() function that is contained within the pyexcel package:


In [151]:
# Import `pyexcel`
import pyexcel

# Get an array from the data
my_array = pyexcel.get_array(file_name="IMDB-Movie-Data.xlsx")

You can also get your data in an ordered dictionary of lists. You can use the get_dict() function:

In [128]:
# Import `OrderedDict` module 
from pyexcel._compact import OrderedDict

# Get your data in an ordered dictionary of lists
my_dict = pyexcel.get_dict(file_name="IMDB-Movie-Data.xlsx", name_columns_by_row=0)

# Get your data in a dictionary of 2D arrays
book_dict = pyexcel.get_book_dict(file_name="IMDB-Movie-Data.xlsx")

However, you also see that if you want to get back a dictionary of two-dimensional arrays or, stated differently, obtain all the workbook sheets in a single dictionary, you can resort to `get_book_dict()`.
Be aware that these two data structures that were mentioned above, the arrays and dictionaries of your spreadsheet, allow you to create DataFrames of your data with `pd.DataFrame()`. This will make it easier to handle your data!
Lastly, you can also just retrieve the records with `pyexcel` thanks to the `get_records()` function. Just pass the argument `file_name` to the function and you should be getting back a list of dictionaries:

In [129]:
# Retrieve the records of the file
records = pyexcel.get_records(file_name="IMDB-Movie-Data.xlsx")
records

[OrderedDict([('Rank', 81),
              ('Title', 'Inception'),
              ('Genre', 'Action,Adventure,Sci-Fi'),
              ('Description',
               'A thief, who steals corporate secrets through use of dream-sharing technology, is given the inverse task of planting an idea into the mind of a CEO.'),
              ('Director', 'Christopher Nolan'),
              ('Actors',
               'Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen Page, Ken Watanabe'),
              ('Year', 2010),
              ('Runtime (Minutes)', 148),
              ('Rating', 8.8),
              ('Votes', 1583625),
              ('Revenue (Millions)', 292.57),
              ('Metascore', 74)]),
 OrderedDict([('Rank', 139),
              ('Title', 'Shutter Island'),
              ('Genre', 'Mystery,Thriller'),
              ('Description',
               'In 1954, a U.S. marshal investigates the disappearance of a murderess who escaped from a hospital for the criminally insane.'),
              ('

## Writing Files With pyexcel

Just like it’s easy to load your data into arrays with this package, you can also easily export your arrays back to a spreadsheet. Use the `save_as()` function and pass the array and the name of the destination file to the `dest_file_name`` argument:

In [130]:
# Get the data
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

# Save the array to a file
pyexcel.save_as(array=data, dest_file_name="array_data.xlsx")

Note that if you want to specify a delimiter, you can add the `dest_delimiter` argument and pass the symbol that you want to use as a delimiter in between `""`.
If, however, you have a dictionary, you’ll need to use the `save_book_as()` function. Pass the twodimensional dictionary to `bookdict` and specify the file name and you’re good:

In [131]:
# The data
twod_array_dictionary = {'Sheet 1': [
                                   ['ID', 'AGE', 'SCORE'],
                                   [1, 22, 5],
                                   [2, 15, 6],
                                   [3, 28, 9]
                                  ],
                       'Sheet 2': [
                                    ['X', 'Y', 'Z'],
                                    [1, 2, 3],
                                    [4, 5, 6],
                                    [7, 8, 9]
                                  ],
                       'Sheet 3': [
                                    ['M', 'N', 'O', 'P'],
                                    [10, 11, 12, 13],
                                    [14, 15, 16, 17],
                                    [18, 19, 20, 21]
                                   ]}

# Save the data to a file                        
pyexcel.save_book_as(bookdict=twod_array_dictionary, dest_file_name="2d_array_data.xlsx")

## Reading and Writing .csv files

If you’re still looking for packages that allow you to load in and write data to .csv files besides Pandas, you can best use the csv package:


In [132]:
# Write csv file
import csv
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
outfile = open('data.csv', 'w')
writer = csv.writer(outfile, delimiter=';', quotechar='"')
writer.writerows(data)
outfile.close()

In [133]:
# import `csv`
import csv
# Read in csv file 
for row in csv.reader(open('data.csv'), delimiter=';'):
      print(row)
    

['1', '2', '3']
['4', '5', '6']
['7', '8', '9']


Note also that the NumPy package has a function `genfromtxt()` that allows you to load in the data that is contained within `.csv` files in arrays which you can then put in DataFrames.

## Final Check of Your Data 

When you have the data available, don’t forget the last step: checking whether the data has been loaded in correctly. If you have put your data in a DataFrame, you can easily and quickly check whether the import was successful by running the following commands:

In [134]:
# Check the first entries of the DataFrame
df1.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,81,Inception,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0
1,139,Shutter Island,"Mystery,Thriller","In 1954, a U.S. marshal investigates the disap...",Martin Scorsese,"Leonardo DiCaprio, Emily Mortimer, Mark Ruffal...",2010,138,8.1,855604,127.97,63.0
2,142,Diary of a Wimpy Kid,"Comedy,Family",The adventures of a teenager who is fresh out ...,Thor Freudenthal,"Zachary Gordon, Robert Capron, Rachael Harris,...",2010,94,6.2,34184,64.0,56.0
3,159,Scott Pilgrim vs. the World,"Action,Comedy,Fantasy",Scott Pilgrim must defeat his new girlfriend's...,Edgar Wright,"Michael Cera, Mary Elizabeth Winstead, Kieran ...",2010,112,7.5,291457,31.49,69.0
4,220,Kick-Ass,"Action,Comedy",Dave Lizewski is an unnoticed high school stud...,Matthew Vaughn,"Aaron Taylor-Johnson, Nicolas Cage, Chlo√´ Gra...",2010,117,7.7,456749,48.04,66.0


In [135]:
# Check the last entries of the DataFrame
df1.tail()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
55,942,The Twilight Saga: Eclipse,"Adventure,Drama,Fantasy",As a string of mysterious killings grips Seatt...,David Slade,"Kristen Stewart, Robert Pattinson, Taylor Laut...",2010,124,4.9,192740,300.52,58.0
56,953,Sex and the City 2,"Comedy,Drama,Romance","While wrestling with the pressures of life, lo...",Michael Patrick King,"Sarah Jessica Parker, Kim Cattrall, Kristin Da...",2010,146,4.3,62403,95.33,27.0
57,957,Legion,"Action,Fantasy,Horror",When a group of strangers at a dusty roadside ...,Scott Stewart,"Paul Bettany, Dennis Quaid, Charles S. Dutton,...",2010,100,5.2,84158,40.17,32.0
58,964,I Spit on Your Grave,"Crime,Horror,Thriller",A writer who is brutalized during her cabin re...,Steven R. Monroe,"Sarah Butler, Jeff Branson, Andrew Howard,Dani...",2010,108,6.3,60133,0.09,27.0
59,994,Resident Evil: Afterlife,"Action,Adventure,Horror",While still out to destroy the evil Umbrella C...,Paul W.S. Anderson,"Milla Jovovich, Ali Larter, Wentworth Miller,K...",2010,97,5.9,140900,60.13,37.0
