In [None]:
https://www.datacamp.com/community/tutorials/python-excel-tutorial

https://stackoverflow.com/questions/19616205/running-an-excel-macro-via-python
https://www.dataquest.io/blog/excel-and-pandas
    
https://github.com/python-excel
#https://pbpython.com/excel-pandas-comp.html
#https://www.dataquest.io/blog/excel-and-pandas/
#https://medium.com/importexcel/common-excel-task-in-python-vlookup-with-pandas-merge-c99d4e108988
#https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf
https://medium.com/importexcel/common-excel-task-in-python-vlookup-with-pandas-merge-c99d4e108988
https://www.dataquest.io/blog/excel-and-pandas/
https://www.dataquest.io/blog/pandas-pivot-table/

![image.png](https://raw.githubusercontent.com/fjvarasc/DSPXI/master/figures/python_logo.png)

# Using Python And Excel For Data Science

You will probably already know that Excel is a spreadsheet application developed by Microsoft. You can use this easily accessible tool to organize, analyze and store your data in tables. What’s more, this software is widely used in many different application fields all over the world.

And, whether you like it or not, this applies to data science.

You’ll need to deal with these spreadsheets at some point, but you won’t always want to continue working in it either. That’s why Python developers have implemented ways to read, write and manipulate not only these files, but also many other types of files.

Today’s tutorial will give you some insights into how you can work with Excel and Python. It will provide you with an overview of packages that you can use to load and write these spreadsheets to files with the help of Python. You’ll learn how to work with packages such as `pandas`, `openpyxl`, `xlrd`, `xlutils` and `pyexcel`.

## The Data As Your Starting Point
When you’re starting a data science project, you will often work from data that you have gathered maybe from web scraping, but probably mostly from datasets that you download from other places, such as Kaggle, Quandl, etc.

But more often than not, you’ll also find data on Google or on repositories that are shared by other users. This data might be in an Excel file or saved to a file with .csv extension, … The possibilities can seem endless sometimes. But whenever you have data, your first step should be to make sure that you’re working with a qualitative data.

In the case of a spreadsheet, you should corroborate that it's qualitative because you might not only want to check if this data can answer the research question that you have in mind but also if you can trust the data that the spreadsheet holds.

## Check The Quality of Your Spreadsheet
To check the overall quality of your spreadsheet, you can go over the following checklist:

- Does the spreadsheet represent static data?
- Does your spreadsheet mix data, calculation, and reporting?
- Is the data in your spreadsheet complete and consistent?
    - Does your spreadsheet have a systematic worksheet structure?
    - Did you check if the live formulas in the spreadsheet are valid?

This list of questions is to make sure that your spreadsheet doesn’t ‘sin’ against the best practices that are generally accepted in the industry. Of course, the above list is not exhaustive: there are many more general rules that you can follow to make sure your spreadsheet is not an ugly duckling. However, the questions that have been formulated above are most relevant for when you want to make sure if the spreadsheet is qualitative.

## Best Practices For Spreadsheet Data
Previous to reading in your spreadsheet in Python, you also want to consider adjusting your file to meet some basic principles, such as:

- The first row of the spreadsheet is usually reserved for the header, while the first column is used to identify the sampling unit;
- Avoid names, values or fields with blank spaces. Otherwise, each word will be interpreted as a separate variable, resulting in errors that are related to the number of elements per line in your data set. Consider using:
 - Underscores,
 - Dashes,
 - Camel case, where the first letter of each section of text is capitalized, or
 - Concatenating words
- Short names are preferred over longer names;
- Try to avoid using names that contain symbols such as ?, `$`,`%`, `^`, `&`, `*`, `(`,`)`,`-`,`#`, `?`,`,`,`<`,`>`, `/`, `|`, `\`, `[` ,`]` ,`{`, and `}`;
- Delete any comments that you have made in your file to avoid extra columns or NA’s to be added to your file; and
Make sure that any missing values in your data set are indicated with NA.

Next, after you have made the necessary changes or when you have taken a thorough look at your data, make sure that you save your changes if you have made any. By doing this, you can revisit the data later to edit it, to add more data or to change them, while you preserve the formulas that you maybe used to calculate the data, etc.

If you’re working with Microsoft Excel, you’ll see that there are a considerable amount of options to save your file: besides the default extension `.xls` or .`xlsx`, you can go to the “File” tab, click on “Save As” and select one of the extensions that are listed as the “Save as Type” options. The most commonly used extensions to save datasets for data science are `.csv` and `.txt` (as tab-delimited text file). Depending on the saving option that you choose, your data set’s fields are separated by tabs or commas, which will make up the “field separator characters” of your data set.

Now that have checked and saves your data, you can start with the preparation of your workspace!

# Load Excel Files As Pandas DataFrames
One of the ways that you’ll often used to import your files when you’re working with them for data science is with the help of the Pandas package. As we saw previously, the Pandas library is built on NumPy and provides easy-to-use data structures and data analysis tools for the Python programming language.

This powerful and flexible library is very frequently used by (aspiring) data scientists to get their data into data structures that are highly expressive for their analyses.

If you already have Pandas available through Anaconda, you can just load your files in Pandas DataFrames with pd.Excelfile():

In [11]:
# Import pandas
import pandas as pd

# Load spreadsheet
xl = pd.ExcelFile('IMDB-Movie-Data.xlsx')
# Print the sheet names
print(xl.sheet_names)


['MainData', 'IMDB-Movie-Data']


In [12]:
# Load a sheet into a DataFrame by name: df1
df1 = xl.parse('IMDB-Movie-Data')

df1.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


## Pandas read_excel
We import the pandas module, including ExcelFile. The method read_excel() reads the data into a Pandas Data Frame, where the first parameter is the filename and the second parameter is the sheet.

The list of columns will be called df.columns.

In [25]:
from pandas import ExcelWriter
from pandas import ExcelFile
 
df = pd.read_excel('IMDB-Movie-Data.xlsx', sheet_name='IMDB-Movie-Data')
 
print("Column headings:")
print(df.columns)

Column headings:
Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')


Using the data frame, we can get all the rows below an entire column as a list. To get such a list, simply use the column header

In [18]:
print(df['Title'])

0                      Guardians of the Galaxy
1                                   Prometheus
2                                        Split
3                                         Sing
4                                Suicide Squad
5                               The Great Wall
6                                   La La Land
7                                     Mindhorn
8                           The Lost City of Z
9                                   Passengers
10     Fantastic Beasts and Where to Find Them
11                              Hidden Figures
12                                   Rogue One
13                                       Moana
14                                    Colossal
15                     The Secret Life of Pets
16                               Hacksaw Ridge
17                                Jason Bourne
18                                        Lion
19                                     Arrival
20                                        Gold
21           

## How To Write Pandas DataFrames to Excel Files
Let’s say that after your analysis of the data, you want to write the data back to a new file. There’s also a way to write your Pandas DataFrames back to files with the to_excel() function.

But, before you use this function, make sure that you have the XlsxWriter installed if you want to write your data to multiple worksheets in an .xlsx file:
```
# Install `XlsxWriter` 
pip install XlsxWriter
```

In [19]:
# Specify a writer
writer = pd.ExcelWriter('example.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file     
df1.to_excel(writer, 'Sheet1')

# Save the result 
writer.save()

Note that in the code chunk above, you use an ExcelWriter object to output the DataFrame.

Stated differently, you pass the writer variable to the to_excel() function and you also specify the sheet name. This way, you add a sheet with the data to an existing workbook: you can use the ExcelWriter to save multiple, (slightly) different DataFrames to one workbook.

This all means that if you just want to save one DataFrame to a file, you can also go without installing the XlsxWriter package. Then, you just don’t specify the engine argument that you would pass to the pd.ExcelWriter() function. The rest of the steps stay the same.

Similarly to the functions that you used to read in .csv files, you also have a function to_csv() to write the results back to a comma separated file. It again works much in the same way as when you used it to read in the file:


In [20]:
# Write the DataFrame to csv
df1.to_csv("example.csv")

If you want to have a tab separated file, you can also pass a \t to the sep argument to make this clear. Note that there are various other functions that you can use to output your files. You can find all of them [here](http://pandas.pydata.org/pandas-docs/stable/api.html#id12).

## Packages To Parse Excel Files And Write Them Back With Python
Besides the Pandas package, which you will probably use very often to load in your data, you can also use other packages to get your data in Python. Our overview of the available packages is based on [this page](http://www.python-excel.org/), which includes a list of packages that you can use to work with Excel files in Python.

In what follows, you’ll see how to use these packages with the help of some real-life but simplified examples.

## How To Read and Write Excel Files With openpyxl

This package is generally recommended if you want to read and write .xlsx, xlsm, xltx, and xltm files.

In [24]:
# Import `load_workbook` module from `openpyxl`
from openpyxl import load_workbook

# Load in the workbook
wb = load_workbook('./test.xlsx')

# Get sheet names
print(wb.sheetnames)

['Sheet1', 'Sheet2', 'Sheet3']


You see that the code chunk above returns the sheet names of the workbook that you loaded in Python. Next, you can use this information to also retrieve separate sheets of the workbook.

You can also check which sheet is currently active with `wb.active`. As you can see in the code below, you can also use it to load in another sheet from your workbook:

In [27]:
# Get a sheet by name 
sheet = wb['Sheet3']

# Print the sheet title 
sheet.title

# Get currently active sheet
anotherSheet = wb.active

# Check `anotherSheet` 
anotherSheet

<Worksheet "Sheet3">

You’ll see that with these `Worksheet` objects, you won’t be able to do much at first sight. However, you can retrieve values from certain cells in your workbook's sheet by using square brackets `[]`, to which you pass the exact cell from which you want to retrieve the value.

Note that this seems very similar to selecting, getting and indexing NumPy arrays and Pandas DataFrames, yet this is not all that you need to do to get the value; You need to add the attribute `value`:

In [29]:
# Retrieve the value of a certain cell
sheet['A1'].value

# Select element 'B2' of your sheet 
c = sheet['B2']

# Retrieve the row number of your element
c.row

# Retrieve the column letter of your element
c.column

# Retrieve the coordinates of the cell 
c.coordinate

'B2'

As you can see that besides `value`, there are also other attributes that you can use to inspect your cell, namely `row`, `column` and `coordinate`.

- The `row` attribute will give back `2`;
- Adding the `column` attribute to `c` will give you `'B'`, and
- The `coordinate` will give back `'B2'`.

You can also retrieve cell values by using the `cell()` function. Pass the `row` and the `column` arguments and add values to these arguments that correspond to the values of the cell that you want to retrieve and, of course, don’t forget to add the attribute `value`:

In [30]:
# Retrieve cell value 
sheet.cell(row=1, column=2).value

# Print out values in column 2 
for i in range(1, 4):
     print(i, sheet.cell(row=i, column=2).value)

1 N
2 11
3 15


Note that if you don’t specify the attribute `value`, you’ll get back `<Cell Sheet3.B1>`, which doesn’t tell you anything about the value that is contained within that particular cell.

You see that you use a for loop with the help of the `range()` function to help you to print out the values of the rows that have values in column 2. If those particular cells are empty, you’ll just get back `None`.

What’s more, there are also special functions that you can call to get certain other values back, like `get_column_letter()` and `column_index_from_string`.

The two functions already state more or less what you can retrieve by using them, but for clarity it’s best to make them explicit: while you can retrieve the letter of the column with the former, you can do the reverse or get the index of a column when you pass a letter to the latter. You can see how it works below:

In [31]:
# Import relevant modules from `openpyxl.utils`
from openpyxl.utils import get_column_letter, column_index_from_string

# Return 'A'
get_column_letter(1)

# Return '1'
column_index_from_string('A')

1

You have already retrieved values for rows that have values in a particular column, but what do you need to do if you want to print out the rows of your file, without just focusing on one single column?

You use another for loop, of course!

You say, for example, that you want to focus on the area that lies in between `'A1'` and `'C3'`, where the first specifies the left upper corner and the second the right bottom corner of the area on which you want to focus.

This area will be the so-called `cellObj` that you see in the first line of code below. You then say that for each cell that lies in that area, you print the coordinate and the value that is contained within that cell. After the end of each row, you’ll print a message that signals that the row of that `cellObj` area has been printed.

In [36]:
# Print row per row
for cellObj in sheet['A1':'C3']:
    for cell in cellObj:
        print(cell.coordinate, cell.value)
    print('--- END ---')

A1 M
B1 N
C1 O
--- END ---
A2 10
B2 11
C2 12
--- END ---
A3 14
B3 15
C3 16
--- END ---


Note again how the selection of the area is very similar to selecting, getting and indexing list and NumPy array elements, where you also use square brackets and a colon `:` to indicate the area of which you want to get the values. In addition, the above loop also makes good use of the cell attributes!