Pre-processing Excel Sheets for Later Data Analysis in R
While cleaning data for a data analysis project in R, I came across a very big (> 100 MB) Excel file (xlsx) containing most of the data. There was no codebook, so I naturally wanted to open the file in Excel first, and have a look at it before loading it into R. It took a very long time to open and when it did, Excel often produced a "not enough resources" error at any modification to the file.
Trying to open the file in R using
read.xlsx2(), or using most of the solutions mentioned here, took even longer than opening it in Excel and wasn't always successful.
In this short guide, I am sharing how I quickly worked around this issue. In particular, how I pre-processed the Excel sheet before reading it into R.
Although my goal was to read the file into R to clean it and analyse it, I learnt some valuable lessons by detouring from the standard flow. I hope you find this guide useful.
How I pre-processed the Excel file
My first objective was to reduce the size of the file so I can manipulate it. I was able to achieve this by doing the following:
The file had 6 sheets. I split each sheet into a separate file. I made copies of the file, iteratively opened each copy in Excel and deleted all but a single sheet. Trying to copy or move a sheet into another file while it is open in Excel consumed more memory so it was not possible in this case.
I checked the file for any color-coded information in preparation for the next step. In my case, colors did not carry any information, but in case it does, I suggest you encode the color-coded information into a variable.
I cleared formats of all the cells in the file.
I then cleared all the data validation rules. These were there to ensure that data entered in certain columns were selected for look-up tables. These look-up tables were available as sheets in the same file.
All formulas were substituted by values. Columns were copied and pasted in-place as values.
All conditional formatting formulas were deleted.
The above steps reduced the file size to 3.5 MB. It is important to mention that the first 4 steps paved the way to steps 5 and 6. Copying and pasting columns (step 5) was impossible without reducing the file size considerably first.
I could have loaded the new Excel file resulting from step 5, into R, but I decided to do more cleaning in Excel first:
I removed empty columns between populated ones, so that all the variables are stacked to the left.
The first row represented the variable names. They were not in English, so to avoid character encoding issues while loading the data into R and to make it easier to manipulate the variables, I translated the variable names to English.
Some records (observations) were invalid, missing required variables. These were filtered and deleted. This could have been easily done in R after the data was loaded, but it was worth doing in Excel since it contributed a lot to reducing the file size, and in some cases, could be an essential step in order to load the data into R. The number of records went down from more than 100,000 to about 9,000.
Since the version of Excel I had had an issue saving the file in Unicode, I saved the file as csv and changed the encoding in Notepad. To do that, I replaced all the
","characters in the file into
" - ". This enabled me to preserve the number of columns in the csv file while using
","as a delimiter. You might want to choose a different substitution or delimiter based on the type of data you are dealing with.
Since the paragraph direction in Excel cells depend on the language of the text in the cell, the sheet had mixed paragraph directions. This could cause columns in the csv file to move before or after another adjacent column, so I set the paragraph direction for all columns to be the same.
The file was finally saved in csv format, re-encoded in Notepad, and loaded into R. But the tidy Excel sheet was also shared with the data providers which they liked and lead to improving the data entry process.
Pre-processing Excel Sheets for Later Data Analysis in R by Ahmad Al-Kashef is licensed under a Creative Commons Attribution 4.0 International License.