# Data consolidation: Part 1

This tutorial teaches how to consolidate multiple csv files into a dataframe in Python and to follow along this tutorial, you can download the *[data](https://drive.google.com/open?id=1sFL2MMELasLHEYtoxK1ian4dTxS7UV7S)* which are of size 2.12GB.


In this tutorial, we are going to use two methods namely:
* Pandas
* dask with pandas

In [1]:
# import necessary libraries
import pandas as pd
from glob import glob

# **Method 1**: Using Pandas

Step 1 : Use glob() to list all files that match a pattern and sort the results

In [2]:
# sorted function is used to sort the files in order.
sales_files = sorted(glob('Data/Sales_Data*.csv'))
sales_files

['Data\\Sales_Data_01.csv',
 'Data\\Sales_Data_02.csv',
 'Data\\Sales_Data_03.csv',
 'Data\\Sales_Data_04.csv',
 'Data\\Sales_Data_05.csv',
 'Data\\Sales_Data_06.csv',
 'Data\\Sales_Data_07.csv',
 'Data\\Sales_Data_08.csv',
 'Data\\Sales_Data_09.csv',
 'Data\\Sales_Data_10.csv',
 'Data\\Sales_Data_11.csv',
 'Data\\Sales_Data_12.csv',
 'Data\\Sales_Data_13.csv',
 'Data\\Sales_Data_14.csv',
 'Data\\Sales_Data_15.csv',
 'Data\\Sales_Data_16.csv',
 'Data\\Sales_Data_17.csv',
 'Data\\Sales_Data_18.csv',
 'Data\\Sales_Data_19.csv',
 'Data\\Sales_Data_20.csv']

Step 2: Use a generator expression to read the files, assign() to create a new column, and concat() to combine the dataFrames

In [3]:
%%time 
sales = pd.concat((pd.read_csv(file).assign(filename = file) for file in sales_files), ignore_index = True)

Wall time: 1min 31s


We now have a consolidated dataframe. We can see the first five of the dataset by using head() function.

In [4]:
sales.head()

Unnamed: 0,Region,Product,Date,Sales,filename
0,West,Prod T,2012-09-06,53395.177324,Data\Sales_Data_01.csv
1,West,Prod K,2016-02-23,116609.694781,Data\Sales_Data_01.csv
2,South,Prod F,2013-09-20,72524.095297,Data\Sales_Data_01.csv
3,South,Prod J,2010-12-24,22538.478726,Data\Sales_Data_01.csv
4,North,Prod D,2012-03-28,45616.532823,Data\Sales_Data_01.csv


and the last five rows by using tail() function.

In [5]:
sales.tail()

Unnamed: 0,Region,Product,Date,Sales,filename
49999995,South,Prod R,2014-10-22,92396.453391,Data\Sales_Data_20.csv
49999996,North,Prod E,2012-02-16,43532.044366,Data\Sales_Data_20.csv
49999997,South,Prod R,2015-10-25,110877.15507,Data\Sales_Data_20.csv
49999998,South,Prod A,2014-06-05,85371.140536,Data\Sales_Data_20.csv
49999999,North,Prod E,2011-04-20,28391.187738,Data\Sales_Data_20.csv


In [6]:
sales.shape

(50000000, 5)

In [7]:
sales.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000000 entries, 0 to 49999999
Data columns (total 5 columns):
Region      object
Product     object
Date        object
Sales       float64
filename    object
dtypes: float64(1), object(4)
memory usage: 13.0 GB


# **Method 2**: Using dask frame and pandas

Dask provides multi-core execution on larger-than-memory datasets. More info about Dask can be found [here](https://github.com/dask/dask-tutorial).

In [8]:
import dask.dataframe as dd
import datetime

start = datetime.datetime.now()

# read multiple files as dask.dataframe
sales_dask = dd.read_csv('Data/Sales_Data*.csv')

# Convert it back to Pandas dataframe
sales_pandas = sales_dask.compute()

end = datetime.datetime.now()

print(f'It took {end-start} secs')

It took 0:00:34.356744 secs


It appears that reading a big data file as a Dask dataframe and then converting it to Pandas dataframe takes lesser time than reading through pandas dataframe. Thus, it might be a good idea to import a large data file through dask and then convert it to pandas dataframe.

In [9]:
sales_pandas.shape

(50000000, 4)

In [10]:
sales_pandas.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000000 entries, 0 to 1097681
Data columns (total 4 columns):
Region     object
Product    object
Date       object
Sales      float64
dtypes: float64(1), object(3)
memory usage: 9.7 GB


**Note**: If the size of the file is small and can quite fit the RAM of the computer, then there is no need to use Dask dataframe in that sense.

---
The Github repository can be found [here](https://github.com/gbganalyst/merge-csv-files-in-python) and if you like this write up, you can also follow me on [Twitter](https://www.twitter.com/gbganalyst) and/or [Linkedin](https://www.linkedin.com/in/ezekiel-ogundepo/) for more updates in `R`, `Excel`, and `Python` for data science.