# BARC Lakehouse Cookbook
 
#### A stepwise introduction to modern data lakehouse technologies for business users

*written by Thomas Zeutschler, Analyst at [BARC](https://barc.com) (Würzburg, Germany), sponsored by [Dremio](https://www.dremio.com)(Santa Clara, California, USA), provider of a next generation Data and Data Lakehouse platform*

-------------------------------------

## Step 2 - Moving from CSV to Parquet files
Think of Parquet files as CSV files on steroids. They do anything better than CSV files, they are more compact, much faster, more efficient, and more secure. The only downside is you can't open them in a text editor. But who cares? We have Python.  

### 2.1. Converting a CSV file to a Parquet file
First, we need to convert the CSV file to a Parquet file. This can take some time, but it's worth to convert all your CSV files to Parquet, the reasons to do so, you will see in a few seconds. Here's how to convert a CSV file to a Parquet file using Python and Pandas.

In [17]:
import pandas as pd
df = pd.read_csv('car_sales.csv')   # Load the CSV file
df.to_parquet('car_sales.parquet')  # Save the DataFrame in the Parquet file format.

### 2.2. Now that we have the Parquet file, we can load and work with it.
The approach for loading and analysing a Parquet file is the same as with the CSV file. The only difference is the file format. Let's load the Parquet file and find out **how many black BMWs were sold in 2015**.

In [18]:
df = pd.read_parquet('car_sales.parquet') # Load the Parquet file
black_bmw = df.query('make == "BMW" and color == "black" and year == 2015')['vin'].count()
print(f"Number of black BMW sold in 2015: {black_bmw}, counted by VIN (vehicle identification number).")

Number of black BMW sold in 2015: 168, counted by VIN (vehicle identification number).


### 2.3. What's different now?
**Nothing and everything**. As the approach is identical, we now have a file that is ±5x times smaller than the original CSV file. Think of 5x times less the storage cost, and think of Tera-Bytes and Peta-Bytes. 

Much more importantly, the loading and execution time was already much faster. You have processed and analyzed 1/2 a million records in a fraction of a second. Roughly 5x or more times faster using the CSV file. And please compare that with your Excel based workflow if this is where you're coming from. 

### 2.4 Final thoughts and takeaways   

But **the real deal** is that you now work on exactly the same data format, that is used by almost all modern and powerful data and analytics technologies and platform out there, the [Parquet data format](https://parquet.apache.org).  

*Please continue with the next step, where we will show you how to query CSV and Parquet files using DuckDB and SQL...*