<a href="https://colab.research.google.com/github/YesInAJiffy/Utilities/blob/main/csv_to_parquet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The code performs the following tasks:
>
This Python code converts a CSV file to Parquet format and removes either rows or columns with NaN values based on the specified drop_option parameter. Specifically, the code:

* reads a CSV file into a Pandas DataFrame using the specified input file path
* removes rows or columns with NaN values based on the specified drop_option parameter
* converts the Pandas DataFrame to a PyArrow Table
* writes the PyArrow Table to a Parquet file using the specified output file path
* reads the Parquet file into a PyArrow Table
* converts the PyArrow Table back to a Pandas DataFrame
* prints the first 100 rows of the resulting Pandas DataFrame.




**ORIGINAL SOURCE**
https://colab.research.google.com/gist/byteshiva/10c04953d3d3768dd3ed6f1081845031/data_cleaning_customer_data_analysis_csv_parquet.ipynb



---


> ### !apt update is a command that updates the package lists on a Debian-based Linux distribution. It is important to run periodically to keep your system up-to-date.




In [None]:
!apt update

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease
[33m0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (185.1[0m                                                                               Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
[33m0% [Connecting to archive.ubuntu.com (91.189.91.39)] [Connecting to security.ub[0m[33m0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net[0m                                                                               Hit:3 http://archive.ubuntu.com/ubuntu focal InRelease
[33m0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net[0m                                                                               Hit:4 http://security.ubuntu.com/ubuntu focal-security InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:6 http://ar

### !apt upgrade -y is a command that upgrades all packages to their latest versions without prompting for confirmation. It is recommended to run periodically to ensure system security and performance.





In [None]:
!apt upgrade -y

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
The following packages have been kept back:
  libcudnn8 libcudnn8-dev libnccl-dev libnccl2
0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.


In [None]:
cd /content/

/content


In [None]:
![ -d csv_parq ] || mkdir csv_parq

In [None]:
cd csv_parq

/content/csv_parq


In [None]:
!pwd

/content/csv_parq


In [None]:
cd /content/csv_parq

/content/csv_parq


### This shell command installs pandas and pyarrow Python packages using pip.





In [None]:
!pip install pandas pyarrow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### This code writes a CSV file named "input.csv" in the current working directory with a predefined set of rows and columns, where some values are NaN, and then prints a message confirming the completion of writing the file.





In [None]:
# Import the necessary modules
import os

# Define the content to be written
content = """
Name,Age,Country,Salary
John,30,USA,50000
Jane,35,UK,NaN
Bob,25,Canada,40000
Alice,NaN,USA,60000
Mike,40,USA,75000
Emma,27,UK,NaN
Dave,33,Australia,NaN
Mary,29,USA,55000
Mark,45,Canada,NaN
Sara,NaN,USA,90000
Chris,31,UK,38000
Steph,38,USA,NaN
Kevin,26,USA,NaN
Julia,37,Canada,NaN
Tom,NaN,Australia,65000
Amy,24,USA,30000
Ben,36,UK,70000
Laura,NaN,Canada,50000
Matt,42,USA,80000
Sophie,28,UK,NaN
Lisa,34,USA,55000
Max,39,Australia,60000
Jenny,30,USA,NaN
Greg,29,UK,NaN
Emily,32,Canada,47000
Ryan,27,USA,NaN
Olivia,NaN,Australia,72000
Sam,35,USA,95000
Lucy,23,UK,40000
Adam,26,USA,30000
Katie,37,USA,NaN
Nate,31,Canada,NaN
Beth,33,USA,60000
Scott,29,USA,50000
Kelly,43,Australia,NaN
Derek,27,USA,NaN
Tina,38,USA,80000
George,NaN,Canada,67000
Jill,24,USA,45000
Oscar,42,UK,NaN
Maggie,NaN,USA,55000
David,29,Australia,NaN
Lena,36,USA,NaN
Joe,33,Canada,48000
Helen,40,USA,90000
Fred,26,USA,NaN
Kate,35,Australia,NaN
"""

# Define the file name and path
file_name = "input.csv"
file_path = os.path.join(os.getcwd(), file_name)

# Open the file in write mode and write the content
with open(file_path, "w") as f:
    f.write(content)

print(f"Content written to {file_name}")


Content written to input.csv


### This code creates a Python script that converts a CSV file to a Parquet file using Pandas and PyArrow, and saves it to disk.



.









In [None]:
# Import the necessary modules
import os

# Define the content to be written
content = """
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


def convert_csv_to_parquet(input_file_path, output_file_path, drop_option):
    # Read CSV file into a Pandas DataFrame
    df = pd.read_csv(input_file_path)

    # Remove rows or columns with NaN fields based on the drop_option argument
    if drop_option == 'row':
        df = df.dropna()
    elif drop_option == 'column':
        df = df.dropna(axis=1)

    # Convert Pandas DataFrame to PyArrow Table
    table = pa.Table.from_pandas(df)

    # Write PyArrow Table to Parquet file
    pq.write_table(table, output_file_path)

    # Open the parquet file
    table = pq.read_table(output_file_path)

    # Convert the table to a pandas dataframe
    df = table.to_pandas()

    # Print the dataframe
    # print(df.head(100))
    print(df.head(100).to_string(index=False))


input_file_path = 'input.csv'
output_file_path = 'output.parquet'
drop_option = 'row'  # options: 'row' or 'column'

convert_csv_to_parquet(input_file_path, output_file_path, drop_option)

"""

# Define the file name and path
file_name = "main.py"
file_path = os.path.join(os.getcwd(), file_name)

# Open the file in write mode and write the content
with open(file_path, "w") as f:
    f.write(content)

print(f"Content written to {file_name}")


Content written to main.py


### This command runs the Python script "main.py".





In [None]:
!python main.py

 Name  Age   Country  Salary
 John 30.0       USA 50000.0
  Bob 25.0    Canada 40000.0
 Mike 40.0       USA 75000.0
 Mary 29.0       USA 55000.0
Chris 31.0        UK 38000.0
  Amy 24.0       USA 30000.0
  Ben 36.0        UK 70000.0
 Matt 42.0       USA 80000.0
 Lisa 34.0       USA 55000.0
  Max 39.0 Australia 60000.0
Emily 32.0    Canada 47000.0
  Sam 35.0       USA 95000.0
 Lucy 23.0        UK 40000.0
 Adam 26.0       USA 30000.0
 Beth 33.0       USA 60000.0
Scott 29.0       USA 50000.0
 Tina 38.0       USA 80000.0
 Jill 24.0       USA 45000.0
  Joe 33.0    Canada 48000.0
Helen 40.0       USA 90000.0
