### **ðŸ§­ Stage 1 â†’ Lesson 4: Importing & Exporting Data (I/O Operations)**

**ðŸŽ¯ Objective**

By the end of this lesson, youâ€™ll:

- Load data from various sources (CSV, Excel, JSON, SQL)
- Handle file paths, encodings, missing headers
- Export clean datasets in multiple formats
 -Work with real data import/export patterns used in ETL

#### **ðŸ§± Pandas I/O Ecosystem Overview**

Pandas supports reading and writing 40+ data formats through its I/O API.

| **Format**        | **Read Function**    | **Write Function**   |
|-------------------|----------------------|----------------------|
| CSV               | `read_csv()`         | `to_csv()`           |
| Excel             | `read_excel()`       | `to_excel()`         |
| JSON              | `read_json()`        | `to_json()`          |
| Parquet           | `read_parquet()`     | `to_parquet()`       |
| SQL               | `read_sql()`         | `to_sql()`           |
| Clipboard         | `read_clipboard()`   | â€”                    |


**Importing necessary libraries**

In [1]:
import pandas as pd
import numpy as np

**ðŸ§© Reading CSV Files**

In [16]:
# Since pandas is being used to handle the data, we assume it has already been imported. 
# If not, you can import it like this:
# import pandas as pd

# Step 2: Reading data from a CSV file
# pd.read_csv() is used to read a CSV file and load its contents into a pandas DataFrame.
# The 'r' before the file path is used to indicate a raw string, which ensures that backslashes in file paths are interpreted correctly.
df_SalesData = pd.read_csv(r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\raw\sample_sales.csv")

# Step 3: Displaying the first few rows of the DataFrame
# .head() is a pandas function that returns the first 5 rows of the DataFrame by default. 
# This is useful for quickly inspecting the structure of the data, ensuring it's loaded correctly.
print(df_SalesData.head())

         Date     Product  Units  UnitPrice  Revenue
0  01-01-2024     Monitor     46       1999    91954
1  02-01-2024    Keyboard      8       9999    79992
2  03-01-2024  Headphones     25        999    24975
3  04-01-2024     Monitor     35      14999   524965
4  05-01-2024  Headphones     25       2999    74975


**Parameters for `read_csv()` and Other I/O Functions**
| **Parameter**     | **Description**                              |
|-------------------|----------------------------------------------|
| `sep`             | Specify separator (default `,`)              |
| `header`          | Row number(s) to use as the column names    |
| `names`           | Custom column names (overrides header)      |
| `index_col`       | Set column(s) to be used as index          |
| `usecols`         | Load selected columns                       |
| `nrows`           | Limit the number of rows to read           |
| `encoding`        | Handle text encodings like `UTF-8`, `ISO-8859-1`, etc. |

In [3]:
print("Orifinal_Data:\n",df_SalesData.head())

Orifinal_Data:
          Date     Product  Units  UnitPrice  Revenue
0  01-01-2024     Monitor     46       1999    91954
1  02-01-2024    Keyboard      8       9999    79992
2  03-01-2024  Headphones     25        999    24975
3  04-01-2024     Monitor     35      14999   524965
4  05-01-2024  Headphones     25       2999    74975


In [None]:
# Step 1: Defining the file path
# The path to the CSV file is defined as a raw string (r) to handle any backslashes in the file path correctly.
sales_data_path = r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\raw\sample_sales.csv"

# Step 2: Reading the CSV file with specific columns
# pd.read_csv() is used to read the data from the CSV file.
# The 'usecols' parameter specifies which columns to read from the CSV file.
# In this case, only the 'Product', 'Units', and 'Revenue' columns will be loaded into the DataFrame.
df_SalesData = pd.read_csv(sales_data_path, usecols=['Product', 'Units', 'Revenue'])

# Step 3: Displaying the first few rows of the DataFrame
# .head() returns the first 5 rows of the DataFrame, which is useful for verifying that the correct data has been loaded.
df_SalesData.head()


Unnamed: 0,Product,Units,Revenue
0,Monitor,46,91954
1,Keyboard,8,79992
2,Headphones,25,24975
3,Monitor,35,524965
4,Headphones,25,74975


**Handling Missing or Bad Data**

In [5]:
df = pd.read_csv(sales_data_path,na_values=['?','NA','Missing'])
df

Unnamed: 0,Date,Product,Units,UnitPrice,Revenue
0,01-01-2024,Monitor,46,1999,91954
1,02-01-2024,Keyboard,8,9999,79992
2,03-01-2024,Headphones,25,999,24975
3,04-01-2024,Monitor,35,14999,524965
4,05-01-2024,Headphones,25,2999,74975
...,...,...,...,...,...
115,25-04-2024,Headphones,26,4999,129974
116,26-04-2024,Keyboard,42,14999,629958
117,27-04-2024,Laptop,22,999,21978
118,28-04-2024,Monitor,14,1999,27986


**Large File Optimization**

In [None]:
import pandas as pd

# Define the path to the Superstore Sales dataset using a raw string
superstore_sales = r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\raw\superstore_sales.csv"

# Try to read the CSV file in chunks, with error handling for possible issues
try:
    # Read the CSV file in chunks of 100 rows at a time using a different encoding (ISO-8859-1)
    # The 'ISO-8859-1' encoding is often used for files containing special characters in European languages.
    df_superstore = pd.read_csv(superstore_sales, encoding="ISO-8859-1", chunksize=100)

    # Iterate through each chunk of 100 rows and print the first 5 rows
    for data in df_superstore:
        print(data.head())  # Using .head() to display just the first 5 rows of each chunk

except FileNotFoundError:
    # If the file is not found, this block will catch it and print a custom error message
    print(f"The file at {superstore_sales} does not exist. Please check the file path.")

except Exception as e:
    # Catch any other unexpected errors and print them
    print(f"An error occurred: {e}")

   Row ID        Order ID  Order Date   Ship Date       Ship Mode Customer ID  \
0       1  CA-2016-152156  11-08-2016  11-11-2016    Second Class    CG-12520   
1       2  CA-2016-152156  11-08-2016  11-11-2016    Second Class    CG-12520   
2       3  CA-2016-138688  06-12-2016   6/16/2016    Second Class    DV-13045   
3       4  US-2015-108966  10-11-2015  10/18/2015  Standard Class    SO-20335   
4       5  US-2015-108966  10-11-2015  10/18/2015  Standard Class    SO-20335   

     Customer Name    Segment        Country             City  ...  \
0      Claire Gute   Consumer  United States        Henderson  ...   
1      Claire Gute   Consumer  United States        Henderson  ...   
2  Darrin Van Huff  Corporate  United States      Los Angeles  ...   
3   Sean O'Donnell   Consumer  United States  Fort Lauderdale  ...   
4   Sean O'Donnell   Consumer  United States  Fort Lauderdale  ...   

  Postal Code  Region       Product ID         Category Sub-Category  \
0       42420   Sout

**ðŸ§© Reading Excel Files**

In [None]:
# Import pandas library (make sure you've imported it before running this code)
import pandas as pd

# Define the path to the Excel file containing the dataset.
# Update this path if the file is stored elsewhere on your system.
people_basic = r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\raw\people_basic_data.xlsx"

# Read the Excel file into a pandas DataFrame.
# pd.read_excel() automatically detects the sheet unless specified using 'sheet_name' parameter.
df_exl = pd.read_excel(people_basic)

# Display the first 5 rows of the DataFrame to get a quick overview of the data.
# This helps verify that the file was read correctly and understand the structure of the dataset.
print(df_exl.head())


     Name  Age       City  Salary_INR
0   Aarav   23     Mumbai      115059
1  Vivaan   50  Ahmedabad       93035
2  Aditya   46  Ahmedabad       61033
3  Vihaan   37       Pune      187550
4   Arjun   37      Delhi      162866


**ðŸ§© Reading Excel Files (Sheets)**

In [None]:
# Import pandas library â€” required for data manipulation and reading Excel files
import pandas as pd

# Define the path to the Excel file.
# The 'r' before the string makes it a raw string, so backslashes are treated literally.
people_basic_sheet = r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\raw\people_basic_data.xlsx"

# Read a specific sheet ("peoplebasic") from the Excel file into a pandas DataFrame.
# The 'sheet_name' parameter specifies which sheet to load â€” useful when the file contains multiple sheets.
df_ex_sheet = pd.read_excel(people_basic_sheet, sheet_name="peoplebasic")

# Print the entire DataFrame to view its contents.
# Be careful: if the dataset is large, this will print all rows.
# You can use df_ex_sheet.head() instead to show only the first 5 rows.
print(df_ex_sheet.head())

     Name  Age       City  Salary_INR
0   Aarav   23     Mumbai      115059
1  Vivaan   50  Ahmedabad       93035
2  Aditya   46  Ahmedabad       61033
3  Vihaan   37       Pune      187550
4   Arjun   37      Delhi      162866


**Multiple Sheets at Once**

In [None]:
# Import pandas library â€” used for data analysis and Excel file handling
import pandas as pd

# Define the path to the Excel file containing multiple sheets.
# The 'r' before the string makes it a raw string so backslashes are treated literally.
people_basic_sheet = r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\raw\people_basic_data.xlsx"

# Read all sheets from the Excel file into a dictionary of DataFrames.
# Setting sheet_name=None tells pandas to read *every* sheet in the file.
# The resulting object is a dictionary where:
#   - Keys = sheet names
#   - Values = DataFrames containing the data from each sheet
dfs_multi_sheet = pd.read_excel(people_basic_sheet, sheet_name=None)

# Print the keys (sheet names) of the dictionary.
# This shows which sheets were loaded from the Excel file.
print(dfs_multi_sheet.keys())

dict_keys(['peoplebasic'])


**ðŸ§© Reading JSON Files**

In [None]:
# Import pandas library â€” required for handling JSON, Excel, CSV, and other data formats
import pandas as pd

# Define the path to the JSON file.
# The 'r' prefix makes it a raw string literal, so backslashes are treated literally (useful for Windows paths).
sales_json = r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\raw\sales_data.json"

# Read the JSON file into a pandas DataFrame.
# pd.read_json() automatically converts JSON structures (like arrays or objects) into tabular format.
# The JSON file should contain a valid structure such as:
# [
#   {"date": "2025-01-01", "sales": 200, "region": "North"},
#   {"date": "2025-01-02", "sales": 250, "region": "South"}
# ]
df_json = pd.read_json(sales_json)

# Display the DataFrame to verify that the JSON data was loaded correctly.
print(df_json)

     sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]


**ðŸ§© Reading Data from URLs or APIs**

In [11]:
# Import pandas library â€” required for reading CSV files and data manipulation
import pandas as pd

# Define the URL of the CSV file.
# This file is hosted online on GitHub and contains the famous Iris dataset.
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"

# Read the CSV file directly from the URL into a pandas DataFrame.
# pd.read_csv() can handle both local file paths and URLs.
df = pd.read_csv(url)

# Display the first 5 rows of the DataFrame.
# This gives a quick overview of the dataset's structure and contents.
df.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [12]:

import requests

url = "https://data.wa.gov/api/views/f6w7-q2d2/rows.json?accessType=DOWNLOAD"


response = requests.get(url)
invest_data = response.json()

rows = invest_data['data']

columns = [col['name'] for col in invest_data['meta']['view']['columns']]

df = pd.DataFrame(rows, columns=columns)
print(df.head)


<bound method NDFrame.head of                        sid                                    id  position  \
0       row-qsfa-65pn_v9u2  00000000-0000-0000-E9FC-12701D5586E7         0   
1       row-4tqm_us9b_wbj5  00000000-0000-0000-B530-5CDA42E30834         0   
2       row-rws4_ztb8.gq89  00000000-0000-0000-9BDE-49CFA3DEF257         0   
3       row-vzyr_pt7h_wabd  00000000-0000-0000-3DC6-8268AF9133D9         0   
4       row-jhd9_idhy~hxxg  00000000-0000-0000-EC81-B0E440208B03         0   
...                    ...                                   ...       ...   
269668  row-zsu6~nyk7-8pt5  00000000-0000-0000-01F0-A452036752C6         0   
269669  row-mpbp~2ev5-qjt7  00000000-0000-0000-CC91-101F0076685B         0   
269670  row-2ss3~c9iw.j6jw  00000000-0000-0000-9F59-D918D94029B8         0   
269671  row-8i5u_u3ij~hcwq  00000000-0000-0000-6BF3-C41558A6510B         0   
269672  row-x7px~2fmc~bz2h  00000000-0000-0000-A085-4490E86B2CBD         0   

        created_at created_meta  

**ðŸ§© Writing (Exporting) Data**

In [None]:
df_expo = df.to_csv(r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\superstore_sales_cleaned.csv", index=False)

print(df_expo)

None


**Write to Excel**

In [14]:
# Import pandas library â€” required for data manipulation and exporting data
import pandas as pd

# Assuming 'df' is the cleaned sales DataFrame that you want to export

# Reset the index of the DataFrame.
# drop=True ensures that the old index is removed and not added as a column.
df_reset = df.reset_index(drop=True)

# Export the cleaned DataFrame to an Excel file.
# The 'index=False' argument prevents the DataFrame's index from being written to the file.
# This saves the DataFrame to the specified path without the index column, which is often not needed in the output.
df_reset.to_excel(r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\people_basic_data.xlsx", index=False)

# Print the cleaned DataFrame to verify the changes before saving
print(df_reset)

                       sid                                    id  position  \
0       row-qsfa-65pn_v9u2  00000000-0000-0000-E9FC-12701D5586E7         0   
1       row-4tqm_us9b_wbj5  00000000-0000-0000-B530-5CDA42E30834         0   
2       row-rws4_ztb8.gq89  00000000-0000-0000-9BDE-49CFA3DEF257         0   
3       row-vzyr_pt7h_wabd  00000000-0000-0000-3DC6-8268AF9133D9         0   
4       row-jhd9_idhy~hxxg  00000000-0000-0000-EC81-B0E440208B03         0   
...                    ...                                   ...       ...   
269668  row-zsu6~nyk7-8pt5  00000000-0000-0000-01F0-A452036752C6         0   
269669  row-mpbp~2ev5-qjt7  00000000-0000-0000-CC91-101F0076685B         0   
269670  row-2ss3~c9iw.j6jw  00000000-0000-0000-9F59-D918D94029B8         0   
269671  row-8i5u_u3ij~hcwq  00000000-0000-0000-6BF3-C41558A6510B         0   
269672  row-x7px~2fmc~bz2h  00000000-0000-0000-A085-4490E86B2CBD         0   

        created_at created_meta  updated_at updated_meta meta  

**ðŸ§© Reading/Writing from SQL (Preview)**

In [15]:
# import sqlite3
# conn = sqlite3.connect('../datasets/raw/sample.db')

# df = pd.read_sql("SELECT * FROM sales", conn)
# df.to_sql('cleaned_sales', conn, if_exists='replace', index=False)
# conn.close()