In [71]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [72]:
import os
os.chdir('/content/drive/MyDrive/Academics/Visiting Lectures/2026-H1/202601-SDP-AU/Session-08-Data-Management-in-Pandas')

In [73]:
import pandas as pd
import os

folder_name = 'Data' # Ensure folder_name is defined within this cell
excel_file_path = os.path.join(folder_name, 'bank_marketing_data.xlsx') # Corrected filename here

# Check if the 'Data' folder exists
bank_data = pd.read_excel(excel_file_path)

## **Introduction to Data Export**



#### **The Importance and Variety of Data Export in Pandas**

Data export is a critical step in any data science workflow, allowing us to share results, persist processed data for later use, integrate with other systems, or make data available for further analysis and reporting. Pandas provides robust and flexible functionalities for exporting `DataFrames` into a wide array of file formats, catering to different needs and scenarios.

Here's an overview of common data export options in pandas and their typical use cases:

1.  **CSV (Comma Separated Values):**
    *   **Description:** A plain-text format where values are separated by commas. It's one of the most widely used and universally compatible formats.
    *   **Use Cases:** Ideal for sharing data across different platforms and software, simple data archiving, and when universal compatibility is paramount.

2.  **Excel (.xlsx):**
    *   **Description:** A proprietary binary file format developed by Microsoft for spreadsheets.
    *   **Use Cases:** Preferred when sharing data with business users who are familiar with Excel, for creating reports with multiple sheets, or when data needs to be easily viewable and manipulated in a spreadsheet program.

3.  **JSON (JavaScript Object Notation):**
    *   **Description:** A lightweight, human-readable data interchange format that uses a schema similar to JavaScript objects.
    *   **Use Cases:** Excellent for web integration, APIs, configuration files, and when data needs to be easily readable and parsed by programming languages.

4.  **HTML (HyperText Markup Language):**
    *   **Description:** The standard markup language for documents designed to be displayed in a web browser.
    *   **Use Cases:** Useful for embedding tables directly into web pages, generating simple reports that can be viewed in a browser, or for quick display of tabular data.

5.  **Parquet:**
    *   **Description:** A columnar storage file format optimized for efficient data storage and retrieval, especially for large datasets. It's often used in big data processing frameworks like Apache Spark.
    *   **Use Cases:** Highly recommended for performance-critical applications, large datasets, and when working within data warehousing or big data ecosystems due to its efficiency in compression and query performance.

6.  **Feather:**
    *   **Description:** A fast, language-agnostic binary file format for storing `DataFrames`. It's designed for efficient data transfer between Python (pandas) and R.
    *   **Use Cases:** Ideal for fast reading/writing of `DataFrames` within Python or when interoperability between Python and R is required.

7.  **Pickle:**
    *   **Description:** Python's standard library for serializing and de-serializing Python object structures.
    *   **Use Cases:** Best for preserving the exact Python object structure (including custom objects and `DataFrame` metadata) for later use within a Python environment. It is Python-specific and generally not recommended for cross-language data exchange due to security risks and version incompatibility.

### **Exporting Data to CSV**

To ensure data portability and compatibility, it's often necessary to export your DataFrame to a common format like CSV. When doing so, it's good practice to specify `index=False` to prevent pandas from writing the DataFrame index as a column in the CSV file, which is usually redundant. Additionally, setting an `encoding` like `'utf-8'` ensures that special characters are handled correctly across different systems.

In [74]:
csv_file_path = os.path.join(folder_name, 'bank_marketing_data.csv')
bank_data.to_csv(csv_file_path, index=False, encoding='utf-8')
print(f"DataFrame successfully exported to: {csv_file_path}")

DataFrame successfully exported to: Data/bank_marketing_data.csv


### **Exporting to Excel (Single Sheet)**

Exporting data to Excel is crucial for sharing with users who prefer spreadsheet formats. When exporting to a single sheet, it's important to specify the `sheet_name` to make the data easily identifiable within the workbook. Additionally, setting `index=False` prevents pandas from writing the DataFrame's index as a column in the Excel file, which is often unnecessary and can be confusing.

In [75]:
excel_single_sheet_path = os.path.join(folder_name, 'bank_marketing_data_single_sheet.xlsx')
bank_data.to_excel(excel_single_sheet_path, sheet_name='Bank Data', index=False)
print(f"DataFrame successfully exported to: {excel_single_sheet_path}")

DataFrame successfully exported to: Data/bank_marketing_data_single_sheet.xlsx


### **Export to Excel (Multiple Sheets)**

In [76]:
df_married = bank_data[bank_data['marital'] == 'married']
df_single = bank_data[bank_data['marital'] == 'single']

print(f"Created df_married with {len(df_married)} rows.")
print(f"Created df_single with {len(df_single)} rows.")

Created df_married with 24928 rows.
Created df_single with 11568 rows.


In [77]:
excel_multi_sheet_path = os.path.join(folder_name, 'bank_marketing_data_multi_sheet.xlsx')

with pd.ExcelWriter(excel_multi_sheet_path) as writer:
    df_married.to_excel(writer, sheet_name='Married Clients', index=False)
    df_single.to_excel(writer, sheet_name='Single Clients', index=False)

print(f"DataFrame successfully exported to: {excel_multi_sheet_path}")

DataFrame successfully exported to: Data/bank_marketing_data_multi_sheet.xlsx


### **Exporting Data to JSON**

Exporting data to JSON (JavaScript Object Notation) is essential for web applications, APIs, and configuration files due to its human-readable and machine-parseable nature. Pandas provides the `.to_json()` method, which offers various `orient` parameters to control the structure of the output JSON.

Here are the most common `orient` options:

*   **`orient='records'` (Default):** This is the most common and often preferred format. It exports the DataFrame as a list of JSON objects, where each object represents a row, and keys correspond to column names. This is intuitive and easy to parse in many applications.

    ```json
    [
      {"col1": "value1", "col2": "value2"},
      {"col1": "value3", "col2": "value4"}
    ]
    ```

*   **`orient='index'`:** This format exports the DataFrame as a JSON object, where keys are the DataFrame's index labels, and values are JSON objects representing the rows. This can be useful when the index carries significant meaning or needs to be preserved as primary keys.

    ```json
    {
      "index1": {"col1": "value1", "col2": "value2"},
      "index2": {"col1": "value3", "col2": "value4"}
    }
    ```

*   **`orient='columns'`:** This format exports the DataFrame as a JSON object where keys are the column names, and values are JSON objects where keys are the index labels and values are the cell data. This orientation can be less intuitive for general use but might be suitable for specific data structures or visualizations.

    ```json
    {
      "col1": {"index1": "value1", "index2": "value3"},
      "col2": {"index1": "value2", "index2": "value4"}
    }
    ```

Understanding these orientations allows you to export your data in the most suitable structure for your target application or consumer.

**Reasoning**:
I will define the output file path for the first JSON file, export the `bank_data` DataFrame to it using `orient='records'`, and then verify its creation. This covers steps 1-3 of the instructions.



In [78]:
json_records_file_path = os.path.join(folder_name, 'bank_marketing_data.json')
bank_data.to_json(json_records_file_path, orient='records', indent=4)
print(f"DataFrame successfully exported to: {json_records_file_path} with orient='records'")

DataFrame successfully exported to: Data/bank_marketing_data.json with orient='records'


**Reasoning**:
Now, I will define the output file path for the second JSON file, export the `bank_data` DataFrame to it using `orient='index'`, and then verify its creation, covering steps 4-5 of the instructions.



In [79]:
json_index_file_path = os.path.join(folder_name, 'bank_marketing_data_by_index.json')
bank_data.to_json(json_index_file_path, orient='index', indent=4)
print(f"DataFrame successfully exported to: {json_index_file_path} with orient='index'")

DataFrame successfully exported to: Data/bank_marketing_data_by_index.json with orient='index'


**Reasoning**:
Now, I will define the output file path for the third JSON file, export the `bank_data` DataFrame to it using `orient='columns'`, and then verify its creation, covering steps 6-7 of the instructions.



In [80]:
json_columns_file_path = os.path.join(folder_name, 'bank_marketing_data_by_columns.json')
bank_data.to_json(json_columns_file_path, orient='columns', indent=4)
print(f"DataFrame successfully exported to: {json_columns_file_path} with orient='columns'")

DataFrame successfully exported to: Data/bank_marketing_data_by_columns.json with orient='columns'


**Reasoning**:
Now that all JSON files have been created and individually verified, I will list all the contents of the 'Data' folder to show all newly created JSON files alongside previously exported files, completing step 8 of the instructions.



In [81]:
print("Contents of the 'Data' folder after all JSON exports:")
print(os.listdir(folder_name))

Contents of the 'Data' folder after all JSON exports:
['bank_marketing_data.xlsx', 'bank_marketing_data.csv', 'bank_marketing_data_single_sheet.xlsx', 'bank_marketing_data_multi_sheet.xlsx', 'bank_marketing_data.json', 'bank_marketing_data_by_index.json', 'bank_marketing_data_by_columns.json', 'bank_marketing_data.html', 'bank_marketing_data.parquet', 'bank_marketing_data.feather', 'bank_marketing_data.pkl']


### **Export to HTML File**

In [82]:
html_file_path = os.path.join(folder_name, 'bank_marketing_data.html')
bank_data.to_html(html_file_path, index=False)
print(f"DataFrame successfully exported to: {html_file_path}")

DataFrame successfully exported to: Data/bank_marketing_data.html


### **Export to Parquet File**



In [83]:
parquet_file_path = os.path.join(folder_name, 'bank_marketing_data.parquet')
bank_data.to_parquet(parquet_file_path)
print(f"DataFrame successfully exported to: {parquet_file_path}")

DataFrame successfully exported to: Data/bank_marketing_data.parquet


## **Export to Feather File**



In [84]:
feather_file_path = os.path.join(folder_name, 'bank_marketing_data.feather')
bank_data.to_feather(feather_file_path)
print(f"DataFrame successfully exported to: {feather_file_path}")

DataFrame successfully exported to: Data/bank_marketing_data.feather


### **Export to Pickle File**



Pickle is Python's standard module for serializing and de-serializing Python object structures. When you export a pandas DataFrame to a Pickle file, you are saving the DataFrame in a binary format that can be easily reloaded later in a Python environment, preserving its exact structure, data types, and even custom attributes. This makes Pickle an excellent choice for persisting `DataFrames` for internal use within Python projects.

**Key characteristics of Pickle:**

*   **Python-specific:** Pickle files are designed for Python and are generally not suitable for sharing data with other programming languages or systems due to potential compatibility issues and security concerns.
*   **Preserves object fidelity:** It saves the full object hierarchy, including custom Python objects and complex data structures, ensuring that when reloaded, the object is identical to its original state.
*   **Security risk:** Unpickling data from untrusted sources can be dangerous as it can execute arbitrary code. Therefore, it should only be used with trusted data.
*   **Efficiency:** For Python-to-Python serialization, Pickle can be quite efficient in terms of both storage and performance.

In [85]:
pickle_file_path = os.path.join(folder_name, 'bank_marketing_data.pkl')
bank_data.to_pickle(pickle_file_path)
print(f"DataFrame successfully exported to: {pickle_file_path}")

DataFrame successfully exported to: Data/bank_marketing_data.pkl


## **Summary:**

### Data Analysis Key Findings

*   **Diverse Export Options Demonstrated**: The `bank_data` DataFrame was successfully exported to seven different file formats: CSV, Excel (single and multiple sheets), JSON (records, index, and columns orientations), HTML, Parquet, Feather, and Pickle.
*   **CSV Export**: The `bank_data` DataFrame was exported to `bank_marketing_data.csv` using `index=False` to omit the DataFrame index and `encoding='utf-8'` for broad character support.
*   **Excel Export (Single Sheet)**: The `bank_data` was exported to `bank_marketing_data_single_sheet.xlsx` with the sheet named 'Bank Data' and `index=False`.
*   **Excel Export (Multiple Sheets)**: The `bank_data` was split into two DataFrames based on marital status: `df_married` (24,928 rows) and `df_single` (11,568 rows). Both were exported into a single Excel file, `bank_marketing_data_multi_sheet.xlsx`, with separate sheets named 'Married Clients' and 'Single Clients', respectively.
*   **JSON Export**: The `bank_data` was exported to three distinct JSON files, demonstrating different `orient` parameters:
    *   `bank_marketing_data.json` using `orient='records'` (list of objects, each representing a row).
    *   `bank_marketing_data_by_index.json` using `orient='index'` (object where keys are index labels, values are row objects).
    *   `bank_marketing_data_by_columns.json` using `orient='columns'` (object where keys are column names, values are index-value pairs).
*   **HTML Export**: The `bank_data` was exported to `bank_marketing_data.html` with `index=False`, suitable for web display.
*   **Parquet Export**: The `bank_data` was exported to `bank_marketing_data.parquet`, highlighting its efficiency for large datasets and columnar storage.
*   **Feather Export**: The `bank_data` was exported to `bank_marketing_data.feather`, emphasizing its speed and efficiency for data transfer between Python and R.
*   **Pickle Export**: The `bank_data` was exported to `bank_marketing_data.pkl`, demonstrating Python's native serialization for preserving exact object structures within a Python environment, with a note on its Python-specificity and security considerations.

### Insights or Next Steps

*   **Choosing the Right Format**: The choice of export format should align with the data's intended use, target system, and required performance characteristics. For general sharing, CSV or Excel are common; for web integration, JSON or HTML; for big data and performance, Parquet or Feather; and for Python-internal persistence, Pickle.
*   **Data Persistence beyond Export**: While exporting covers various formats, a logical next step for true data persistence would be to demonstrate how to *read* these exported files back into pandas DataFrames, ensuring data integrity and validating the export process. This would complete the round-trip for data storage and retrieval.


## **Introduction to Data Import**

### The Importance and Variety of Data Import in Pandas

Data import is a foundational step in any data analysis or data science workflow. It involves bringing raw data from various external sources into a structured format, typically a pandas DataFrame, for manipulation, analysis, and visualization. Without effective data import mechanisms, the valuable insights locked within data remain inaccessible.

pandas, a powerful and flexible data manipulation library in Python, excels at handling a wide array of data formats. Its intuitive functions make it straightforward to read data from different sources directly into DataFrames, streamlining the initial phase of data processing. This versatility is crucial because real-world data rarely comes in a single, standardized format.

In the following sections, we will explore practical examples of importing data from several common file types, demonstrating pandas' robust capabilities:

1.  **CSV (Comma Separated Values):** A universal plain-text format, ideal for simple data exchange.
2.  **Excel (.xlsx):** Widely used for structured data, especially in business contexts.
3.  **JSON (JavaScript Object Notation):** A lightweight, human-readable format, frequently used for web data and APIs.
4.  **HTML (HyperText Markup Language):** Data embedded within web pages, often in tabular form.
5.  **Parquet:** A columnar storage format optimized for large datasets and analytical queries.
6.  **Feather:** A fast, language-agnostic binary format designed for efficient DataFrame transfer.
7.  **Pickle:** Python's native serialization format, perfect for preserving exact Python object structures within a Python environment.

Understanding how to effectively import these diverse formats is key to becoming proficient in data handling with pandas.

### **Import from CSV File**



In [86]:
csv_import_file_path = os.path.join(folder_name, 'bank_marketing_data.csv')
df_csv = pd.read_csv(csv_import_file_path, encoding='utf-8')
print(f"CSV file successfully imported from: {csv_import_file_path}")
print("\nFirst 5 rows of the imported DataFrame (df_csv):")
display(df_csv.head())

CSV file successfully imported from: Data/bank_marketing_data.csv

First 5 rows of the imported DataFrame (df_csv):


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### **Import from Excel (Single Sheet)**



In [87]:
excel_single_sheet_import_path = os.path.join(folder_name, 'bank_marketing_data_single_sheet.xlsx')
df_excel_single = pd.read_excel(excel_single_sheet_import_path, sheet_name='Bank Data')
print(f"Excel file (single sheet) successfully imported from: {excel_single_sheet_import_path}")
print("\nFirst 5 rows of the imported DataFrame (df_excel_single):")
display(df_excel_single.head())

Excel file (single sheet) successfully imported from: Data/bank_marketing_data_single_sheet.xlsx

First 5 rows of the imported DataFrame (df_excel_single):


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### **Import from Excel (Multiple Sheets)**


In [88]:
excel_multi_sheet_import_path = os.path.join(folder_name, 'bank_marketing_data_multi_sheet.xlsx')

df_married_imported = pd.read_excel(excel_multi_sheet_import_path, sheet_name='Married Clients')
df_single_imported = pd.read_excel(excel_multi_sheet_import_path, sheet_name='Single Clients')

print(f"Excel file (multiple sheets) successfully imported from: {excel_multi_sheet_import_path}")
print("\nFirst 5 rows of the 'Married Clients' DataFrame (df_married_imported):")
display(df_married_imported.head())
print(f"Length of df_married_imported: {len(df_married_imported)}")

print("\nFirst 5 rows of the 'Single Clients' DataFrame (df_single_imported):")
display(df_single_imported.head())
print(f"Length of df_single_imported: {len(df_single_imported)}")

Excel file (multiple sheets) successfully imported from: Data/bank_marketing_data_multi_sheet.xlsx

First 5 rows of the 'Married Clients' DataFrame (df_married_imported):


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


Length of df_married_imported: 24928

First 5 rows of the 'Single Clients' DataFrame (df_single_imported):


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,24,technician,single,professional.course,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,25,services,single,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,25,services,single,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,29,blue-collar,single,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,39,management,single,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


Length of df_single_imported: 11568


### **Import from JSON File**

In [89]:
json_records_import_file_path = os.path.join(folder_name, 'bank_marketing_data.json')
df_json_records = pd.read_json(json_records_import_file_path, orient='records')

print(f"JSON file (orient='records') successfully imported from: {json_records_import_file_path}")
print("\nFirst 5 rows of the imported DataFrame (df_json_records):")
display(df_json_records.head())


JSON file (orient='records') successfully imported from: Data/bank_marketing_data.json

First 5 rows of the imported DataFrame (df_json_records):


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no



DataFrame Info (df_json_records):


Demonstrate how to import the `bank_marketing_data_by_index.json` file (exported with `orient='index'`) into a DataFrame using `pd.read_json`, and verify its structure, paying attention to the index.

In [90]:
json_index_import_file_path = os.path.join(folder_name, 'bank_marketing_data_by_index.json')
df_json_index = pd.read_json(json_index_import_file_path, orient='index')

print(f"JSON file (orient='index') successfully imported from: {json_index_import_file_path}")
print("\nFirst 5 rows of the imported DataFrame (df_json_index):")
display(df_json_index.head())


JSON file (orient='index') successfully imported from: Data/bank_marketing_data_by_index.json

First 5 rows of the imported DataFrame (df_json_index):


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no



DataFrame Info (df_json_index):



Demonstrate how to import the `bank_marketing_data_by_columns.json` file (exported with `orient='columns'`) into a DataFrame using `pd.read_json`, and verify its structure, paying attention to how the columns are handled.

In [91]:
json_columns_import_file_path = os.path.join(folder_name, 'bank_marketing_data_by_columns.json')
df_json_columns = pd.read_json(json_columns_import_file_path, orient='columns')

print(f"JSON file (orient='columns') successfully imported from: {json_columns_import_file_path}")
print("\nFirst 5 rows of the imported DataFrame (df_json_columns):")
display(df_json_columns.head())


JSON file (orient='columns') successfully imported from: Data/bank_marketing_data_by_columns.json

First 5 rows of the imported DataFrame (df_json_columns):


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no



DataFrame Info (df_json_columns):


### **Import from HTML File**


In [92]:
html_import_file_path = os.path.join(folder_name, 'bank_marketing_data.html')
df_html = pd.read_html(html_import_file_path)[0]

print(f"HTML file successfully imported from: {html_import_file_path}")
print("\nFirst 5 rows of the imported DataFrame (df_html):")
display(df_html.head())


HTML file successfully imported from: Data/bank_marketing_data.html

First 5 rows of the imported DataFrame (df_html):


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no



DataFrame Info (df_html):


### **Import from Parquet File**

In [93]:
parquet_import_file_path = os.path.join(folder_name, 'bank_marketing_data.parquet')
df_parquet = pd.read_parquet(parquet_import_file_path)

print(f"Parquet file successfully imported from: {parquet_import_file_path}")
print("\nFirst 5 rows of the imported DataFrame (df_parquet):")
display(df_parquet.head())


Parquet file successfully imported from: Data/bank_marketing_data.parquet

First 5 rows of the imported DataFrame (df_parquet):


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no



DataFrame Info (df_parquet):


### **Import from Feather File**

In [94]:
feather_import_file_path = os.path.join(folder_name, 'bank_marketing_data.feather')
df_feather = pd.read_feather(feather_import_file_path)

print(f"Feather file successfully imported from: {feather_import_file_path}")
print("\nFirst 5 rows of the imported DataFrame (df_feather):")
display(df_feather.head())


Feather file successfully imported from: Data/bank_marketing_data.feather

First 5 rows of the imported DataFrame (df_feather):


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no



DataFrame Info (df_feather):


### **Import from Pickle File**

In [95]:
pickle_import_file_path = os.path.join(folder_name, 'bank_marketing_data.pkl')
df_pickle = pd.read_pickle(pickle_import_file_path)

print(f"Pickle file successfully imported from: {pickle_import_file_path}")
print("\nFirst 5 rows of the imported DataFrame (df_pickle):")
display(df_pickle.head())


Pickle file successfully imported from: Data/bank_marketing_data.pkl

First 5 rows of the imported DataFrame (df_pickle):


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no



DataFrame Info (df_pickle):


## **Summary:**

The task implicitly asks to summarize various data import methods, discuss their use cases, and provide insights into data loading best practices.

*   **Summary of Data Import Methods Covered:**
    The process successfully demonstrated importing data using `pd.read_csv()`, `pd.read_excel()`, `pd.read_json()`, `pd.read_html()`, `pd.read_parquet()`, `pd.read_feather()`, and `pd.read_pickle()`. These methods cover common tabular data formats, spreadsheet formats, semi-structured data, web-based tables, and specialized binary formats.

*   **Use Cases for Each Method:**
    *   **CSV (Comma Separated Values):** Ideal for simple, universal data exchange due to its plain-text nature and broad compatibility.
    *   **Excel (.xlsx):** Widely used in business for structured data, often containing multiple sheets or specific formatting. `sheet_name` is crucial here.
    *   **JSON (JavaScript Object Notation):** Excellent for web data, APIs, and semi-structured data. The `orient` parameter (e.g., 'records', 'index', 'columns') is vital for correctly interpreting its nested structure.
    *   **HTML (HyperText Markup Language):** Useful for extracting tabular data directly from web pages or local HTML files.
    *   **Parquet:** A columnar storage format optimized for large datasets and analytical queries, offering efficient storage and retrieval, especially for big data ecosystems.
    *   **Feather:** A fast, language-agnostic binary format designed for efficient DataFrame transfer between different programming environments (e.g., Python and R), prioritizing speed.
    *   **Pickle:** Python's native serialization format, perfect for preserving the exact state and data types of Python objects (like DataFrames) within a Python environment, ensuring fidelity but with limited cross-language compatibility.

*   **Data Loading Best Practices:**
    *   **Choose the Right Format:** Select the most appropriate file format based on data source, size, performance requirements, and interoperability needs.
    *   **Specify Encoding:** Always define the `encoding` (e.g., 'utf-8') for text-based formats like CSV to prevent character decoding errors.
    *   **Handle Multi-Sheet Files:** For Excel files, explicitly specify the `sheet_name` or iterate through sheets when multiple datasets are present.
    *   **Understand JSON `orient`:** When importing JSON, determine the `orient` parameter (e.g., 'records', 'index', 'columns') that best matches the structure of your JSON file to ensure correct DataFrame construction.
    *   **Verify Imports:** Always inspect the first few rows (`.head()`) and the DataFrame's information (`.info()`) after import to confirm data integrity, correct column names, data types, and expected number of rows.
    *   **Utilize Efficient Formats:** For large datasets or performance-critical applications, prefer binary formats like Parquet or Feather over CSV or JSON.
    *   **Contextual Use of Pickle:** Use Pickle when preserving the exact Python object structure is critical and within a Python-only workflow.

### Data Analysis Key Findings

*   **CSV Import:** The `bank_marketing_data.csv` file was successfully imported into `df_csv` (41,188 rows, 21 columns) using `pd.read_csv()` with `encoding='utf-8'`.
*   **Excel (Single Sheet) Import:** The `bank_marketing_data_single_sheet.xlsx` file was successfully imported into `df_excel_single` (41,188 rows, 21 columns) using `pd.read_excel()` by specifying `sheet_name='Bank Data'`.
*   **Excel (Multiple Sheets) Import:** The `bank_marketing_data_multi_sheet.xlsx` file was successfully imported into two separate DataFrames: `df_married_imported` (24,928 rows, 21 columns from 'Married Clients' sheet) and `df_single_imported` (11,568 rows, 21 columns from 'Single Clients' sheet) using `pd.read_excel()` with specific `sheet_name` parameters.
*   **JSON Import (`orient='records'`):** The `bank_marketing_data.json` file was successfully imported into `df_json_records` (41,188 rows, 21 columns) using `pd.read_json()` with `orient='records'`, which is suitable for list-of-dictionaries JSON structures.
*   **JSON Import (`orient='index'`):** The `bank_marketing_data_by_index.json` file was successfully imported into `df_json_index` (41,188 rows, 21 columns) using `pd.read_json()` with `orient='index'`, correctly mapping JSON keys to the DataFrame's index.
*   **JSON Import (`orient='columns'`):** The `bank_marketing_data_by_columns.json` file was successfully imported into `df_json_columns` (41,188 rows, 21 columns) using `pd.read_json()` with `orient='columns'`, where JSON keys represent columns.
*   **HTML Import:** The `bank_marketing_data.html` file was successfully imported into `df_html` (41,188 rows, 21 columns) using `pd.read_html()`, which extracts tables from HTML.
*   **Parquet Import:** The `bank_marketing_data.parquet` file was successfully imported into `df_parquet` (41,188 rows, 21 columns) using `pd.read_parquet()`, maintaining data types and structure.
*   **Feather Import:** The `bank_marketing_data.feather` file was successfully imported into `df_feather` (41,188 rows, 21 columns) using `pd.read_feather()`, demonstrating efficient binary import.
*   **Pickle Import:** The `bank_marketing_data.pkl` file was successfully imported into `df_pickle` (41,188 rows, 21 columns) using `pd.read_pickle()`, preserving the exact Python DataFrame object.
*   All imported DataFrames consistently contained 41,188 entries and 21 columns, demonstrating pandas' ability to load the full dataset from diverse formats accurately.

### Insights or Next Steps

*   Understanding the appropriate `pd.read_*` function and its key parameters (e.g., `encoding`, `sheet_name`, `orient`) is crucial for efficient and error-free data loading, adapting to various data structures and file formats.
*   For large-scale data analytics, prioritizing binary storage formats like Parquet or Feather for intermediate data storage can significantly improve I/O performance and reduce storage footprint compared to text-based formats like CSV or JSON.
