# Pandas Tutorial - Part 66: DataFrame Methods (set_index, to_csv)

This notebook covers two important DataFrame methods:
- `set_index()` - Set the DataFrame index using existing columns
- `to_csv()` - Write object to a comma-separated values (csv) file

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import tempfile

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

## 1. DataFrame.set_index()

The `set_index()` method sets the DataFrame index (row labels) using one or more existing columns or arrays. The index can replace the existing index or expand on it.

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({
    'month': [1, 4, 7, 10],
    'year': [2012, 2014, 2013, 2014],
    'sale': [55, 40, 84, 31]
})

print("Original DataFrame:")
df

In [None]:
# Set the index to become the 'month' column
print("Set 'month' as index:")
df.set_index('month')

In [None]:
# Create a MultiIndex using columns 'year' and 'month'
print("Set MultiIndex using 'year' and 'month':")
df.set_index(['year', 'month'])

In [None]:
# Set index without dropping the columns
print("Set 'month' as index without dropping the column:")
df.set_index('month', drop=False)

In [None]:
# Append to the existing index
# First, set 'year' as index
df_year_index = df.set_index('year')
print("DataFrame with 'year' as index:")
print(df_year_index)

# Then, append 'month' to the index
print("\nAppend 'month' to the existing index:")
df_year_index.set_index('month', append=True)

In [None]:
# Set index inplace
print("Before inplace operation:")
print(df)

df.set_index('month', inplace=True)
print("\nAfter inplace operation:")
print(df)

In [None]:
# Reset the DataFrame for further examples
df = pd.DataFrame({
    'month': [1, 4, 7, 10],
    'year': [2012, 2014, 2013, 2014],
    'sale': [55, 40, 84, 31]
})

In [None]:
# Using an array as index
new_index = ['a', 'b', 'c', 'd']
print("Set index using an array:")
df.set_index(new_index)

In [None]:
# Using a Series as index
new_index_series = pd.Series(['w', 'x', 'y', 'z'])
print("Set index using a Series:")
df.set_index(new_index_series)

In [None]:
# Create a DataFrame with duplicate values
df_dup = pd.DataFrame({
    'month': [1, 1, 7, 10],
    'year': [2012, 2012, 2013, 2014],
    'sale': [55, 40, 84, 31]
})

print("DataFrame with duplicate values:")
print(df_dup)

In [None]:
# Set index with verify_integrity=True
try:
    df_dup.set_index('month', verify_integrity=True)
except ValueError as e:
    print(f"Error: {e}")

In [None]:
# Set index with verify_integrity=False (default)
print("Set index with verify_integrity=False (default):")
df_dup.set_index('month', verify_integrity=False)

In [None]:
# Create a MultiIndex with duplicate values
print("Set MultiIndex with duplicate values:")
df_dup.set_index(['year', 'month'])

## 2. DataFrame.to_csv()

The `to_csv()` method writes the DataFrame to a comma-separated values (CSV) file. It provides many options for customizing the output format.

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'salary': [50000, 60000, np.nan, 80000],
    'department': ['HR', 'IT', 'Finance', 'Marketing']
})

print("Sample DataFrame:")
df

In [None]:
# Basic usage - write to a file
temp_file = tempfile.NamedTemporaryFile(suffix='.csv', delete=False)
temp_file.close()

df.to_csv(temp_file.name)

# Read the file back to see what was written
print("Contents of the CSV file:")
with open(temp_file.name, 'r') as f:
    print(f.read())

# Clean up
os.unlink(temp_file.name)

In [None]:
# Return as string instead of writing to a file
csv_string = df.to_csv()
print("CSV as string:")
print(csv_string)

In [None]:
# Customize the separator
csv_string = df.to_csv(sep='|')
print("CSV with pipe separator:")
print(csv_string)

In [None]:
# Customize how missing values are represented
csv_string = df.to_csv(na_rep='MISSING')
print("CSV with custom NA representation:")
print(csv_string)

In [None]:
# Format floating point numbers
df_float = pd.DataFrame({
    'A': [1.123456, 2.123456],
    'B': [3.123456, 4.123456]
})

csv_string = df_float.to_csv(float_format='%.2f')
print("CSV with formatted float values:")
print(csv_string)

In [None]:
# Select specific columns to write
csv_string = df.to_csv(columns=['name', 'age'])
print("CSV with selected columns:")
print(csv_string)

In [None]:
# Control header output
# No header
csv_string = df.to_csv(header=False)
print("CSV without header:")
print(csv_string)

# Custom header names
csv_string = df.to_csv(header=['Name', 'Age', 'Salary', 'Department'])
print("\nCSV with custom header names:")
print(csv_string)

In [None]:
# Control index output
# No index
csv_string = df.to_csv(index=False)
print("CSV without index:")
print(csv_string)

# Custom index label
csv_string = df.to_csv(index_label='ID')
print("\nCSV with custom index label:")
print(csv_string)

In [None]:
# Create a DataFrame with MultiIndex
df_multi = pd.DataFrame(
    np.random.randn(4, 2),
    index=pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)]),
    columns=['C1', 'C2']
)

print("DataFrame with MultiIndex:")
print(df_multi)

# Write to CSV
csv_string = df_multi.to_csv()
print("\nMultiIndex DataFrame as CSV:")
print(csv_string)

# With custom index labels
csv_string = df_multi.to_csv(index_label=['Group', 'Number'])
print("\nMultiIndex DataFrame as CSV with custom index labels:")
print(csv_string)

In [None]:
# Write to compressed file
temp_file_gz = tempfile.NamedTemporaryFile(suffix='.csv.gz', delete=False)
temp_file_gz.close()

df.to_csv(temp_file_gz.name, compression='gzip')

# Check that the file exists and has content
print(f"Compressed file size: {os.path.getsize(temp_file_gz.name)} bytes")

# Read the compressed file back
df_read = pd.read_csv(temp_file_gz.name, compression='gzip')
print("\nDataFrame read from compressed CSV:")
print(df_read)

# Clean up
os.unlink(temp_file_gz.name)

In [None]:
# Write with different encoding
df_encoding = pd.DataFrame({
    'name': ['José', 'María', 'João', 'François'],
    'country': ['Spain', 'Mexico', 'Brazil', 'France']
})

print("DataFrame with non-ASCII characters:")
print(df_encoding)

# Write with UTF-8 encoding (default)
temp_file_utf8 = tempfile.NamedTemporaryFile(suffix='.csv', delete=False)
temp_file_utf8.close()

df_encoding.to_csv(temp_file_utf8.name, encoding='utf-8')

# Read the file back
with open(temp_file_utf8.name, 'r', encoding='utf-8') as f:
    print("\nCSV with UTF-8 encoding:")
    print(f.read())

# Clean up
os.unlink(temp_file_utf8.name)

## Summary

In this notebook, we've explored two important DataFrame methods:

1. **set_index()**: Sets the DataFrame index (row labels) using one or more existing columns or arrays. Key parameters include:
   - `keys`: Column name(s) or array(s) to use as the new index
   - `drop`: Whether to delete the column(s) used as the new index
   - `append`: Whether to append columns to the existing index
   - `inplace`: Whether to modify the DataFrame in place
   - `verify_integrity`: Whether to check the new index for duplicates

2. **to_csv()**: Writes the DataFrame to a comma-separated values (CSV) file. This method provides many options for customizing the output format, including:
   - `path_or_buf`: File path or object to write to
   - `sep`: Field delimiter (default is ',')
   - `na_rep`: How to represent missing values
   - `float_format`: Format string for floating point numbers
   - `columns`: Which columns to write
   - `header`: Whether to write column names and what to call them
   - `index`: Whether to write row names (index)
   - `index_label`: Column label for index column(s)
   - `encoding`: Character encoding for the output
   - `compression`: Compression mode for the output file

These methods are essential for data manipulation and exporting data from pandas to CSV files for sharing or further processing.