# Pandas Tutorial - Part 37

This notebook covers:
- More on merging and joining DataFrames
- Plotting with pandas
- Data input/output operations

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os

%matplotlib inline

## Merging and Joining DataFrames

Continuing from Part 36, let's explore more advanced merging and joining operations.

### Appending DataFrames with ignore_index

When appending DataFrames, you might want to ignore the original index values and create a new sequential index.

In [None]:
# Create two sample DataFrames
df1 = pd.DataFrame(np.random.randn(6, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(4, 3), columns=['A', 'B', 'C'])

# Display the DataFrames
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

In [None]:
# Append with ignore_index=True
df = df1.append(df2, ignore_index=True)
df

### Self Join of a DataFrame

Sometimes you need to join a DataFrame with itself based on certain conditions.

In [None]:
# Create a sample DataFrame
df = pd.DataFrame(data={
    'Area': ['A'] * 5 + ['C'] * 2,
    'Bins': [110] * 2 + [160] * 3 + [40] * 2,
    'Test_0': [0, 1, 0, 1, 2, 0, 1],
    'Data': np.random.randn(7)
})
df

In [None]:
# Add a new column Test_1 which is Test_0 - 1
df['Test_1'] = df['Test_0'] - 1
df

In [None]:
# Self join: join the DataFrame with itself
# Join where Test_0 in the left DataFrame equals Test_1 in the right DataFrame
pd.merge(df, df, 
         left_on=['Bins', 'Area', 'Test_0'],
         right_on=['Bins', 'Area', 'Test_1'],
         suffixes=('_L', '_R'))

## Plotting with Pandas

Pandas integrates with Matplotlib to provide plotting capabilities directly from DataFrames and Series.

### Boxplot for Each Quartile of a Stratifying Variable

Let's create a boxplot for each quartile of a stratifying variable.

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({
    'stratifying_var': np.random.uniform(0, 100, 20),
    'price': np.random.normal(100, 5, 20)
})

# Create quartile categories
df['quartiles'] = pd.qcut(
    df['stratifying_var'],
    4,
    labels=['0-25%', '25-50%', '50-75%', '75-100%']
)

df.head()

In [None]:
# Create a boxplot for each quartile
df.boxplot(column='price', by='quartiles', figsize=(10, 6))
plt.title('Price Distribution by Quartiles of Stratifying Variable')
plt.suptitle('')  # Remove the default suptitle
plt.ylabel('Price')
plt.show()

### Creating a Multi-line Plot

Let's create a multi-line plot to visualize multiple time series.

In [None]:
# Create a DataFrame with multiple time series
dates = pd.date_range('2020-01-01', periods=100)
df = pd.DataFrame({
    'Series1': np.random.randn(100).cumsum(),
    'Series2': np.random.randn(100).cumsum(),
    'Series3': np.random.randn(100).cumsum()
}, index=dates)

df.head()

In [None]:
# Plot all series
ax = df.plot(figsize=(12, 6))
ax.set_title('Multi-line Time Series Plot')
ax.set_xlabel('Date')
ax.set_ylabel('Value')
plt.grid(True)
plt.show()

## Data Input/Output Operations

Pandas provides a wide range of functions for reading from and writing to various file formats.

### Reading and Writing CSV Files

CSV (Comma-Separated Values) is one of the most common file formats for data exchange.

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': np.random.randn(5),
    'B': np.random.randn(5),
    'C': np.random.randn(5),
    'D': np.random.randn(5)
})
df

In [None]:
# Write to CSV
csv_path = 'sample_data.csv'
df.to_csv(csv_path, index=False)
print(f"Data written to {csv_path}")

In [None]:
# Read from CSV
df_read = pd.read_csv(csv_path)
df_read

### Reading CSV Chunk by Chunk

For large files, you can read the data in chunks to avoid memory issues.

In [None]:
# Create a larger DataFrame for demonstration
large_df = pd.DataFrame({
    'A': np.random.randn(1000),
    'B': np.random.randn(1000),
    'C': np.random.randn(1000)
})

# Write to CSV
large_csv_path = 'large_sample_data.csv'
large_df.to_csv(large_csv_path, index=False)
print(f"Large data written to {large_csv_path}")

In [None]:
# Read in chunks
chunk_size = 200
chunks = []

# Read and process each chunk
for chunk in pd.read_csv(large_csv_path, chunksize=chunk_size):
    # Process the chunk (here we're just calculating the mean of column A)
    processed = pd.DataFrame({'A_mean': [chunk['A'].mean()]})
    chunks.append(processed)

# Combine the processed chunks
result = pd.concat(chunks)
result

### Reading from and Writing to Excel Files

Pandas can read from and write to Excel files.

In [None]:
# Create a sample DataFrame
df_excel = pd.DataFrame({
    'Name': ['John', 'Jane', 'Bob', 'Alice'],
    'Age': [28, 34, 42, 31],
    'Salary': [50000, 60000, 55000, 65000]
})
df_excel

In [None]:
# Write to Excel
excel_path = 'sample_data.xlsx'
df_excel.to_excel(excel_path, index=False)
print(f"Data written to {excel_path}")

In [None]:
# Read from Excel
df_excel_read = pd.read_excel(excel_path)
df_excel_read

### Reading from Clipboard

Pandas can read data directly from the clipboard, which is useful for quickly importing data that you've copied from another application.

In [None]:
# Note: This requires data to be in the clipboard
# Example usage (commented out as it depends on clipboard content):
# df_clipboard = pd.read_clipboard()
# df_clipboard

### Reading Fixed-Width Files

Fixed-width files have columns that are aligned by padding with spaces.

In [None]:
# Create a sample fixed-width file
fixed_width_data = """
  Name  Age Salary
  John   28  50000
  Jane   34  60000
  Bob    42  55000
  Alice  31  65000
"""

# Write to a file
with open('fixed_width.txt', 'w') as f:
    f.write(fixed_width_data)

# Read the fixed-width file
df_fwf = pd.read_fwf('fixed_width.txt')
df_fwf

## Conclusion

In this notebook, we've explored:

1. Advanced merging and joining operations in pandas, including:
   - Appending DataFrames with ignore_index
   - Self-joining a DataFrame

2. Plotting capabilities in pandas, including:
   - Creating boxplots for quartiles of a stratifying variable
   - Creating multi-line time series plots

3. Data input/output operations, including:
   - Reading and writing CSV files
   - Reading CSV files chunk by chunk
   - Reading from and writing to Excel files
   - Reading from clipboard
   - Reading fixed-width files

These operations are fundamental for data manipulation and analysis with pandas, allowing you to efficiently work with data from various sources and formats.