#### Pandas Tutorial - Part 39

This notebook covers:
- Working with STATA files
- Advanced data manipulation with pandas.concat
- Creating dummy variables with get_dummies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os

%matplotlib inline

##### Working with STATA Files

Pandas provides functionality to read and write STATA files, which are commonly used in statistical analysis, particularly in economics and social sciences.

### Reading STATA Files

The `read_stata()` function allows you to read STATA files into pandas DataFrames.

In [None]:
# Example of reading a STATA file (commented out as it requires a .dta file)
"""
# Read a STATA file
df_stata = pd.read_stata('filename.dta')
df_stata.head()
"""

### Key Parameters for read_stata()

The `read_stata()` function offers several parameters to customize how data is read:

In [None]:
# Example with various parameters (commented out as it requires a .dta file)
"""
# Read a STATA file with specific options
df_stata = pd.read_stata(
    'filename.dta',
    convert_dates=True,           # Convert date variables to DataFrame time values
    convert_categoricals=True,    # Convert columns to Categorical/Factor variables
    index_col='id',               # Column to set as index
    convert_missing=False,        # Replace missing values with NaN
    preserve_dtypes=True,         # Preserve STATA datatypes
    columns=['var1', 'var2'],     # Only read specific columns
    order_categoricals=True       # Order converted categorical data
)
df_stata.head()
"""

### Reading STATA Files in Chunks

For large STATA files, you can read the data in chunks to avoid memory issues.

In [None]:
# Example of reading a STATA file in chunks (commented out as it requires a .dta file)
"""
# Read a STATA file in chunks of 10,000 lines
itr = pd.read_stata('filename.dta', chunksize=10000)

# Process each chunk
for i, chunk in enumerate(itr):
    print(f"Processing chunk {i}, shape: {chunk.shape}")
    # Do something with the chunk
    # For example, calculate summary statistics
    print(chunk.describe())
    
    # Only process the first 3 chunks for demonstration
    if i >= 2:
        break
"""

### Writing to STATA Files

You can also write pandas DataFrames to STATA files using the `to_stata()` method.

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({
    'id': range(1, 6),
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 40, 45],
    'income': [50000, 60000, 75000, 90000, 85000],
    'date': pd.date_range('2020-01-01', periods=5)
})
df

In [None]:
# Write to STATA file
stata_file = 'sample.dta'
df.to_stata(stata_file)
print(f"Data written to {stata_file}")

##### Advanced Data Manipulation with pandas.concat

The `pandas.concat()` function is a powerful tool for combining pandas objects along a particular axis with optional set logic along the other axes.

### Basic Concatenation

Let's start with basic examples of concatenating DataFrames.

In [None]:
# Create sample DataFrames
df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

In [None]:
# Concatenate vertically (along axis=0, the default)
result = pd.concat([df1, df2])
result

### Concatenating DataFrames with Different Columns

When concatenating DataFrames with different columns, pandas will include all columns and fill missing values with NaN.

In [None]:
# Create a DataFrame with an additional column
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], columns=['letter', 'number', 'animal'])

print("DataFrame 3:")
print(df3)

In [None]:
# Concatenate with different columns
result = pd.concat([df1, df3], sort=False)
result

### Using the join Parameter

The `join` parameter determines how to handle columns when concatenating DataFrames with different columns.

In [None]:
# Concatenate with join='inner' to keep only shared columns
result = pd.concat([df1, df3], join="inner")
result

### Concatenating Horizontally

You can concatenate DataFrames horizontally by setting `axis=1`.

In [None]:
# Create a new DataFrame
df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']], columns=['animal', 'name'])

print("DataFrame 4:")
print(df4)

In [None]:
# Concatenate horizontally
result = pd.concat([df1, df4], axis=1)
result

### Verifying Integrity

You can use the `verify_integrity` parameter to check for duplicate index values.

In [None]:
# Create DataFrames with the same index
df5 = pd.DataFrame([1], index=['a'])
df6 = pd.DataFrame([2], index=['a'])

print("DataFrame 5:")
print(df5)
print("\nDataFrame 6:")
print(df6)

In [None]:
# Concatenate without verifying integrity
result = pd.concat([df5, df6])
result

In [None]:
# Try to concatenate with verify_integrity=True
try:
    result = pd.concat([df5, df6], verify_integrity=True)
except ValueError as e:
    print(f"Error: {e}")

##### Creating Dummy Variables with get_dummies

The `pandas.get_dummies()` function is used to convert categorical variables into dummy/indicator variables.

### Basic Usage of get_dummies

In [None]:
# Create a sample DataFrame with categorical variables
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'gender': ['F', 'M', 'M', 'M', 'F'],
    'department': ['HR', 'IT', 'Finance', 'IT', 'HR'],
    'age': [25, 30, 35, 40, 45]
})
df

In [None]:
# Convert gender to dummy variables
dummies = pd.get_dummies(df['gender'])
dummies

In [None]:
# Convert all categorical columns to dummy variables
dummies_all = pd.get_dummies(df, columns=['gender', 'department'])
dummies_all

### Customizing Dummy Variable Names with Prefix

In [None]:
# Use custom prefixes for dummy variable names
dummies_prefix = pd.get_dummies(df, columns=['gender', 'department'], 
                               prefix=['Sex', 'Dept'])
dummies_prefix

### Dropping the First Category

In [None]:
# Drop the first category to avoid the dummy variable trap
dummies_drop_first = pd.get_dummies(df, columns=['gender', 'department'], 
                                   drop_first=True)
dummies_drop_first

### Handling Missing Values

In [None]:
# Create a DataFrame with missing values
df_missing = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'gender': ['F', 'M', np.nan, 'M', 'F'],
    'department': ['HR', 'IT', 'Finance', np.nan, 'HR'],
    'age': [25, 30, 35, 40, 45]
})
df_missing

In [None]:
# By default, missing values are ignored
dummies_missing = pd.get_dummies(df_missing, columns=['gender', 'department'])
dummies_missing

In [None]:
# Add a column for missing values
dummies_missing_na = pd.get_dummies(df_missing, columns=['gender', 'department'], 
                                   dummy_na=True)
dummies_missing_na

### Using Sparse Matrices for Efficiency

In [None]:
# Create a DataFrame with many categories
df_large = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], 1000)
})

# Use sparse=True for memory efficiency
dummies_sparse = pd.get_dummies(df_large, sparse=True)

# Compare memory usage
dummies_dense = pd.get_dummies(df_large, sparse=False)

print(f"Sparse dummies memory usage: {dummies_sparse.memory_usage().sum() / 1024:.2f} KB")
print(f"Dense dummies memory usage: {dummies_dense.memory_usage().sum() / 1024:.2f} KB")

##### Conclusion

In this notebook, we've explored:

1. Working with STATA files, including:
   - Reading STATA files with various options
   - Reading STATA files in chunks
   - Writing pandas DataFrames to STATA files

2. Advanced data manipulation with pandas.concat, including:
   - Basic concatenation
   - Concatenating DataFrames with different columns
   - Using the join parameter
   - Concatenating horizontally
   - Verifying integrity

3. Creating dummy variables with get_dummies, including:
   - Basic usage
   - Customizing dummy variable names with prefix
   - Dropping the first category
   - Handling missing values
   - Using sparse matrices for efficiency

These techniques are essential for data preparation and manipulation in data analysis and machine learning workflows.