#### Pandas Tutorial - Part 39

This notebook covers:
- Working with STATA files
- Advanced data manipulation with pandas.concat
- Creating dummy variables with get_dummies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os

%matplotlib inline

##### Working with STATA Files

Pandas provides functionality to read and write STATA files, which are commonly used in statistical analysis, particularly in economics and social sciences.

### Reading STATA Files

The `read_stata()` function allows you to read STATA files into pandas DataFrames.

In [2]:
# Example of reading a STATA file (commented out as it requires a .dta file)
"""
# Read a STATA file
df_stata = pd.read_stata('filename.dta')
df_stata.head()
"""

"\n# Read a STATA file\ndf_stata = pd.read_stata('filename.dta')\ndf_stata.head()\n"

### Key Parameters for read_stata()

The `read_stata()` function offers several parameters to customize how data is read:

In [3]:
# Example with various parameters (commented out as it requires a .dta file)
"""
# Read a STATA file with specific options
df_stata = pd.read_stata(
    'filename.dta',
    convert_dates=True,           # Convert date variables to DataFrame time values
    convert_categoricals=True,    # Convert columns to Categorical/Factor variables
    index_col='id',               # Column to set as index
    convert_missing=False,        # Replace missing values with NaN
    preserve_dtypes=True,         # Preserve STATA datatypes
    columns=['var1', 'var2'],     # Only read specific columns
    order_categoricals=True       # Order converted categorical data
)
df_stata.head()
"""

"\n# Read a STATA file with specific options\ndf_stata = pd.read_stata(\n    'filename.dta',\n    convert_dates=True,           # Convert date variables to DataFrame time values\n    convert_categoricals=True,    # Convert columns to Categorical/Factor variables\n    index_col='id',               # Column to set as index\n    convert_missing=False,        # Replace missing values with NaN\n    preserve_dtypes=True,         # Preserve STATA datatypes\n    columns=['var1', 'var2'],     # Only read specific columns\n    order_categoricals=True       # Order converted categorical data\n)\ndf_stata.head()\n"

### Reading STATA Files in Chunks

For large STATA files, you can read the data in chunks to avoid memory issues.

In [4]:
# Example of reading a STATA file in chunks (commented out as it requires a .dta file)
"""
# Read a STATA file in chunks of 10,000 lines
itr = pd.read_stata('filename.dta', chunksize=10000)

# Process each chunk
for i, chunk in enumerate(itr):
    print(f"Processing chunk {i}, shape: {chunk.shape}")
    # Do something with the chunk
    # For example, calculate summary statistics
    print(chunk.describe())
    
    # Only process the first 3 chunks for demonstration
    if i >= 2:
        break
"""

'\n# Read a STATA file in chunks of 10,000 lines\nitr = pd.read_stata(\'filename.dta\', chunksize=10000)\n\n# Process each chunk\nfor i, chunk in enumerate(itr):\n    print(f"Processing chunk {i}, shape: {chunk.shape}")\n    # Do something with the chunk\n    # For example, calculate summary statistics\n    print(chunk.describe())\n\n    # Only process the first 3 chunks for demonstration\n    if i >= 2:\n        break\n'

### Writing to STATA Files

You can also write pandas DataFrames to STATA files using the `to_stata()` method.

In [5]:
# Create a sample DataFrame
df = pd.DataFrame({
    'id': range(1, 6),
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 40, 45],
    'income': [50000, 60000, 75000, 90000, 85000],
    'date': pd.date_range('2020-01-01', periods=5)
})
df

Unnamed: 0,id,name,age,income,date
0,1,Alice,25,50000,2020-01-01
1,2,Bob,30,60000,2020-01-02
2,3,Charlie,35,75000,2020-01-03
3,4,David,40,90000,2020-01-04
4,5,Eve,45,85000,2020-01-05


In [6]:
# Write to STATA file
stata_file = 'sample.dta'
df.to_stata(stata_file)
print(f"Data written to {stata_file}")

Data written to sample.dta


##### Advanced Data Manipulation with pandas.concat

The `pandas.concat()` function is a powerful tool for combining pandas objects along a particular axis with optional set logic along the other axes.

### Basic Concatenation

Let's start with basic examples of concatenating DataFrames.

In [7]:
# Create sample DataFrames
df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

DataFrame 1:
  letter  number
0      a       1
1      b       2

DataFrame 2:
  letter  number
0      c       3
1      d       4


In [8]:
# Concatenate vertically (along axis=0, the default)
result = pd.concat([df1, df2])
result

Unnamed: 0,letter,number
0,a,1
1,b,2
0,c,3
1,d,4


### Concatenating DataFrames with Different Columns

When concatenating DataFrames with different columns, pandas will include all columns and fill missing values with NaN.

In [9]:
# Create a DataFrame with an additional column
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], columns=['letter', 'number', 'animal'])

print("DataFrame 3:")
print(df3)

DataFrame 3:
  letter  number animal
0      c       3    cat
1      d       4    dog


In [10]:
# Concatenate with different columns
result = pd.concat([df1, df3], sort=False)
result

Unnamed: 0,letter,number,animal
0,a,1,
1,b,2,
0,c,3,cat
1,d,4,dog


### Using the join Parameter

The `join` parameter determines how to handle columns when concatenating DataFrames with different columns.

In [11]:
# Concatenate with join='inner' to keep only shared columns
result = pd.concat([df1, df3], join="inner")
result

Unnamed: 0,letter,number
0,a,1
1,b,2
0,c,3
1,d,4


### Concatenating Horizontally

You can concatenate DataFrames horizontally by setting `axis=1`.

In [12]:
# Create a new DataFrame
df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']], columns=['animal', 'name'])

print("DataFrame 4:")
print(df4)

DataFrame 4:
   animal    name
0    bird   polly
1  monkey  george


In [13]:
# Concatenate horizontally
result = pd.concat([df1, df4], axis=1)
result

Unnamed: 0,letter,number,animal,name
0,a,1,bird,polly
1,b,2,monkey,george


### Verifying Integrity

You can use the `verify_integrity` parameter to check for duplicate index values.

In [14]:
# Create DataFrames with the same index
df5 = pd.DataFrame([1], index=['a'])
df6 = pd.DataFrame([2], index=['a'])

print("DataFrame 5:")
print(df5)
print("\nDataFrame 6:")
print(df6)

DataFrame 5:
   0
a  1

DataFrame 6:
   0
a  2


In [15]:
# Concatenate without verifying integrity
result = pd.concat([df5, df6])
result

Unnamed: 0,0
a,1
a,2


In [16]:
# Try to concatenate with verify_integrity=True
try:
    result = pd.concat([df5, df6], verify_integrity=True)
except ValueError as e:
    print(f"Error: {e}")

Error: Indexes have overlapping values: Index(['a'], dtype='object')


##### Creating Dummy Variables with get_dummies

The `pandas.get_dummies()` function is used to convert categorical variables into dummy/indicator variables.

### Basic Usage of get_dummies

In [17]:
# Create a sample DataFrame with categorical variables
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'gender': ['F', 'M', 'M', 'M', 'F'],
    'department': ['HR', 'IT', 'Finance', 'IT', 'HR'],
    'age': [25, 30, 35, 40, 45]
})
df

Unnamed: 0,name,gender,department,age
0,Alice,F,HR,25
1,Bob,M,IT,30
2,Charlie,M,Finance,35
3,David,M,IT,40
4,Eve,F,HR,45


In [18]:
# Convert gender to dummy variables
dummies = pd.get_dummies(df['gender'])
dummies

Unnamed: 0,F,M
0,True,False
1,False,True
2,False,True
3,False,True
4,True,False


In [19]:
# Convert all categorical columns to dummy variables
dummies_all = pd.get_dummies(df, columns=['gender', 'department'])
dummies_all

Unnamed: 0,name,age,gender_F,gender_M,department_Finance,department_HR,department_IT
0,Alice,25,True,False,False,True,False
1,Bob,30,False,True,False,False,True
2,Charlie,35,False,True,True,False,False
3,David,40,False,True,False,False,True
4,Eve,45,True,False,False,True,False


### Customizing Dummy Variable Names with Prefix

In [20]:
# Use custom prefixes for dummy variable names
dummies_prefix = pd.get_dummies(df, columns=['gender', 'department'], 
                               prefix=['Sex', 'Dept'])
dummies_prefix

Unnamed: 0,name,age,Sex_F,Sex_M,Dept_Finance,Dept_HR,Dept_IT
0,Alice,25,True,False,False,True,False
1,Bob,30,False,True,False,False,True
2,Charlie,35,False,True,True,False,False
3,David,40,False,True,False,False,True
4,Eve,45,True,False,False,True,False


### Dropping the First Category

In [21]:
# Drop the first category to avoid the dummy variable trap
dummies_drop_first = pd.get_dummies(df, columns=['gender', 'department'], 
                                   drop_first=True)
dummies_drop_first

Unnamed: 0,name,age,gender_M,department_HR,department_IT
0,Alice,25,False,True,False
1,Bob,30,True,False,True
2,Charlie,35,True,False,False
3,David,40,True,False,True
4,Eve,45,False,True,False


### Handling Missing Values

In [22]:
# Create a DataFrame with missing values
df_missing = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'gender': ['F', 'M', np.nan, 'M', 'F'],
    'department': ['HR', 'IT', 'Finance', np.nan, 'HR'],
    'age': [25, 30, 35, 40, 45]
})
df_missing

Unnamed: 0,name,gender,department,age
0,Alice,F,HR,25
1,Bob,M,IT,30
2,Charlie,,Finance,35
3,David,M,,40
4,Eve,F,HR,45


In [23]:
# By default, missing values are ignored
dummies_missing = pd.get_dummies(df_missing, columns=['gender', 'department'])
dummies_missing

Unnamed: 0,name,age,gender_F,gender_M,department_Finance,department_HR,department_IT
0,Alice,25,True,False,False,True,False
1,Bob,30,False,True,False,False,True
2,Charlie,35,False,False,True,False,False
3,David,40,False,True,False,False,False
4,Eve,45,True,False,False,True,False


In [24]:
# Add a column for missing values
dummies_missing_na = pd.get_dummies(df_missing, columns=['gender', 'department'], 
                                   dummy_na=True)
dummies_missing_na

Unnamed: 0,name,age,gender_F,gender_M,gender_nan,department_Finance,department_HR,department_IT,department_nan
0,Alice,25,True,False,False,False,True,False,False
1,Bob,30,False,True,False,False,False,True,False
2,Charlie,35,False,False,True,True,False,False,False
3,David,40,False,True,False,False,False,False,True
4,Eve,45,True,False,False,False,True,False,False


### Using Sparse Matrices for Efficiency

In [25]:
# Create a DataFrame with many categories
df_large = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], 1000)
})

# Use sparse=True for memory efficiency
dummies_sparse = pd.get_dummies(df_large, sparse=True)

# Compare memory usage
dummies_dense = pd.get_dummies(df_large, sparse=False)

print(f"Sparse dummies memory usage: {dummies_sparse.memory_usage().sum() / 1024:.2f} KB")
print(f"Dense dummies memory usage: {dummies_dense.memory_usage().sum() / 1024:.2f} KB")

Sparse dummies memory usage: 5.01 KB
Dense dummies memory usage: 9.89 KB


##### Conclusion

In this notebook, we've explored:

1. Working with STATA files, including:
   - Reading STATA files with various options
   - Reading STATA files in chunks
   - Writing pandas DataFrames to STATA files

2. Advanced data manipulation with pandas.concat, including:
   - Basic concatenation
   - Concatenating DataFrames with different columns
   - Using the join parameter
   - Concatenating horizontally
   - Verifying integrity

3. Creating dummy variables with get_dummies, including:
   - Basic usage
   - Customizing dummy variable names with prefix
   - Dropping the first category
   - Handling missing values
   - Using sparse matrices for efficiency

These techniques are essential for data preparation and manipulation in data analysis and machine learning workflows.