# Concat

Concatenation allows you to combine multiple DataFrames by stacking them vertically (adding rows) or horizontally (adding columns). This is useful when you have data split across multiple files or sources that need to be combined.

## Setup data

Let's create some sample datasets to demonstrate concatenation:

In [None]:
import pandas as pd

# Load the main accident data
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")

print(f"Loaded {len(accident_list)} total accidents")
print(f"Date range: {accident_list['date'].min()} to {accident_list['date'].max()}")

## Vertical concatenation (combining rows)

Let's split our data and then recombine it to demonstrate vertical concatenation:

In [None]:
# Split data by state for demonstration
california_accidents = accident_list[accident_list["state"] == "CA"]
texas_accidents = accident_list[accident_list["state"] == "TX"]
florida_accidents = accident_list[accident_list["state"] == "FL"]

print(f"California accidents: {len(california_accidents)}")
print(f"Texas accidents: {len(texas_accidents)}")
print(f"Florida accidents: {len(florida_accidents)}")

In [None]:
# Concatenate the state datasets back together
combined_states = pd.concat([california_accidents, texas_accidents, florida_accidents])

print(f"Combined dataset: {len(combined_states)} accidents")
print("\nState breakdown:")
print(combined_states["state"].value_counts())

## Adding row labels during concatenation

You can add labels to identify which dataset each row came from:

In [None]:
# Concatenate with keys to identify source
labeled_concat = pd.concat(
    [california_accidents, texas_accidents, florida_accidents],
    keys=["CA", "TX", "FL"]
)

print(f"Labeled concatenation shape: {labeled_concat.shape}")
print("\nIndex structure:")
print(labeled_concat.index.names)
print(labeled_concat.head())

## Horizontal concatenation (combining columns)

Let's create additional data to demonstrate horizontal concatenation:

In [None]:
# Create summary statistics for each accident
accident_summary = pd.DataFrame({
    'total_people': accident_list['total_fatalities'] + 
                   accident_list['total_serious_injuries'] + 
                   accident_list['total_minor_injuries'],
    'severity_score': accident_list['total_fatalities'] * 3 + 
                     accident_list['total_serious_injuries'] * 2 + 
                     accident_list['total_minor_injuries'] * 1
}, index=accident_list.index)

print("Summary statistics:")
accident_summary.head()

In [None]:
# Concatenate horizontally to add new columns
enhanced_data = pd.concat([accident_list, accident_summary], axis=1)

print(f"Original columns: {len(accident_list.columns)}")
print(f"Enhanced columns: {len(enhanced_data.columns)}")
print(f"New columns added: {list(accident_summary.columns)}")

## Handling missing data during concatenation

When DataFrames have different columns, concatenation will fill missing values with NaN:

In [None]:
# Create DataFrames with different columns
basic_info = accident_list[['accident_number', 'date', 'state']].head(3)
detailed_info = accident_list[['accident_number', 'location', 'total_fatalities']].head(3)

print("Basic info columns:", list(basic_info.columns))
print("Detailed info columns:", list(detailed_info.columns))

# Concatenate DataFrames with different columns
mixed_concat = pd.concat([basic_info, detailed_info])
print("\nConcatenated result:")
print(mixed_concat)

## Ignoring index during concatenation

Sometimes you want to reset the index when concatenating:

In [None]:
# Concatenate and reset index
reset_index_concat = pd.concat(
    [california_accidents.head(2), texas_accidents.head(2)], 
    ignore_index=True
)

print("Concatenation with reset index:")
print(reset_index_concat[['accident_number', 'state', 'location']])

## Practical example: Combining multiple CSV files

This is a common scenario when you have data split across multiple files:

In [None]:
# Simulate reading multiple files and combining them
# In practice, you might do something like:
# files = ['data2020.csv', 'data2021.csv', 'data2022.csv']
# dataframes = [pd.read_csv(f) for f in files]
# combined = pd.concat(dataframes, ignore_index=True)

# For demonstration, split by year and recombine
accident_list['date'] = pd.to_datetime(accident_list['date'])
accident_list['year'] = accident_list['date'].dt.year

year_2015 = accident_list[accident_list['year'] == 2015]
year_2016 = accident_list[accident_list['year'] == 2016]
year_2017 = accident_list[accident_list['year'] == 2017]

# Combine years with source tracking
multi_year = pd.concat(
    [year_2015, year_2016, year_2017],
    keys=['2015', '2016', '2017'],
    names=['source_year', 'original_index']
)

print(f"Combined multi-year data: {len(multi_year)} accidents")
print("\nAccidents by source year:")
print(multi_year.groupby(level=0).size())

Concatenation is essential when working with data from multiple sources or files. It allows you to combine datasets efficiently while maintaining data integrity and providing options for handling mismatched structures.