# Concatenating and Transforming Data with pandas

Knowing how to **concatenate** and **transform** data is a fundamental skill in data analysis. These operations help restructure, clean, and prepare data so it can be effectively analyzed or consumed by downstream applications.

In this notebook, we demonstrate:
- How to concatenate data from multiple sources
- How to transform data by dropping, adding, and sorting

The examples use **pandas** and **NumPy**, which are standard Python libraries for data manipulation and numerical computing.

In [None]:
import numpy as np
import pandas as pd

from pandas import Series, DataFrame

## Concatenating Data

Concatenation is the process of **combining data from separate sources** into a single data structure.

### Real-world analogy
Imagine sending a direct mail advertisement:
- One table contains **customer ID and name**
- Another table contains **customer ID, mailing address, and age**

If the mailing application requires **only customer name and address**, you would:
1. Concatenate the two tables using customer ID
2. Drop unnecessary columns (such as age)

Below, we demonstrate concatenation using pandas.

In [None]:
DF_obj = DataFrame(np.arange(36).reshape(6, 6))
DF_obj

The first DataFrame contains **36 values**, reshaped into **6 rows and 6 columns**, with values ranging from 0 to 35.

In [None]:
DF_obj_2 = DataFrame(np.arange(15).reshape(5, 3))
DF_obj_2

The second DataFrame contains **15 values**, arranged into **5 rows and 3 columns**. This difference in shape will affect how concatenation behaves.

In [None]:
pd.concat([DF_obj, DF_obj_2], axis=1)

### Concatenation by Row Index (`axis=1`)

By specifying `axis=1`, pandas concatenates **column-wise**, joining the DataFrames based on their **row index values**.

- The result has **9 columns** (6 + 3)
- Where row indices do not match, pandas inserts **NaN values**

In [None]:
pd.concat([DF_obj, DF_obj_2])

### Concatenation by Column Index (Default)

When `axis` is omitted (or set to `0`), pandas concatenates **row-wise**, stacking one DataFrame on top of the other based on their column structure.

## Transforming Data

Data transformation involves **modifying the structure or content** of data to meet specific requirements.

Common transformations include:
- Dropping rows or columns
- Adding new variables
- Sorting data

We explore each of these below.

### Dropping Data

You can remove unwanted rows or columns using the `drop()` method.

- Rows are dropped by default
- Columns are dropped by specifying `axis=1`

In [None]:
DF_obj.drop([0, 2])

The rows at index positions **0 and 2** have been removed from the DataFrame.

In [None]:
DF_obj.drop([0, 2], axis=1)

By setting `axis=1`, the columns at index positions **0 and 2** are removed instead.

### Adding Data

New data can be added to a DataFrame by joining or concatenating additional data sources.

Here, we create a new Series and attach it as a column.

In [None]:
series_obj = Series(np.arange(6))
series_obj.name = "added_variable"
series_obj

This Series contains **6 values** (0 to 5) and is explicitly named so it can be added as a column.

In [None]:
variable_added = DataFrame.join(DF_obj, series_obj)
variable_added

The `join()` method adds the Series to the DataFrame by **matching row indices**, resulting in a new column.

In [None]:
added_datatable = pd.concat([variable_added, variable_added], ignore_index=False)
added_datatable

When concatenating the DataFrame with itself while keeping `ignore_index=False`, the **original index values are preserved**, resulting in **duplicate indices**.

In [None]:
added_datatable = pd.concat([variable_added, variable_added], ignore_index=True)
added_datatable

By setting `ignore_index=True`, pandas **creates a new integer index**, ensuring unique and sequential row labels.

### Sorting Data

Sorting allows you to reorder rows based on the values in one or more columns.

The `sort_values()` method is used, with:
- `by` specifying the column index or name
- `ascending` controlling sort order

In [None]:
DF_sorted = DF_obj.sort_values(by=[5], ascending=[False])
DF_sorted

The DataFrame is sorted in **descending order** based on the values in column index **5**.

All other columns are rearranged consistently to preserve row integrity.