# Pandas Concatenation

`pd.concat` concatenates a list of `DataFrame` or `Series` objects across either rows (axis=0) or columns(axis=1)

In [None]:
import numpy as np
import pandas as pd

X = pd.DataFrame(np.r_[:9].reshape(3,3), columns='A B C'.split(), index='x y z'.split())
X

In [None]:
Y = pd.DataFrame(np.r_[:900:100].reshape(3,3), columns='A C F'.split(), index='w x y'.split())
Y

If you concatenate across rows, Pandas tries to align the columns (filling in NaN / None) where it can't

In [None]:
Z = pd.concat([X, Y], sort=False)
Z

Likewise, concatenating across columns tries to align the index

In [None]:
Z = pd.concat([X, Y], axis=1, sort=False)
Z

Concatenation *will* copy the underlying data

In [None]:
X.loc['x', 'A'] = 232

In [None]:
Z

In [None]:
X

## Dealing with Scikit-Learn datasets

When dealing with Scikit-Learn datasets, the target column is provided as a separate entry. If we want to store the whole dataset as one object, we need to concatenate it:

In [None]:
from sklearn import datasets

iris = datasets.load_iris()

In [None]:
type(iris)

In [None]:
data = pd.DataFrame(iris.data, columns=iris.feature_names)
target = pd.Series(iris.target, name='Species')

In [None]:
data.head()

In [None]:
target.head()

In [None]:
# There is an error here -- can you guess what it is before executing?

df_iris = pd.concat([data, target])
df_iris.head()


In [None]:
df_iris.tail()

# .

# .

# .

# .

# .

# .

# .

# .

# .



In [None]:
df_iris = pd.concat([data, target], axis=1)
df_iris.head()

# Merging

Although you can use `concat` to do "joins" (especially on the index), I usually use `pd.merge` for that purpose.

In [None]:
sales = pd.read_csv('./data/kaggle-sales/sales_train.csv.gz', parse_dates=['date'])
sales.head()

In [None]:
items = pd.read_csv('./data/kaggle-sales/items.csv.gz')
items.head()

In [None]:
categories = pd.read_csv('./data/kaggle-sales/item_categories.csv.gz')
categories.head()

We can merge in the item data to sales first...

In [None]:
data = pd.merge(sales, items)  # merge on common column names
# data = pd.merge(sales, items, on='item_id')
# data = pd.merge(sales, items, left_on='item_id', right_on='item_id')
# data = pd.merge(sales, items, left_on='item_id', right_index=True)  # if items has index

data.head()

... and then merge in the categories to get our 'fully-flattened' data

In [None]:
data = pd.merge(data, categories)
data.head()

Now, we can answer questions like "which categories had the most/fewest transactions?"

In [None]:
pd.value_counts(data.item_category_name)

In [None]:
data.info()

Open the [Pandas merging lab][pandas-merging-lab]

[pandas-merging-lab]: ./pandas-merging-lab.ipynb