# Reshaping data in Python

## 1. Reshaping data in NumPy
NumPy has a number of nice reshaping commands. It is in the object's shape where Numpy really shines. NumPy is great at handling a large number of dimensions of object data. Pandas becomes more clunky the more hierarchical indices are added. However, NumPy is limited by the content of its objects in that all numpy arrays must have the same type of content.

### 1.1 Transpose
Let the matrix `A` be defined as follows.

In [None]:
import numpy as np

A = np.array([[1, 2, 3], [4, 5, 6]])
print(A)
print(A.shape)

In [None]:
A.T

In [None]:
A.transpose()

In [None]:
np.transpose(A)

The transpose function is much more flexible in that you can combine elements of reshaping the array and transposing along certain dimensions.

### 1.2 np.reshape()
Sometimes you want to change an array's shape. The reshape command is very flexible. The key concept to keep in mind is that NumPy arrays are row-major (as with the C programming language). When flattening or reshaping an array, Numpy starts counting in the first row along each of the columns, then proceeds to the second row, and follows that pattern along each of the other dimensions.

In [None]:
A.reshape((1, 6))

In [None]:
A.reshape((6,1))

In [None]:
print(A.reshape((1, 2, 3)))
print(A.reshape((1, 2, 3)).shape)

In [None]:
A.reshape((1, 2, 3))[0, :, :]

In [None]:
A.reshape((1, 2, 3))[0, :, 0]

In [None]:
A.reshape((1, 2, 3))[0, :, 1]

### 1.3 Vectorize: np.ravel() and np.flatten()
It is often valuable to vectorize (make one-dimensional) a multi-dimensional array. A major reason for this is that computers are much faster at working with one-dimensional arrays than with arrays of more-than-one dimension. The two main functions for doing this are `np.ravel()` and `np.flatten()` they both do almost the same thing. The only difference is that `ravel()` produces a vectorized version of the original unvectorized array for which changing values of the `ravel()` vector might change elements of the original array. `flatten()` creates a vectorized copy of the original array. So no changes of the `flatten()` vector would ever change the original array.

In [None]:
A.flatten()

In [None]:
A.ravel()

### 1.4 Stacking
The stacking commands are very easy to use in numpy. The main commands are `np.hstack()`, `np.vstack()`, and `np.tile()`.

The function `np.hstack()` stacks arrays horizontally. Each array must have the same shape in its first dimension. For example, define the arrays A, B, and C in the following way.

In [None]:
A = np.array([[1, 2, 3], [4, 5, 6]])
A

In [None]:
B = np.array([-1, -2, -3])
B

In [None]:
C = np.array([[10, 20], [30, 40], [50, 60]])
C

The following won't work because they are not comformable.

In [None]:
np.hstack((A, B))

But the following will work.

In [None]:
np.hstack((C, A.T))

However, the following will not work because `B.T` is not defined for a one-dimensional vector. For this, we will have to use the `.reshape()` method.

In [None]:
np.hstack((C, A.T, B.T))

In [None]:
B.T.shape

In [None]:
np.hstack((C, A.T, B.reshape((3, 1))))

The function `np.vstack()` stacks arrays vertically. Each array must have the same shape in its second dimension (across columns). We use the same arrays A, B, and C as in the preceding `np.hstack()` example.

In [None]:
A

In [None]:
B

In [None]:
C

Note that the following `np.vstack()` command works with the one-dimensional vector `B`.

In [None]:
np.vstack((A, B))

We could vertically stack all three arrays in the following way.

In [None]:
np.vstack((A, B, C.T))

Finally, the `np.tile()` command is a useful function for copying arrays of any dimension into other any dimension. This is a very flexible function.

The following command takes a one-dimensional (3,) vector, reshapes it into a 1 x 1 x 3 and copies it down two rows and across three columns.

In [None]:
F = np.array([1, 2, 3])
G = np.tile(F.reshape((1, 1, 3)), (2, 3, 1))
G

In [None]:
G[:, :, 0]

In [None]:
G[:, :, 1]

## 2. Reshaping data in Pandas
NumPy has a number of nice reshaping commands. It is in the object's shape where Numpy really shines. NumPy is great at handling a large number of dimensions of object data. Pandas becomes more clunky the more hierarchical indices are added. However, NumPy is limited by the content of its objects in that all numpy arrays must have the same type of content.

### 2.1. Combining and merging data sets
Focus on three functions.
* `pandas.merge`: connects rows in DataFrames based on one or more keys. This is pandas way of implementing database *join* operations. This is the main entry point for using these algorithms on your data.
* `pandas.concat`: stacks together objects along an axis.
* `combine_first`: instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

#### 2.1.1. pd.merge()

In [None]:
import pandas as pd
from pandas import DataFrame, Series

df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)}) 
df1

In [None]:
df2 = DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})
df2

The following is a *many-to-one* merge. `df1` has multiple values for `a` (2, 4, 5) while `df2` has only one value for `a` (0). And `df2` has multiple values for `b` (0, 1, 6) while `df2` has only one value for `b` (1).

In [None]:
pd.merge(df1, df2)

Two things to note.
* If you don't specify which column to join on, `merge` uses the overlapping column names as the keys--in this case, `key`.
* This command leaves out the `key` of `b` from `df2` and the `key` of `c` from `df1`. This is because these `key` values only occur in one, not both, of the DataFrames being merged.

It is good practice to specify the column on which to merge.

In [None]:
pd.merge(df1, df2, on='key')

If the column names are different in each object, you can specify them separately.

In [None]:
df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df3

In [None]:
df4 = DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})
df4

In [None]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Note again that this only includes value for which the `key` is found in both datasets. This is called an `inner` join. Other possible options are `left`, `right`, and `outer`. The `inner` join takes the intersection of keys, the `outer` join takes the union of keys.

In [None]:
pd.merge(df1,df2, how='outer')

*Many-to-many* merges form the Cartesian product of the observations or rows.

In [None]:
df5 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df5

In [None]:
df6 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data2': range(5)})
df6

In [None]:
pd.merge(df5, df6, on='key', how='left')

Notice the following:
* Because it was a `left` merge, it included all the values in `df5` but not all of the values in `df6`.
* Since there were 3 `b` rows in `df5` and 2 `b` rows in `df6`, the merged DataFrame has 6 `b` rows.

To merge with multiple keys, pass a list of column names.

In [None]:
df7 = DataFrame({'key1': ['foo', 'foo', 'bar'],
                 'key2': ['one', 'two', 'one'],
                 'lval': [1, 2, 3]})
df7

In [None]:
df8 = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                 'key2': ['one', 'one', 'one', 'two'],
                 'rval': [4, 5, 6, 7]})
df8

In [None]:
pd.merge(df7, df8, on=['key1', 'key2'], how='outer')

When merging data, you often have to deal with column names that overlap. the merge command will automatically give suffixes to column names that are not the `key` column that have the same name. Otherwise, you can use the `suffixes` option to specify the suffix labels.

In [None]:
pd.merge(df7, df8, on='key1')

In [None]:
pd.merge(df7, df8, on='key1', suffixes=('_left', '_right'))

A more common practice is to merge using a DataFrame's index as the merge key. You can use `left_index=True` or `right_index=True` or both.

In [None]:
df9 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)})
df9

In [None]:
df10 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
df10

In [None]:
pd.merge(df9, df10, left_on='key', right_index=True)

You can use the indices on both DataFrames of the merge as the merge keys, and you can deal with hierarchical key merges.

#### 2.1.2. pd.join()
The `pd.join()` method is a more convenient function for merging by index than `pd.merge`. It can also be used to comine together many DataFrame objects having the same or similar indices but not overlapping columns.

In [None]:
df11 = DataFrame([[1, 2], [3, 4], [5, 6]], index=['a', 'b', 'c'], columns=['Ohio', 'Nevada'])
df11

In [None]:
df12 = DataFrame([[7, 8], [9, 10], [11, 12], [13, 14]],
                 index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama'])
df12

In [None]:
df11.join(df12, how='outer')

The following command joins the index of the calling DataFrame with a column of the passed DataFrame.

In [None]:
df9.join(df10, on='key')

### 2.2. Reshaping, pivoting, and groupby

#### 2.2.1. Reshaping with hierarchical indexing

In [None]:
data = DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'),
                 columns=pd.Index(['one', 'two', 'three'], name='number'))
data

The `stack()` method on this data pibots the columns into the rows, producing a hierarchically indexed Series.

In [None]:
result = data.stack()
result

From a hierarchically indexed Series, you can rearrange the data back into a DataFrame with `unstack()`. [I don't know what the warning is coming from.]

In [None]:
result.unstack()

By default, the innermost level is unstacked (same with `stack`). You can unstack a different level by passing a level number or name.

In [None]:
result.unstack(0)

In [None]:
result.unstack('state')

#### 2.2.2. Pivoting between "long" and "wide"
Data is often saved in "long" format in which each observation is a characteristic of a particular key. That is, the "long" format contains multiple observations on a given key. "Wide" format is a reshaping of the data in which each observation on a given key becomes a separate column of the data such that the new DataFrame has only one observation for each unique value of the key.

### 2.3. Data transformation
asdf

## References

* McKinney, Wes, Python for Data Analysis, O'Reilly Media, Inc. (2013).
* [Python labs](http://www.acme.byu.edu/?page_id=2067), Applied and Computational Mathematics Emphasis (ACME), Brigham Young University.