# Data Wrangling: join, combine, reshape

In this chapter you will see tools to handle the data and make them suitable for analysis.

## Index

* [Hierarchical Indexing](#hierarchical-indexing)
    * [Reordering and Sorting Levels](#reordering-and-sorting-levels)
    * [Summary Statistics by Level](#summary-statistics-by-level)
    * [Indexing with DataFrame's Columns](#indexing-with-dataframes-columns)
* [Combining and Merging Datasets](#combining-and-merging-datasets)
    * [Database-Style DataFrame Joins](#database-style-dataframe-joins)
        * [Table: Pandas 'merge' arguments](#pandas-merge-function-arguments)
    * [Merging on Index](#merging-on-index)
        * ['join' method](#join-method)
    * [Concatenating Along an Axis](#concatenating-along-an-axis)
    * [Combining Data with Overlap](#combining-data-with-overlap)
* [Reshaping and Pivoting](#reshaping-and-pivoting)
    * [Reshaping with Hierachical Indexing](#reshaping-with-hierachical-indexing)
    * [Pivoting 'Long' to 'Wide' Format](#pivoting-long-to-wide-format)
        * [pivot method](#pivot-method)
    * [Pivoting 'Wide' to 'Long' Format](#pivoting-wide-to-long-format)
    

In [1]:
import numpy as np 
import pandas as pd 
import warnings

warnings.filterwarnings("ignore")

## Hierarchical Indexing

Hierarchical Indexing enables you to have multiple index levels on an axis. Also to work with high dimensional data in a lower dimensional form.

In [2]:
## Series with two index levels

data = pd.Series(np.random.uniform(size=9), 
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])

print(data)

a  1    0.346664
   2    0.140604
   3    0.150648
b  1    0.091401
   3    0.003765
c  1    0.754212
   2    0.645723
d  2    0.331423
   3    0.369343
dtype: float64


In [3]:
print(data.index)

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )


In [4]:
print(f"Data 'b' index: \n{data['b']}")
print(f"\nData 'b' to 'c' index: \n{data['b':'c']}")
print(f"\nData 'b' and 'd' index: \n{data.loc[['b', 'd']]}")
print(f"\nData all value with 2 in second index: \n{data.loc[:, 2]}")
print(f"\nData '1' and '3' from int index: \n{data.iloc[[1, 3]]}")

Data 'b' index: 
1    0.091401
3    0.003765
dtype: float64

Data 'b' to 'c' index: 
b  1    0.091401
   3    0.003765
c  1    0.754212
   2    0.645723
dtype: float64

Data 'b' and 'd' index: 
b  1    0.091401
   3    0.003765
d  2    0.331423
   3    0.369343
dtype: float64

Data all value with 2 in second index: 
a    0.140604
c    0.645723
d    0.331423
dtype: float64

Data '1' and '3' from int index: 
a  2    0.140604
b  1    0.091401
dtype: float64


We can use `unstack()` method to rearrange this Series into a DataFrame. 

In [5]:
data = data.astype('Float64')
print(f"Unstack data Series into a DataFrame: \n{data.unstack()}")
print(f"\nStucking again the data: \n{data.unstack().stack()}")

Unstack data Series into a DataFrame: 
          1         2         3
a  0.346664  0.140604  0.150648
b  0.091401      <NA>  0.003765
c  0.754212  0.645723      <NA>
d      <NA>  0.331423  0.369343

Stucking again the data: 
a  1    0.346664
   2    0.140604
   3    0.150648
b  1    0.091401
   3    0.003765
c  1    0.754212
   2    0.645723
d  2    0.331423
   3    0.369343
dtype: Float64


With a DataFrame, either axis can have a hierarchical index.

In [6]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['z', 'z', 'x'], ['red', 'lime', 'red']])

print(f"DataFrame with 2 index in each axis: \n{frame}")

DataFrame with 2 index in each axis: 
      z        x
    red lime red
a 1   0    1   2
  2   3    4   5
b 1   6    7   8
  2   9   10  11


In [7]:
# Naming the hierarchical levels:

frame.index.names = ['key1', 'key2']
frame.columns.names = ['omega', 'color']

print(f"DataFrame with names in its axis levels: \n{frame}")
print(f"\nDataFrame 'z' columns': \n{frame['z']}")
print(f"\nDataFrame index 1: \n{frame.iloc[1]}")
print(f"\nDataFrame 'a' and 'x' row: \n{frame.loc['a','x']}")


DataFrame with names in its axis levels: 
omega       z        x
color     red lime red
key1 key2             
a    1      0    1   2
     2      3    4   5
b    1      6    7   8
     2      9   10  11

DataFrame 'z' columns': 
color      red  lime
key1 key2           
a    1       0     1
     2       3     4
b    1       6     7
     2       9    10

DataFrame index 1: 
omega  color
z      red      3
       lime     4
x      red      5
Name: (a, 2), dtype: int32

DataFrame 'a' and 'x' row: 
color  red
key2      
1        2
2        5


Creating a MultiIndex:
```python
pd.MultiIndex.from_arrays([["Ohio", "Ohio", "Colorado"],
                           ["Green", "Red", "Green"]],
                           names=["state", "color"])
```

### Reordering and Sorting Levels

The swaplevel method returns a new object when we pass two level numbers or 
names and this levels will be interchanged.

Also, `sort_index()` sort the index levels. Data selection performance is beter
if index is sorted starting with the outermost level: `sort_index(level=0)`.

In [8]:
print(f"Swap level and sort 'key2':"
      f"\n{frame.swaplevel('key1', 'key2').sort_index(level=0)}")

print(f"\nSwap level 0 and 1, filter '1' rows: "
      f"\n{frame.swaplevel('key1', 'key2').loc[1]}")

Swap level and sort 'key2':
omega       z        x
color     red lime red
key2 key1             
1    a      0    1   2
     b      6    7   8
2    a      3    4   5
     b      9   10  11

Swap level 0 and 1, filter '1' rows: 
omega   z        x
color red lime red
key1              
a       0    1   2
b       6    7   8


### Summary Statistics by Level

Dataframes and Series have 'level' option in many descriptive and summary statistics.

In [9]:
frame.groupby(level='key2').sum()


omega,z,z,x
color,red,lime,red
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [10]:
frame.groupby(level='color', axis="columns").sum()

Unnamed: 0_level_0,color,lime,red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,2
a,2,4,8
b,1,7,14
b,2,10,20


### Indexing with DataFrame's Columns


In [11]:
frame = pd.DataFrame({"a": range(7), "b": range(7, 0, -1),
                      "c": ["one","one","one", "two","two","two","two"],
                      "d": [0, 1, 2, 0, 1, 2, 3]})
print(frame)

   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3


We can select from the previous DataFrame one or more columns as the row index.
Also, we can move the row index into the DataFrames columns:

By default, the columns are removed but you can maintain it with `drop=Falsep`.
An with `.reset_index()` to reset the index from `set_index()`.


In [12]:
frame2 = frame.set_index(["c", "d"], drop=False)
print(f"\nFrame2 Maintaining columns:\n{frame2}")

frame2 = frame.set_index(["c", "d"])
print(f"\nDefault set_index: \n{frame2}")

frame2 = frame2.reset_index()
print(f"\nReset index from Frame2:\n{frame2}")



Frame2 Maintaining columns:
       a  b    c  d
c   d              
one 0  0  7  one  0
    1  1  6  one  1
    2  2  5  one  2
two 0  3  4  two  0
    1  4  3  two  1
    2  5  2  two  2
    3  6  1  two  3

Default set_index: 
       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1

Reset index from Frame2:
     c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  two  0  3  4
4  two  1  4  3
5  two  2  5  2
6  two  3  6  1


## Combining and Merging Datasets

|Method|Description|
|---|---|
|`pandas.merge`|Connect rows in DataFrames based n one or more keys. Like `JOIN` in SQL.|
|`pandas.concat`|Concatenate or "stack" objects toghether along an axis|
|`combine_first`|Splice together overlapping data to fill in missing values in one object with values from another.|

### Database-Style DataFrame Joins

To combine datasets we can use `merge` or `join` operations by liking rows
using one or more keys. With `merge()` we'll do *many-to-one* join, by default
uses the overlapping column to identify the index (keys) but is a good
practice to specify explicitly with `on="key"`:

In [13]:
df_one = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "a", "b"],
                       "data1": pd.Series(range(7), dtype="Int64")})

df_two = pd.DataFrame({"key": ["a", "b", "d"],
                       "data2": pd.Series(range(3), dtype="Int64")})

print(f"DataFrame one: \n{df_one}")
print(f"\nDataFrame two: \n{df_two}")

# Merging dataframes
df_merge = pd.merge(df_one, df_two, on="key")
print(f"\nDataFrame after 'merge': \n{df_merge}")

DataFrame one: 
  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   a      5
6   b      6

DataFrame two: 
  key  data2
0   a      0
1   b      1
2   d      2

DataFrame after 'merge': 
  key  data1  data2
0   b      0      1
1   b      1      1
2   a      2      0
3   a      4      0
4   a      5      0
5   b      6      1


*Pandas `merge()` options for 'how=' argument:*
|Option|Behavior|
|---|---|
|how="inner" |Use only the key combinations observed in both tables|
|how="left" |Use all key combinations found in the left table|
|how="right" |Use all key combinations found in the right table|
|how="outer" |Use all key combinations observed in both tables together|

With 'outer' option in 'how' argument takes the union of the keys, combining 
the effect of applying both left and right joins. By defaut `merge` does 
an "inner" join, which is the intersection or the common set 
found in both tables.

And if the column names when we merge are diferentwe can specify them 
separetely with 'left_on="lkey"' and 'right_on="rkey"'.

In [14]:
# Merging with outer-join
df_merge_outer = pd.merge(df_one, df_two, how="outer")

print(f"\nOuter join of DF_one and DF_two: \n{df_merge_outer}")

df_three = pd.DataFrame({"lkey": ["b", "b", "a", "c", "a", "a", "b"],
                         "data1": pd.Series(range(7), dtype="Int64")})

df_four = pd.DataFrame({"rkey": ["a", "b", "d"],
                        "data2": pd.Series(range(3), dtype="Int64")})

df_merge = pd.merge(df_three, df_four, 
                    left_on="lkey", right_on="rkey",
                    how="outer")

print(f"\nOuter join (df_three + df_four) diferencing columns:"
      f"\n{df_merge}")


Outer join of DF_one and DF_two: 
  key  data1  data2
0   a      2      0
1   a      4      0
2   a      5      0
3   b      0      1
4   b      1      1
5   b      6      1
6   c      3   <NA>
7   d   <NA>      2

Outer join (df_three + df_four) diferencing columns:
  lkey  data1 rkey  data2
0    a      2    a      0
1    a      4    a      0
2    a      5    a      0
3    b      0    b      1
4    b      1    b      1
5    b      6    b      1
6    c      3  NaN   <NA>
7  NaN   <NA>    d      2


#### *Pandas `merge` function arguments*

|Argument| Description|
|---:|---|
|left| DataFrame to be merged on the left side.|
|right| DataFrame to be merged on the right side.|
|how| Type of join to apply: one of "inner", "outer", "left", or "right"; defaults to "inner"|
|on| Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given, will use the intersection of the column names in left and right as the join keys.|
|left_on| Columns in left DataFrame to use as join keys. Can be a single column name or a list of column names.|
|right_on| Analogous to left_on for right DataFrame.|
|left_index| Use row index in left as its join key (or keys, if a MultiIndex).|
|right_index| Analogous to left_index.|
|sort| Sort merged data lexicographically by join keys; False by default.|
|suffixes| Tuple of string values to append to column names in case of overlap; defaults to ("_x", "_y") (e.g., if "data" in both DataFrame objects, would appear as "data_x" and "data_y" in result).|
|copy| If False, avoid copying data into resulting data structure in some exceptional cases; by default always copies.|
|validate| Verifies if the merge is of the specified type, whether one-to-one, one-to-many, or many-to-many. See the docstring for full details on the options.|
|indicator| Adds a special column _merge that indicates the source of each row; values will be "left_only", "right_only", or "both" based on the origin of the joined data in each row.|

In [15]:
df_1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                     "data1": pd.Series(range(6), dtype="Int64")})
df_2 = pd.DataFrame({"key": ["a", "b", "a", "b", "d"],
                     "data2": pd.Series(range(5), dtype="Int64")})

# Left JOIN (df_1 + df_2)
df_merge = pd.merge(df_1, df_2, on="key", how="left")
print(f"\nMerge with left Join: \n{df_merge}")


Merge with left Join: 
   key  data1  data2
0    b      0      1
1    b      0      3
2    b      1      1
3    b      1      3
4    a      2      0
5    a      2      2
6    c      3   <NA>
7    a      4      0
8    a      4      2
9    b      5      1
10   b      5      3


In [16]:
left = pd.DataFrame({"key1": ["foo", "foo", "bar"],
                     "key2": ["one", "two", "one"],
                     "lval": pd.Series([1, 2, 3], dtype='Int64')})
right = pd.DataFrame({"key1": ["foo", "foo", "bar", "bar"],
                     "key2": ["one", "one", "one", "two"],
                     "rval": pd.Series([4, 5, 6, 7], dtype='Int64')})

# Merge with multiple keys using a list of names

df_merge_lr = pd.merge(left, right, on=["key1", "key2"], how="outer")

print(f"\nMerging with two keys and Outer Join: \n{df_merge_lr}")


Merging with two keys and Outer Join: 
  key1 key2  lval  rval
0  bar  one     3     6
1  bar  two  <NA>     7
2  foo  one     1     4
3  foo  one     1     5
4  foo  two     2  <NA>


When we join columns on columns the indexes on the pased DataFrame objects 
are discarded. To preserve the index values use `reset_index` to append 
the index to the columns.

Merging one column (overlap it) but with other column with the same name in both datasets
pandas will generate a sufix (_x and _y by default), we can specify the suffix
with `suffixes=("_left", "_right")` argument.

In [17]:
pd.merge(left, right, on="key1", suffixes=("_left", "_right"))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


### Merging on Index

When we are merging we can select from which DataFrame use the index, to do
this we can pass `left_index=True` or `right_index=True` (or both) to indicate
that the indes should be used as the merge keys.

Remember, by default merge use Inner Join (how='inner'), then it will preserve 
the rows with index that match with our selection.

Joining with hierarchically indexed data is equivalent to a multiple-key merge.

In [18]:
left1 = pd.DataFrame({"key": ["a", "b", "a", "a", "b", "c"],
                      "value": pd.Series(range(6), dtype="Int64")})
right1 = pd.DataFrame({"group_val": [3.5, 7]}, index=["a", "b"])

print(f"First DataFrame: \n{left1}\n"
      f"\nSecond DataFrame: \n{right1}")

# Merging and specify right index
print(f"\nMerging with right index:\n",
      pd.merge(left1, right1, left_on="key", right_index=True))

First DataFrame: 
  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5

Second DataFrame: 
   group_val
a        3.5
b        7.0

Merging with right index:
   key  value  group_val
0   a      0        3.5
1   b      1        7.0
2   a      2        3.5
3   a      3        3.5
4   b      4        7.0


In [19]:
lefth = pd.DataFrame({"key1": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada"],
                      "key2": [2000, 2001, 2002, 2001, 2002],
                      "data": pd.Series(range(5), dtype="Int64")})

righth_index = pd.MultiIndex.from_arrays(
                    [
                        ["Nevada", "Nevada", "Ohio", "Ohio", "Ohio", "Ohio"],
                        [2001, 2000, 2000, 2000, 2001, 2002]
                    ]
               )
righth = pd.DataFrame({"event1": pd.Series([0, 2, 4, 6, 8, 10], dtype="Int64",
                                           index=righth_index),
                       "event2": pd.Series([1, 3, 5, 7, 9, 11], dtype="Int64",
                                           index=righth_index)})

print(f"Left DataFrame: \n{lefth}\n"
      f"\nRight DataFrame: \n{righth}")

Left DataFrame: 
     key1  key2  data
0    Ohio  2000     0
1    Ohio  2001     1
2    Ohio  2002     2
3  Nevada  2001     3
4  Nevada  2002     4

Right DataFrame: 
             event1  event2
Nevada 2001       0       1
       2000       2       3
Ohio   2000       4       5
       2000       6       7
       2001       8       9
       2002      10      11


In [20]:
"""
    We have to indicate multiple columns to merge on as a list because 
    in this case the DataFrames have multiple index.
"""

mergedh = pd.merge(lefth, righth, left_on=["key1", "key2"], 
                   right_index=True, how="outer")
print(f"Merge with right index and Outer Join:\n{mergedh}")

Merge with right index and Outer Join:
     key1  key2  data  event1  event2
4  Nevada  2000  <NA>       2       3
3  Nevada  2001     3       0       1
4  Nevada  2002     4    <NA>    <NA>
0    Ohio  2000     0       4       5
0    Ohio  2000     0       6       7
1    Ohio  2001     1       8       9
2    Ohio  2002     2      10      11


#### `join()` Method

The `join()` method simplify merging by index, we can combine DataFrame objects
having the same or similar indexes but non-overlapping. Compare with `merge()`,
`join` performs a **left join on the join keys by default**.

Joining data *into* the object whose `join` method was called it's joinining 
the index of the passed DataFrame on one of the columns 
of the calling DataFrame. Also, you can pass a list of DataFrames with join
as an alternative to using `pandas.concat()` function.

In [21]:
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                    index=["a", "c", "e"],
                    columns=["Ohio", "Nevada"]).astype("Int64")
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                    index=["b", "c", "d", "e"],
                    columns=["Missouri", "Alabama"]).astype("Int64")
mergel2r2 = pd.merge(left2, right2, how="outer", 
                     left_index=True, right_index=True)
print(f"Merge using both left_ right_index as True and Outer Join:"
      f"\n{mergel2r2}")

print(f"\nAlternative using 'join()' return the same result:\n",
      left2.join(right2, how="outer"))

print(f"\n'join()' but specify 'on' key:\n",
      left1.join(right1, on="key"))

Merge using both left_ right_index as True and Outer Join:
   Ohio  Nevada  Missouri  Alabama
a     1       2      <NA>     <NA>
b  <NA>    <NA>         7        8
c     3       4         9       10
d  <NA>    <NA>        11       12
e     5       6        13       14

Alternative using 'join()' return the same result:
    Ohio  Nevada  Missouri  Alabama
a     1       2      <NA>     <NA>
b  <NA>    <NA>         7        8
c     3       4         9       10
d  <NA>    <NA>        11       12
e     5       6        13       14

'join()' but specify 'on' key:
   key  value  group_val
0   a      0        3.5
1   b      1        7.0
2   a      2        3.5
3   a      3        3.5
4   b      4        7.0
5   c      5        NaN


In [38]:
"""
    And lastly, using 'join()' with a list of DataFrames
"""

right_other = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                            index=["a", "c", "e", "f"],
                            columns=["New York", "Oregon"])

print(f"Extra DataFrame: \n{right_other}")

print(f"\n'join()' with a list of DataFrames\n",
      left2.join([right2, right_other]), "\n\n",
      left2.join([right2, right_other], how="outer"), sep='')

Extra DataFrame: 
   New York  Oregon
a       7.0     8.0
c       9.0    10.0
e      11.0    12.0
f      16.0    17.0

'join()' with a list of DataFrames
   Ohio  Nevada  Missouri  Alabama  New York  Oregon
a     1       2      <NA>     <NA>       7.0     8.0
c     3       4         9       10       9.0    10.0
e     5       6        13       14      11.0    12.0

   Ohio  Nevada  Missouri  Alabama  New York  Oregon
a     1       2      <NA>     <NA>       7.0     8.0
c     3       4         9       10       9.0    10.0
e     5       6        13       14      11.0    12.0
b  <NA>    <NA>         7        8       NaN     NaN
d  <NA>    <NA>        11       12       NaN     NaN
f  <NA>    <NA>      <NA>     <NA>      16.0    17.0


### Concatenating Along an Axis

Concatenate is called 'stacking' too. NumPy can concatenate arrays 
with `numpy.concatenate()`. 

Pandas have `concat()` function which cares about what to do if the objects are
indexed differently, if the chunks of concatenated data have to be 
identifiable, and if the axis contain data that need to be preserved. 
By default `concat()` works in 'index' axis, but we can use 'axis' argument and
specify "columns", this it will create a DataFrame. When we use the "columns"
axis is like a 'outer join' of the indexes (without overlap).


*`concat` function arguments*

|Argument| Description|
|---:|----|
|objs |List or dictionary of pandas objects to be concatenated; this is the only required argument|
|axis |Axis to concatenate along; defaults to concatenating along rows (axis="index")|
|join |Either "inner" or "outer" ("outer" by default); whether to intersect (inner) or union (outer) indexes along the other axes|
|keys |Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis; can be a list or array of arbitrary values, an array of tuples, or a list of arrays (if multiple-level arrays passed in levels)|
|levels |Specific indexes to use as hierarchical index level or levels if keys passed|
|names |Names for created hierarchical levels if keys and/or levels passed|
|verify_integrity |Check new axis in concatenated object for duplicates and raise an exception if so; by default (False) allows duplicates|
|ignore_index |Do not preserve indexes along concatenation axis, instead produce a new range(total_length) index|

In [23]:
ser1 = pd.Series([0, 1], index=["a", "b"], dtype="Int64")
ser2 = pd.Series([2, 3, 4], index=["c", "d", "e"], dtype="Int64")
ser3 = pd.Series([5, 6], index=["f", "g"], dtype="Int64")
ser4 = pd.concat([ser1, ser3])

# Concatenating series
print(f"Three series concatenated:\n",
      pd.concat([ser1, ser2, ser3]), sep='')

print(f"\nThree series concatenated on axis columns:\n",
      pd.concat([ser1, ser2, ser3], axis="columns"), sep='')

print(f"\nConcatenating Series1 and Series4 on columns:\n",
      pd.concat([ser1, ser4], axis="columns"), sep='')

print(f"\nConcatenating Series1 and Series4 on columns with 'inner join':\n",
      pd.concat([ser1, ser4], axis="columns", join="inner"), sep='')

Three series concatenated:
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: Int64

Three series concatenated on axis columns:
      0     1     2
a     0  <NA>  <NA>
b     1  <NA>  <NA>
c  <NA>     2  <NA>
d  <NA>     3  <NA>
e  <NA>     4  <NA>
f  <NA>  <NA>     5
g  <NA>  <NA>     6

Concatenating Series1 and Series4 on columns:
      0  1
a     0  0
b     1  1
f  <NA>  5
g  <NA>  6

Concatenating Series1 and Series4 on columns with 'inner join':
   0  1
a  0  0
b  1  1


#### Concat options

When we concat specifying "keys" argument, it will concat and use this 'keys' 
as second index but if we pass "columns" as axis will use the 'keys' as 
column names. In a DataFrame it follows the same logic, it will use 'keys' like
a second level of "columns" axis in a hierarchical index which help us to
identify each of the concatenated DataFrame. Alternatively we can pass intead
of a list of DataFrame a Dictionary and it will use the dictionary keys as
the keys argument. The 'names' argument allow us to named the created axis 
levels.

If the index of a DataFrame have not relevant data, we can pass
'ignore_index=True' argument to concatenate the data in the columns only, and
the concatenation will asign a new default index.

In [24]:
result = pd.concat([ser1, ser1, ser3, ser4], 
                   keys=["one", "two", "three", "four"])

result_col = pd.concat([ser1, ser1, ser3, ser4], 
                   keys=["one", "two", "three", "four"], axis="columns")

print(f"\nConcat with 'keys' argument as hierarchical index:\n",
      result,"\n\nUnstacking the result:\n", result.unstack(), sep='')

print(f"\nConcat with 'keys' argument and 'columns' axis:\n",
      result_col, sep='')




Concat with 'keys' argument as hierarchical index:
one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
four   a    0
       b    1
       f    5
       g    6
dtype: Int64

Unstacking the result:
          a     b     f     g
one       0     1  <NA>  <NA>
two       0     1  <NA>  <NA>
three  <NA>  <NA>     5     6
four      0     1     5     6

Concat with 'keys' argument and 'columns' axis:
    one   two  three  four
a     0     0   <NA>     0
b     1     1   <NA>     1
f  <NA>  <NA>      5     5
g  <NA>  <NA>      6     6


In [25]:
df1 = pd.DataFrame(np.arange(8).reshape(4, 2), index=["a", "b", "c", "d"],
        columns=["one", "two"], dtype="Int64")
df2 = pd.DataFrame(7 + np.arange(6).reshape(3, 2), index=["a", "c", "d"],
        columns=["three", "four"], dtype="Int64")

print(f"DataFrames:\n", df1, "\n\n", df2)

print(f"\nConcatenated DataFrame:\n",
      pd.concat([df1, df2], axis="columns", keys=["level1", "level2"]), sep='')

DataFrames:
    one  two
a    0    1
b    2    3
c    4    5
d    6    7 

    three  four
a      7     8
c      9    10
d     11    12

Concatenated DataFrame:
  level1     level2      
     one two  three  four
a      0   1      7     8
b      2   3   <NA>  <NA>
c      4   5      9    10
d      6   7     11    12


In [26]:
print(f"\nConcatenated DataFrame using dictionary instead of 'keys' option:\n",
      "and adding 'names' argument to name the columns levels:\n\n",
      pd.concat({"level1": df1, "level2": df2}, 
                axis="columns", names=["upper", "lower"]),
      sep='')


Concatenated DataFrame using dictionary instead of 'keys' option:
and adding 'names' argument to name the columns levels:

upper level1     level2      
lower    one two  three  four
a          0   1      7     8
b          2   3   <NA>  <NA>
c          4   5      9    10
d          6   7     11    12


In [27]:
df1 = pd.DataFrame(np.random.standard_normal((3, 4)),
                   columns=["a", "b", "c", "d"])
df2 = pd.DataFrame(np.random.standard_normal((2, 3)),
                   columns=["b", "d", "a"])

print(f"First DataFrame:\n", df1, sep='')
print(f"\nSecond DataFrame:\n", df2, sep='')

print(f"\nConcatenating DataFrames but ignored non-important index:\n", 
      pd.concat([df1, df2], ignore_index=True), sep='')


First DataFrame:
          a         b         c         d
0  0.235196  0.264797  0.278396 -0.216948
1 -0.821781 -0.785710  1.219959  1.517616
2  0.459172  1.603130 -0.482889  0.434575

Second DataFrame:
          b         d         a
0 -0.557042 -1.103028 -0.552315
1  2.795264  0.685565  0.211977

Concatenating DataFrames but ignored non-important index:
          a         b         c         d
0  0.235196  0.264797  0.278396 -0.216948
1 -0.821781 -0.785710  1.219959  1.517616
2  0.459172  1.603130 -0.482889  0.434575
3 -0.552315 -0.557042       NaN -1.103028
4  0.211977  2.795264       NaN  0.685565


### Combining Data with Overlap

If we have two Series with an index mostly shared but with NaN values, we can
use function `combine_first()` to combine the values of one Series with other. 
`combine_first` line up values by index.
Also works in DataFrames in which case will combine columns. This method takes
one column or series and will fill NaN values with the other column from
the second DataFrame matching the index.


In [28]:
a = pd.Series([np.nan, 2.5, 0.0, 3.5, 4.5, np.nan],
                index=["f", "e", "d", "c", "b", "a"])
b = pd.Series([0., np.nan, 2., np.nan, np.nan, 5.],
                index=["a", "b", "c", "d", "e", "f"])

print(f"\nSeries 'a': \n{a}", sep='')
print(f"\nSeries 'b': \n{b}", sep='')
print(f"\nCombining Series 'a' NaN values with matching Series 'b' values: ",
      f"\n{a.combine_first(b)}", sep='')



Series 'a': 
f    NaN
e    2.5
d    0.0
c    3.5
b    4.5
a    NaN
dtype: float64

Series 'b': 
a    0.0
b    NaN
c    2.0
d    NaN
e    NaN
f    5.0
dtype: float64

Combining Series 'a' NaN values with matching Series 'b' values: 
a    0.0
b    4.5
c    3.5
d    0.0
e    2.5
f    5.0
dtype: float64


## Reshaping and Pivoting

### Reshaping with Hierachical Indexing

- stack: rotates or pivots from the columns in the data to the rows.
- unstack: pivots from the rows into the columns.

You can `unstack` using the level number or level index name. But might 
introduce some NaN values if this values are not found in each subgroup. If
there are various levels, the level unstacked becomes the lowerst level in 
the result. As with unstack, when calling stack we can indicate the name of 
the axis to stack

In [29]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=pd.Index(["Ohio", "Colorado"], name="state"),
                    columns=pd.Index(["one", "two", "three"], name="number"))

print(f"DataFrame:\n", data, sep='')

data_stack = data.stack()

print(f"\nDataFrame with 'stack()' is like a Series: \n",
      data_stack, sep='')

# Unstacking on level 0
print("\n'unstack()' recover DF the shape but in level 0:\n", 
      data_stack.unstack(level=0), sep='')
print("\n'unstack()' recover DF the shape passing level 'state':\n", 
      data_stack.unstack(level="state"), sep='')

DataFrame:
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

DataFrame with 'stack()' is like a Series: 
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

'unstack()' recover DF the shape but in level 0:
state   Ohio  Colorado
number                
one        0         3
two        1         4
three      2         5

'unstack()' recover DF the shape passing level 'state':
state   Ohio  Colorado
number                
one        0         3
two        1         4
three      2         5


In [30]:
"""
    'unstack' a series which introduce some NaN values.
"""

ser1 = pd.Series([0, 1, 2, 3], index=["a", "b", "c", "d"], dtype="Int64")
ser2 = pd.Series([4, 5, 6], index=["c", "d", "e"], dtype="Int64")
data2 = pd.concat([ser1, ser2], keys=["one", "two"])

print("Concatenated Series:\n", data2, sep='')
print("\nSeries unstacked:\n", data2.unstack(), sep='')

#
# Stacking the data2.unstack() with 'dropna=False' option will show us
# the previous NaN values, thus equating the index of both chunks
#

print("\nSeries unstacked:\n", data2.unstack().stack(dropna=False), sep='')

Concatenated Series:
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: Int64

Series unstacked:
        a     b  c  d     e
one     0     1  2  3  <NA>
two  <NA>  <NA>  4  5     6

Series unstacked:
one  a       0
     b       1
     c       2
     d       3
     e    <NA>
two  a    <NA>
     b    <NA>
     c       4
     d       5
     e       6
dtype: Int64


In [31]:
"""
    DataFrame with multiple hierarchical levels will unstack passing the level
    number or level name, it'll be unstacked the lowest level in the result.
"""

df = pd.DataFrame({"left": data_stack, "right": data_stack + 5},
                   columns=pd.Index(["left", "right"], name="side"))

print("DataFrame:\n", df, sep='')
print("\n Unstacked DataFrame, level=state: \n", df.unstack(level="state"),
      sep='')

print("\nStack DataFrame again but passing level='side'\n", 
      df.unstack(level="state").stack(level="side"),
      sep='')

DataFrame:
side             left  right
state    number             
Ohio     one        0      5
         two        1      6
         three      2      7
Colorado one        3      8
         two        4      9
         three      5     10

 Unstacked DataFrame, level=state: 
side   left          right         
state  Ohio Colorado  Ohio Colorado
number                             
one       0        3     5        8
two       1        4     6        9
three     2        5     7       10

Stack DataFrame again but passing level='side'
state         Ohio  Colorado
number side                 
one    left      0         3
       right     5         8
two    left      1         4
       right     6         9
three  left      2         5
       right     7        10


### Pivoting 'Long' to 'Wide' Format

In long or stacked format, the individual values are represented by a single 
row in a table rather tahn multiple values per row.

Dataset:
https://github.com/wesm/pydata-book/blob/3rd-edition/examples/macrodata.csv 



In [39]:
data = pd.read_csv("../datasets/macrodata.csv")
data = data.loc[:, ["year", "quarter", "realgdp", "infl", "unemp"]]

print("DataFrame sample:\n", data.sample(8), sep='')

"""
    Extracting time periods, combining 'year' and 'quarter' columns
    resulting in datetime values as 'date' column name. Then, unsing the new
    'date' values as index for the DataFrame.
"""

periods = pd.PeriodIndex(year=data.pop("year"),
                         quarter=data.pop("quarter"),
                         name="date")
print("\nPeriods of time extracted from dataset:\n", periods[:7], sep='')
# First 7 values from 'periods' 

# Using 'periods' as index in the DataFrame
data.index = periods.to_timestamp("D")

"""
    Reindexing and ordering columns and giving it a 'item' name for that index.
"""

data = data.reindex(columns=["realgdp", "unemp", "infl"])
data.columns.name = "item"

print("\nDataFrame sample, index updated and column name:\n",
      data.sample(8), sep='')

DataFrame sample:
     year  quarter    realgdp   infl  unemp
155  1997        4  10008.874   1.24    4.7
2    1959        3   2775.488   2.74    5.3
95   1982        4   5871.001  -0.82   10.7
23   1964        4   3431.957   2.05    5.0
74   1977        3   5451.921   5.23    6.9
84   1980        1   5908.467  14.60    6.3
163  1999        4  11014.254   2.85    4.1
53   1972        2   4633.101   2.88    5.7

Periods of time extracted from dataset:
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3'],
            dtype='period[Q-DEC]', name='date')

DataFrame sample, index updated and column name:
item          realgdp  unemp  infl
date                              
1986-10-01   7153.359    6.8  4.33
1962-01-01   3031.241    5.6  2.26
1972-04-01   4633.101    5.7  2.88
1991-01-01   7950.164    6.6  1.19
1972-10-01   4754.546    5.3  4.71
1998-01-01  10103.425    4.6  0.49
1991-10-01   8069.046    7.1  3.19
1993-10-01   8643.769    6.6  1.92


In [42]:
"""
    Reshaping 'data' to long format with 'stack()' and turning the index levels
    into columns with reset_index. Also giving to the data column 'value' name.
"""

long_data = (data.stack()
             .reset_index()
             .rename(columns={0: "value"}))

print("First 9 rows from Data in long format:\n\n", long_data[:9])


First 9 rows from Data in long format:

         date     item     value
0 1959-01-01  realgdp  2710.349
1 1959-01-01    unemp     5.800
2 1959-01-01     infl     0.000
3 1959-04-01  realgdp  2778.801
4 1959-04-01    unemp     5.100
5 1959-04-01     infl     2.340
6 1959-07-01  realgdp  2775.488
7 1959-07-01    unemp     5.300
8 1959-07-01     infl     2.740


#### pivot method

Sometimes in relational databases the data is stored this way. In some cases
this long format may be more dificult and we might prefer to have a DataFrame
containing one column per distinct item and indexed by timestamps. With 
DataFrame `pivot` method do this transformation.



In [43]:
data_pivot = long_data.pivot(index="date", 
                             columns="item", 
                             values="value")

print("Pivoted long format Data:\n", data_pivot.head(),"\n", sep='')

# Adding new value column
long_data["value2"] = np.random.standard_normal(len(long_data))
print("Long Data with new column:\n",long_data.sample(6))


Pivoted long format Data:
item        infl   realgdp  unemp
date                             
1959-01-01  0.00  2710.349    5.8
1959-04-01  2.34  2778.801    5.1
1959-07-01  2.74  2775.488    5.3
1959-10-01  0.27  2785.204    5.6
1960-01-01  2.31  2847.699    5.2

Long Data with new column:
           date     item     value    value2
218 1977-01-01     infl     8.760 -0.964067
345 1987-10-01  realgdp  7458.022  0.543163
276 1982-01-01  realgdp  5857.333  0.103318
223 1977-07-01    unemp     6.900  0.479803
488 1999-07-01     infl     3.350  2.076974
164 1972-07-01     infl     3.810  0.121510


In [36]:

# Pivoting long data without passing values, this generate hierarchical columns
data_pivot = long_data.pivot(index="date", 
                             columns="item")

print("Pivoted long format Data:\n", data_pivot.head(),"\n", sep='')
print("\nShowing only value2:\n", data_pivot["value2"].head(),"\n", sep='')


Pivoted long format Data:
           value                    value2                    
item        infl   realgdp unemp      infl   realgdp     unemp
date                                                          
1959-01-01  0.00  2710.349   5.8  1.109704  0.478075  0.778557
1959-04-01  2.34  2778.801   5.1  1.902302 -0.332077  0.835549
1959-07-01  2.74  2775.488   5.3  1.053750  1.354385 -0.812201
1959-10-01  0.27  2785.204   5.6  0.520720 -1.477654 -2.297180
1960-01-01  2.31  2847.699   5.2 -0.483893  0.509806 -1.351570


Showing only value2:
item            infl   realgdp     unemp
date                                    
1959-01-01  1.109704  0.478075  0.778557
1959-04-01  1.902302 -0.332077  0.835549
1959-07-01  1.053750  1.354385 -0.812201
1959-10-01  0.520720 -1.477654 -2.297180
1960-01-01 -0.483893  0.509806 -1.351570



In [37]:
"""
    'pivot' method is equivalent to creating a hierarchical index 
    using 'set_index' folowed by a call to unstack
"""

data_unstack = long_data.set_index(["date", "item"]).unstack(level="item")

print("Same result of 'pivot()' but using 'unstack()' and 'set_index()'\n",
      data_unstack.head(), sep='')

Same result of 'pivot()' but using 'unstack()' and 'set_index()'
           value                    value2                    
item        infl   realgdp unemp      infl   realgdp     unemp
date                                                          
1959-01-01  0.00  2710.349   5.8  1.109704  0.478075  0.778557
1959-04-01  2.34  2778.801   5.1  1.902302 -0.332077  0.835549
1959-07-01  2.74  2775.488   5.3  1.053750  1.354385 -0.812201
1959-10-01  0.27  2785.204   5.6  0.520720 -1.477654 -2.297180
1960-01-01  2.31  2847.699   5.2 -0.483893  0.509806 -1.351570


### Pivoting "Wide" to "Long" Format