Note: This notebook was completed alongside the DataCamp course by the same name
# Reshaping Data with pandas
Often data is in a human-readable format, but it’s not suitable for data analysis. This is where pandas can help—it’s a powerful tool for reshaping DataFrames into different formats. In this course, you’ll grow your data scientist and analyst skills as you learn how to wrangle string columns and nested data contained in a DataFrame. You’ll work with real-world data, including FIFA player ratings, book reviews, and churn analysis data, as you learn how to reshape a DataFrame from wide to long format, stack and unstack rows and columns, and get descriptive statistics of a multi-index DataFrame.

**Instructor:** Maria Eugenia Inzaugarat, PhD Data Scientist

In [5]:
import pandas as pd

# $\star$ Chapter 1: Introduction to Data Reshaping
Let's start by understanding the concept of wide and long formats and the advantages of using each of them. You’ll then learn how to pivot data from long to a wide format, and get summary statistics from a large DataFrame.

* Wide and long formats
* Long to wide transformation
* Wide to long transformation
* Stacking and unstacking columns
* Reshaping and handling complex data, such as string columns or JSON data
* Nested data
* Statistical data formats
* Multi-level index DataFrames

### Shape of data
* The way in which a dataset is organized into rows and columns

#### Wide format
* Each feature is in a separate column
* Each row contains many features of the same player
* Wide format has **no repeated records**, but this **could lead to missing values.**
* This format is preferred to do **simple statistics and imputation**, such as calculating the mean or imputing missing values.

<img src='data/wide_format.png' width="600" height="300" align="center"/>

#### Long format
* Each row represents one feature
* Multiple rows for each player
* Notice that there is no row for the feature `age` for the first player
* This happens because we had a missing value there
* A column (`name`) that identifies the same player through the records
* These are typical characteristics of the long format that is usually seen as the standard for a tidy dataset
* Tidy data:
    * Better to summarize data
    * Key-value pairs
    * Preferred or required for many advanced graphing and analysis techniques

<img src='data/long_format.png' width="400" height="200" align="center"/>

#### Reshaping data
* In a broad sense, reshaping data is transforming a data structure to adjust it for analysis
* Transpose:
    * `fifa_players.set_index('club')[['name', 'nationality']].transpose()`
    * Alternately, `.T`
* **In this course, we will define reshaping data as converting data from wide to long format and vice versa.**
* To decide between using long or wide format, think about the unit of analysis:
    * Long format $\Rightarrow$ characteristic of a player
    * Wide format $\Rightarrow$ each player
    
#### Wide to long transformation
* Performed using `pandas` functions, such as:
     * `.melt()`
     * `.wide_to_long()`
   
#### Long to wide format
* Transform data using `pandas` methods, for example:
    * `.pivot()`
    * `.pivot_table()`
    
#### Exercises: Flipping players

```
# Set name as index
fifa_transpose = fifa_players.set_index('name')

# Print fifa_transpose
print(fifa_transpose)

# Modify the DataFrame to keep only height and weight columns
fifa_transpose = fifa_players.set_index('name')[['height', 'weight']]
# Print fifa_transpose
print(fifa_transpose)

# Change the DataFrame so rows become columns and vice versa
fifa_transpose = fifa_players.set_index('name')[['height', 'weight']].transpose()

# Print fifa_transpose
print(fifa_transpose)
```

### Reshaping using pivot method
* The long format is usually the most suitble to store a clean dataset

#### Why go from long to WIDE format?
* Demonstrate relationship between two (or more) columns
* Time series operations with the variables
* Operation that requires columns to be the unique variable
* Wide format allows us to **discover patterns**
* The `pivot()` method allows us to reshape the data from a long to a wide format

<img src='data/pivot_method2.png' width="600" height="300" align="center"/>

* The `pivot()` method takes three arguments:
    * **`index`:** takes the name of the column we want to have as an index in the new, pivoted DF.
    * **`columns`:** takes the name of the column we want to have as each column in the new, pivoted DF.
    * **`values`:** takes the name of the column of values with which we want to populate the new, pivoted DF.
* **If the method cannot find a row and column matching the original dataframe, it will set that cell value as a missing value (see the `NaN` above).**

<img src='data/pivot1.png' width="600" height="300" align="center"/>

```
fifa.pivot(index = 'name', columns = 'variable', values='metric_system')
```

<img src='data/pivot2.png' width="600" height="300" align="center"/>

* **Note** that we can also pass a list of two values to the pivot method:
* `fifa.pivot(index='name', columns='variable', values=['metric_system', 'imperical_system'])`
* In this case, the resulting DataFrame has a hierarchical column index with both column names as demonstrated below:

<img src='data/pivot3.png' width="600" height="300" align="center"/>

***
#### Pivoting multiple columns

<img src='data/pivot4.png' width="600" height="300" align="center"/>

* What if we want to extend the pivot method to all the column values in the DataFrame instead of just one or two?
* We can do this easily by omitting the values argument:
* `df.pivot(index='name', columns='variable')
* We see that we get the same result as above when we identified multiple columns as `value` columns:

<img src='data/pivot5.png' width="600" height="300" align="center"/>

#### Duplicate entries error
* Passing only index and column arguments to the pivot method will work in most cases
* However, pay attention to the 3rd and 5th rows below:

<img src='data/pivot6.png' width="600" height="300" align="center"/>

* If we try to perform the same operation: `another_fifa.pivot(index='name', columns='variable')`, we get:

<img src='data/pivot7.png' width="600" height="300" align="center"/>

* It doesn't know which of the two values should be the corresponding value, pandas will raise an error. 
* We could choose to delete one of the rows (for example the fifth row) and then rerun the command without raising an error.

#### Exercises: Dribbling the pivot method 

```
# Pivot fifa_players to get overall scores indexed by name and identified by movement
fifa_overall = fifa_players.pivot(index='name', columns='movement', values='overall')

# Print fifa_overall
print(fifa_overall)

# Pivot fifa_players to get attacking scores indexed by name and identified by movement
fifa_attacking = fifa_players.pivot(index='name', columns='movement', values='attacking')

# Print fifa_attacking
print(fifa_attacking)

# Use the pivot method to get overall scores indexed by movement and identified by name
fifa_names = fifa_players.pivot(index='movement', columns='name', values='overall')

# Print fifa_names
print(fifa_names)
```

#### Exercises: Offensive or defensive player?

```
# Pivot fifa_players to get overall and attacking scores indexed by name and identified by movement
fifa_over_attack = fifa_players.pivot(index='name', 
                                     columns='movement', 
                                     values=['overall', 'attacking'])

# Print fifa_over_attack
print(fifa_over_attack)

# Use pivot method to get all the scores index by name and identified by movement
fifa_all = fifa_players.pivot(index='name',
                              columns='movement',
                              values=['overall', 'attacking'])

# Print fifa_over_attack
print(fifa_all)
```

#### Exercises: Replay that last move!

```
# Drop the fifth row to delete all repeated rows
fifa_no_rep = fifa_players.drop(4, axis=0)

# Print fifa_pivot
print(fifa_no_rep)

# Drop the fifth row to delete all repeated rows
fifa_no_rep = fifa_players.drop(4, axis=0)

# Pivot fifa players to get all scores by name and movement
fifa_pivot = fifa_no_rep.pivot(index='name', columns='movement', values=['overall', 'attacking']) 

# Print fifa_pivot
print(fifa_pivot)  
```

## Pivot tables

#### Pivot method limitations
* The `.pivot()` method has some limitations
* Great general purpose pivoting technique
* However, **it requires the index column pair to be unique**
    * This is mainly due to the fact that the pivot method cannot aggregate values
    
### Pivot table
* A DataFrame containing statistics that summarize the data of a larger DataFrame
* To convert from the DataFrame in long format on the left to the DataFrame on the right with aggregated values, we can use the **`.pivot_table()`** method
* It is important to note that with this method we can also summarize DataFrames that are not in long format

<img src='data/pivot8.png' width="600" height="300" align="center"/>

* `df.pivot_table(index='Year', columns='Name', values='Weight', aggfunc='mean')`
* **Note** the new, additional parameter: **`aggfunc`**
    * **Default `aggfunc` is `mean`.**
    
### Hierarchical indexes
* Another advantage of pivot tables is that we can have multi-level indexes, not only in the columns, but also in the tows: the indexes first and last 

<img src='data/pivot9.png' width="700" height="350" align="center"/>

### Margins
* Finally, we would like to get the number of attacking and overall scores each player has
* In the `pivot_table` method, by omitting a `values` argument, pandas will pivot **all values.**
* But, we will pass the **`margins`** argument
* `fifa_players.pivot_table(index['first', 'last'], columns='movement', aggfunc='count', margins=True)`
* When the `margins` parameter is set to `True`, all the columns and rows will be added.
* In this case (with `aggfunc='count'`) we'll get the total counts for each row and column

<img src='data/pivot10.png' width="600" height="300" align="center"/>

#### Pivot or pivot table?
* *Does the DataFrame have more than one value for each index/column pair?*
* *Do you need to have a multi-index in your resulting pivoted DataFrame?*
* *Do you need summary statistics of your large DataFrame?*
* **If YES** Use `.pivot_table()`

#### Exercises: Reviewing the moves

```
# Discard the fifth row to delete all repeated rows
fifa_drop = fifa_players.drop(4, axis=0)

# Use pivot method to get all scores by name and movement
fifa_pivot = fifa_drop.pivot(index='name', columns='movement') 

# Print fifa_pivot
print(fifa_pivot)  

# Use pivot table to get all scores by name and movement
fifa_pivot_table = fifa_players.pivot_table(index='name', 
                                     columns='movement', 
                                     aggfunc='mean')
# Print fifa_pivot_table
print(fifa_pivot_table)
```

#### Exercises: Exploring the big match

```
# Use pivot table to display mean age of players by club and nationality 
mean_age_fifa = fifa_players.pivot_table(index='nationality', 
                                  columns=['club', 'nationality'], 
                                  values='age', 
                                  aggfunc='mean')

# Print mean_age_fifa
print(mean_age_fifa)

# Use pivot table to display max height of any player by club and nationality
tall_players_fifa = fifa_players.pivot_table(index='nationality', 
                                     columns='club', 
                                      values='height', 
                                      aggfunc='max')

# Print tall_players_fifa
print(tall_players_fifa)

# Use pivot table to show the count of players by club and nationality and the total count
players_country = fifa_players.pivot_table(index='nationality', 
                                    columns='club', 
                                    values='name', 
                                    aggfunc='count', 
                                    margins=True)

# Print players_country
print(players_country)
```

#### Exercises: The tallest and the heaviest

```
# Define a pivot table to get the characteristic by nationality and club
fifa_mean = fifa_players.pivot_table(index=['nationality', 'club'], 
                                     columns='year')

# Print fifa_mean
print(fifa_mean)

# Set the appropriate argument to show the maximum values
fifa_mean = fifa_players.pivot_table(index=['nationality', 'club'], 
                                     columns='year', 
                                     aggfunc='max')

# Print fifa_mean
print(fifa_mean)

# Set the argument to get the maximum for each row and column
fifa_mean = fifa_players.pivot_table(index=['nationality', 'club'], 
                                     columns='year', 
                                     aggfunc='max', 
                                     margins=True)

# Print fifa_mean
print(fifa_mean)
```

# $\star$ Chapter 2: Converting Between Wide and Long Format
Master the technique of reshaping DataFrames from wide to long format. In this chapter, you'll learn how to use the melting method and wide to long function before discovering how to handle string columns by concatenating or splitting them.

### Reshaping with melt
* In this lesson, we will learn how to reshape a DataFrame from wide to long format using the `melt` function.

#### Wide to long transformation
* Perform analytics
* Plot different variables in the same graph

<img src='data/wide_to_long.png' width="700" height="350" align="center"/>

* Most data is stored in a wide format
* The first argument to set is **`id_vars`**
* This argument takes the names of the column(s) to use as identifier variables
* **`df.melt(id_vars=["first","last"])`**
* The columns identified in `id_vars` will also appear in the long format table and will help us match all the records for the same observation
    * The rest of the columns are melted
    
<img src='data/pivot11.png' width="700" height="350" align="center"/>

### Values and variables
* What can we do if we do not want to melt all the columns?
* We have other arguments for that purpose: 
    * **`value_vars`:** 
        * Takes the names of the columns we want to melt
        * This can be only one column or a list of many columns
    * **`var_name`:** 
        * Takes the name to use for the column "variable"
        * Default value is `variable`
    * **`value_name`:**
        * Takes the name to use for the column "value"
        * Default value is `value`

<img src='data/pivot12.png' width="700" height="350" align="center"/>

#### Specifying values to melt
* `books.melt(id_vars='title', value_vars=['language_code', 'num_pages'])`

#### Exercises: Gothic times

```
# Melt books_gothic using the title column as identifier 
gothic_melted = books_gothic.melt(id_vars='title')

# Print gothic_melted
print(gothic_melted)

# Melt books_gothic using the title, authors, and publisher columns as identifier
gothic_melted_new = books_gothic.melt(id_vars=['title', 'authors', 'publisher'])

# Print gothic_melted_new
print(gothic_melted_new)

# Melt publisher column using title and authors as identifiers
publisher_melted = books_gothic.melt(id_vars=['title', 'authors'], 
                                     value_vars='publisher')

# Print publisher_melted
print(publisher_melted)

# Melt rating and rating_count columns using the title as identifier
rating_melted = books_gothic.melt(id_vars='title', 
                                  value_vars=['rating', 'rating_count'])

# Print rating_melted
print(rating_melted)

# Melt rating and rating_count columns using title and authors as identifier
books_melted = books_gothic.melt(id_vars=['title', 'authors'], 
                                 value_vars=['rating', 'rating_count'])

# Print books_melted
print(books_melted)

# Melt the rating and rating_count using title, authors and publisher as identifiers
books_ratings = books_gothic.melt(id_vars=['title', 'authors', 'publisher'], 
                                  value_vars=['rating', 'rating_count'])

# Print books_ratings
print(books_ratings)

# Assign the name feature to the new variable column
books_ratings = books_gothic.melt(id_vars=['title', 'authors', 'publisher'], 
                                  value_vars=['rating', 'rating_count'], 
                                  var_name='feature')

# Print books_ratings
print(books_ratings)

# Assign the name number to the new column containing the values
books_ratings = books_gothic.melt(id_vars=['title', 'authors', 'publisher'], 
                                  value_vars=['rating', 'rating_count'], 
                                  var_name='feature', 
                                  value_name='number')

# Print books_ratings
print(books_ratings)
```

### Wide to long function
* In addition to `melt`, another function that can help us transform the data from wide to long is the **`pd.wide_to_long()`** function
    * **Notice** that this is a pandas function, and not a dataframe method
    
<img src='data/pivot13.png' width="800" height="400" align="center"/>

* `pd.wide_to_long(books, stubnames=['ratings', 'sold'], i='title', j='year')`

<img src='data/pivot14.png' width="700" height="350" align="center"/>

* **It is important to mention that if we have a DataFrame with a named index and we apply the `wide_to_long` function, the resulting DataFrame will not keep the original index.**

<img src='data/pivot15.png' width="600" height="300" align="center"/>

* If we want to keep a named index, we must modify the original dataframe by resetting the index without dropping it
* Then, apply the transformation including the new column

```
books_with_index.reset_index(drop=False, inplace=True)
pd.wide_to_long(books_with_index, stubnames=['ratings', 'sold'], i=['author', 'title'], j='year')
```

<img src='data/pivot16.png' width="600" height="300" align="center"/>

#### sep argument
* This new dataframe (below) is very similar to the previous one, but the name of the columns contains an underscore between the prefix (`ratings` or `sold`) and the suffix (the year, `2019` or `2020`).

<img src='data/pivot17.png' width="600" height="300" align="center"/>

* If we apply the transformation as before, we'll get an empty DataFrame:

<img src='data/pivot18.png' width="600" height="300" align="center"/>

* This happens because pandas doesn't recognize the name of the columns
* **It is always assumed that the prefix is *immediately* followed by a numeric suffix.**
* To overcome this, we can use the `sep` argument

<img src='data/pivot19.png' width="600" height="300" align="center"/>

#### suffix argument
* Finally, if the names of the wide columns do not end in a numeric number, (and instead, for example, end in alphabetic `one` or `two`)... if we apply the same transformation as before, we'll get an empty DataFrame since pandas assumes the suffixes are numeric
* To solve this, we use the `suffix` argument with a regex expression

<img src='data/pivot20.png' width="600" height="300" align="center"/>

#### Exercises: The golden age

```
# Reshape wide to long using title as index and version as new name, and extracting isbn prefix 
isbn_long = pd.wide_to_long(golden_age, 
                            stubnames='isbn', 
                            i='title', 
                            j='version')

# Print isbn_long
print(isbn_long)
```

```
# Reshape wide to long using title and authors as index and version as new name, and prefix as wide column prefix
prefix_long = pd.wide_to_long(golden_age, 
                      stubnames='prefix', 
                      i=['title', 'authors'], 
                      j='version')

# Print prefix_long
print(prefix_long)
```


```
# Reshape wide to long using title and authors as index and version as new name, and prefix and isbn as wide column prefixes
all_long = pd.wide_to_long(golden_age, 
                   stubnames=['isbn', 'prefix'], 
                   i=['title', 'authors'], 
                   j='version')

# Print all_long
print(all_long)
```

#### Exercises: Decrypting the code

```
# Reshape using author and title as index, code as new name and getting the prefix language and publisher
the_code_long = pd.wide_to_long(books_brown, 
                                stubnames=['language', 'publisher'], 
                                i=['author', 'title'], 
                                j='code',
                                sep='_')

# Print the_code_long
print(the_code_long)
```

```
# Specify underscore as the character that separates the variable names
the_code_long = pd.wide_to_long(books_brown, 
                                stubnames=['language', 'publisher'], 
                                i=['author', 'title'], 
                                j='code', sep='_')

# Print the_code_long
print(the_code_long)
```

```
# Specify that wide columns have a suffix containing words
the_code_long = pd.wide_to_long(books_brown, 
                                stubnames=['language', 'publisher'], 
                                i=['author', 'title'], 
                                j='code', 
                                sep='_', 
                                suffix='\w+')

# Print the_code_long
print(the_code_long)
```


```
# Modify books_hunger by resetting the index without dropping it
books_hunger.reset_index(drop=False, inplace=True)

# Reshape using title and language as index, feature as new name, publication and page as prefix separated by space and ending in a word
publication_features = pd.wide_to_long(books_hunger, 
                                       stubnames=['publication', 'page'], 
                                       i=['title', 'language'], 
                                       j='feature', 
                                       sep=' ', 
                                       suffix='\w+')

# Print publication_features
print(publication_features)
```

### Working with string columns
* pandas Series and Indexes have a set of string processing methods
* Easily accessible with `str` attribute
    * `books['title'].str.split(':')`
    * The method returns a list for each row 
    * Each list contains the two sub-strings obtained from splitting the title by the colon.
    
<img src='data/pivot21.png' width="600" height="300" align="center"/>

* We could also access only one of the resulting elements
* In that case, we would use the `.get()` method from the `str` attribute, passing in the index of the element we want.
    * `books['title'].str.split(":").str.get(0)
    * In this example we get the element of index zero
    
<img src='data/pivot22.png' width="600" height="300" align="center"/>

* We can also set the `expand` argument of `split` to `True`
* This will return a new DataFrame with two columns, one for each split element

<img src='data/pivot23.png' width="600" height="300" align="center"/>

* This allows us to assign the split elements to columns in the original DataFrame:
* `books[['main_title', 'subtitle']] = books['title'].str.split(":", expand=True)`
* In our example, we first split the column title by the colon, indicating we wanted to expand it to two columns, and assign it to two new columns 
* This is useful because now we can drop the original column title
* And after that, transform the DataFrame by using the new columns as indices, getting a clean, long DataFrame with a multi-level index. 

```
books[['main_title', 'subtitle']] = books['title'].str.split(":", expand=True)
books.drop('title', axis =1, inplace=True)
pd.wide_to_long(books, stubnames=['ratings', 'sold'], i=['main_title', 'subtitle'], j='year')
```

<img src='data/pivot24.png' width="600" height="300" align="center"/>

#### Concatenating two columns
* `books_new['name_author'].str.cat(books_new['lastname_author'], sep=' ')`

<img src='data/pivot25.png' width="600" height="300" align="center"/>

* This is helpful because then we can melt our DataFrame using this new (concatenated) column as an index, instead of using the two original columns


### Concatente index
* The `cat` and `split` methods can also be used for indexes
* To concatenate the index with a column in the DataFrame:
* `comics_marvel.index = comics_marvel.index.str.cat(comics_marvel['subtitle'], sep='-')`

<img src='data/pivot26.png' width="600" height="300" align="center"/>

### Split index
* We can do the same to split the string contained in the index
* `comics_marvel.index = comics_marvel.index.str.split('-', expand=True)`
* We now get a dataframe with a multilevel index

<img src='data/pivot27.png' width="600" height="300" align="center"/>

### Concatenate Series
* So far we have only worked with concatenating columns, but we can apply the `cat` method to concatenate a **column** with a **pre-defined list**
* `books_new['name_author'].str.cat(new_list, sep=' ')`

<img src='data/pivot28.png' width="600" height="300" align="center"/>

* As we can see in the output, we obtain a Series where each string in the main title has been concatenated with the corresponding element in the list

#### Exercises: Did you say dystopia

```
# Split the index of books_dys by the hyphen 
books_dys.index = books_dys.index.str.split('-')

# Print books_dys
print(books_dys)
```

```
# Get the first element after splitting the index of books_dys
books_dys.index = books_dys.index.str.split('-').str.get(0)

# Print books_dys
print(books_dys)
```

```
# Split by the hyphen the index of books_dys
books_dys.index = books_dys.index.str.split('-').str.get(0)

# Concatenate the index with the list author_list separated by a hyphen
books_dys.index = books_dys.index.str.cat(author_list, sep='-')

# Print books_dys
print(books_dys)
```

```
# Concatenate the title and subtitle separated by "and" surrounded by spaces
hp_books['full_title'] = hp_books['title'].str.cat(hp_books['subtitle'], sep =' and ') 

# Print hp_books
print(hp_books)
```

```
# Concatenate the title and subtitle separated by "and" surrounded by spaces
hp_books['full_title'] = hp_books['title'].str.cat(hp_books['subtitle'], sep =" and ") 

# Split the authors into writer and illustrator columns
hp_books[['writer', 'illustrator']] = hp_books['authors'].str.split("/", expand=True) 

# Print hp_books
print(hp_books)
```

```
# Concatenate the title and subtitle separated by "and" surrounded by spaces
hp_books['full_title'] = hp_books['title'].str.cat(hp_books['subtitle'], sep =" and ") 

# Split the authors into writer and illustrator columns
hp_books[['writer', 'illustrator']] = hp_books['authors'].str.split('/', expand=True)

# Melt goodreads and amazon columns into a single column
hp_melt = hp_books.melt(id_vars=['full_title', 'writer'], 
                        var_name='source', 
                        value_vars=['goodreads', 'amazon'], 
                        value_name='rating')

# Print hp_melt
print(hp_melt)
```

#### Exercises: Elementary, dear Watson

```
# Split main_title by a colon and assign it to two columns named title and subtitle 
books_sh[['title', 'subtitle']] = books_sh['main_title'].str.split(':', expand=True)

# Print books_sh
print(books_sh)
```

```
# Split main_title by a colon and assign it to two columns named title and subtitle 
books_sh[['title', 'subtitle']] = books_sh['main_title'].str.split(':', expand=True)

# Split version by a space and assign the second element to the column named volume 
books_sh['volume'] = books_sh['version'].str.split(' ').str.get(1)

# Print books_sh
print(books_sh)
```

```
# Split main_title by a colon and assign it to two columns named title and subtitle 
books_sh[['title', 'subtitle']] = books_sh['main_title'].str.split(':', expand=True)

# Split version by a space and assign the second element to the column named volume
books_sh['volume'] = books_sh['version'].str.split(' ').str.get(1)

# Drop the main_title and version columns modifying books_sh
books_sh.drop(['main_title', 'version'], axis=1, inplace=True)

# Print books_sh
print(books_sh)
```

```
# Split main_title by a colon and assign it to two columns named title and subtitle 
books_sh[['title', 'subtitle']] = books_sh['main_title'].str.split(':', expand=True)

# Split version by a space and assign the second element to the column named volume 
books_sh['volume'] = books_sh['version'].str.split(' ').str.get(1)

# Drop the main_title and version columns modifying books_sh
books_sh.drop(['main_title', 'version'], axis=1, inplace=True)

# Reshape using title, subtitle and volume as index, name feature the new variable from columns starting with number, separated by undescore and ending in words 
sh_long = pd.wide_to_long(books_sh, 
                          stubnames='number', 
                          i=['title', 'subtitle', 'volume'], 
                          j='feature', 
                          sep='_', 
                          suffix='\w+')

# Print sh_long 
print(sh_long)
```

# $\star$ Chapter 3: Stacking and Unstacking DataFrames
In this chapter, you’ll level-up your data manipulation skills using multi-level indexing. You'll learn how to reshape DataFrames by rearranging levels of the row indexes to the column axis, or vice versa. You'll also gain the skills you need to handle missing data generated in the stacking and unstacking processes.

### Stacking DataFrames

* Pandas also has some reshaping methods that are designed to work on DataFrames with multi-level indexes
* A **MultiIndex**, also known as a **multi-level index** allows us to **store and manipulate multidimensional data in simple DataFrames.**

### Creating a MultiIndex
* There are several ways to create a multilevel index.

#### Setting the index
* The simplest way to create a multilevel index is to use the **`set_index()`** method:
* In the below code, we specify that we want the columns `country` and `age` to be set as row indices
* `churn.set_index(['country', 'age'], inplace=True)`

<img src='data/pivot29.png' width="600" height="300" align="center"/>

### MultiIndex from array
* Another option is to use the method `from_arrays()` from MultiIndex
* In this case, we define a list of lists named `new_array` (below)
* Each element represents one index
* We call the `from_arrays()` medthod, passing our array and a list of names we want for the indexes
* **We assign it to the original DataFrame *index* by calling the *index attribute.***

```
new_array = [['yes', 'no', 'yes'], ['no', 'yes', 'yes']]
churn.index = pd.MultiIndex.from_arrays(new_array, names=['member', 'credit_card'])
```

<img src='data/pivot30.png' width="600" height="300" align="center"/>

* We can also define a DataFrame with multi-level indexes on the rows **and** the columns 

<img src='data/pivot31.png' width="600" height="300" align="center"/>

* The process is very similar:
    * We create two MultiIndexes using the method `from_arrays()`: one for the index and one for the columns
    
```
index = pd.MultiIndex.from_arrays([['Wick', 'Wick', 'Shelley', 'Shelley'],
                                  ['John', 'Julien', 'Mary', 'Frank']],
                                 names = ['last', 'first'])
columns = pd.MultiIndex.from_arrays([['2019', '2019', '2020', '2020'],
                                     ['age', 'weight', 'age', 'weight']],
                      names=['year', 'feature'])
```    

In [7]:
index = pd.MultiIndex.from_arrays([['Wick', 'Wick', 'Shelley', 'Shelley'],
                                  ['John', 'Julien', 'Mary', 'Frank']],
                                 names = ['last', 'first'])

In [8]:
columns = pd.MultiIndex.from_arrays([['2019', '2019', '2020', '2020'],
                                     ['age', 'weight', 'age', 'weight']],
                      names=['year', 'feature'])

* When we create the DataFrame, we set the index and the columns to be the recently created multi-level indexes

```
patients = pd.DataFrame(data, index=index, columns=columns)
patients
```
* As a result, we get a DataFrame with multi-level indexes on the rows and on the columns

<img src='data/pivot32.png' width="600" height="300" align="center"/>

### The .stack() method
* The `stack()` method with reshape the DataFrame with a multi-level index by converting it into a **stacked form**
* `df.stack()`
* In other words, stacking means: **Rearranging the innermost column index to become the innermost row index.**

<img src='data/pivot33.png' width="600" height="300" align="center"/>


### Stack into a series
* If we take a DataFrame with a multi-level index on the rows and a simple column index, `stack()` will compress the last (/first?) level in the DataFrame columns to produce a Series, as we can see in the output

<img src='data/pivot34.png' width="600" height="300" align="center"/>

<img src='data/pivot35.png' width="600" height="300" align="center"/>

### Stack into a DataFrame
* We have a DataFrame with a multi-level index in the columns; we apply the `stack()` method
* As a consequence, `stack()` will compress the last level in the columns to produce a DataFrame, as seen in the output

<img src='data/pivot36.png' width="600" height="300" align="center"/>

### Stack a Level by Number
* It is also possible to choose which level to stack
* In the example below, we want to stack the first column level, so we set the `level` argument to zero
* Now **the stacked level becomes the new lowest level in the row multi-level index**

<img src='data/pivot37.png' width="600" height="300" align="center"/>

* **It is important to remember that if we don't set the level argument, `stack()` will move the last level by default.**

### Stack a level by name
* If our DataFrame has named column levels, we can also specify the level to stack by passing in the column name
* In the code below, we set `level='year'`
* In the resulting DataFrame, we see that the year level has now become the innermost row level

<img src='data/pivot38.png' width="600" height="300" align="center"/>

#### Exercises: Stack the calls!

```
# Predefined list to use as index
new_index = [['California', 'California', 'New York', 'Ohio'], 
             ['Los Angeles', 'San Francisco', 'New York', 'Cleveland']]

# Create a multi-level index using predefined new_index
churn_new = pd.MultiIndex.from_arrays(new_index, names=['state', 'city'])

# Print churn_new
print(churn_new)
```

```
# Predefined list to use as index
new_index = [['California', 'California', 'New York', 'Ohio'], 
             ['Los Angeles', 'San Francisco', 'New York', 'Cleveland']]

# Create a multi-level index using predefined new_index
churn_new = pd.MultiIndex.from_arrays(new_index, names=['state', 'city'])

# Assign the new index to the churn index
churn.index = churn_new

# Print churn
print(churn)
```

```
# Predefined list to use as index
new_index = [['California', 'California', 'New York', 'Ohio'], 
             ['Los Angeles', 'San Francisco', 'New York', 'Cleveland']]

# Create a multi-level index using predefined new_index
churn_new = pd.MultiIndex.from_arrays(new_index, names=['state', 'city'])

# Assign the new index to the churn index
churn.index = churn_new

# Reshape by stacking churn DataFrame
churn_stack = churn.stack()

# Print churn_stack
print(churn_stack)
```

#### Exercises: Phone Directory Index

```
# Set state and city as index modifying the DataFrame
churn.set_index(['state', 'city'], inplace=True)

# Print churn
print(churn)
```

```
# Set state and city as index modifying the DataFrame
churn.set_index(['state', 'city'], inplace=True)

# Reshape by stacking the second level
churn_stack = churn.stack(level=1)

# Print churn_stack
print(churn_stack)
```

#### Exercises: Text me!

```
# Stack churn by the time column level
churn_time = churn.stack(level='time')

# Print churn_time
print(churn_time)
```

```
# Stack churn by the feature column level
churn_feature = churn.stack(level='feature')

# Print churn_feature
print(churn_feature)
```

## Unstacking DataFrames

### Undoing stacking process
* pandas provides us with the `unstack()` method
* **The unstacking process performs exactly the inverse operation of stacking**
* **Unstacking** means rearranging the innermost row index to become the innermost column index.

<img src='data/pivot39.png' width="600" height="300" align="center"/>

* If we take another look at this $\Downarrow$ stacked series, we can see that it has three row index levels:

<img src='data/pivot40.png' width="600" height="300" align="center"/>

* If we apply the `unstack()` method, we can see that the innermost row level has now moved to the innermost column level; this is the same as the original DataFrame we had before the stacking operation, effectively "undoing" the stacking:

<img src='data/pivot41.png' width="600" height="300" align="center"/>

### Unstacking a DataFrame
* `unstack()` can also be applied to DataFrames

<img src='data/pivot42.png' width="600" height="300" align="center"/>

<img src='data/pivot43.png' width="600" height="300" align="center"/>

* As a result $\Uparrow$, we can see that the last row level, the `feature` level, has moved to the column level.
* Again, we got the original DataFrame we had before the stacking operation

### Unstack a level
* We can also choose *which* level to unstack by setting the `level` argument to either the **index name** or **index number**, just as we did with the `stack()` method:

<img src='data/pivot44.png' width="600" height="300" align="center"/>

### Unstack level by number
* Remember that if we don't set the `level` argument, `unstack()` moves the last column level by default.

<img src='data/pivot45.png' width="600" height="300" align="center"/>

### Unstack level by name

<img src='data/pivot46.png' width="600" height="300" align="center"/>

### Sort index
* Note that the `stack()` and `unstack()` methods implicitly sort the index levels
* To change that, we can use the **`sort_index()`** method
* In the example below, we set the `ascending` argument to `False`
* The resulting DataFrame contains the row indices sorted by descending order

<img src='data/pivot47.png' width="600" height="300" align="center"/>

### Rearranging levels
* **One useful way to rearrange levels is to chain the stacking and unstacking processes**
* Below we unstack the second row level and then stack the first column level
* In the output, we see that the row level named `first` appears now in the column index
* Also, the column level named `year` has moved to the row index

<img src='data/pivot48.png' width="600" height="300" align="center"/>

#### Exercises: International caller

```
# Reshape the churn DataFrame by unstacking
churn_unstack = churn.unstack()

# Print churn_unstack
print(churn_unstack)
```

```
# Reshape churn by unstacking the first row level
churn_first = churn.unstack(level=0)

# Print churn_zero
print(churn_first)
```

```
# Reshape churn by unstacking the second row level
churn_second = churn.unstack(level=1)

# Print churn_second
print(churn_second)
```

#### Exercises: Call another time

```
# Unstack the time level from churn
churn_time = churn.unstack(level='time')

# Print churn_time
print(churn_time)
```

```
# Sort the index in descending order
churn_time = churn.unstack(level='time').sort_index(ascending=False)

# Print churn_time
print(churn_time)
```

#### Exercises: Organizing your voicemail

```
# Unstack churn by type level
churn_type = churn.unstack(level='type')

# Stack the resulting DataFrame using the first column level
churn_final = churn_type.stack(level=0)

# Print churn_type
print(churn_final)
```

## Working with multiple levels
* Rearranging one level at a time has its limitations

### Rearranging multiple levels
* Swap levels
* Stack and unstack multiple levels at the same time

### Swap levels
* The **`swaplevel()`** method can switch the order of two levels *within the same axis*
* This means that we can swap the order of two row levels or two column levels

<img src='data/pivot49.png' width="600" height="300" align="center"/>

* See the following example with the `cars` dataset:
* We apply the `swaplevel()` method, passing the index zero and two
* In the output, we can see how the first and tird row levels are now interchanged

<img src='data/pivot50.png' width="600" height="300" align="center"/>

### Swap levels and unstack
* We can now chain it with the unstacking process
* We can see that the row level containing the `price` and `sold` features was moved to the column index
* If we hadn't changed the order of the levels, the unstacked level would have been the brand level.

<img src='data/pivot51.png' width="600" height="300" align="center"/>

***

### Swap levels and unstack
* In the following example, we first unstack the last row index level, then swap the first and second column levels 
* We do this by setting the `axis` parameters to `1`
* We can see how the year appears on top of the brand level

<img src='data/pivot52.png' width="600" height="300" align="center"/>

### Swap levels and stack
* Finally, we can also stack the column index of cars
* Then, call the `swaplevel()` method, passing zero and two as arguments
* In the output, we can see how the recently stacked level and the output first level are switched

<img src='data/pivot53.png' width="600" height="300" align="center"/>

### Multiple levels
* The DataFrame here has multi-level indexes on the rows and on the columns 
* So, how do we reshape any of these multiple levels *at the same time*

<img src='data/pivot54.png' width="600" height="300" align="center"/>

### Unstacking multiple levels
* The `cars` DataFrame; it has a **multi-index on the rows**. In particular, it has **three levels.**

<img src='data/pivot55.png' width="600" height="300" align="center"/>

### Unstacking levels by number
* Unstacking several levels at the same time is easy
* We just have to pass a list of the index numbers to the level parameter
* `cars.unstack(level=[0,1])`
* In the output, we see that the first and second row levels are now on the column index.
* The resulting DataFrame has three levels on the row indices

<img src='data/pivot56.png' width="600" height="300" align="center"/>

### Unstacking levels by name
* We could also use the level names by passing a list of brand and model levels to `unstack()`
* `cars.unstack(level=['brand', 'model'])`
* As a result, we get the same DataFrame as before

<img src='data/pivot57.png' width="600" height="300" align="center"/>

### Stacking multiple levels
* For the following examples, we'll use this $\Downarrow$ DataFrame:

<img src='data/pivot58.png' width="600" height="300" align="center"/>

### Stacking by name or number
* We could pass a list of index numbers or the respective names 
* In both cases, we get a resulting DataFrame where the year and brand levels are now in the row indices
* It is important to notice that the order in which you pass the names matters

<img src='data/pivot59.png' width="600" height="300" align="center"/>

#### Exercises: Swap your SIM card

```
# Switch the first and third row index levels in churn
churn_swap = churn.swaplevel(0, 2)

# Print churn_swap
print(churn_swap)
```

```
# Switch the first and third row index levels in churn
churn_swap = churn.swaplevel(0, 2)

# Reshape by unstacking the last row level 
churn_unstack = churn_swap.unstack()

# Print churn_unstack
print(churn_unstack)
```

#### Exercises: Two many calls

```
# Unstack the first and second row level of churn
churn_unstack = churn.unstack(level=[0, 1])

# Print churn_unstack
print(churn_unstack)
```

```
# Unstack the first and second row level of churn
churn_unstack = churn.unstack(level=[0, 1])

# Stack the resulting DataFrame using plan and year
churn_py = churn_unstack.stack(level=['plan', 'year'])

# Print churn_py
print(churn_py)
```

```
# Unstack the first and second row level of churn
churn_unstack = churn.unstack(level=[0, 1])

# Stack the resulting DataFrame using plan and year
churn_py = churn_unstack.stack(['plan', 'year'])

# Switch the first and second column levels
churn_switch = churn_py.swaplevel(0, 1, axis=1)

# Print churn_switch
print(churn_switch)
```

## Handling missing data
* In this lesson, we'll learn how to handle missing data when we stack or unstack DataFrames

### Unstacking leads to missing values
* This happens when: **Subgroups do not have the same set of labels**

<img src='data/pivot60.png' width="600" height="300" align="center"/>

<img src='data/pivot61.png' width="600" height="300" align="center"/>

* In the reshaped data above $\Uparrow$ we can see that the subgroup *Aves Carnivora* shows a missing value `NaN`
* This happens because it was not present in the original DataFrame

### Handling NaN with unstack
* Luckily, the parameter `fill_value` of the `unstack()` method allows us to fill those values with any value

<img src='data/pivot62.png' width="600" height="300" align="center"/>

## Stack and missing values
* However, the case of `stack()` is different;
* Missing values appear when: **Combinations of index and column values missing from the original DataFrame**
* We'll work with the following DataFrame $\Downarrow$

<img src='data/pivot63.png' width="600" height="300" align="center"/>

* After applying the stack method, we can see that the combination of rose and size is completely missing $\Downarrow$

<img src='data/pivot64.png' width="600" height="300" align="center"/>

* **This happens because `stack()` has the argument `dropna` set to `True` by *default***
* If we would prefer to keep that information, we need to set the `dropna` argument to `False` $\Downarrow$

<img src='data/pivot65.png' width="600" height="300" align="center"/>

* We can see in the resulting DataFrame that the row with indices rose size is now present (all it's values are now missing values. 
* We *could* then fill the missing values using the `fillna()` method $\Downarrow$

<img src='data/pivot66.png' width="600" height="300" align="center"/>

* We pass the value with which we want to replace the missing values (in this case `0`).
* The resulting DataFrame $\Uparrow$ will have zeros instead of `NaN`s

#### Exercises: A missed phone call

```
# Unstack churn level and fill missing values with zero
churn = churn.unstack(level='churn', fill_value=0)

# Sort by descending voice mail plan and ascending international plan
churn_sorted = churn.sort_index(level=['voice_mail_plan', 'international_plan'], 
                          ascending=[False, True])

# Print final DataFrame and observe pattern
print(churn_sorted)
```

#### Exercises: Don't drop the stack

```
# Stack the level type from churn
churn_stack = churn.stack(level='type')

# Fill the resulting missing values with zero 
churn_fill = churn_stack.fillna(0)

# Print churn_fill
print(churn_fill)
```

```

<img src='data/pivot.png' width="600" height="300" align="center"/>