Note: This notebook was completed alongside the DataCamp course by the same name
# Reshaping Data with pandas
Often data is in a human-readable format, but it’s not suitable for data analysis. This is where pandas can help—it’s a powerful tool for reshaping DataFrames into different formats. In this course, you’ll grow your data scientist and analyst skills as you learn how to wrangle string columns and nested data contained in a DataFrame. You’ll work with real-world data, including FIFA player ratings, book reviews, and churn analysis data, as you learn how to reshape a DataFrame from wide to long format, stack and unstack rows and columns, and get descriptive statistics of a multi-index DataFrame.

**Instructor:** Maria Eugenia Inzaugarat, PhD Data Scientist

# $\star$ Chapter 1: Introduction to Data Reshaping
Let's start by understanding the concept of wide and long formats and the advantages of using each of them. You’ll then learn how to pivot data from long to a wide format, and get summary statistics from a large DataFrame.

* Wide and long formats
* Long to wide transformation
* Wide to long transformation
* Stacking and unstacking columns
* Reshaping and handling complex data, such as string columns or JSON data
* Nested data
* Statistical data formats
* Multi-level index DataFrames

### Shape of data
* The way in which a dataset is organized into rows and columns

#### Wide format
* Each feature is in a separate column
* Each row contains many features of the same player
* Wide format has **no repeated records**, but this **could lead to missing values.**
* This format is preferred to do **simple statistics and imputation**, such as calculating the mean or imputing missing values.

<img src='data/wide_format.png' width="600" height="300" align="center"/>

#### Long format
* Each row represents one feature
* Multiple rows for each player
* Notice that there is no row for the feature `age` for the first player
* This happens because we had a missing value there
* A column (`name`) that identifies the same player through the records
* These are typical characteristics of the long format that is usually seen as the standard for a tidy dataset
* Tidy data:
    * Better to summarize data
    * Key-value pairs
    * Preferred or required for many advanced graphing and analysis techniques

<img src='data/long_format.png' width="400" height="200" align="center"/>

#### Reshaping data
* In a broad sense, reshaping data is transforming a data structure to adjust it for analysis
* Transpose:
    * `fifa_players.set_index('club')[['name', 'nationality']].transpose()`
    * Alternately, `.T`
* **In this course, we will define reshaping data as converting data from wide to long format and vice versa.**
* To decide between using long or wide format, think about the unit of analysis:
    * Long format $\Rightarrow$ characteristic of a player
    * Wide format $\Rightarrow$ each player
    
#### Wide to long transformation
* Performed using `pandas` functions, such as:
     * `.melt()`
     * `.wide_to_long()`
   
#### Long to wide format
* Transform data using `pandas` methods, for example:
    * `.pivot()`
    * `.pivot_table()`
    
#### Exercises: Flipping players

```
# Set name as index
fifa_transpose = fifa_players.set_index('name')

# Print fifa_transpose
print(fifa_transpose)

# Modify the DataFrame to keep only height and weight columns
fifa_transpose = fifa_players.set_index('name')[['height', 'weight']]
# Print fifa_transpose
print(fifa_transpose)

# Change the DataFrame so rows become columns and vice versa
fifa_transpose = fifa_players.set_index('name')[['height', 'weight']].transpose()

# Print fifa_transpose
print(fifa_transpose)
```

### Reshaping using pivot method
* The long format is usually the most suitble to store a clean dataset

#### Why go from long to WIDE format?
* Demonstrate relationship between two (or more) columns
* Time series operations with the variables
* Operation that requires columns to be the unique variable
* Wide format allows us to **discover patterns**
* The `pivot()` method allows us to reshape the data from a long to a wide format

<img src='data/pivot_method2.png' width="600" height="300" align="center"/>

* The `pivot()` method takes three arguments:
    * **`index`:** takes the name of the column we want to have as an index in the new, pivoted DF.
    * **`columns`:** takes the name of the column we want to have as each column in the new, pivoted DF.
    * **`values`:** takes the name of the column of values with which we want to populate the new, pivoted DF.
* **If the method cannot find a row and column matching the original dataframe, it will set that cell value as a missing value (see the `NaN` above).**

<img src='data/pivot1.png' width="600" height="300" align="center"/>

```
fifa.pivot(index = 'name', columns = 'variable', values='metric_system')
```

<img src='data/pivot2.png' width="600" height="300" align="center"/>

* **Note** that we can also pass a list of two values to the pivot method:
* `fifa.pivot(index='name', columns='variable', values=['metric_system', 'imperical_system'])`
* In this case, the resulting DataFrame has a hierarchical column index with both column names as demonstrated below:

<img src='data/pivot3.png' width="600" height="300" align="center"/>

***
#### Pivoting multiple columns

<img src='data/pivot4.png' width="600" height="300" align="center"/>

* What if we want to extend the pivot method to all the column values in the DataFrame instead of just one or two?
* We can do this easily by omitting the values argument:
* `df.pivot(index='name', columns='variable')
* We see that we get the same result as above when we identified multiple columns as `value` columns:

<img src='data/pivot5.png' width="600" height="300" align="center"/>

#### Duplicate entries error
* Passing only index and column arguments to the pivot method will work in most cases
* However, pay attention to the 3rd and 5th rows below:

<img src='data/pivot6.png' width="600" height="300" align="center"/>

* If we try to perform the same operation: `another_fifa.pivot(index='name', columns='variable')`, we get:

<img src='data/pivot7.png' width="600" height="300" align="center"/>

* It doesn't know which of the two values should be the corresponding value, pandas will raise an error. 
* We could choose to delete one of the rows (for example the fifth row) and then rerun the command without raising an error.

#### Exercises: Dribbling the pivot method 

```
# Pivot fifa_players to get overall scores indexed by name and identified by movement
fifa_overall = fifa_players.pivot(index='name', columns='movement', values='overall')

# Print fifa_overall
print(fifa_overall)

# Pivot fifa_players to get attacking scores indexed by name and identified by movement
fifa_attacking = fifa_players.pivot(index='name', columns='movement', values='attacking')

# Print fifa_attacking
print(fifa_attacking)

# Use the pivot method to get overall scores indexed by movement and identified by name
fifa_names = fifa_players.pivot(index='movement', columns='name', values='overall')

# Print fifa_names
print(fifa_names)
```

#### Exercises: Offensive or defensive player?

```
# Pivot fifa_players to get overall and attacking scores indexed by name and identified by movement
fifa_over_attack = fifa_players.pivot(index='name', 
                                     columns='movement', 
                                     values=['overall', 'attacking'])

# Print fifa_over_attack
print(fifa_over_attack)

# Use pivot method to get all the scores index by name and identified by movement
fifa_all = fifa_players.pivot(index='name',
                              columns='movement',
                              values=['overall', 'attacking'])

# Print fifa_over_attack
print(fifa_all)
```

#### Exercises: Replay that last move!

```
# Drop the fifth row to delete all repeated rows
fifa_no_rep = fifa_players.drop(4, axis=0)

# Print fifa_pivot
print(fifa_no_rep)

# Drop the fifth row to delete all repeated rows
fifa_no_rep = fifa_players.drop(4, axis=0)

# Pivot fifa players to get all scores by name and movement
fifa_pivot = fifa_no_rep.pivot(index='name', columns='movement', values=['overall', 'attacking']) 

# Print fifa_pivot
print(fifa_pivot)  
```

## Pivot tables

#### Pivot method limitations
* The `.pivot()` method has some limitations
* Great general purpose pivoting technique
* However, **it requires the index column pair to be unique**
    * This is mainly due to the fact that the pivot method cannot aggregate values
    
### Pivot table
* A DataFrame containing statistics that summarize the data of a larger DataFrame
* To convert from the DataFrame in long format on the left to the DataFrame on the right with aggregated values, we can use the **`.pivot_table()`** method
* It is important to note that with this method we can also summarize DataFrames that are not in long format

<img src='data/pivot8.png' width="600" height="300" align="center"/>

* `df.pivot_table(index='Year', columns='Name', values='Weight', aggfunc='mean')`
* **Note** the new, additional parameter: **`aggfunc`**
    * **Default `aggfunc` is `mean`.**
    
### Hierarchical indexes
* Another advantage of pivot tables is that we can have multi-level indexes, not only in the columns, but also in the tows: the indexes first and last 

<img src='data/pivot9.png' width="700" height="350" align="center"/>

### Margins
* Finally, we would like to get the number of attacking and overall scores each player has
* In the `pivot_table` method, by omitting a `values` argument, pandas will pivot **all values.**
* But, we will pass the **`margins`** argument
* `fifa_players.pivot_table(index['first', 'last'], columns='movement', aggfunc='count', margins=True)`
* When the `margins` parameter is set to `True`, all the columns and rows will be added.
* In this case (with `aggfunc='count'`) we'll get the total counts for each row and column

<img src='data/pivot10.png' width="600" height="300" align="center"/>

#### Pivot or pivot table?
* *Does the DataFrame have more than one value for each index/column pair?*
* *Do you need to have a multi-index in your resulting pivoted DataFrame?*
* *Do you need summary statistics of your large DataFrame?*
* **If YES** Use `.pivot_table()`

#### Exercises: Reviewing the moves

```
# Discard the fifth row to delete all repeated rows
fifa_drop = fifa_players.drop(4, axis=0)

# Use pivot method to get all scores by name and movement
fifa_pivot = fifa_drop.pivot(index='name', columns='movement') 

# Print fifa_pivot
print(fifa_pivot)  

# Use pivot table to get all scores by name and movement
fifa_pivot_table = fifa_players.pivot_table(index='name', 
                                     columns='movement', 
                                     aggfunc='mean')
# Print fifa_pivot_table
print(fifa_pivot_table)
```

#### Exercises: Exploring the big match

```
# Use pivot table to display mean age of players by club and nationality 
mean_age_fifa = fifa_players.pivot_table(index='nationality', 
                                  columns=['club', 'nationality'], 
                                  values='age', 
                                  aggfunc='mean')

# Print mean_age_fifa
print(mean_age_fifa)

# Use pivot table to display max height of any player by club and nationality
tall_players_fifa = fifa_players.pivot_table(index='nationality', 
                                     columns='club', 
                                      values='height', 
                                      aggfunc='max')

# Print tall_players_fifa
print(tall_players_fifa)

# Use pivot table to show the count of players by club and nationality and the total count
players_country = fifa_players.pivot_table(index='nationality', 
                                    columns='club', 
                                    values='name', 
                                    aggfunc='count', 
                                    margins=True)

# Print players_country
print(players_country)
```

#### Exercises: The tallest and the heaviest

```
# Define a pivot table to get the characteristic by nationality and club
fifa_mean = fifa_players.pivot_table(index=['nationality', 'club'], 
                                     columns='year')

# Print fifa_mean
print(fifa_mean)

# Set the appropriate argument to show the maximum values
fifa_mean = fifa_players.pivot_table(index=['nationality', 'club'], 
                                     columns='year', 
                                     aggfunc='max')

# Print fifa_mean
print(fifa_mean)

# Set the argument to get the maximum for each row and column
fifa_mean = fifa_players.pivot_table(index=['nationality', 'club'], 
                                     columns='year', 
                                     aggfunc='max', 
                                     margins=True)

# Print fifa_mean
print(fifa_mean)
```

# $\star$ Chapter 2: Converting Between Wide and Long Format
Master the technique of reshaping DataFrames from wide to long format. In this chapter, you'll learn how to use the melting method and wide to long function before discovering how to handle string columns by concatenating or splitting them.

### Reshaping with melt
* In this lesson, we will learn how to reshape a DataFrame from wide to long format using the `melt` function.

#### Wide to long transformation
* Perform analytics
* Plot different variables in the same graph

<img src='data/wide_to_long.png' width="700" height="350" align="center"/>

* Most data is stored in a wide format
* The first argument to set is **`id_vars`**
* This argument takes the names of the column(s) to use as identifier variables
* **`df.melt(id_vars=["first","last"])`**
* The columns identified in `id_vars` will also appear in the long format table and will help us match all the records for the same observation
    * The rest of the columns are melted
    
<img src='data/pivot11.png' width="700" height="350" align="center"/>

### Values and variables
* What can we do if we do not want to melt all the columns?
* We have other arguments for that purpose: 
    * **`value_vars`:** 
        * Takes the names of the columns we want to melt
        * This can be only one column or a list of many columns
    * **`var_name`:** 
        * Takes the name to use for the column "variable"
        * Default value is `variable`
    * **`value_name`:**
        * Takes the name to use for the column "value"
        * Default value is `value`

<img src='data/pivot12.png' width="700" height="350" align="center"/>

#### Specifying values to melt
* `books.melt(id_vars='title', value_vars=['language_code', 'num_pages'])`

#### Exercises: Gothic times

```
# Melt books_gothic using the title column as identifier 
gothic_melted = books_gothic.melt(id_vars='title')

# Print gothic_melted
print(gothic_melted)

# Melt books_gothic using the title, authors, and publisher columns as identifier
gothic_melted_new = books_gothic.melt(id_vars=['title', 'authors', 'publisher'])

# Print gothic_melted_new
print(gothic_melted_new)

# Melt publisher column using title and authors as identifiers
publisher_melted = books_gothic.melt(id_vars=['title', 'authors'], 
                                     value_vars='publisher')

# Print publisher_melted
print(publisher_melted)

# Melt rating and rating_count columns using the title as identifier
rating_melted = books_gothic.melt(id_vars='title', 
                                  value_vars=['rating', 'rating_count'])

# Print rating_melted
print(rating_melted)

# Melt rating and rating_count columns using title and authors as identifier
books_melted = books_gothic.melt(id_vars=['title', 'authors'], 
                                 value_vars=['rating', 'rating_count'])

# Print books_melted
print(books_melted)

# Melt the rating and rating_count using title, authors and publisher as identifiers
books_ratings = books_gothic.melt(id_vars=['title', 'authors', 'publisher'], 
                                  value_vars=['rating', 'rating_count'])

# Print books_ratings
print(books_ratings)

# Assign the name feature to the new variable column
books_ratings = books_gothic.melt(id_vars=['title', 'authors', 'publisher'], 
                                  value_vars=['rating', 'rating_count'], 
                                  var_name='feature')

# Print books_ratings
print(books_ratings)

# Assign the name number to the new column containing the values
books_ratings = books_gothic.melt(id_vars=['title', 'authors', 'publisher'], 
                                  value_vars=['rating', 'rating_count'], 
                                  var_name='feature', 
                                  value_name='number')

# Print books_ratings
print(books_ratings)

### Wide to long function
* In addition to `melt`, another function that can help us transform the data from wide to long is the **`pd.wide_to_long()`** function
    * **Notice** that this is a pandas function, and not a dataframe method
    
<img src='data/pivot13.png' width="800" height="400" align="center"/>

* `pd.wide_to_long(books, stubnames=['ratings', 'sold'], i='title', j='year')`

<img src='data/pivot14.png' width="700" height="350" align="center"/>

* **It is important to mention that if we have a DataFrame with a named index and we apply the `wide_to_long` function, the resulting DataFrame will not keep the original index.**

<img src='data/pivot15.png' width="600" height="300" align="center"/>

* If we want to keep a named index, we must modify the original dataframe by resetting the index without dropping it
* Then, apply the transformation including the new column

```
books_with_index.reset_index(drop=False, inplace=True)
pd.wide_to_long(books_with_index, stubnames=['ratings', 'sold'], i=['author', 'title'], j='year')
```

<img src='data/pivot16.png' width="600" height="300" align="center"/>

#### sep argument
* This new dataframe (below) is very similar to the previous one, but the name of the columns contains an underscore between the prefix (`ratings` or `sold`) and the suffix (the year, `2019` or `2020`).

<img src='data/pivot17.png' width="600" height="300" align="center"/>

* If we apply the transformation as before, we'll get an empty DataFrame:

<img src='data/pivot18.png' width="600" height="300" align="center"/>

* This happens because pandas doesn't recognize the name of the columns
* **It is always assumed that the prefix is *immediately* followed by a numeric suffix.**
* To overcome this, we can use the `sep` argument

<img src='data/pivot19.png' width="600" height="300" align="center"/>

#### suffix argument
* Finally, if the names of the wide columns do not end in a numeric number, (and instead, for example, end in alphabetic `one` or `two`)... if we apply the same transformation as before, we'll get an empty DataFrame since pandas assumes the suffixes are numeric
* To solve this, we use the `suffix` argument with a regex expression

<img src='data/pivot20.png' width="600" height="300" align="center"/>

#### Exercises: The golden age

```
# Reshape wide to long using title as index and version as new name, and extracting isbn prefix 
isbn_long = pd.wide_to_long(golden_age, 
                            stubnames='isbn', 
                            i='title', 
                            j='version')

# Print isbn_long
print(isbn_long)
```

```
# Reshape wide to long using title and authors as index and version as new name, and prefix as wide column prefix
prefix_long = pd.wide_to_long(golden_age, 
                      stubnames='prefix', 
                      i=['title', 'authors'], 
                      j='version')

# Print prefix_long
print(prefix_long)
```


```
# Reshape wide to long using title and authors as index and version as new name, and prefix and isbn as wide column prefixes
all_long = pd.wide_to_long(golden_age, 
                   stubnames=['isbn', 'prefix'], 
                   i=['title', 'authors'], 
                   j='version')

# Print all_long
print(all_long)
```

#### Exercises: Decrypting the code

```
# Reshape using author and title as index, code as new name and getting the prefix language and publisher
the_code_long = pd.wide_to_long(books_brown, 
                                stubnames=['language', 'publisher'], 
                                i=['author', 'title'], 
                                j='code',
                                sep='_')

# Print the_code_long
print(the_code_long)
```

```
# Specify underscore as the character that separates the variable names
the_code_long = pd.wide_to_long(books_brown, 
                                stubnames=['language', 'publisher'], 
                                i=['author', 'title'], 
                                j='code', sep='_')

# Print the_code_long
print(the_code_long)
```

```
# Specify that wide columns have a suffix containing words
the_code_long = pd.wide_to_long(books_brown, 
                                stubnames=['language', 'publisher'], 
                                i=['author', 'title'], 
                                j='code', 
                                sep='_', 
                                suffix='\w+')

# Print the_code_long
print(the_code_long)
```


```
# Modify books_hunger by resetting the index without dropping it
books_hunger.reset_index(drop=False, inplace=True)

# Reshape using title and language as index, feature as new name, publication and page as prefix separated by space and ending in a word
publication_features = pd.wide_to_long(books_hunger, 
                                       stubnames=['publication', 'page'], 
                                       i=['title', 'language'], 
                                       j='feature', 
                                       sep=' ', 
                                       suffix='\w+')

# Print publication_features
print(publication_features)
```

<img src='data/pivot.png' width="600" height="300" align="center"/>