Note: This notebook was completed alongside the DataCamp course by the same name
# Reshaping Data with pandas
Often data is in a human-readable format, but it’s not suitable for data analysis. This is where pandas can help—it’s a powerful tool for reshaping DataFrames into different formats. In this course, you’ll grow your data scientist and analyst skills as you learn how to wrangle string columns and nested data contained in a DataFrame. You’ll work with real-world data, including FIFA player ratings, book reviews, and churn analysis data, as you learn how to reshape a DataFrame from wide to long format, stack and unstack rows and columns, and get descriptive statistics of a multi-index DataFrame.

**Instructor:** Maria Eugenia Inzaugarat, PhD Data Scientist

# $\star$ Chapter 1: Introduction to Data Reshaping
Let's start by understanding the concept of wide and long formats and the advantages of using each of them. You’ll then learn how to pivot data from long to a wide format, and get summary statistics from a large DataFrame.

* Wide and long formats
* Long to wide transformation
* Wide to long transformation
* Stacking and unstacking columns
* Reshaping and handling complex data, such as string columns or JSON data
* Nested data
* Statistical data formats
* Multi-level index DataFrames

### Shape of data
* The way in which a dataset is organized into rows and columns

#### Wide format
* Each feature is in a separate column
* Each row contains many features of the same player
* Wide format has **no repeated records**, but this **could lead to missing values.**
* This format is preferred to do **simple statistics and imputation**, such as calculating the mean or imputing missing values.

<img src='data/wide_format.png' width="600" height="300" align="center"/>

#### Long format
* Each row represents one feature
* Multiple rows for each player
* Notice that there is no row for the feature `age` for the first player
* This happens because we had a missing value there
* A column (`name`) that identifies the same player through the records
* These are typical characteristics of the long format that is usually seen as the standard for a tidy dataset
* Tidy data:
    * Better to summarize data
    * Key-value pairs
    * Preferred or required for many advanced graphing and analysis techniques

<img src='data/long_format.png' width="400" height="200" align="center"/>

#### Reshaping data
* In a broad sense, reshaping data is transforming a data structure to adjust it for analysis
* Transpose:
    * `fifa_players.set_index('club')[['name', 'nationality']].transpose()`
    * Alternately, `.T`
* **In this course, we will define reshaping data as converting data from wide to long format and vice versa.**
* To decide between using long or wide format, think about the unit of analysis:
    * Long format $\Rightarrow$ characteristic of a player
    * Wide format $\Rightarrow$ each player
    
#### Wide to long transformation
* Performed using `pandas` functions, such as:
     * `.melt()`
     * `.wide_to_long()`
   
#### Long to wide format
* Transform data using `pandas` methods, for example:
    * `.pivot()`
    * `.pivot_table()`
    
#### Exercises: Flipping players

```
# Set name as index
fifa_transpose = fifa_players.set_index('name')

# Print fifa_transpose
print(fifa_transpose)

# Modify the DataFrame to keep only height and weight columns
fifa_transpose = fifa_players.set_index('name')[['height', 'weight']]
# Print fifa_transpose
print(fifa_transpose)

# Change the DataFrame so rows become columns and vice versa
fifa_transpose = fifa_players.set_index('name')[['height', 'weight']].transpose()

# Print fifa_transpose
print(fifa_transpose)
```

### Reshaping using pivot method
* The long format is usually the most suitble to store a clean dataset

#### Why go from long to WIDE format?
* Demonstrate relationship between two (or more) columns
* Time series operations with the variables
* Operation that requires columns to be the unique variable
* Wide format allows us to **discover patterns**
* The `pivot()` method allows us to reshape the data from a long to a wide format

<img src='data/pivot_method2.png' width="600" height="300" align="center"/>

* The `pivot()` method takes three arguments:
    * **`index`:** takes the name of the column we want to have as an index in the new, pivoted DF.
    * **`columns`:** takes the name of the column we want to have as each column in the new, pivoted DF.
    * **`values`:** takes the name of the column of values with which we want to populate the new, pivoted DF.
* **If the method cannot find a row and column matching the original dataframe, it will set that cell value as a missing value (see the `NaN` above).**

<img src='data/pivot1.png' width="600" height="300" align="center"/>

```
fifa.pivot(index = 'name', columns = 'variable', values='metric_system')
```

<img src='data/pivot2.png' width="600" height="300" align="center"/>

* **Note** that we can also pass a list of two values to the pivot method:
* `fifa.pivot(index='name', columns='variable', values=['metric_system', 'imperical_system'])`
* In this case, the resulting DataFrame has a hierarchical column index with both column names as demonstrated below:

<img src='data/pivot3.png' width="600" height="300" align="center"/>

***
#### Pivoting multiple columns

<img src='data/pivot4.png' width="600" height="300" align="center"/>

* What if we want to extend the pivot method to all the column values in the DataFrame instead of just one or two?
* We can do this easily by omitting the values argument:
* `df.pivot(index='name', columns='variable')
* We see that we get the same result as above when we identified multiple columns as `value` columns:

<img src='data/pivot5.png' width="600" height="300" align="center"/>

#### Duplicate entries error
* Passing only index and column arguments to the pivot method will work in most cases
* However, pay attention to the 3rd and 5th rows below:

<img src='data/pivot6.png' width="600" height="300" align="center"/>

* If we try to perform the same operation: `another_fifa.pivot(index='name', columns='variable')`, we get:

<img src='data/pivot7.png' width="600" height="300" align="center"/>

* It doesn't know which of the two values should be the corresponding value, pandas will raise an error. 
* We could choose to delete one of the rows (for example the fifth row) and then rerun the command without raising an error.

<img src='data/NER_example.png' width="600" height="300" align="center"/>