# Pandas4 - Visualization

**[1] Reshape DataFrame for visualization**<br>

**[2] X-axis with categorical data**<br>
- (Line chart)
- Bar chart
- Area chart
- Pie chart

**[3] Numerical data**<br>
- Histogram
- Scatter plot
- Hexagon plot

In [None]:
import pandas as pd

## [1] Reshape DataFrame for visualization

In [None]:
score_df = pd.DataFrame({"ID":["S01","S02","S03","S04","S05","S01","S02","S03","S04","S05"],
                        "Exam":["Exam-1","Exam-1","Exam-1","Exam-1","Exam-1","Exam-2","Exam-2","Exam-2","Exam-2","Exam-2"],
                        "Score":[95, 75, 70, 65, 85, 85, 70, 90, 55, 60]})
score_df

- **Use <code>pivot_table()</code> to convert a dataframe from long-from to wide-form.**

In [None]:
# Target: "ID"
score_df.pivot_table(index = "ID", columns = "Exam", values = "Score")

- **Change the target**

In [None]:
# Target: "Exam"
score_df.pivot_table(index = "Exam", columns = "ID", values = "Score")

## [2] X-axis with categorical data
### [2.1] Line chart

A line chart is usually used to visualize the trend of data over a period of time.

- **Series**

In [None]:
series_A = pd.Series([67, 57, 87, 50, 97, 68], 
                     index = ["Jan","Feb","Mar","Apr","May","Jun"])
series_A

In [None]:
series_A.plot()

- **DataFrame: single line chart**

In [None]:
product_df = pd.DataFrame({"month":["Jan","Feb","Mar","Apr","May","Jun"],
                           "sales_A":[67, 57, 87, 50, 97, 68],
                           "sales_B":[78, 102, 113, 98, 80, 84]})
product_df 

In [None]:
product_df.plot(x = "month", y = "sales_A")

- **DataFrame: multiple line chart**

In [None]:
product_df.plot(x = "month", y= ["sales_A","sales_B"])

- **DataFrame - custom style**

In [None]:
product_df.plot(x = "month",
                y = ["sales_A","sales_B"], 
                marker = "o", 
                color = ["red","green"],
                linestyle = 'dashed',
                figsize = (8,3))

## Exercise.A

**(A.1) Given the dataframe <code>expense_df</code>. Convert the dataframe to the following format (wide-form) and store the result in a new variable named <code>expense_df_wide</code>.**<br>

Expected result:

||grocery|transportation|
|:-:|:-:|:-:|
|**01-2022**|3050|1050|
|**02-2022**|2800|900|
|**03-2022**|2750|1150|
|**04-2022**|2300|1850|
|**05-2022**|3150|1250|
|**06-2022**|2900|950|

In [None]:
expense_df = pd.DataFrame({'month':['01-2022','02-2022','03-2022','04-2022','05-2022','06-2022', '01-2022','02-2022','03-2022','04-2022','05-2022','06-2022'],
                            'expense':[3050, 2800, 2750, 2300, 3150, 2900,1050, 900, 1150, 1850, 1250, 950],
                            'category':['grocery', 'grocery', 'grocery', 'grocery', 'grocery', 'grocery', 'transportation','transportation','transportation','transportation','transportation','transportation']})
expense_df

**(A.2) Use the dataframe <code>expense_df_wide</code> obtained in (A.1). Draw a multiple line chart to show the monthly groceries and transportation expenses.**

**(A.3) Import dataset <code>fashion.csv</code>. Show the first five rows.**<br>

**(A.4) Show the sales trends of <code>Tiger_of_Sweden</code> with a line chart.**

**(A.5) Show the sales trends of <code>Eton</code>, <code>Levi_s</code>, and <code>Tiger_of_Sweden</code> with a multiple line chart.**<br>

Settings: Use <code>marker = "D"</code>, <code>figsize = (12,4)</code>,  <code>title = "Monthly Sales"</code>, <code>ylabel = "Sales"</code>. 

### [2.2] Bar chart

A bar chart is used to compare values of different categories.

- **Series**

In [None]:
series_A.plot(kind = 'bar')

In [None]:
series_A.plot(kind = 'barh')

- **DataFrame: single bar chart**

In [None]:
product_df.plot(kind = "bar", x = "month", y = "sales_A")

- **DataFrame: multiple bars**

In [None]:
product_df.plot(kind = "bar", x = "month", y= ["sales_A","sales_B"])

- **DataFrame - stacked bar chart**

In [None]:
product_df.plot(kind = "bar", x = "month", y = ["sales_A","sales_B"], stacked = True)

### [2.3] Area

- **Stacked area chart**

In [None]:
product_df.plot(kind = "area",  
                x = "month", 
                y = ["sales_A","sales_B"])  #By default, stacked = False

- **Unstacked area chart**

In [None]:
product_df.plot(kind = "area", 
                x = "month", 
                y = ["sales_A","sales_B"], 
                stacked = False)

### [2.4] Pie chart
A pie chart is used to show the proportion of each category to the whole.

In [None]:
spend_df = pd.DataFrame({'Jan':[1050, 250, 850, 3750],
                         'Feb':[1750, 850, 1050, 3050],
                         'Mar':[1150, 450, 950, 3250]},
                         index = ['transportation','dining out',"entertainment", 'grocery'])
spend_df

- **One pie chart**

In [None]:
spend_df.plot(kind = "pie", y = "Jan")

- **Multiple pie charts**

In [None]:
spend_df.plot.pie(subplots = True , figsize= (18,5), legend = False);

## Exercise.B

**(B.1) Import dataset <code>parks.csv</code>. Show the first five rows.**

**(B.2) Select the national parks in the following five states and keep columns <code>Park Name</code>, <code>State</code>, and <code>Acres</code>. Use this subset to answer the following questions.**<br>
State: CA, CO, UT, AK, WA

**(B.3) Count the number of national parks in each state. Display the result using a bar graph.**<br>
Hint: (1) Group data using column "State". (2) The x-axis shows each state, and each bar is the number of national parks in each state.

**(B.4) Calculate the total area of national parks in each state. Display the result using a pie chart.**

## [3] Numerical data

In [None]:
diabetes_df = pd.read_csv("../dataset/diabetes.csv", dtype = {"Outcome":object})
diabetes_df.head(5)

In [None]:
# Select observations with pressure greater than 0 (remove outliers)
diabetes_df = diabetes_df[diabetes_df.BloodPressure>0]

### [3.1] Histogram

A histogram is used to display the distribution of numerical data.

In [None]:
diabetes_df.plot(kind = "hist", y = "BloodPressure")

In [None]:
print("min:", diabetes_df.BloodPressure.min())
print("max:", diabetes_df.BloodPressure.max())
print("bin width:", (diabetes_df.BloodPressure.max()-diabetes_df.BloodPressure.min())/10)

- **Custom bins**

In [None]:
diabetes_df.plot(kind = "hist", 
                 y = "BloodPressure", 
                 bins = 15, 
                 edgecolor = "black")

In [None]:
diabetes_df.plot(kind = "hist", 
                 y = "BloodPressure", 
                 bins = range(0,130,5), 
                 edgecolor = "black")

### [3.2] Scatter plot
Scatter plots are used to observe the relationship between two variables.

In [None]:
diabetes_df.plot(kind = "scatter", x = "Age", y = "BloodPressure")

In [None]:
diabetes_df.plot(kind = "scatter", x = "Age", y = "BloodPressure", c = "BMI", cmap = "viridis")

### [3.3] Hexagon plot

In [None]:
diabetes_df.plot(kind = "hexbin", x = "Age", y= "BloodPressure", gridsize = 15)

## Exercise.C

**(C.1) Import dataset <code>wine.csv</code> and set the first column as the index. Display the first 5 rows.** <br>
Hint: <code>index_col = [0]</code>

Description of each column
- **country**: The country that the wine is from
- **description**: A few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.
- **designation**: The vineyard within the winery where the grapes that made the wine are from
- **points**: The number of points WineEnthusiast rated the wine on a scale of 1-100
- **price**: The cost for a bottle of the wine
- **province**: The province or state that the wine is from
- **region_1**: The wine growing area in a province or state
- **region_2**: Sometimes there are more specific regions specified within a wine growing area, but this value can sometimes be blank
- **variety**: The type of grapes used to make the wine
- **winery**: The winery that made the wine

**(C.2) Select a subset that satisfies the following two conditions. Use this subset for the following tasks.**<br>
- Select wines (rows) from Spain, Italy or France (use column <code>country</code>).
- Select wines (rows) with a price of less than 200 (use column <code>price</code>).

**(C.3) Use a histogram to show the price distribution of French wines.**<br>
Hint: Use column <code>price</code>.

**(C.4) Use a scatter plot to show the relationship between price and the points received in the review.**<br>
Hint: Use column <code>price</code> and <code>points</code>.