# Assignment 3

## Table of Contents
- [Visualization Technique](#visualization-technique)
- [Visualization Library](#visualization-library)
- [Demonstration and Conclusion](#demonstration)

#### NOTE: All files are uploaded to:
Vocareum under the path: /voc/work
- filenames: assignment3_Demo.mp4, assignment3.ipynb, and vehicles.csv

GitHub repository link: https://github.com/duhoang920/assignment3
- filenames: assignment3_Demo.mp4, assignment3.ipynb, and vehicles.csv

Video Demo: https://youtu.be/s9P7w3_Ljcc

<a name="visualization-technique"></a>
### Visualization Technique

In this dashboard, I will use a combination of 4 visualziation types - line chart, bar chart, scatter plot, and box plot to explore the [EPA Auto MPG](https://www.fueleconomy.gov/feg/download.shtml) dataset. These visualizations are designed to work in tandem to answer different questions about fuel efficiency trends and influencing factors across vehicle makes, regions, and configurations. 
- **Line charts** are very effective at highlighting temporal trends (changes or patterns over a period of time). In this dashboard, the line chart illustrates how average fuel economy (combined MPG) has changed over the model years, making it easier to spot long-term trends and shifts based on regulatory impacts or technological advancements. This can help users explore which regions are improving faster than others.
- **Bar charts** excel at making it easier to compare categories side-by-side, allowing for the ability to highlight categorial differences. Here, the bar chart highlights the top 20 automotive brands vs highest average combined MPG, grouped by region. This allows for direct brand-level comparisions that help indentify which automaker produces the most fuel-efficient vehicles. This chart can be dynamically filtered by what the user selects for cylinders, fuel type, and region. This allows for more in depth side-by-side comparisons between automakers.
- **Scatter plots** are ideal for examining relationships between two continuous variables to uncover patterns and outliers that might not be apparent in aggregated data. In this assignment, the scatter plot displays engine displacement (L) and combined MPG grouped by region. This can help the user examine how engine size correlates with fuel economy. This type of visualation also allows for a hover over feature that will allow users to see specific model names and years adding more depth to the analysis.
- **Box plots** are powerful for summarizing the distribution, spread, and outliers within categories. The box plot in this dashboard will summarize combined MPG distributions across engine size (number of cylinders), broken down by region. This allows for the comparsion of fuel efficiency by engine size.

Each of these visualizations play a distinct role, and together they provide a multi-dimensional view of the data. A Line plot captures trends over time, the bar chart compares averages across categories, scatter plot explores the relationships and trade-offs between variables, and the box plot highlights variability and typical values across groups. In this dashboard, they form a cohesive analytical tool where users can go from high-level overview (trends) to automaker specific details, and engineering-related comparisons.

The dashboards interactvity will incorporate three widget filters for region, fuel type, and engine size (number of cylinders). The ability to dynamically and simulaneously control all visualtions allows users to compare datasets consistently across charts. This turns static plots into an exploratory tool that adapts to sepcific questions a user wants to answer, making the dashboard both informative and engaging.

<a name="visualization-library"></a>
### Visualization Library

For this assignment, I will be using Plotly Express for creating interactive visualizations and Panel to build the interactive dashboard along side with Pandas for data wrangling. 

[Panel](https://panel.holoviz.org/), developed by the [HoloViz](https://panel.holoviz.org/about/people.html) team, is also an opened sourced libraray that allows for the ability to build interactive dashboards directly in Jupyter Notebooks. It supports a wide range of interactive widgets, such as checkboxes and selectors, that allows the user to filter and explore the data dynamically. Panel uses the reactive programming model via @pn.depends decorator that enables the dashboard to automatically update visualizations when the users changes input values. Lastly, it allows for flexiable layout management through Rows, Columns, Tabs, and Grids and integrates smoothly with Plotly and other Python plotting libraries such as Seaborn.

[Panel Install](https://panel.holoviz.org/getting_started/installation.html):
- Pip Install: pip install panel watchfiles
- Conda Install: conda install panel watchfiles
- NOTE: for both Conda and Pip it is recommend also installing watchfiles while developing. This will provide a significantly better experience when using Panel’s autoreload features when activating --dev mode. It’s not needed for production.

[Plotly Express](https://plotly.com/python/plotly-express/), an open sourced library which is essentally a high-level wrapper around the Plotly library, developed by [Plotly Inc.](https://plotly.com/), a data visualization company. It enables user the ability to quickly create interactive charts with minimal code. Plotly Express has a declarative interface, meaning that users can sepcify what they want to display such as axis values or color grouping without having to manually define every visual element. Plotly Express includes built-in interactivity features such as zooming, panning, ability to download the plot graphic into a png, and tooltips, which enhances the user's expereince and data exploration capabilities.

[Plotly Express Install](https://plotly.com/python/getting-started/):
- Pip Install: pip install plotly
- Conda Install: conda install -c conda-forge plotly

The combination of Panel and Plotly Express delivers an interactive and user friendly experience. Panel follows in a reactive and procedural model. This means that user interactions automatically trigger updates to the visualization. The @pn.depends decorator enables functions to response dynamically to widget changes ensuring real time updates to the dashboard. In contrast, Plotly Express uses a declarative approach. Users have to define what they want to visualize (axis values, groupings, etc.) and the library handles the rest. This makes it easy to create interactive charts with minimal code. Lastly, both intergrate smoothly with Jupyter Notebooks, where Panel manages the layout and user input and Plotly Express handles the chart/plot interactivity.

<a name="demonstration"></a>
### Demonstration and Conclusion

The EPA Auto MPG dataset provides information about different vehicle models, their fuel efficiency (MPG), engine size, displacement, model year, and origin. This rich dataset allows for the exploration of relationships between all of these vehicle attributes and their impact on fuel efficiency over time and across different regions.

For https://www.fueleconomy.gov/feg/ws/

#### Import libraries: 
This section imports all of the necessary libraries for the dashboard:
- pandas: data manipulation
- numpy: numerical operations
- os: file management
- plotly.express: interactive visualizations
- panel: framework for building the interactive dashboard.

In [51]:
# Import all of the packages needed
import pandas as pd # Data manipulation and analysis
import numpy as np  # Numerical opeartions
import os
import panel as pn  # Dashboard framework
import plotly.express as px # Create interactive plots

pn.extension('plotly') # initialies Plotly support in Panel for use in Jupyter.

#### Load data:
Attempts to load the vehicle.csv file.
- If unsuccessful, a prompt is displayed for the user to check file path or download it manual from the website listed below.

In [52]:
# Load the dataset vehicles.csv and convert to DataFrame

try:
    # Attempt to read the vehicles.csv file and convert to a DataFrame.
    df = pd.read_csv('vehicles.csv')
    print("Dataset loaded succesfully.")
    print(f"Initial dataset shape: {df.shape}")
    print(f"Initial columns:\n {df.columns.tolist()}")
except:
    # If the file is not found or cannot be read, print error message
    print("Please verify the file path and file name. You might need to manually download at https://www.fueleconomy.gov/feg/download.shtml and place 'vehicles.csv' in the same directory.")

# Since this dataset is missing the origin (where the auto maker is located). I had to go out and look up all auto makers in the original data and will create a lookup dictionary.
origin_dict = {
    'AM General': 'USA', 'ASC Incorporated': 'USA', 'Acura': 'Asia', 'Alfa Romeo': 'Europe',
    'American Motors Corporation': 'USA', 'Aston Martin': 'Europe', 'Audi': 'Europe',
    'Aurora Cars Ltd': 'USA', 'Autokraft Limited': 'Europe', 'Avanti Motor Corporation': 'USA',
    'Azure Dynamics': 'USA', 'BMW': 'Europe', 'BMW Alpina': 'Europe', 'BYD': 'Asia',
    'Bentley': 'Europe', 'Bertone': 'Europe', 'Bill Dovell Motor Car Company': 'USA',
    'Bitter Gmbh and Co. Kg': 'Europe', 'Bugatti': 'Europe', 'Bugatti Rimac': 'Europe',
    'Buick': 'USA', 'CCC Engineering': 'USA', 'CODA Automotive': 'Asia', 'CX Automotive': 'USA',
    'Cadillac': 'USA', 'Chevrolet': 'USA', 'Chrysler': 'USA', 'Consulier Industries Inc': 'USA',
    'Dabryan Coach Builders Inc': 'USA', 'Dacia': 'Europe', 'Daewoo': 'Asia',
    'Daihatsu': 'Asia', 'Dodge': 'USA', 'E. P. Dutton, Inc.': 'USA', 'Eagle': 'USA',
    'Environmental Rsch and Devp Corp': 'USA', 'Evans Automobiles': 'USA', 'Excalibur Autos': 'USA',
    'Federal Coach': 'USA', 'Ferrari': 'Europe', 'Fiat': 'Europe', 'Fisker': 'USA',
    'Ford': 'USA', 'GMC': 'USA', 'General Motors': 'USA', 'Genesis': 'Asia', 'Geo': 'USA',
    'Goldacre': 'USA', 'Grumman Allied Industries': 'USA', 'Grumman Olson': 'USA'
}

df['origin'] = df['make'].map(origin_dict) # Creates new 'origin' column and maps the origin to the auto maker

Dataset loaded succesfully.
Initial dataset shape: (48671, 84)
Initial columns:
 ['barrels08', 'barrelsA08', 'charge120', 'charge240', 'city08', 'city08U', 'cityA08', 'cityA08U', 'cityCD', 'cityE', 'cityUF', 'co2', 'co2A', 'co2TailpipeAGpm', 'co2TailpipeGpm', 'comb08', 'comb08U', 'combA08', 'combA08U', 'combE', 'combinedCD', 'combinedUF', 'cylinders', 'displ', 'drive', 'engId', 'eng_dscr', 'feScore', 'fuelCost08', 'fuelCostA08', 'fuelType', 'fuelType1', 'ghgScore', 'ghgScoreA', 'highway08', 'highway08U', 'highwayA08', 'highwayA08U', 'highwayCD', 'highwayE', 'highwayUF', 'hlv', 'hpv', 'id', 'lv2', 'lv4', 'make', 'model', 'mpgData', 'phevBlended', 'pv2', 'pv4', 'range', 'rangeCity', 'rangeCityA', 'rangeHwy', 'rangeHwyA', 'trany', 'UCity', 'UCityA', 'UHighway', 'UHighwayA', 'VClass', 'year', 'youSaveSpend', 'baseModel', 'guzzler', 'trans_dscr', 'tCharger', 'sCharger', 'atvType', 'fuelType2', 'rangeA', 'evMotor', 'mfrCode', 'c240Dscr', 'charge240b', 'c240bDscr', 'createdOn', 'modifiedOn', 


Columns (74,75,77) have mixed types. Specify dtype option on import or set low_memory=False.



#### Data Wrangling:
Creates a custom dictionary for the missing dataset of vehicle origion (region). 
- Then map the vehicle origin to the original dataset.
Clean and prep the final dataset.
- Create a copy of the original dataframe to avoid overwriting the original.
- Grab only the releant columns and rename them.
- Drop missing value rows (NAN).
- Covert relevant columns to appropriate data types so there are to data type issues later on.

In [53]:
# Wrangle and clean the dataset
df_clone = df.copy() # Make a copy to avoid modifying the original df dataframe. This helps when you are running multiple cells

# Rename relavent columns to make it more understandable in dashboard.
df_clone = df_clone.rename(columns={ 
    'year': 'Model Year',
    'cylinders': 'Cylinders',
    'displ': 'Engine Displacement',
    'trany': 'Transmission',
    'VClass': 'Vehicle Class',
    'fuelType': 'Fuel Type Sub Category',
    'fuelType1': 'Fuel Type',
    'comb08': 'Combined MPG',
    'city08': 'City MPG',
    'highway08': 'Highway MPG',
    'make': 'Make',
    'model': 'Model',
    'drive': 'Drive Type',
    'eng_dscr': 'Engine Description',
    'co2TailpipeGpm': 'CO2 Tailpipe',
    'origin': 'Origin'}
    )

# print(f"\nAfter column rename: \n {df.head()}") # Debugging ONLY

# Select relevant columns for visualization.
vehicle_df = df_clone[['Model Year', 'Cylinders', 'Engine Displacement', 'Transmission', 'Vehicle Class', 'Fuel Type Sub Category', 'Fuel Type', 'Combined MPG', 'City MPG', 'Highway MPG', 'Make', 'Model', 'Drive Type', 'Engine Description', 'CO2 Tailpipe', 'Origin']]
# print(vehicle_df.head()) # Debugging ONLY

# Cleaning data
print(f"\nFinal vehicle dataset shape BEFORE DROP NAN: {vehicle_df.shape}")
vehicle_df = vehicle_df.dropna(subset=['Model Year', 'Cylinders', 'Combined MPG', 'Origin']) # Create a new vehicle_df dataframe and drop rows with missing data

# Covert relevatn columns to appropriate data types so there are to data type issues later on.
vehicle_df['Cylinders'] = vehicle_df['Cylinders'].astype(int) # force interger datatype
vehicle_df['Model Year'] = vehicle_df['Model Year'].astype(int) # force interger datatype
vehicle_df['Origin'] = vehicle_df['Origin'].astype(str) # force string datatype

print(f"\nFinal vehicle dataset shape AFTER DROP NAN: {vehicle_df.shape}")
print("First 5 rows of cleaned data:\n")
print(vehicle_df.head()) # Debugging ONLY



Final vehicle dataset shape BEFORE DROP NAN: (48671, 16)

Final vehicle dataset shape AFTER DROP NAN: (21662, 16)
First 5 rows of cleaned data:

    Model Year  Cylinders  Engine Displacement     Transmission  \
0         1985          4                  2.0     Manual 5-spd   
1         1985         12                  4.9     Manual 5-spd   
2         1985          4                  2.2     Manual 5-spd   
3         1985          8                  5.2  Automatic 3-spd   
14        1985          8                  5.2  Automatic 3-spd   

      Vehicle Class Fuel Type Sub Category         Fuel Type  Combined MPG  \
0       Two Seaters                Regular  Regular Gasoline            21   
1       Two Seaters                Regular  Regular Gasoline            11   
2   Subcompact Cars                Regular  Regular Gasoline            27   
3              Vans                Regular  Regular Gasoline            11   
14             Vans                Regular  Regular Gasoline 

In [54]:
# Generate sorted list of unique values for filters used in widgets
makes = sorted(vehicle_df['Make'].unique().tolist())
# print(f"Number of Auto Makers: {len(makes)}") # Debugging ONLY
# print(makes) # Debugging ONLY
origin_type = sorted(vehicle_df['Origin'].unique().tolist())
# print(origin_type) # Debugging ONLY
num_cylinder = sorted(vehicle_df['Cylinders'].unique().tolist())
# print(num_cylinder) # Debugging ONLY
model_year = sorted(vehicle_df['Model Year'].unique().tolist()) # Did NOT use, but leave if I decide to use later.
# print(model_year) # Debugging ONLY
fuel_type = sorted(vehicle_df['Fuel Type'].unique().tolist())

#### Create Widget:
- Create 3 widgets for filtering: vehicle reigon, engine size (number of cylinders), and fuel type.

In [55]:
# Create Panel Widgets - allows users to filter data in graphics.
origin_selector = pn.widgets.CheckBoxGroup(name='Vehicle Origin', options=origin_type, value=origin_type)
cyl_selector = pn.widgets.CrossSelector(name='Cylinders', options=num_cylinder, value=num_cylinder)
fuel_selector = pn.widgets.CheckBoxGroup(name='Fuel Type', options=fuel_type, value=fuel_type)

#### Visualization and Filtering Functions
- filtered_data(): Creates a function that filters the dataset based on the current widge selections.
- lineplot(): Creates a line plot to display how the mean combined MPG changes over Model Years.
- barchart(): Creates a bar chart showing the top 20 automakers with the highest mean combined MPG, grouped by region.
- scatter(): Creates a scatter plot displaying engine displacement (L) vs combined MPG, grouped by region.
- boxplot(): Creates a box plot that allows for the comparison of combined MPG across different engine sizes, grouped by region.

In [56]:
# Data filtering
@pn.depends(cyl_selector, origin_selector, fuel_selector)
def filtered_data(cyls, origins, fuel_t):
    return vehicle_df[vehicle_df["Cylinders"].isin(cyls) & vehicle_df["Origin"].isin(origins) & vehicle_df["Fuel Type"].isin(fuel_t)] # Filter dataset for widget selections.

In [57]:
# Line Chart
@pn.depends(cyl_selector, origin_selector, fuel_selector)
def lineplot(cyls, origins, fuel_t):
    data = filtered_data(cyls, origins, fuel_t)
    summary = data.groupby(["Model Year", "Origin"])["Combined MPG"].mean().reset_index() # Groups the Model Year and Region by the mean of the Combined MPG.
    fig = px.line(summary, x="Model Year", y="Combined MPG", color="Origin", title="Average Combined MPG Over Model Years by Region") # Creates the line plot.

    fig.update_layout(title={'x': 0.5}) # Centers the graphic title.
        
    return fig

In [58]:
# Bar Chart
@pn.depends(cyl_selector, origin_selector, fuel_selector)
def barchart(cyls, origins, fuel_t):
    data = filtered_data(cyls, origins, fuel_t)
    avg_mpg = data.groupby(["Make", "Origin"])["Combined MPG"].mean().reset_index() # Groups the Make and Region by the mean of the Combined MPG.
    top_makes = avg_mpg.groupby("Make")["Combined MPG"].mean().nlargest(20).index # Groups the Make by the mean of the Combined MPG. Optional: Select top N makes with highest average MPG.
    avg_mpg_top = avg_mpg[avg_mpg["Make"].isin(top_makes)]

    fig = px.bar(avg_mpg_top, x="Make", y="Combined MPG", color="Origin", barmode="group", title="Top 20 Automakers by Average Combined MPG") # Creates the bar chart

    fig.update_layout(title={'x': 0.5},xaxis_tickangle=-45) # Centers the graphic title and rotates the x axis labels by 45 degrees.
    
    return fig

In [59]:
# Scatter Plot
@pn.depends(cyl_selector, origin_selector, fuel_selector)
def scatter(cyls, origins, fuel_t):
    data = filtered_data(cyls, origins, fuel_t)
    # Creates the Scatter Plot
    fig = px.scatter(data, x="Engine Displacement", y="Combined MPG", color="Origin", hover_data=["Model", "Model Year"], labels={"Engine Displacement": "Engine Displacement (L)"}, title="Engine Displacement vs Combined MPG by Region")

    fig.update_layout(title={'x': 0.5}) # Centers the graphic title.
    
    return fig            

In [60]:
# Box Plot
@pn.depends(cyl_selector, origin_selector, fuel_selector)
def boxplot(cyls, origins, fuel_t):
    data = filtered_data(cyls, origins, fuel_t)
    # Creates the Scatter Plot
    fig = px.box(data, x="Cylinders", y="Combined MPG", color="Origin", labels={"Cylinders": "Number of Cylinders"}, title="Fuel Economy vs Engine Size (# of Cylinder) by Region")

    fig.update_layout(title={'x': 0.5}) # Centers the graphic title.
    
    return fig

#### Dashboard Layout
The dashboard layout is built using pn.Column and pn.Row. These 2 features allows the user to arrange the widgets and plots onto the grid. The 4 graphics are placed side-by-side for comparative analysis.

In [61]:
dashboard = pn.Column(
    pn.pane.Markdown("# EPA Fuel Economy Dashboard"), # Title
    pn.Row("## Select the Region", origin_selector, "## Select Fuel Type", fuel_selector), # Widget for reigon and fuel type
    pn.Row("## Select Engine Size \n ## (# of Cylinders)", cyl_selector, ), # Widget for Engine Size
    pn.Row(
        pn.Column(lineplot, scatter), # Left column: Line and Scatter plots
        pn.Column(barchart, boxplot)  # Right column: Bar and Box plots
    )
)
dashboard # Display the dashboard

<a name="conclusion"></a>
### Conclusion

### Line Plot - Average Combined MPG Over Model Years by Region (Upper Left)
This line plot shows the average combined MPG trends for each of the different automaker regions over model years.
- Asian automakers consistently produce the highest fuel efficient vehicles over time.
- European and US automakers both are generally moderate but exhibit slower fuel efficiency improvements, and in recent years can be seen to have a positive upward trend.

### Bar Chart - Top 20 Automakers by Average Combined MPG (Upper Right)
This bar chart highlights the top 20 automakers with the best average combined MPG, grouped by region.
- The top ranking brands are from Asian automakers. This supports what was seen on the line plot.
- Some Europeamn automakers also make the list.
- US brands are underrepresented among the top performers (i.e. top 10). The US brand takes the lower half of the top 20 list. This suggest that there is room for improvement in fuel efficiency from the US automakers.

### Scatter Plot - Engine Displacement vs Combined MPG by Region (Lower Left)
This scatter plot relates engine size to combined MPG by region.
- As engine displacement increases fuel efficiency decrease.
- Asian automakers cluster around smaller engine size with higher combined MPG compared to their European and US counterparts.
- Although European automakers produce vehicles across the engine displacement spectrum, the data shows that these automakers still perfer smaller, more efficient engines.

### Box Plot - Fuel Economy vs Engine Size (# of Cylinders) by Region (Lower Right)
The box plot compares combined MPG distributions across engine sizes.
- As engine size increases, fuel efficiency decrease.
- U.S. automakers have the largest combined MPG spread (greater variability) across all engine sizes. This could be contributed to U.S. automakers having more hybrid vehicles in their portfolio.
- European automakers seem to fall between their Asian and U.S. counterparts.