# Homework 8 - Beyond 1D visualizations
In this homework, you will repeat the exploratory analysis from [class hands-on](class.ipynb) but with another dataset. And also provide your own observations of the data.

Feel free to choose your own dataset (like the one you are using for the final project), or to keep things simple, you can use the Cars dataset.

## Instructions

1. **Project Setup**:  
   - Set up your Python and Jupyter (or VSCode) environment.  
   - Clone or download the repository provided in class (refer to the class notes).

2. **Fill the cells**:
    - Fill in the cells with the code provided in the instructions.
    - You can use the provided code as a starting point and modify it as needed.
    - Make sure to run the code in each cell to see the output.
    - Also respond the markdown questions in the notebook where indicated.

3. **Documentation**:  
   - Comment your code and add markdown explanations for each part of your analysis.

4. **Submission**:  
   - Ensure your notebook is complete and all cells are executed without errors.
   - Save your notebook and export as either PDF or HTML. If the visualizations using altair are not being shown in the html, submit a separated version with altair html. Refer to: https://altair-viz.github.io/getting_started/starting.html#publishing-your-visualization (you can use the `chart.save('chart_file.html')` method).
   - Submit to Canvas.

Ok let's import the packages

In [1]:
# For Jupyter notebooks, use 'widget' backend for interactivity
%matplotlib widget

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
# make text editable in Illustrator
plt.rcParams['pdf.fonttype'] = 42
plt.rcParams['ps.fonttype'] = 42

# import altair
import altair as alt
alt.data_transformers.disable_max_rows()


DataTransformerRegistry.enable('default')

Now it is time to load a dataset. You can choose your own dataset or use the Cars dataset. If you want to use the Cars dataset, you can load it using the following code:


In [2]:

# Load ../Datasets/carfeatures.csv
df = pd.read_csv("../../Datasets/carfeatures.csv") # adjust the path as needed
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [9]:
# If you are using the car dataset, let's remove a few entries that
# are problematic for our analysis.
# The dataset seems to have some issues:
# Entry with highest Highway MPG
weirdEntry = df.loc[df['highway MPG'].idxmax()]
print(weirdEntry)
# probably a mistake
# remove it
df_cars = df[df['highway MPG'] < 100]
# also removing electric cars
df_cars = df_cars[df_cars['Engine Fuel Type'] != 'electric']
df_cars.head()

Make                                           Audi
Model                                            A6
Year                                           2017
Engine Fuel Type     premium unleaded (recommended)
Engine HP                                     252.0
Engine Cylinders                                4.0
Transmission Type                  AUTOMATED_MANUAL
Driven_Wheels                     front wheel drive
Number of Doors                                 4.0
Market Category                              Luxury
Vehicle Size                                Midsize
Vehicle Style                                 Sedan
highway MPG                                     354
city mpg                                         24
Popularity                                     3105
MSRP                                          51600
Name: 1119, dtype: object


Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


### 2D Scatter plot
Choose two numeric variables from the dataset and plot a scatterplot using either matplotlib, seaborn, or altair. 
- Ensure to include appropriate labels and a title for the plot.

Try to encode an additional categorical variable (if available) in the scatterplot using color. For instance, for the car dataset, you can use Make or Fuel Type

In [None]:
# YOUR CODE HERE

### 3D Scatter plot
Now choose three numeric variables from the dataset and plot a 3D scatterplot. You can use matplotlib's `mpl_toolkits.mplot3d` or any other library that supports 3D plotting.

In [None]:
# YOUR CODE HERE

### Find the numeric variables in the dataset
Now we will find all the numeric variables in the dataset. This will help us choose the variables for the further analysis.

In [6]:
# list of numeric features (get from data)
numeric_features = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
# remove Year
numeric_features.remove('Year')
numeric_features

# Remove any other column you do not want to analyze (like Ids)

['Engine HP',
 'Engine Cylinders',
 'Number of Doors',
 'highway MPG',
 'city mpg',
 'Popularity',
 'MSRP']

### Correlation Matrix
For your selection of numeric variables, plot the correlation matrix using seaborn's `heatmap` or any other visualization library of your choice.

First Pearson, then Spearman correlation. Do you see any differences?

In [None]:
# Pearson correlation matrix
# YOUR CODE HERE

In [None]:
# Spearman correlation matrix
# YOUR CODE HERE

### Scatterplot matrix
To visualize the relationships between all numeric variables, create a scatterplot matrix (also known as a pair plot). You can use seaborn's `pairplot`.

In [7]:
# YOUR CODE HERE

### Parallel Coordinates Plot
Another way to visualize multi-dimensional data is by using a parallel coordinates plot. You can either use pandas or altair for this.

In [8]:
# YOUR CODE HERE

### Discussion
Write a brief discussion on your observations from the visualizations above. Consider the following questions:
 - Did you gain any insights from the scatter plots (2D and 3D)?
 - Any interesting correlations you found in the correlation matrix?
 - Do you think the scatterplot matrix was helpful in understanding the relationships between variables? Was it too crowded or too sparse?
 - What about the parallel coordinates plot? Did it help in understanding the multi-dimensional relationships? Or was it too messy?



 `YOUR DISCUSSION HERE`