# Exploratory Data Analysis in Python.

Arthur Kennedy | Charles Rehder | Luca Comba | Mohamed Elshahawi | Roban Shrestha

---

## Introduction

**What is Exploratory Data Analysis (EDA)?**

*"One thing the data analyst has to learn is how to expose himself to what his data are willing–or even anxious–to tell him. Finding clues requires looking in the right places and with the right magnifying glass."*

*– John Tukey, Exploratory Data Analysis, p. 21*

Exploratory Data Analysis is the first step in getting to know a dataset and uncovering what insights it has to offer. Loading, manipulating, and visualizing the data are all keys steps in getting the most from data you have. But, as the quote above alludes to, you need the right tools to be able to find the right insights. In this notebook, our group uses **pandas**, **numpy**, **matplotlib**, and **seaborn** to manipulate and visualize the data and explore possibilities we found interesting. 

**What data are we exploring in this project?**



Our group is looking into the *Car Features and MSRP* dataset, found [here](https://www.kaggle.com/CooperUnion/cardataset). This data contains 16 columns and over 11,00 rows detailing statistics about different models of cars. Different features include engine details like the number of cylinders and horsepower and also consumer-level information like price and popularity score. We hope to use this dataset to demonstrate the insights that can be gained from the Exploratory Data Analysis process and hone in on some key insights that we found interesting.



---



## 1. Importing the required libraries for EDA

Below we load the libraries needed for Exploratory Data Analysis. Loading all the requirments at the top of the notebook is common practice and libraries we use are prolific, well-support analysis libraries.

**Data loading and manipulation**: pandas, numpy

**Data visualization**: seaborn, matplotlib

*Note: `%matplotlib inline` makes it such that plots render directly in the notebook and is IPython specific functionality. `sns.set(color_codes=True)` aligns the shortened color codes that matplotlib uses to those used by seaborn.*

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline     
sns.set(color_codes=True)



---



## 2. Loading the data into the data frame.

Loading data is the second step in the process and luckily pandas allows for plenty of flexibility when loading data. With it, users can load data from a variety of formats, including the clipboard, MS Excel documents, JSON, and many more. The full list of functions can be found [here](https://pandas.pydata.org/docs/reference/io.html). In this notebook, we will be using `read_csv` to read in the *Car Features and MSRP* dataset.


Pandas also provides functions to peak at a few rows of the dataset, like `df.head()` and `df.tail()`. These are helpful when analysts want to get an initial feel for the data.

In [9]:
# load data into pandas dataframe
df = pd.read_csv("data/data.csv")

# display first five rows of data
df.head(5)               

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [10]:
# display last five rows of data.
df.tail(5) 

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
11909,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,46120
11910,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,56670
11911,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,50620
11912,Acura,ZDX,2013,premium unleaded (recommended),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,50920
11913,Lincoln,Zephyr,2006,regular unleaded,221.0,6.0,AUTOMATIC,front wheel drive,4.0,Luxury,Midsize,Sedan,26,17,61,28995
