> ## Acknowledgments 
> Tutorial adapted from: 
> * [Analyzing Cars.csv File in Python – A Complete Guide](https://www.askpython.com/python/examples/analyzing-cars-dataset-in-python)
> * [Exploring Data using Python](https://towardsdatascience.com/exploring-the-data-using-python-47c4bc7b8fa2)
>  
> Data taken from:
> * [Cars Data](https://www.kaggle.com/ljanjughazyan/cars1)  


> Note: You can get practice data sets from [Kaggle](https://www.kaggle.com), but you cannot use it for the project.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Packages
In Python, we don't have libraries, we rather have packages.  

To help us with data exploration (and really all of data science), we will use the following packages:
* NumPy: efficient numeric manipulation
* Pandas: load and manipulate data frames
* Seaborn: data visualization (highly compatible with Pandas)
* MatPlotLib: basic data visualization (Seaborn is built on top of matplotlib)  

To load a package, you use the keyword `import`.

```Python
import pandas
```
I can also choose to rename the package when I import it so it is easier to use it in my code

```Python
import pandas as pd
```
I can choose to import a specific item (class, function, etc.) from a package

```Python
from pandas import DataFrame 
```

In [None]:
# import the libraries we will use
import pandas as pd 
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

# use color codes in the plots geenrated by seaborn
sns.set(color_codes=True)

# 1. Loading our dataset

In [None]:
# upload sample automobile data for exploration
from google.colab import files 
uploaded_files = files.upload() # upload file

Saving CARS.csv to CARS.csv


In [None]:
import io
data = io.BytesIO(uploaded_files['CARS.csv']) # read file

In [None]:
# load as a data frame

# if doing locally, simply pass the name of the file

# view the first 5 rows to get an initial idea of the columns


In [None]:
# Getting the number of instances and features


In [None]:
# Getting the dimensions of the data frame


# Data Types I

Our utlimate goal (which we will not get to in this tutorial) is to predict the price of car given its other attributes.  

This is a supervised learning problem. And in the context of our data set, it means that:
* **Outcome**: MSRP (manufacturer's suggested retail price)
* **Predictor**: all other features/columns/variables

In [None]:
# get a quick feel for your data set (# rows and columns, null-values, data types)
# note: pandas views strings as objects. 


In [None]:
# view summary/stats of each numeric variable/column


# Data Types II

## Numeric
* MSRP (but the `$` is causing a problem)
* Invoice (but the `$` is causing a problem)
* Engine Size
* Cylinders
* Horsepower
* etc.

## Categorical
* Make
* Model
* Type
* Origin
* Drive Train

## Boolean
* Let's make our own 

In [None]:
# a car is powerful if its horse power is over the 75th percentile


df.head()

# 2. Removing irrelevant features
In this case, those are features that either provide redundant information (e.g., `Invoice`), or won't help us with prediction (e.g. `Origin`).

In [None]:
# a dataframe in pandas has 2 axes:
#   axis 0: rows
#   axis 1: columns


df.head()

# 3. Eliminating duplicates
It's very unlikely that two cars have the exact same price, so we will use `MSRP` as our duplication source of truth.

In [None]:
print("Count before Removing Duplicates: ")
df.count()

In [None]:
# remove rows that have a duplicate MSRP

 
print("Count after Removing Duplicates: ")
df.shape

# 4. Dealing with missing OR null values
Generally, we have 2 options:

## A: Fill in missing value with mean
* Advantage: keeps all the rows, which might be especially valuable if your data set is not that big
* Disadvantage: has "fabricated" data

## B: Drop row that has a missing values
* Advantage: all the data is true
* Disadvantage: must lose a lot of data, especially if there are lots of columns, which increases the chance of missing data

In [None]:
# Note: the original data had no missing values, but I deleted some on purpose 😅
print(df.isnull().sum())

In [None]:
# Calculate mean of all the values of the column (Cylinders)
cylinders_mean = 
print("The mean of Cylinders is: ",cylinders_mean)

cylinders_mean = 
print("Rounded value of the mean of Cylinders is: ", cylinders_mean)
 
# Replace the null value with the mean of Cylinders


# For someone wondering about inplace = True: If it is True it the original 
# object is modified with this change. If it is False (default) the function 
# doesn't modify the original object, instead it returns a modified copy of it 
# and you have to assign it to the original object to replace it.

print(df.isnull().sum())

print("Count:\n", df.count())

In [None]:
print("Count before dropping:\n", df.count(), "\n")

# drop any row (axis 0) that has any type of null value


print(df.isnull().sum(), "\n")
print("Count after dropping:\n", df.count())

# 5. Converting object values to numeric
Since `MSRP` values start with `$`, Pandas does not recognize them as numbers. This means that we cannot plot it, and we cannot use linear regression, just to name a few disadvantages that come with the column not being recognized as a number.  
To fix this, we will remove the dollar sign and convert the column to become numberic.

In [None]:

 

 

 
df.head()

# 6. Detecting Outliers
Outliers can skew your data. Thus, it is usually a good idea to remove them. And if you have a very good reason against removing them, then you should at least know about them.

In [None]:
# visualize box plot based on main feature


In [None]:
Q1 = 
Q3 = 
IQR = Q3 - Q1
print(IQR)

In [None]:
# remove all points that are either:
#   - less than 1.5*IQR under Q1 OR 
#   - more than 1.5*IQR above Q3
df1 = df[ ~ ( (df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)) ).any(axis=1)]

In [None]:
sns.boxplot(x=df1['MSRP'])
plt.show() # note: this makes it cleaner

# 7. Visualizing Correlation between Variables

## 7.1 Correlation between all numeric values

A heat map can be useful if you want to find your dependent variable (a.k.a. the outcome). It shows the correlation between all variables in the data set.  
I can then use the realted features to build my model.

In [None]:
plt.figure(figsize=(10,5))
c= df1.corr()


## 7.2 Correlation between a pair of variables
I see that `MSRP` and `Horsepower` have a very high correlation of `0.82`.  
Naturally, I wanna take a closer look at their relationship and try to recognize its shape/degree.  
I can do this using a scatter plot.

In [None]:
fig, ax = plt.subplots(figsize=(5,5))

plt.title('Scatter plot between MSRP and Horsepower')
ax.set_xlabel('Horsepower')
ax.set_ylabel('MSRP')
plt.show()

# 👩‍🎓👨‍🎓 I wanna learn more

* [10 mins to pandas](https://pandas.pydata.org/docs/user_guide/10min.html#object-creation)
* [NumPy: the absolute basics for beginners](https://numpy.org/doc/stable/user/absolute_beginners.html)
* [An introduction to seaborn](https://seaborn.pydata.org/introduction.html)
* [Python Numpy Tutorial with Jupyter and Colab_Stanford CS231n](https://cs231n.github.io/python-numpy-tutorial/)
