# Lesson 3c - On your Own
Utilize Python to import a dataset, wrangle the data and perform some basic visualizations as well.  Use whatever you need (my code, internet, etc.) to proceed

## 0. Import necessary libraries

In [3]:
import numpy as np
import pandas as pd

## 1. Import the dataset
- First run the following code to create a list of feature names 
- Then import the dataset `imports-85.data`, use the arguments `sep=",", names=headers`

In [23]:
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

In [24]:
df = pd.read_csv("imports-85.data",sep=",", names=headers)


## 2. Look over the data
I like to start with `"filename".head()` to look at the first few rows (when I can). <br>
You should also check for missing values (NaN, ?, blank entries, etc.)

In [25]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


## 3. Python likes NaN better then any of the other missing data formats so replace the "?" with nan
The Numpy `replace` function may be useful

In [26]:
df.replace('?', np.NaN)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470


## 4. Check the dataset for missing values
Hint: I showed how to do this in the data wrangling notebook

In [27]:
print(df.isnull().sum())

symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64


## 5. Deal with missing values
We have now identified several features with missing values.  We are going to deal with different columns by imputing different values to replace the missing values

### Replace by mean:
For the features: "normalized-losses", "stroke", "bore", "horsepower", and "peak-rpm", replace any missing values with the column mean


In [29]:
features = {'normalized-losses', 'stroke', 'bore', 'horsepower', 'peak-rpm'}

for f in features:
    mean = df[f].mean()
    df[f].replace(np.NaN, mean)
    

TypeError: Could not convert 2.682.683.473.403.403.403.403.403.403.402.802.803.193.193.193.393.393.393.033.113.113.233.233.393.233.233.233.393.463.903.413.413.073.413.413.413.413.583.583.583.583.583.583.233.113.113.234.174.172.763.153.153.153.153.15????3.393.393.393.393.393.393.163.643.643.643.643.643.103.103.353.353.123.233.233.233.393.463.463.863.863.863.463.463.463.463.293.473.293.293.293.293.293.293.293.293.473.473.273.273.273.273.273.273.193.523.193.522.193.522.193.523.193.523.213.233.393.233.233.233.463.863.112.902.902.903.113.903.903.073.072.073.073.073.072.362.642.642.642.642.642.642.642.642.642.642.643.033.033.033.033.033.033.033.033.353.353.033.033.033.033.033.083.083.503.503.503.503.503.503.543.353.543.543.543.353.353.353.353.403.403.403.403.403.403.403.403.403.403.403.403.153.153.153.153.153.153.153.152.873.403.15 to numeric

### Replace by mode:
For the feature: "num-of-doors", replace by "mode" of the column.  <br>
Hint: don't try to use the .mode() function since this column is actually categorical, not numeric.  Try `df['name'].value_counts()`

### Drop row:
Any rows with missing values for "price" (which is our response variable) should be dropped.  Use the Pandas function `.dropna` <br>

Then reset the index for the dataframe since you deleted rows. `reset_index`

## 6. Recheck your dataframe for missing values and formating

## 7. Check and fix your datatypes
Ensure all data types are correct for future analysis.  For our problem we want: 

symboling:              int64 <br>
normalized-losses:      int32<br>
make:                object<br>
fuel-type:             object<br>
aspiration:            object<br>
num-of-doors:          object<br>
body-style:            object<br>
drive-wheels:          object<br>
engine-location:       object<br>
wheel-base:           float64<br>
length:               float64<br>
width:                float64<br>
height:               float64<br>
curb-weight:            int64<br>
engine-type:           object<br>
num-of-cylinders:      object<br>
engine-size:            int64<br>
fuel-system:           object<br>
bore:                 float64<br>
stroke:               float64<br>
compression-ratio:    float64<br>
horsepower:            object<br>
peak-rpm:             float64<br>
city-mpg:               int64<br>
highway-mpg:            int64<br>
price:                float64<br>

In Pandas you can use:<br>
`.dtype()` to check the data type <br>
`.astype("type")` to change the data type, common data types are "int" and "float"

## 7. Create scatter plots of all variables against price