##Data Summary

This data is from the UC Irvine Machine Learning archive. The original dataset contains 205 observations and 26 attributes. Those following attributes are:

- symboling - the degree to which a car is more risky than its price indicates. This value is any integer ranging from -3 to 3 where 3 indicates the highest risk value and -3 is the safest risk value.

- normalized losses - a numerical and continuous value representing the relative average loss payment per insured vehicle year.  This value is normalized for all autos within a particular size classification and represents the average loss per car per year

- make -  A string representing the name of the car brand/manufacturer

- fuel - type a string, either 'diesal' or 'gas', to represent what fuel the car engines use.

- aspiration - A string, either 'std' or 'turbo', representing how the car engine is powered.

- num of doors - A String, either 'two' or 'four', representing how many doors the car has.

- body-style - A string indicating the general shape of the car.

- drive-wheels - a string value of either '4wd', 'fwd', or 'rwd' indicating the set of wheels on the vehicle that gets power from the engine.

- engine-location - a string value of either 'front' or 'rear' indicating where the car engine is located.

- wheel-base - a numerical and continuous value representing the length from the center of the front wheel to the center of the rear wheel.

- length - a numerical and continuous value representing the overall length of the car.

- width - a numerical and continuous value representing the overall width of the car.

- height - a numerical and continuous value representing the overall height of the car. 

- curb-weight a numerical and continuous value representing the weight of the vehicle with all equipment and a full tank included. 

- engine-type - a string value indicating what type of engine the car uses

- num-of-cylinders - a string value indicating how many cylinders the car engine uses to operate.

- engine-size - a numerical and continuous value representing the capacity a cars engine needs to push through air and fuel. measured in cubic centimeters. (inchcape)

- fuel-system - a string value indicating the fuel system the car uses. 

- bore - a numerical and continuous value representing the diameter of the cylinder.

- stroke - a numerical and continuous value representing the distance that a piston has to travel in a cylinder.

- compression-ratio - a numerical continuous value representing the ratio between largest volume and smallest volume of a cars cylinder.

- horsepower - a numerical continuous value representing how much power a car engine produces.

- peak-rpm - a numerical and continuous value representing how many times a piston goes up and down in a cylinder per minute in the cars engine. 

- city-mpg - a numerical and continuous value representing how many miles a car can travel per gallon in city conditions.

- highway-mpg - a numerical and continuous value representing how many miles a car can travel per gallon on the highway

- price - a numerical and continuous value indicating the cost of the car

The data set originally contains 205 rows and 26 attributes. This means there will be 205 use cases for each attribute. For this project, only a subset of the attributes will used for the investigation. These attributes include make, fuel type, aspiration, number of cylinders, horsepower, peak rpm, and price. If missing values are located for any of these use cases, how the values and the specified use case is dealt with will be explained. 

##Data initialization##

In [151]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#pulls the original data set from github and gives each column their respective attribute names
df = pd.read_csv('https://raw.githubusercontent.com/cindysame179/Data-Analytics-Work/main/CarData2.csv', names = ['symboling'
,'normalized losses'
,'make'
,'fuel_type'
,'aspiration'
,'number_of_doors'
,'body_style'
, 'drive_wheels'
,'engine_location'
, 'wheel_base'
, 'length'
,'width'
,'height'
,'curb_weight'
,'engine_type'
,'number_of_cylinders'
,'engine_size'
,'fuel_system'
,'bore'
,'stroke'
,'compression_ratio'
,'horsepower'
,'peak_rpm'
,'city_mpg'
,'highway_mpg'
,'price'])
df


Unnamed: 0,symboling,normalized losses,make,fuel_type,aspiration,number_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.4,23.0,106,4800,26,27,22470


In [152]:
#Creating the Subset dataset that will be used by extracting the necessary columns
ProjDS = df[['make','fuel_type','aspiration','number_of_cylinders','horsepower','peak_rpm','price']]
ProjDS

Unnamed: 0,make,fuel_type,aspiration,number_of_cylinders,horsepower,peak_rpm,price
0,alfa-romero,gas,std,four,111,5000,13495
1,alfa-romero,gas,std,four,111,5000,16500
2,alfa-romero,gas,std,six,154,5000,16500
3,audi,gas,std,four,102,5500,13950
4,audi,gas,std,five,115,5500,17450
...,...,...,...,...,...,...,...
200,volvo,gas,std,four,114,5400,16845
201,volvo,gas,turbo,four,160,5300,19045
202,volvo,gas,std,six,134,5500,21485
203,volvo,diesel,turbo,six,106,4800,22470


In [153]:
#checking for nulls
ProjDS.columns[ProjDS.isnull().any()]

Index([], dtype='object')

In [154]:
#checking for missing values
ProjDS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   make                 205 non-null    object
 1   fuel_type            205 non-null    object
 2   aspiration           205 non-null    object
 3   number_of_cylinders  205 non-null    object
 4   horsepower           205 non-null    object
 5   peak_rpm             205 non-null    object
 6   price                205 non-null    object
dtypes: object(7)
memory usage: 11.3+ KB


In [155]:
#checking for incorrect values for the non-numerical columns
ProjDS['make'].unique()

array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
       'isuzu', 'jaguar', 'mazda', 'mercedes-benz', 'mercury',
       'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'renault',
       'saab', 'subaru', 'toyota', 'volkswagen', 'volvo'], dtype=object)

In [156]:
ProjDS['fuel_type'].unique()

array(['gas', 'diesel'], dtype=object)

In [157]:
ProjDS['aspiration'].unique()

array(['std', 'turbo'], dtype=object)

In [158]:
ProjDS['number_of_cylinders'].unique()

array(['four', 'six', 'five', 'three', 'twelve', 'two', 'eight'],
      dtype=object)

In [159]:
#checking for 0 values in the numerical attributes
ProjDS.min()

make                   alfa-romero
fuel_type                   diesel
aspiration                     std
number_of_cylinders          eight
horsepower                     100
peak_rpm                      4150
price                        10198
dtype: object

In [160]:
#checking for ? values in the dataset
ProjDS[ProjDS.isin(['?']).any(axis=1)]


Unnamed: 0,make,fuel_type,aspiration,number_of_cylinders,horsepower,peak_rpm,price
9,audi,gas,turbo,five,160,5500,?
44,isuzu,gas,std,four,70,5400,?
45,isuzu,gas,std,four,70,5400,?
129,porsche,gas,std,eight,288,5750,?
130,renault,gas,std,four,?,?,9295
131,renault,gas,std,four,?,?,9895


Checking for '?' values in the dataset, we see that 6 rows contain the '?' value. We will have to deal with these values before we progress.

###Dealing with '?' values

In [161]:
#checking for other audi vehicles with the same first 4 attributes at the row with the '?'
ProjDS.loc[(ProjDS['make'] == 'audi') & (ProjDS['fuel_type'] == 'gas') & (ProjDS['aspiration'] == 'turbo') & (ProjDS['number_of_cylinders'] == 'five')]

Unnamed: 0,make,fuel_type,aspiration,number_of_cylinders,horsepower,peak_rpm,price
8,audi,gas,turbo,five,140,5500,23875
9,audi,gas,turbo,five,160,5500,?


In [162]:
#replacing the '?' with 23875
ProjDS.loc[9,['price']] = 23875
ProjDS.loc[(ProjDS['make'] == 'audi') & (ProjDS['fuel_type'] == 'gas') & (ProjDS['aspiration'] == 'turbo') & (ProjDS['number_of_cylinders'] == 'five')]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


Unnamed: 0,make,fuel_type,aspiration,number_of_cylinders,horsepower,peak_rpm,price
8,audi,gas,turbo,five,140,5500,23875
9,audi,gas,turbo,five,160,5500,23875


Here we see another entry with pretty similar attribute values that has a valid price. Since it is only one other entry, we replace the '?' value with the valid entries price value.

In [163]:
#checking all cars made by isuzu
ProjDS.loc[(ProjDS['make'] == 'isuzu')]

Unnamed: 0,make,fuel_type,aspiration,number_of_cylinders,horsepower,peak_rpm,price
43,isuzu,gas,std,four,78,4800,6785
44,isuzu,gas,std,four,70,5400,?
45,isuzu,gas,std,four,70,5400,?
46,isuzu,gas,std,four,90,5000,11048


In [164]:
#getting the median/average of the isuzu prices
isuzuprice = (6785+11048)/2
#replacing the values of '?'
ProjDS.loc[44,['price']] = isuzuprice
ProjDS.loc[45,['price']] = isuzuprice
ProjDS.loc[(ProjDS['make'] == 'isuzu')]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


Unnamed: 0,make,fuel_type,aspiration,number_of_cylinders,horsepower,peak_rpm,price
43,isuzu,gas,std,four,78,4800,6785.0
44,isuzu,gas,std,four,70,5400,8916.5
45,isuzu,gas,std,four,70,5400,8916.5
46,isuzu,gas,std,four,90,5000,11048.0


For isuzu, we see two other rows with the same majority of attribute values that have a valid price. Here is it decided to replace the '?' with the mean/average prices from the other two entries. The mean/average values would be the same in this case because there are only two entries.

In [165]:
#checking all cars made by porsche
ProjDS.loc[(ProjDS['make'] == 'porsche')]

Unnamed: 0,make,fuel_type,aspiration,number_of_cylinders,horsepower,peak_rpm,price
125,porsche,gas,std,four,143,5500,22018
126,porsche,gas,std,six,207,5900,32528
127,porsche,gas,std,six,207,5900,34028
128,porsche,gas,std,six,207,5900,37028
129,porsche,gas,std,eight,288,5750,?


In [166]:
#dropping the one entry
ProjDS.drop(index = 129, inplace = True)
ProjDS.loc[(ProjDS['make'] == 'porsche')]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,make,fuel_type,aspiration,number_of_cylinders,horsepower,peak_rpm,price
125,porsche,gas,std,four,143,5500,22018
126,porsche,gas,std,six,207,5900,32528
127,porsche,gas,std,six,207,5900,34028
128,porsche,gas,std,six,207,5900,37028


Since after checking how many vehicles were made by porsche, we see plenty of other data for Prosche so it is decided to drop that row, decreasing the use case total from 205 to 204.

In [167]:
#Checking how many rows are cars made by renault
renault = ProjDS.loc[(ProjDS['make'] == 'renault')]
renault

Unnamed: 0,make,fuel_type,aspiration,number_of_cylinders,horsepower,peak_rpm,price
130,renault,gas,std,four,?,?,9295
131,renault,gas,std,four,?,?,9895


In [168]:
noq = ProjDS[(ProjDS['fuel_type'] == 'gas') & (ProjDS['aspiration'] == 'std') & (ProjDS['number_of_cylinders'] == 'four') & (ProjDS['horsepower'] != '?') & (ProjDS['peak_rpm'] != '?')]
hmean = int(noq.astype({'horsepower':'int32'}).horsepower.mean())
rpmMean = int(noq.astype({'peak_rpm':'int64'}).peak_rpm.mean())
ProjDS.loc[130,['horsepower']] = hmean
ProjDS.loc[131,['horsepower']] = hmean
ProjDS.loc[130,['peak_rpm']] = rpmMean
ProjDS.loc[131,['peak_rpm']] = rpmMean
ProjDS.loc[(ProjDS['make'] == 'renault')]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


Unnamed: 0,make,fuel_type,aspiration,number_of_cylinders,horsepower,peak_rpm,price
130,renault,gas,std,four,85,5174,9295
131,renault,gas,std,four,85,5174,9895


Since these '?' value rows are the only representation we have for the maker 'renault' it is decided that the values should be replaced instead of the row being dropped. The value is replaced by the mean values of horsepower and peak_rpm where fuel_type is gas, aspiration is std, and the number_of_cylinders is four.

In [169]:
#rechecking for '?' values
ProjDS[ProjDS.isin(['?']).any(axis=1)]

Unnamed: 0,make,fuel_type,aspiration,number_of_cylinders,horsepower,peak_rpm,price


In [176]:
#changing numerical columns to numerical datatypes
ProjDS2 = ProjDS.astype({'horsepower':'int32','peak_rpm':'int64','price':'float64'}, copy = False)
ProjDS2.dtypes

make                    object
fuel_type               object
aspiration              object
number_of_cylinders     object
horsepower               int32
peak_rpm                 int64
price                  float64
dtype: object

##Exploratory Data Analysis (EDA)

In [177]:
ProjDS2.describe()

Unnamed: 0,horsepower,peak_rpm,price
count,204.0,204.0,204.0
mean,103.166667,5122.784314,13217.357843
std,37.491433,476.155843,7935.011217
min,48.0,4150.0,5118.0
25%,70.0,4800.0,7784.75
50%,95.0,5200.0,10270.0
75%,116.0,5500.0,16500.75
max,262.0,6600.0,45400.0


- horsepower: The mean of this attribute is 103.166667 while the median is 95. The mean is greater than the median which means that the shape of this distribution is skewed to the right.
- peak_rpm: the mean of this attribute is 5122.784314 while the median is 5200. Since the mean is greater than the median, the shape of this distribution is skewed to the right
- price: the mean of this attribute is 13217.357843 while the median is 10270. Since the mean is greater than the median, the shape of this distribution is skewed to the right

##Questions


###2.


In [180]:
df[df['number_of_doors']== '?']

Unnamed: 0,symboling,normalized losses,make,fuel_type,aspiration,number_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
27,1,148,dodge,gas,turbo,?,sedan,fwd,front,93.7,...,98,mpfi,3.03,3.39,7.6,102,5500,24,30,8558
63,0,?,mazda,diesel,std,?,sedan,fwd,front,98.8,...,122,idi,3.39,3.39,22.7,64,4650,36,42,10795


Two missing data values occur for doors

In [181]:
df[df['price']== '?']

Unnamed: 0,symboling,normalized losses,make,fuel_type,aspiration,number_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
9,0,?,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,131,mpfi,3.13,3.4,7.0,160,5500,16,22,?
44,1,?,isuzu,gas,std,two,sedan,fwd,front,94.5,...,90,2bbl,3.03,3.11,9.6,70,5400,38,43,?
45,0,?,isuzu,gas,std,four,sedan,fwd,front,94.5,...,90,2bbl,3.03,3.11,9.6,70,5400,38,43,?
129,1,?,porsche,gas,std,two,hatchback,rwd,front,98.4,...,203,mpfi,3.94,3.11,10.0,288,5750,17,28,?


Four missing data values occur for Prices

##References

1. https://www.sweeneychevrolet.com/blog/what-is-the-difference-between-diesel-and-gas/
2. https://www.basilcars.com/diesel-vs-gasoline/
3. https://www.inchcape.co.uk/blog/guides/what-size-engine-do-i-need/#:~:text=The%20size%20of%20an%20engine,in%20cubic%20centimetres%20(cc).
4. https://www.roadandtrack.com/car-culture/a30443334/engine-stroke-vs-bore-explained/#:~:text=An%20engine's%20bore%20is%20the,more%20power%20it%20produces%20overall.
5. https://www.caranddriver.com/research/a31873205/mpg-meaning/#:~:text=City%20MPG%3A%20the%20score%20a,of%20highway%20and%20city%20MPG.
