In [1]:
import pandas as pd

Exploring data and data cleaning are crucial steps in the machine learning workflow. Exploring data allows us to gain insight and understand the data characteristics. Additionally, cleansing data ensures that the data is accurate and consistent, which improves the quality of machine learning models. To explore and clean the car data, please follow the steps outlined below.

1. Open an empty notebook and read in the car dataset.

2. Check information about the data. How many numerical and categorical variables exist in the dataset?

3. Check the statistical description of the data. What are the maximum and minimum prices of the cars? What is the standard deviation of the city fuel consumption of the cars?

4. Check How many cars from each brand are in the dataset.

5. Check the column name of the dataset. Is there any inconsistency in the column names? Is there a way to resolve the inconsistency in this case?

**Hint:** A naming convention can be used to resolve inconsistencies. Consider using lowercase letters, shortening names, and clear names, such as `hp` instead of `Engine HP`, `dirve` instead of `Driven_Wheels` , `Price` instead of `MSRP` and etc.

6. Check the duplicate rows. What should we do about the duplicate data?

7. Save the dataframe under a suitable name for further analysis.

8. Check data for null values. Remove the rows with missing values from the dataframe.

**Note:** Removing missing values is one of the methods that can be used to resolve the issue. In the following days, we will learn how to deal with those in a systematic manner.

**1.** Open an empty notebook and read in the car dataset.

In [2]:
car_data = pd.read_csv('./data/cars_data.csv')

**2.** Check information about the data. How many numerical and categorical variables exist in the dataset?

In [3]:
car_data.dtypes

Make                  object
Model                 object
Year                   int64
Engine Fuel Type      object
Engine HP            float64
Engine Cylinders     float64
Transmission Type     object
Driven_Wheels         object
Number of Doors      float64
Market Category       object
Vehicle Size          object
Vehicle Style         object
highway MPG            int64
city mpg               int64
Popularity             int64
MSRP                   int64
dtype: object

In [4]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Make               11914 non-null  object 
 1   Model              11914 non-null  object 
 2   Year               11914 non-null  int64  
 3   Engine Fuel Type   11911 non-null  object 
 4   Engine HP          11845 non-null  float64
 5   Engine Cylinders   11884 non-null  float64
 6   Transmission Type  11914 non-null  object 
 7   Driven_Wheels      11914 non-null  object 
 8   Number of Doors    11908 non-null  float64
 9   Market Category    8172 non-null   object 
 10  Vehicle Size       11914 non-null  object 
 11  Vehicle Style      11914 non-null  object 
 12  highway MPG        11914 non-null  int64  
 13  city mpg           11914 non-null  int64  
 14  Popularity         11914 non-null  int64  
 15  MSRP               11914 non-null  int64  
dtypes: float64(3), int64(5

In [5]:
car_data.shape

(11914, 16)

In [6]:
car_data.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


**3.** Check the statistical description of the data. What are the maximum and minimum prices of the cars? What is the standard deviation of the city fuel consumption of the cars?



In [7]:
car_data['MSRP'].max()

2065902

In [8]:
car_data['MSRP'].min()

2000

In [9]:
car_data['city mpg'].std()

8.987798160299272

In [10]:
car_data.describe()

Unnamed: 0,Year,Engine HP,Engine Cylinders,Number of Doors,highway MPG,city mpg,Popularity,MSRP
count,11914.0,11845.0,11884.0,11908.0,11914.0,11914.0,11914.0,11914.0
mean,2010.384338,249.38607,5.628829,3.436093,26.637485,19.733255,1554.911197,40594.74
std,7.57974,109.19187,1.780559,0.881315,8.863001,8.987798,1441.855347,60109.1
min,1990.0,55.0,0.0,2.0,12.0,7.0,2.0,2000.0
25%,2007.0,170.0,4.0,2.0,22.0,16.0,549.0,21000.0
50%,2015.0,227.0,6.0,4.0,26.0,18.0,1385.0,29995.0
75%,2016.0,300.0,6.0,4.0,30.0,22.0,2009.0,42231.25
max,2017.0,1001.0,16.0,4.0,354.0,137.0,5657.0,2065902.0


**4.** Check How many cars from each brand are in the dataset.

In [11]:
car_data['Make'].nunique()

48

In [12]:
car_data['Make'].unique()

array(['BMW', 'Audi', 'FIAT', 'Mercedes-Benz', 'Chrysler', 'Nissan',
       'Volvo', 'Mazda', 'Mitsubishi', 'Ferrari', 'Alfa Romeo', 'Toyota',
       'McLaren', 'Maybach', 'Pontiac', 'Porsche', 'Saab', 'GMC',
       'Hyundai', 'Plymouth', 'Honda', 'Oldsmobile', 'Suzuki', 'Ford',
       'Cadillac', 'Kia', 'Bentley', 'Chevrolet', 'Dodge', 'Lamborghini',
       'Lincoln', 'Subaru', 'Volkswagen', 'Spyker', 'Buick', 'Acura',
       'Rolls-Royce', 'Maserati', 'Lexus', 'Aston Martin', 'Land Rover',
       'Lotus', 'Infiniti', 'Scion', 'Genesis', 'HUMMER', 'Tesla',
       'Bugatti'], dtype=object)

In [13]:
car_data.describe(include='O')

Unnamed: 0,Make,Model,Engine Fuel Type,Transmission Type,Driven_Wheels,Market Category,Vehicle Size,Vehicle Style
count,11914,11914,11911,11914,11914,8172,11914,11914
unique,48,915,10,5,4,71,3,16
top,Chevrolet,Silverado 1500,regular unleaded,AUTOMATIC,front wheel drive,Crossover,Compact,Sedan
freq,1123,156,7172,8266,4787,1110,4764,3048


**5.** Check the column name of the dataset. Is there any inconsistency in the column names? Is there a way to resolve the inconsistency in this case?

**Hint:** A naming convention can be used to resolve inconsistencies. Consider using lowercase letters, shortening names, and clear names, such as `hp` instead of `Engine HP`, `dirve` instead of `Driven_Wheels` , `Price` instead of `MSRP` and etc.

In [14]:
car_data.columns

Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP'],
      dtype='object')

In [15]:
car_data.head(2)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650


In [16]:
car_data.columns = car_data.columns.str.lower()
car_data.columns

Index(['make', 'model', 'year', 'engine fuel type', 'engine hp',
       'engine cylinders', 'transmission type', 'driven_wheels',
       'number of doors', 'market category', 'vehicle size', 'vehicle style',
       'highway mpg', 'city mpg', 'popularity', 'msrp'],
      dtype='object')

In [17]:
car_data.rename(columns={'engine fuel type':'fuel',
                         'engine hp':'hp',
                         'engine cylinders':'cylinders',
                         'transmission type':'transmission',
                         'driven_wheels':'drive',
                         'number of doors':'doors',
                         'market category':'mrk_cat',
                         'vehicle size':'size',
                         'vehicle style':'style',
                         'highway mpg':'highway_mpg',
                         'city mpg':'city_mpg',
                         'msrp':'price'
                        },inplace=True)

car_data.columns

Index(['make', 'model', 'year', 'fuel', 'hp', 'cylinders', 'transmission',
       'drive', 'doors', 'mrk_cat', 'size', 'style', 'highway_mpg', 'city_mpg',
       'popularity', 'price'],
      dtype='object')

**6.** Check the duplicate rows. What should we do about the duplicate data?

In [18]:
#car_data.duplicated().sum()

In [19]:
#duplicate_data=car_data[car_data.duplicated()]
#dup1=duplicate_data.groupby('make')
#dup1.get_group('BMW')

In [20]:
type(car_data)

pandas.core.frame.DataFrame

In [21]:
new_data = car_data.drop_duplicates()
new_data

Unnamed: 0,make,model,year,fuel,hp,cylinders,transmission,drive,doors,mrk_cat,size,style,highway_mpg,city_mpg,popularity,price
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11909,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,46120
11910,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,56670
11911,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,50620
11912,Acura,ZDX,2013,premium unleaded (recommended),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,50920


In [22]:
new_data.shape

(11199, 16)

In [23]:
new_data.reset_index(drop=True, inplace=True)
new_data.index

RangeIndex(start=0, stop=11199, step=1)

**7.** Save the dataframe under a suitable name for further analysis.

In [24]:
new_data.to_csv('../data/cars_no_dupes.csv', index=False)

**8.** Check data for null values. Remove the rows with missing values from the dataframe.

**Note:** Removing missing values is one of the methods that can be used to resolve the issue. In the following days, we will learn how to deal with those in a systematic manner.

In [25]:
new_data.isnull().sum()

make               0
model              0
year               0
fuel               3
hp                69
cylinders         30
transmission       0
drive              0
doors              6
mrk_cat         3376
size               0
style              0
highway_mpg        0
city_mpg           0
popularity         0
price              0
dtype: int64

In [26]:
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11199 entries, 0 to 11198
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   make          11199 non-null  object 
 1   model         11199 non-null  object 
 2   year          11199 non-null  int64  
 3   fuel          11196 non-null  object 
 4   hp            11130 non-null  float64
 5   cylinders     11169 non-null  float64
 6   transmission  11199 non-null  object 
 7   drive         11199 non-null  object 
 8   doors         11193 non-null  float64
 9   mrk_cat       7823 non-null   object 
 10  size          11199 non-null  object 
 11  style         11199 non-null  object 
 12  highway_mpg   11199 non-null  int64  
 13  city_mpg      11199 non-null  int64  
 14  popularity    11199 non-null  int64  
 15  price         11199 non-null  int64  
dtypes: float64(3), int64(5), object(8)
memory usage: 1.4+ MB


In [27]:
data_no_dupes_null = new_data.dropna(axis=0)
data_no_dupes_null.shape

(7735, 16)

In [28]:
data_no_dupes_null.reset_index(drop=True, inplace=True)
data_no_dupes_null.index

RangeIndex(start=0, stop=7735, step=1)

In [29]:
data_no_dupes_null.to_csv('./data/data_no_dupes_null.csv', index=False)