[__Home__](../README.md) | [__Data Cleaning >>__](./02_Cars4u_data_cleaning.ipynb)


# Cars4u: Car Price Prediction
## Initial Data Exploration

__Dataset:__ [Cars4u](https://www.kaggle.com/datasets/sukhmanibedi/cars4u) \
__Author:__ Dmitry Luchkin \
__Date:__ 2024-07-25

__Objectives:__
   - Import the dataset from a CSV file.
   - Identify the missing values and duplicated rows.
   - Explore data structure.

## Table of content

- [Dataset](#dataset)
- [Notebooks](#notebooks)
- [Importing Librarties](#import-libraries)
- [Notebook Setup](#notebook-setup)
- [Loading Data](#loading-data)
- [Data Exploration](#data-exploration)
  - [Check Data Types](#check-data-type)
  - [Check Uniqueness Data](#check-uniqueness-data)
  - [Cjeck Missing Values](#check-missing-values)
  - [Check Duplicated Rows](#check-duplicated-rows)

## Dataset  <a  class="anchor" name='dataset'></a>

This dataset is a CSV file containing 7253 data points with information about used cars.

__Description of Attributes:__

| Attribute         | Description                                                      |
|-------------------|------------------------------------------------------------------|
| S.No.             | A unique identifier for each data point in the dataset.          |
| Name              | The brand and model name of the used car.                        |
| Location          | The city or location where the car is being sold.                |
| Year              | The year the car was manufactured.                               |
| Kilometers_Driven | The total distance the car has been driven, measured in kilometers. |
| Fuel_Type         | The type of fuel the car uses, such as Petrol, Diesel, CNG, etc. |
| Transmission      | The type of transmission in the car, such as Manual or Automatic. |
| Owner_Type        | The ownership status of the car, such as First Owner, Second Owner, etc. |
| Mileage           | The fuel efficiency of the car, typically measured in kilometers per liter (km/l) or miles per gallon (mpg). |
| Engine            | The displacement of the car's engine, typically measured in cubic centimeters (cc). |
| Power             | The power output of the car's engine, typically measured in horsepower (BHP). |
| Seats             | The total number of seats in the car.                            |
| New_Price         | The original price of the car when it was new.                   |
| Price             | The current selling price of the used car.                       |



## Notebooks <a class="anchor" name='notebooks'></a>

+ [__01_Cars4u_initial_data_exploration.ipynb__](./01_Cars4u_initial_data_exploration.ipynb)
+ [02_Cars4u_data_cleaning.ipynb](./02_Cars4u_data_cleaning.ipynb)
+ [03_Cars4u_exploratory_data_analysis.ipynb](./03_Cars4u_exploratory_data_analysis.ipynb)
+ [04_Cars4u_feature_engineering.ipynb](./04_Cars4u_feature_engineering.ipynb)
+ [05_Cars4u_modeling.ipynb](./05_Cars4u_modeling.ipynb)

## Import Libraries <a name='import-libraries'></a>

In [1]:
import datetime
import sys

import pandas as pd
import numpy as np
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Notebook Setup <a name='notebook-setup'></a>

In [2]:
# Pandas settings
pd.options.display.max_columns = None
pd.options.display.max_colwidth = 60
pd.options.display.float_format = '{:,.4f}'.format

# Visualization settings
from matplotlib import rcParams
plt.style.use('fivethirtyeight')
rcParams['figure.figsize'] = (16, 5)   
rcParams['axes.spines.right'] = False
rcParams['axes.spines.top'] = False
rcParams['font.size'] = 12
rcParams['savefig.dpi'] = 300
plt.rc('xtick', labelsize=11)
plt.rc('ytick', labelsize=11)
%config InlineBackend.figure_format = 'retina'

## Loading Data <a name='loading-data'></a>

Have a quick look at the data.

In [3]:
%cat ../00_data/00_raw/used_cars_data.csv | head

S.No.,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5,,1.75
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5,,12.5
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5,8.61 Lakh,4.5
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7,,6
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5,,17.74
5,Hyundai EON LPG Era Plus Option,Hyderabad,2012,75000,LPG,Manual,First,21.1 km/kg,814 CC,55.2 bhp,5,,2.35
6,Nissan Micra Diesel XV,Jaipur,2013,86999,Diesel,Manual,First,23.08 kmpl,1461 CC,63.1 bhp,5,,3.5
7,Toyota Innova Crysta 2.8 GX AT 8S,Mumbai,2016,36000,Diesel,Automatic,First,11.36 kmpl,2755 CC,171.5 bhp,8,21 Lakh,17.5
8,Volkswagen Vento Diesel Comfortline,Pune,

The dataset with comma-separated values, where the header is in the first row and points are used as decimal point.

In [4]:
# loading data
data = pd.read_csv('../00_data/00_raw/used_cars_data.csv')

<a id='data-exploration'></a>

## Data Exploration

In [5]:
print(f'Rows count: {data.shape[0]}, Columns count: {data.shape[1]}')

Rows count: 7253, Columns count: 14


In [6]:
data.columns

Index(['S.No.', 'Name', 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type',
       'Transmission', 'Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats',
       'New_Price', 'Price'],
      dtype='object')

In [7]:
data.head()

Unnamed: 0,S.No.,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75
1,1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5
2,2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5
3,3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.0
4,4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74


In [8]:
data.tail()

Unnamed: 0,S.No.,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
7248,7248,Volkswagen Vento Diesel Trendline,Hyderabad,2011,89411,Diesel,Manual,First,20.54 kmpl,1598 CC,103.6 bhp,5.0,,
7249,7249,Volkswagen Polo GT TSI,Mumbai,2015,59000,Petrol,Automatic,First,17.21 kmpl,1197 CC,103.6 bhp,5.0,,
7250,7250,Nissan Micra Diesel XV,Kolkata,2012,28000,Diesel,Manual,First,23.08 kmpl,1461 CC,63.1 bhp,5.0,,
7251,7251,Volkswagen Polo GT TSI,Pune,2013,52262,Petrol,Automatic,Third,17.2 kmpl,1197 CC,103.6 bhp,5.0,,
7252,7252,Mercedes-Benz E-Class 2009-2013 E 220 CDI Avantgarde,Kochi,2014,72443,Diesel,Automatic,First,10.0 kmpl,2148 CC,170 bhp,5.0,,


### Check Data Types <a name='check-data-type'></a>

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7253 entries, 0 to 7252
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   S.No.              7253 non-null   int64  
 1   Name               7253 non-null   object 
 2   Location           7253 non-null   object 
 3   Year               7253 non-null   int64  
 4   Kilometers_Driven  7253 non-null   int64  
 5   Fuel_Type          7253 non-null   object 
 6   Transmission       7253 non-null   object 
 7   Owner_Type         7253 non-null   object 
 8   Mileage            7251 non-null   object 
 9   Engine             7207 non-null   object 
 10  Power              7207 non-null   object 
 11  Seats              7200 non-null   float64
 12  New_Price          1006 non-null   object 
 13  Price              6019 non-null   float64
dtypes: float64(2), int64(3), object(9)
memory usage: 793.4+ KB


### Check uniqueness of data <a name='check-uniqueness-data'></a>

In [10]:
num_unique = data.nunique().sort_values()
num_unique

Transmission            2
Owner_Type              4
Fuel_Type               5
Seats                   9
Location               11
Year                   23
Engine                150
Power                 386
Mileage               450
New_Price             625
Price                1373
Name                 2041
Kilometers_Driven    3660
S.No.                7253
dtype: int64

In [11]:
print('--- Percentage Similarity of Values (%) ---')
print(100/num_unique)

--- Percentage Similarity of Values (%) ---
Transmission        50.0000
Owner_Type          25.0000
Fuel_Type           20.0000
Seats               11.1111
Location             9.0909
Year                 4.3478
Engine               0.6667
Power                0.2591
Mileage              0.2222
New_Price            0.1600
Price                0.0728
Name                 0.0490
Kilometers_Driven    0.0273
S.No.                0.0138
dtype: float64


__Orservations:__
+ This dataset consists of __7253__ rows and __14__ columns which contain the information about used cars.
+ `Name`, `Location`, `Fuel_Type`, `Transmission`, `Owner_Type`, `Mileage`, `Engine`, `Power`, `New_Price` are __string__.
+ `S.No`, `Year`, `Kilometers_Driven`,  are __integer__.
+ `Seats` and `Price` are __float__.
+ `S.No.` is a sequential number or index that we already have in data frame by default.\
  <span style="color:blue">_# TODO: Drop S.No. column._</span>
+ `Name` has __2041__ unique values, the column contains a brand/manufacturer, model and some code of modification.\
  <span style="color:blue">_# TODO: Split Name column into two columns: Brand and Model, omit the rest codes, convert to category type._</span>
+ `Location` has __11__ unique values and represents a city where a used car is is being sold.\
  <span style="color:blue">_# TODO: Convert Location column to a category type._</span>
+ `Fuel_Type` has __5__ unique values.\
  <span style="color:blue">_# TODO: Convert Fuel_Type column to a category type._</span>
+ `Transmission` has __2__ unique values.\
  <span style="color:blue">_# TODO: Convert Transmission column to a category type._</span>
+ `Owner_Type` has __4__ unique values.\
  <span style="color:blue">_# TODO: Convert Owner_Type column to a category type._</span>
+ `Mileage` represents how much distance a car can drive on one liter/kg of fuel.\
  <span style="color:blue">_# TODO: Convert Mileage column to numeric type and make a column for units._</span>
+ `Engine` represents the total volume of all the cylinders in an engine.\
  <span style="color:blue">_# TODO: Convert Engine column to numeric type._</span>
+ `Power` represents a measure of an engine's horsepower.\
  <span style="color:blue">_# TODO: Convert Power column to numeric type._</span>
+ `Seats` has __9__ unique values and represents a number of seats into a used car.\
  <span style="color:blue">_# TODO: Convert Seats column to an integer type._</span>
+ `New_Price` represents a price of a new car.\
  <span style="color:blue">_# TODO: Convert New_Price column to an float type._</span>

### Check Missing Values <a name='check-missing-values'></a>

In [12]:
data.isnull().sum()

S.No.                   0
Name                    0
Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Mileage                 2
Engine                 46
Power                  46
Seats                  53
New_Price            6247
Price                1234
dtype: int64

In [13]:
# percentage of missing values per column
data.isnull().sum()/len(data)*100

S.No.                0.0000
Name                 0.0000
Location             0.0000
Year                 0.0000
Kilometers_Driven    0.0000
Fuel_Type            0.0000
Transmission         0.0000
Owner_Type           0.0000
Mileage              0.0276
Engine               0.6342
Power                0.6342
Seats                0.7307
New_Price           86.1299
Price               17.0136
dtype: float64

__Observations__:
+ `Mileage` has ~0.2% missing values in the column.
+ `Engine` has ~0.6% missing values in the column.
+ `Power` has ~0.6% missing values in the column.
+ `Seats` has ~0.7% missing values in the column.
+ `New_Price` has ~86% missing values in the column.
+ `Price` has ~17% missing values in the column.

<span style="color:blue">_# TODO: Handle missing values in: Mileage, Engine, Power, Seats, New_Price, Price._</span>

### Check for duplicated rows <a name='check-duplicated-rows'></a>

In [14]:
print(f'No. of entirely duplicated rows: {data.duplicated().sum()}')

No. of entirely duplicated rows: 0


__Observations:__
- There are no duplicated rows.


---
\
[__Home__](../README.md) | [__Data Cleaning >>__](./02_Cars4u_data_cleaning.ipynb)
\
\
Cars4u: Car Price Prediction, _August 2024_